This article describes how to do a k-means cluster analysis in Displayr. The k-means cluster analysis algorithm is a method for grouping similar cases into groups, or clusters. The final clusters will be different from each other, while the cases within a cluster are broadly similar to each other.
- A data set containing the variables that you want to use as inputs to the cluster analysis segmentation.
- Familiarity with the Structure and Value Attributes of Variable Sets.
1. From the toolbar, select Anything > Advanced Analysis > Cluster > K-Means Cluster Analysis. A cluster analysis object will added to the current page.
2. From the object inspector, select the inputs (clustering variables) from the Variables dropdown in the Inputs section. For this example, we've selected 11 behavioral/attitudinal statements on mobile technology. Questions were asked as a 5-point agree/disagree scale. We'll use the top 2 box responses to each of the statements as the inputs to our k-means cluster analysis.
You can use any other numeric variables as clustering variables that can potentially provide differentiation between the respondents and therefore help define the clusters.
Note that if the variables are grouped in a Variable Set, then the Variable Set may be selected instead, which is more convenient than selecting multiple variables.
3. Select the number of clusters that you want to create in the Number of clusters field. I've selected 3 clusters for this example, but you can choose any value you want here.
4. Optional: Modify any of the other input settings as desired. For this example, we'll leave the default values selected. Options include:
Missing data (see Missing Data Options):
- Error if missing data
- Exclude cases with missing data
- Use partial data - This is the default
- Imputation (replace missing values with estimates)
- Batch - This is the default and is the only algorithm that can accommodate weights or missing values. Refer to the Technical Details section below.
- Hartigan-Wong - Refer to kmeans for more information on this and the algorithms below
- Means table - Show the cluster means. Best if wanting to export to another program
- Segment profiling table - Show the composition of the Profiling variables within predicted clusters. More options to control the appearance are described here.
- Weight - Where a weight has been set for the output, the calibrated weight is used. See Weights in R.
- Cluster labels - An optional comma-separated list used to name the clusters predicted by the k-means model
- Profiling variables - Select other variables or variable sets to crosstab with the segment cluster
5. Click the Calculate button (or tick the Automatic checkbox so that the analysis will re-run automatically if any changes are made).
Interpreting the Results
The standard table of means output shown above lists each of the clustering variables in the rows and shows the mean Top 2 Box percentage for each of the clusters.
- The size of each cluster (n) is shown in the column header.
- The red and blue highlights indicate whether or not the Top 2 Box score is higher (blue) or lower (red) than the overall mean. The red and blue colors are also scaled to provide some additional differentiation (darker shades of red/blue are farther from the mean).
- Means in bold font are significantly higher/lower than the mean score.
- The R-Squared value shows proportion of variance in the cluster assignment that is explained by the each of the clustering variables. In the example above, we can see that there are 4 statements that have a greater impact on the segment/cluster predictions than do the remaining variables.
- The p-value shows which statement variables are significant in the model.
- Where weights are provided, the percentages show weighted data but the n does not.
- The Variance Explained is a multivariate R-squared statistic, which is sometimes known as omega-squared in the cluster analysis literature.
- The Calinksi-Harabasz statistic can be useful when selecting the number of segments (higher is better), however, it should not be relied upon as the ultimate arbiter of number of segments as it is not particularly scientific.
See our Data Story Guide article Interpreting Cluster Analysis Outputs for more guidance on interpretation.
Saving Cluster Membership
Individual respondents can be assigned to the individual clusters in Displayr by first selecting the k-Means Cluster Analysis output and then selecting Inputs > Save Variable(s) > Cluster Membership.
See How to Save K-Means Cluster Membership for details.
The Batch algorithm works as follows:
- The Hartigan-Wong k-means algorithm is used to find clusters with missing data set to Exclude cases with missing data.
- Cases are assigned to the most similar cluster. Where Missing data is set to use partial data (the default), this means that cases that were ignored by Hartigan-Wong are now included in the analysis.
- The cluster centers are updated. Where weights have been applied, this means that the cluster centers now reflect weights (they were ignored by Hartigan-Wong).
- The previous two steps are repeated until the either the maximum number of iterations, iter.max has been exceeded (which defaults to 100), or, the Omega-Squared does not increase.
Uses the kmeans function from the stats R package.