How to Do K-Means Cluster Analysis

This article describes how to do a k-means cluster analysis in Displayr. The k-means cluster analysis algorithm is a method for grouping similar cases into groups, or clusters. The final clusters will be different from each other, while the cases within a cluster are broadly similar to each other.

Requirements

A data set containing the variables that you want to use as inputs to the cluster analysis segmentation.
Familiarity with the Structure and Value Attributes of Variable Sets.

Method

1. From the toolbar, select Anything > Advanced Analysis > Cluster > K-Means Cluster Analysis or in the Report tree, select +> Advanced Analysis > Cluster > K-Means Cluster Analysis. A cluster analysis object will be added to the current page.

2. From the Properties select the inputs (clustering variables) from the Variables dropdown in the Data tab. For this example, we've selected 11 behavioral/attitudinal statements on mobile technology. Questions were asked on a 5-point agree/disagree scale. We'll use the top-2-box responses to each statement as inputs to our k-means clustering analysis. You can use any other numeric variables as clustering variables that may differentiate respondents and therefore help define the clusters. Note that if the variables are grouped in a Variable Set, then the Variable Set may be selected instead, which is more convenient than selecting multiple variables.

3. Select the number of clusters that you want to create in the Number of clusters field. I've selected 3 clusters for this example, but you can choose any value you want here.

4. Optional: Modify any of the other input settings as desired. For this example, we'll leave the default values selected. Options include:

Missing data (see Missing Data Options):
- Error if missing data
- Exclude cases with missing data
- Use partial data - This is the default
- Imputation (replace missing values with estimates)
Algorithm:
- Batch - This is the default and is the only algorithm that can accommodate weights or missing values. Refer to the Technical Details section below.
- Hartigan-Wong - Refer to kmeans for more information on this and the algorithms below
- Forgy
- Lloyd
- MacQueen
Output:
- Means
- Means table - Show the cluster means. Best if wanting to export to another program
- Segment profiling table - Show the composition of the Profiling variables within predicted clusters. More options to control the appearance are described here.
Weight - Where a weight has been set for the output, the calibrated weight is used. See Weights in R.
Cluster labels - An optional comma-separated list used to name the clusters predicted by the k-means model
Profiling variables - Select other variables or variable sets to crosstab with the segment cluster

5. Click the Calculate button (or tick the Calculate automatically checkbox so that the analysis will re-run automatically if any changes are made).

Interpreting the Results

The standard table of means output shown above lists the clustering variables in the rows and shows the Top 2 Box mean percentage for each cluster.

The size of each cluster (n) is shown in the column header.
The red and blue highlights indicate whether or not the Top 2 Box score is higher (blue) or lower (red) than the overall mean. The red and blue colors are also scaled to provide some additional differentiation (darker shades of red/blue are farther from the mean).
Means in bold font are significantly higher/lower than the mean score.
The R-Squared value shows the proportion of variance in the cluster assignment that is explained by each of the clustering variables. In the example above, we can see that 4 statements have a greater impact on the segment/cluster predictions than the remaining variables.
The p-value shows which statement variables are significant in the model.
Where weights are provided, the percentages show weighted data, but the n does not.
The Variance Explained is a multivariate R-squared statistic, which is sometimes known as omega-squared in the cluster analysis literature.
The Calinski-Harabasz statistic can be useful when selecting the number of segments (higher is better); however, it should not be relied upon as the ultimate arbiter of the number of segments, as it is not particularly scientific.

See our Data Story Guide article Interpreting Cluster Analysis Outputs for more guidance on interpretation.

Saving Cluster Membership

Individual respondents can be assigned to the individual clusters in Displayr by first selecting the k-Means Cluster Analysis output, then in the Properties click Data > Save Variable(s) > Cluster Membership.

See How to Save K-Means Cluster Membership for details.

Technical Details

The Batch algorithm works as follows:

The Hartigan-Wong k-means algorithm is used to find clusters with missing data set to Exclude cases with missing data.
Cases are assigned to the most similar cluster. Where Missing data is set to use partial data (the default), this means that cases that were ignored by Hartigan-Wong are now included in the analysis.
The cluster centers are updated. Where weights have been applied, this means that the cluster centers now reflect weights (they were ignored by Hartigan-Wong).
The previous two steps are repeated until either the maximum number of iterations, iter.max has been exceeded (which defaults to 100), or, the Omega-Squared does not increase.