This article describes how to do a k-means cluster analysis in Displayr. The k-means cluster analysis algorithm is a method for grouping similar cases into groups, or clusters. The final clusters will be different from each other, while the cases within a cluster are broadly similar to each other.
Requirements
- A data set containing the variables that you want to use as inputs to the cluster analysis segmentation
- Familiarity with the Structure and Value Attributes of Variable Sets
Method
1. From the toolbar, select Anything > Advanced Analysis > Cluster > K-Means Cluster Analysis. A cluster analysis object will added to the current page.
2. From the object inspector, select the inputs (clustering variables) from the Variables dropdown in the Inputs section. For this example, we've selected 11 behavioral/attitudinal statements on mobile technology. Questions were asked as a 5-point agree/disagree scale. We'll use the top 2 box responses to each of the statements as the inputs to our k-means cluster analysis.
You can use any other numeric variables as clustering variables that can potentially provide differentiation between the respondents and therefore help define the clusters.
Note that if the variables are grouped in a Variable Set, then the Variable Set may be selected instead, which is more convenient than selecting multiple variables.
3. Select the number of clusters that you want to create in the Number of clusters field. I've selected 3 clusters for this example, but you can choose any value you want here.
4. Optional: Modify any of the other input settings as desired. For this example, we'll leave the default values selected. Options include:
-
Missing data (see Missing Data Options):
- Error if missing data
- Exclude cases with missing data
- Use partial data - This is the default
- Imputation (replace missing values with estimates)
-
Algorithm:
- Batch - This is the default and is the only algorithm that can accommodate weights or missing values. Refer to the Technical Details section below.
- Hartigan-Wong - Refer to kmeans for more information on this and the algorithms below
- Forgy
- Lloyd
- MacQueen
-
Output:
- Means
- Means table - Show the cluster means. Best if wanting to export to another program
- Segment profiling table - Show the composition of the Profiling variables within predicted clusters. More options to control the appearance are described here.
- Cluster labels - An optional comma-separated list used to name the clusters predicted by the k-means model
- Profiling variables - Select other variables or variable sets to crosstab with the segment cluster
5. Click the Calculate button (or tick the Automatic checkbox so that the analysis will re-run automatically if any changes are made).
Interpreting the Results
The standard table of means output shown above lists each of the clustering variables in the rows and shows the mean Top 2 Box percentage for each of the clusters.
- The size of each cluster (n) is shown in the column header.
- The red and blue highlights indicate whether or not the Top 2 Box score is higher (blue) or lower (red) than the overall mean. The red and blue colors are also scaled to provide some additional differentiation (darker shades of red/blue are farther from the mean).
- Means in bold font are significantly higher/lower than the mean score.
- The R-Squared value shows proportion of variance in the cluster assignment that is explained by the each of the clustering variables. In the example above, we can see that there are 4 statements that have a greater impact on the segment/cluster predictions than do the remaining variables.
- The p-value shows which statement variables are significant in the model.
Saving Cluster Membership
Individual respondents can be assigned to the individual clusters in Displayr by first selecting the k-Means Cluster Analysis output and then selecting Inputs > Save Variable(s) > Cluster Membership. A new categorical variable is added to the top of the data set called "Segment/Cluster memberships from r.output". Locate the new variable in the Data Sets tree and hover over it to preview the respondent level membership data or drag the variable onto the page to create a table.
This segment/cluster variable can be used for profiling against your demographic variables. Once you've identified the key differences between your clusters, try to come up with names that describe each cluster. You can add then these names to the cluster variable by first selecting the variable in theData Setstree, click theLabelsbutton from the Propertieson the right and enter your the cluster names in theLabelcolumn. ClickOKto save the cluster names.
Technical Details
The Batch algorithm works as follows:
- The Hartigan-Wong k-means algorithm is used to find clusters with missing data set to Exclude cases with missing data.
- Cases are assigned to the most similar cluster. Where Missing data is set to use partial data (the default), this means that cases that were ignored by Hartigan-Wong are now included in the analysis.
- The cluster centers are updated. Where weights have been applied, this means that the cluster centers now reflect weights (they were ignored by Hartigan-Wong).
- The previous two steps are repeated until the either the maximum number of iterations, iter.max has been exceeded (which defaults to 100), or, the Omega-Squared does not increase.
Next
How to Analyze Data by Groups/Segments
How to Do Latent Class Analysis
How to Create a Segmentation Comparison Table
How to Do Mixed Mode Cluster Analysis in Displayr
How to Save K-Means Cluster Membership
How to Do Hierarchical Cluster Analysis in Displayr
Comments
0 comments
Article is closed for comments.