How to Assign Respondents to Clusters/Segments in a New Data File – Displayr Help

Sometimes you may wish to create segments of your respondents and use them to classify respondents in a different survey or in a later wave of your tracker. You can, in essence, reuse your original segmentation model to classify respondents in the new data into those segments. This allows you to build a segmentation using your preliminary data and use it to classify all respondents in the final version of the dataset. Or you can create a segmentation using one survey/dataset and use it to classify totally different respondents in a different dataset (as long as it meets the requirements below).

This article covers

Method - Using Latent Class Analysis
Method - Using R-based predictive models

Requirements

A document with one of the following types of segmentation models:
- Latent Class Analysis
- Trees
- k-means
- Mixture Models for Regression
- Most machine learning models - such as Random Forest
A new dataset with variables that correspond to the original variables used in the model. See the Additional Notes section below on how to create a new predictive model if your new data set does not have corresponding variables.
There are further data requirements based on what analysis you originally created:
- For all analyses:
  - Different models may have different data requirements, so to ensure your data works with all models we recommend: the new variables must be coded so that the category label and value match to what they correspond to in the original variable (i.e. Category 1 is given a value of 1 in the original data and corresponding category(s) in the new variable must be labeled "Category 1" and given a value of 1). Labels need to be consistent between the Value Attributes and the summary table for the variable.
  - If there are new codes or missing data for a non-Latent Class Analysis, see Technical Details below.
- Additional requirements for Latent Class Analysis include:
  - The reference Names of the corresponding variables must match in the new data file if using these analyses.
  - The code frame must match exactly; in addition to labels and values matching, all codes must be present in the value attributes, even if there were no respondents in the original data.
  - Because there's always a random element in the algorithm, the order of the categories of the variables used must be exactly the same. That is, when you create a summary table for each variable, the row order must match that of the original tree.
  - There must be at least 1 respondent in each category that was included before. That is, if you use a variable set with a "Don't know" category, for example, where there was 1 respondent who selected it in the original dataset, you also need to have at least 1 respondent for that category in the new dataset.

Method - Using Latent Class Analysis

If your segmentation was not R-based, such as Latent Class Analysis, there will be a variable in your data set named Latent Class Analysis, created automatically with the segmentation memberships, which will be updated. For example, a three-segment latent class solution for a sample size of 400 is shown below. To allocate people in a new data file using these segments:

Select your Data File in the Data Sources and in the Object Inspector Under Data, select Update your document with the new data file (note all variable requirements above).
You will notice that the latent class analysis will show an error:

To keep the same segmentation that you initially created, do not regrow the tree.
The Latent Class Analysis variable in the project that shows segment membership has now automatically updated, allocating people in the new data file to the segments based on the original model created.

Method - Using R-based predictive models

The output below is from a multinomial logit (MNL) model (select > Advanced Analysis > Regression > Multinomial Logit from the toolbar or in the Report tree) that predicts segment membership based on firmographics.

The goal is to now predict segment membership in a new data file containing the same firmographic predictor variables.

Select the model output and in the object inspector Make sure that Calculate automatically is unchecked. Otherwise, your model will automatically be rebuilt using the new data.
OPTIONAL: You can copy this to a different Document to use it in a different Report, if needed.
Get the new dataset into Displayr. You can either:
1. Update your current data set with the new data set. This is recommended if you are replacing the data with different versions or if you don't need the old one, and your variables are named the same:
  - In the Data Sources tree, select your Data Set name.
  - In the object inspector , click Update.
2. Or import the new dataset as another data source. This is recommended if you have other things in your report that you want to keep as-is based on your original data set, or if your data sets are dissimilar:
  - In the Data Sources tree, click the icon at the top.
  - When prompted, select the appropriate import method, see How to Import Data Into Displayr.
In the data set with the new data, hover and select + > Custom Code > R > Nominal.

In the Code Editor paste in the code below. In this example:

###specify what model to use
#replace glm below with the Name of your model
themodel = glm

###create a table of the raw data to use to predict
#replace the contents within data.frame() by specifying the new variables in the same order 
#of the variables in the model and assigning them the old variable Name 
#like: OldVariable = NewVariable
#note if the names are the same you can remove the = NewVariable part
thenewdata = data.frame(q1=Q1,
q2=S2,
q3,
q4,
q5)

###use the predict function to predict the outcome with new data
#you can use use.name=T to see the outcome labels instead of group number, but
#all groups must be present in the data to do this
predict(themodel,newdata=thenewdata,use.names=F)

In the object inspector , Under Data select Labels to access the Value Attributes and enter any labels you desire, and press OK.

Additional Notes

Many times, you may not have all the same questions used to segment respondents in a survey that you want to segment. In this case, you can use a predictive model to predict segment membership (after it's created using one of the original segmentation models). Instead of including all of the original variables used to create the segmentation as predictors, you can either include:

A completely different set of variables that are available in the original data set and the new data set. For example, demographics or other variables from a customer database that are not specific to a survey/data set.
A subset of the variables used to create the segments. Tip: if you are building a predictive model based on exactly the same variables as used to create segments, you are making a mistake and should instead use one of the methods above.

Technical Details

If the variables in the new data set have missing values or new categories that weren't originally in the model, different algorithms will handle this differently.

For K-means:

Missing values will still get a prediction.
The Value in the Value Attributes for new categories will be used in the model to assign a cluster.

For Random Forest:

Cases with missing values and new categories are not predicted. However, you can relabel new categories to one of the old ones to have the model treat those values the same.

How to Do K-means Cluster Analysis

How to Save K-Means Cluster Membership

Articles in this section

Requirements

Method - Using Latent Class Analysis

Technical Details

Related articles