Introduction
This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable.
For example, consider the following table which shows categories of Kickstarter projects in the columns, and whether the projects were successful, failed, cancelled, suspended, or live in the rows:
We may want to create new categories by combining those Kickstarter categories which have the most similar pattern of success and failure. The results are as follows:
The categories Music, Comics, Theatre, and Dance have been combined as those with the highest relative success rate, Food, Crafts, Journalism, Fashion, and Technology have been combined as the categories with the highest relative failure rate, and other patterns in between for the other combined categories.
Requirements
- A categorical variable with categories you want to combine by considering the patterns present when compared to a second variable. For instance, you may have a variable that contains categories for the occupation of respondents in a survey, and you may want to group those occupations based on those with similar income distributions or age distributions.
Method
Automatically Combine Categories > By Pattern (CHAID) > Any Categories
Use this option if you want each category is checked against all other possible categories during each CHAID iteration.
- From the Data Sets tree, select the variable whose categories you want to combine and click the
sign to the right of the variable
- From the Insert Variable(s) menu, select Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Any Categories
(As an alternative, you can also select Anything > Data > Variables > New > Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Any Categories) - From the following menu, select the variable you want to use to determine which categories are combined. Categories are combined that are most similiar when compared to a the second variable.
- Click OK. The new variable will appear immediately below the variable in the Data Sets tree whose categories we are combining (main_category)
- Drag the new variable onto the page
The results are as follows: - Crosstab the combined categories variable with the variable you used to identify the patterns.
The results are as follows:
Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories
Sometimes you may want to only allow adjacent categories instead of all categories to be considered for combining. For example, consider the following table which examines peoples' attitudes on mass transportion spending by their political views. While it may make theoretic sense to combine extreme conservatives with conservatives on the issue, it probably would not make as much sense to combine extreme liberals with extreme consertives.
- From the Data Sets tree, select the variable whose categories you want to combine and click the
sign to the right of the variable
- From the Insert Variable(s) menu, select Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories
- Select the secondary variable that you want to use to identify the patterns
- Click OK.
- Drag the new variable onto the page
The results are as follows: - Crosstab the combined categories variable with the variable you used to identify the patterns.
The results are as follows:
OPTIONS
To change the default settings, click on the combined categories variable in the Data Sets tree
The results are as follows:
Variable This is the variable whose categories you wish to combine.
Combine by Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to By Value if combining numeric data, or By Geography if your data contains geographic locations (zip codes, states, cities, etc).
Based on This is the variable that you want to compare with the first Variable above. The categories of the variable selected in Variable will be combined based on the similarity of their distributions of this Based on variable.
Weight Select a weight variable here if you wish to apply the weighted version of CHAID. This will combined categories based on the weighted distributions.
CHAID ALGORITHM SETTINGS
Combine The option to combine a category with only with adjacent categories or with any other category. If this is set to Adjacent categories, then only the adjacent categories are considered for combining categories. If this is set to Any categories then each category is checked against all other possible categories during each CHAID iteration. If Using variable set structure is chosen then the behaviour is deduced based off the variable structure where only adjacent categories are considered for an ordinal variable, while any categories are considered for a nominal variable.
Use Exhaustive CHAID This controls whether the Exhaustive CHAID algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is Usually, which means that Exhaustive CHAID will always be used unless your Variable has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case you can ensure the exhaustive algorithm is applied by changing this setting to Yes.
Minimum category size The CHAID algorithm will not produce new categories which have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.
Alpha level to combine categories This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change Use Exhaustive CHAID to No).
Alpha level to validate final combined categories This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the Based on menu which has a greater level of variation with the main Variable, or you can increase the value of this setting.
Multiple Comparison adjustment This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the Alpha level to validate final combined categories. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.
Technical details
CHAID stands for Chi-square automatic interaction detection. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)[1] and Biggs, D., Ville, B., and Suen, E. (1991)[2] for more details.
The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.
The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of the those solutions by identifying the solution with the smallest p-value.
When weights are used, we use the Second Order Rao-Scott Test of Independence [3] instead of the standard Chi squared test.
Differences to SPSS CHAID
When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.
In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a tie. This algorithm will attept to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS have not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.
Next
How to Automatically Combine Categories By Value
How to Automatically Combine Categories - By Geography
Comments
0 comments
Article is closed for comments.