This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable.
For example, consider the following table which shows categories of Kickstarter projects in the columns, and whether the projects were successful, failed, canceled, suspended, or live in the rows:
We may want to create new categories by combining those Kickstarter categories which have the most similar pattern of success and failure. The results are as follows:
The categories Music, Comics, Theatre, and Dance have been combined as those with the highest relative success rate, Food, Crafts, Journalism, Fashion, and Technology have been combined as the categories with the highest relative failure rate, and other patterns in between for the other combined categories.
The CHAID algorithm is used to obtain the solution and only considers adjacent categories as possible combined categories. The algorithm also supports identifying some categories to be handled differently. The ones identified as unordered are free to combine with any other category.
This article contains the following sections:
Requirements
- A categorical variable with categories you want to combine by considering the patterns present when compared to a second variable. For instance, you may have a variable that contains categories for the occupation of respondents in a survey, and you may want to group those occupations based on those with similar income distributions or age distributions.
Method
Automatically Combine Categories > By Pattern (CHAID) > Any Categories
Use this option if you want each category to be checked against all other possible categories during each CHAID iteration.
- From the Data Sets tree, select the variable whose categories you want to combine and click the sign to the right of the variable.
- From the Insert Variable(s) menu, select Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Any Categories.
(As an alternative, you can also select Anything > Data > Variables > New > Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Any Categories).
- From the following menu, select the variable you want to use to determine which categories are combined. Categories are combined that are most similar when compared to a second variable.
- Click OK. The new variable will appear immediately below the variable in the Data Sets tree whose categories we are combining (main_category).
- Drag the new variable onto the page.
The results are as follows:
- Crosstab the combined categories variable with the variable you used to identify the patterns.
The results are as follows:
Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories
Sometimes you may want to only allow adjacent categories instead of all categories to be considered for combining. For example, consider the following table which examines peoples' attitudes on mass transportation spending by their political views. While it may make theoretic sense to combine extreme conservatives with conservatives on the issue, it probably would not make as much sense to combine extreme liberals with extreme conservatives.
- From the Data Sets tree, select the variable whose categories you want to combine and click the sign to the right of the variable.
- From the Insert Variable(s) menu, select Ready-Made New Variables > Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories.
- Select the secondary variable that you want to use to identify the patterns.
- Click OK.
- Drag the new variable onto the page.
The results are as follows:
- Crosstab the combined categories variable with the variable you used to identify the patterns.
The results are as follows:
Options
To change the default settings, click on the combined categories variable in the Data Sets tree
The results are as follows:
Variable This is the variable whose categories you wish to combine.
Combine by Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to By Value if combining numeric data, or By Geography if your data contains geographic locations (zip codes, states, cities, etc).
Based on This is the variable that you want to compare with the first Variable above. The categories of the variable selected in Variable will be combined based on the similarity of their distributions of this Based on variable.
Weight Select a weight variable here if you wish to apply the weighted version of CHAID. This will combine categories based on the weighted distributions.
CHAID ALGORITHM SETTINGS
Combine
The option to choose which pairs of categories are permissible to combine. The options are:
- Any categories: It is permissible for each category to combine with any other category.
- Adjacent categories: It is only permissible for each category to combine with adjacent categories. Unless one or more categories are specified in the Unordered categories control. In that case, the categories specified in that control are permitted to combine with any other category and not restricted to adjacent categories.
- Adjacent categories unless missing value code: The same behavior as Adjacent categories except if there are any categories which are coded with a value of NaN in the Value Attributes. Then those categories will always be considered as unordered.
- Using variable set structure: The permissible combine options are determined by the Variable type of the input variable. If the input variable is Categorical then Any categories are permissible to combine. If the input variable is Ordered Categorical then Adjacent categories are permissible to combine.
Unordered categories This control only appears if the Combine option is Adjacent categories or Adjacent categories unless missing value code. This control gives the ability to specify if particular categories should be considered as not ordered and allowed to combine with any other category. Other categories not entered here can only be combined with adjacent ones. This is appropriate if the input variable contains ordinal values on a scale but some options are not ordered. For example, if the categories are on a scale with options, 'Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree', "Don't know" and 'I refuse to answer' then "Don't know" and 'I refuse to answer' can be identified and then permitted to combine with any of the other categories. The two category labels would need to be typed into this control and separated with a ';' or ','. Note that if Combine is set to Adjacent categories unless missing value code, then any category which is coded with a value of NaN in the Value Attributes will always be considered as unordered.
Use Exhaustive CHAID This controls whether the Exhaustive CHAID algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is Usually, which means that Exhaustive CHAID will always be used unless your Variable has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case, you can ensure the exhaustive algorithm is applied by changing this setting to Yes.
Minimum category size The CHAID algorithm will not produce new categories which have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.
Alpha level to combine categories This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change Use Exhaustive CHAID to No).
Alpha level to validate final combined categories This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the Based on menu which has a greater level of variation with the main Variable, or you can increase the value of this setting.
Multiple Comparison adjustment This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the Alpha level to validate final combined categories. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.
Technical details
CHAID stands for Chi-square automatic interaction detection. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)^{[1]} and Biggs, D., Ville, B., and Suen, E. (1991)^{[2]} for more details.
The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.
The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of those solutions by identifying the solution with the smallest p-value.
When weights are used, the second order survey weight adjusted test of independence of Rao and Scott (1984)^{[3]} is used instead of the standard Pearson Chi squared test.
Bonferroni adjustments
If the Multiple comparison adjustment option is selected, then the significance test to assess the significance of the final state of the combined categories from the CHAID algorithm is adjusted by the number of pairwise tests conducted during the combining of each category. This adjustment is a Bonferroni type adjustment that is computed differently for the standard CHAID algorithm against the exhaustive CHAID algorithm. The standard algorithm terminates if there are no pairwise tests that are above the significance level. While the exhaustive algorithm will combine a category with another category until only two categories remain. From the set of states generated in the exhaustive algorithm, the state with the smallest p-value is considered the optimal configuration and becomes the final combined category solution.
In the sections below, details about the Bonferroni adjustments used for both the standard CHAID algorithm and the exhaustive CHAID algorithm. In each section, the detailed adjustments for the Combine option allowing Any categories to combine or only Adjacent categories are given. The latter also considers a more refined adjustment when some Unordered categories are specified in the Adjacent categories option in Combine. More possibilities are explored and therefore a larger Bonferroni adjustment is required when some categories are allowed to combine with any other category in the Adjacent categories option. Define the initial number of categories in the variable as \(c\) and the final number of reduced categories from the combined solution as \(r\). Then the Bonferroni adjustment is denoted \(B(c,r)\) for each of the possible scenarios below (assuming of course that \(1 \le r \le c\) are integer valued).
Standard algorithm
The standard algorithm follows the Bonferroni adjustment approach used in Kass (1980)^{[1]}. Here the adjustment considers the number of possible arrangements from reducing \(c\) categories into \(r\) categories. In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control. Then this is solved by a result of partitions. In particular, Stirling numbers of the second kind ^{[4]} gives the number of ways to partition a set of \(c\) categories into \(r\) non-empty subsets as \(\left\{ \begin{smallmatrix} c\\ r \end{smallmatrix} \right\}\) and takes the role of the Bonferroni adjustment value for the case of Any categories. In particular:
\[\begin{align} B(c,r) = \left\{ \begin{matrix} c\\ r \end{matrix} \right\} = \frac{1}{r!}\sum_{i = 0}^r (-1)^i\binom{r}{i}(r - i)^c, \qquad \left\{ \begin{matrix} c\\ c \end{matrix} \right\} = 1, \quad \text{ and for } c \ge 1, \quad \left\{ \begin{matrix} c\\ 1 \end{matrix} \right\} = 1.\end{align}\]
For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by:
\[\begin{align} B(c,r) = \binom{c - 1}{r - 1}.\end{align}\]
In the case when there are Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are \(u\) unordered categories (including categories coded as missing), with \(1\le u \lt c\) then the Bonferroni adjustment is:
\[\begin{align} B(c, r, u) = \sum_{s = 0}^u \binom{c - u - 1}{r - s - 1}\sum_{i = 0}^{u-s}\binom{u}{i}\left\{ \begin{matrix} u - i\\ s \end{matrix} \right\} (r - s)^i\end{align}\]
Exhaustive algorithm
The exhaustive algorithm follows the Bonferroni adjustment approach used in Biggs, D., Ville, B., and Suen, E. (1991)^{[2]}. Here the adjustment considers the number of tests conducted as the algorithm traverses from the full set of \(c\) categories down to two categories.
In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control, the Bonferroni adjustment is:
\[\begin{align} B(c,r) = \sum_{k = 2}^c \binom{k}{2}\end{align}\]
For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by:
\[\begin{align} B(c,r) = \binom{c}{2}.\end{align}\]
In the case when there are Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are \(u\) unordered categories (including categories coded as missing), with \(1\le u \lt c\) then the Bonferroni adjustment is:
\[\begin{align} B(c, r, u) = \binom{c - u}{2} + \sum_{i = 0}^{u - 1}\frac{c - i}{2} \left( 2 c - u - 1 - i\right).\end{align}\]
Differences to SPSS CHAID
When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.
In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a tie. This algorithm will attempt to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS has not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.
Next
How to Automatically Combine Categories - By Value
How to Automatically Combine Categories - By Geography
How to Recode into Existing or New Variables
How to Recode Variables Using Category Midpoints
How to Recode Values (Capping) in Numeric Variables
How to Recode Low Values (Capping) in Numeric Variables
How to Recode Numeric Variable(s) from Code/Category Midpoints
References
- Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 20, 2, 119-127. doi: https://doi.org/10.2307/2986296
- Biggs, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62. doi:https://doi.org/10.1080/02664769100000005
- Rao, J. N. K. and A. J. Scott (1984). 'On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.' The Annals of Statistics, 12, 1, 46-60. doi: https://doi.org/10.1214/aos/1176346391
- Stirling Numbers of the second kind (2022). Retrieved June 9, 2022, from https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind