When analyzing numeric data (i.e. Age in years or Income in $), you may want to create categories from these values to analyze vs using the numeric values themselves. You can use the Convert To > Ordinal automation in Displayr to achieve this (though there are other methods available too). This automation supports various methods of categorizing the data and can update cut-offs for each category automatically based on new data if needed. This article describes how to go from a numeric variable to one that bands these values into buckets (percentiles for the number of hours spent on a home phone in a typical week, in this example).
Requirements
A data set loaded into a Displayr document
Method
There are four different methods available for combining numeric values in this tool:
- Tidy categories divides the range of values into tidy categories, which are intervals of 2, 5, 10, 20, 50, and so on.
- Percentiles divides the range into percentiles. This is useful when you want to create categories which contain even proportions of the cases in your data.
- Equally spaced categories divides the range of values into categories that all have the same range. This is similar to Tidy categories, but has additional options for customization.
- Custom categories divides the range of values into ranges of your choice. This is useful when you want to specify uneven ranges of values.
Automatically Combine Categories into Percentiles
Let's say you want to compute percentiles for the variable How many SMS sent in a typical week.
- Select the variable in the Data Sources tree.
- Hover and select + > Convert To > Ordinal > Percentiles.
- Customize the options in the New Categories and New Labels sections.
- OPTIONAL: If you make any changes in the object inspector, click Calculate to recalculate the variable.
For example, I updated New Categories > Percentages to "25", and if I drag the variable onto the page, the following summary table appears:
Automatically Combine Categories into Equally Spaced Categories
Let's say you want to compute equally spaced categories for the variable How many SMS sent in a typical week.
- Select the variable in the Data Sources tree.
- Hover and click + > Convert To > Ordinal > Equally Spaced Categories.
- Update any options. In this example, we change the New Categories > Number of categories to 3. The default is 2.
- OPTIONAL: Click Calculate if any updates are made.
A new variable will be added to your Data Sources tree.
The results are as follows:
Automatically Divide Categories into Tidy Categories
Let's say you want to compute tidy categories for the variable How many SMS sent in a typical week. Tidy categories, meaning that the ranges of values in the categories are always 2, 5, or 10 (or multiples of 10 of these), with the same range used for each new category.
- Select the variable in the Data Sources tree.
- Hover and click + > Convert To > Ordinal > Tidy Categories.
Select the options you want. In this example, we changed the Number of categories to 10 and the Label style to Inequality notation.
A new variable will appear in the Data Sources tree. I can drag it onto the page and see the tidied categories:
Note: If the Target number of categories setting is changed to 3 or 5, the resulting categories won't change. This is because the algorithm for finding tidy categories always finds category ranges that are 2, 5, or 10, or multiples of 10 of these, and so some combinations are not always possible.
Automatically Combine Categories into Custom Categories
- Select the variable in the Data Sources tree.
- Hover and click + > Convert To > Ordinal > Custom Categories.
Select the options you want. In this example, we changed the Cut points to 0,25,50,100 (comma separated) and the Category boundary to Start of range.
A new variable will be added to the Data Sources tree. The summarized results will look like:
Options
The settings that are available for Convert To > Ordinal change depending on which method you choose, and whether the input data are categorical. Settings that are available for all methods are shown first, and method-specific settings are shown below.
Variable(s) Choose which variable(s) are being used to create new categories.
Use numbers found in category labels This option will only appear if there are any categorical variables selected in Variable(s). When this is ticked, the tool will try to identify numeric values from the data labels and use those values when forming categories. If this is not ticked, the tool will ignore the category labels and will instead use the underlying data values for each category. Labels contain This option will also only appear if there are any categorical variables selected in Variable(s). This option allows you to communicate the nature of the values in the labels:
Single values This causes the tool to assume that your labels always contain single values. Any labels which contain more than one numeric value will not be combined.
Ranges of values This causes the tool to assume your data labels describe ranges of values. The tool will expect to find pairs of labels, with the exception of the lowest and highest values in the range.
Method The method to be used to identify ranges of values to use for creating new categories. The different methods are described above.
Category boundary Whether the value at the start of each range is included in the new category or the value at the end of each range is included in the new category. For example, if dividing the interval up into ranges of 10, do we create a category that is 10 to 19 (Start of range) or 11 to 20 (End of range).
Label style Determine the style of the labels for the new categories.
- Tidy labels describes the range of values that is contained within each range in English. For example, 10 to 20.
- Inequality notation describes the range of values that is contained within each range using greater-than and less-than symbols. For example, 10 to <21.
- Interval notation describes the range of values using interval notation. For example, [10, 20) describes a range of values which include the value 10 and range up to 20 but does not include the value of 20. Similarly, (10, 20] describes a range that does not include 10 but includes all values greater than 10, up to and including the value of 20.
Use open-ended labels When ticked, this setting produces open-ended category labels at the start and end of the range. For example, an open-ended label would be Less than 10, whereas a non-open-ended label would be 0 to 9
Number prefix / Number suffix Allows you to add text before and after the numeric value in each new label. For example, you may wish to add a dollar sign or other currency symbol in front of each number in the new labels. This is not available if using Ranges of values, where the text to place before and after is drawn from the original category labels.
Decimals in label Choose the number of decimals that are displayed in the new category labels.
Decimals symbol / Thousands symbol These options are only available when there are categorical variables selected in Variable(s). They allow you to communicate the number format of the data labels. For example, the number convention in the United States and other countries is to use periods to denote the position for decimal values, and commas to separate units of thousands, resulting in numbers of the form 10,000.50. In other countries, the convention is reversed, so that the same value may be represented by 10.000,00.
Tidy categories
Target number of categories This option allows you to specify how many categories should be created when using the Tidy categories option. Note that the Tidy categories algorithm always tries to make ranges of values which are intervals of 2, 5, 10, 20, 50, etc, so it will not always be able to achieve the desired number of categories. For more control, you can try using the Equally spaced categories method instead.
Percentiles
Percentages This option allows you to control how categories are created when using the Percentiles method. You can enter a single number here, to divide the range up into even percentiles. For example, entering 10 will create the 10, 20, 30, etc percentiles. Alternatively, if you don't want even percentiles, you can enter a comma-separated list of percentiles. For example, you can enter 50, 75, 85, 90, 95, 99, 100 to create these uneven percentiles.
Equally spaced categories
Number of categories Specify how many categories will be created. Alternatively, rather than dividing the range into a set number of categories, you can set the Increment option to divide the range into categories that have a set width.
Start point / End point Use these options to determine where you want the range to start and end. This range will then be divided up according to the Number of categories setting, or the Increment setting.
Increment Use this option if you want to specify how wide each interval should be instead of specifying how many categories you wish to create.
Custom categories
Cut points Use this field to specify which values should define the new categories.
Always include highest and lowest values When this option is ticked, the entire range of values will always be included in the set of categories even if the highest and lowest values are not entered in Cut points. Turning this option off will result in missing values for any numeric values that fall outside the range of values specified in Cut points. This is useful if you want to exclude values that are too low or too high (e.g. outliers, respondents providing unrealistically high or low values in a survey).
Ranges of values
Number of categories Specify how many categories will be created. When the input data are categorical, and the category labels contain ranges, this tool will combine adjacent ranges into this many categories.
Start of range / End of range Many examples of range categories, like Age or Income questions from surveys, contain ranges that are open-ended at the start and end of the range. For example, an Age question may start with 18 years or younger and end with 65 years and older. If combining such range categories into Equally spaced categories, the algorithm does not how the start and end points for the range of values. These fields allow you to communicate values to use as the highest and lowest in the range. For example, if you conduct a survey, and the youngest respondents who were allowed to complete the survey were 13 years old, entering 13 in Start of range would have the effect of converting 18 years or younger to 13 to 18 from the perspective of finding ranges of values in this algorithm.
Next
How to Automatically Combine Categories - By Pattern (CHAID)