This article describes how to extract categories from text data with a list-like format, with or without delimiters (i.e., commas) separating the items. For example, you may have spontaneous/unaided awareness data in a format like: SBUX, Pete's, caribou. When analyzing the data, you want to be sure that those mentioning SBUX and Starbucks are both categorized as mentioning Starbucks in the final variable. The List of Items feature will automatically identify items in a list and group those and their variations into a final list of items. You can also set up a workflow to update this categorization when data updates.
Using this feature, you can go from uncategorized verbatim (raw) text responses listing things:
To a variable set that classifies those items (and any of their misspellings) into a master item list that you can use in your analyses:
This article is broken into the following sections:
- Create List of Items output
- Review the List of Items output
- Refine category variants
- Saving categories into a variable
- Combining saved categories variables
- Technical Details
Requirements
- A Displayr document.
- One or more Text variables. Text variables are represented by an "A" next to the variable name.
- See Finding the Best Text Analysis for your Data to confirm this is the best solution for you, given your data and desired outputs.
Create List of Items output
- Select the text variable that you wish to classify in the Data Sources tree.
- In the object inspector, select Data >Text Analysis > Categorize list of items.
- An output will be created in your document.
- OPTIONAL: Add other text variables that you'd like to use in the same list of items categorization in the object inspector > Data Sources > Text variable(s) dropdown.
Review the List of Items output
Usually, there is some level of tweaking needed to get the final list of categories as you wish. The example below shows a list of items categorization output for a survey question asking which mobile phone company is used. The Categories section, expanded below, shows a table of the categories on the left and the raw and transformed text on the right. Each category is distinguished by a unique shading and shape, and the raw text going into that category has a matching format on the right. You'll notice that both AT&T and Att are listed as categories; these can be combined into AT&T by modifying some of the settings as described later.
The Diagnostics section at the bottom of the picture, which is collapsed by default but can be expanded, shows diagnostic information for each processing step. You can expand each section to see a description of what it contains and how to use this information to refine your analysis (note Variant suggestions is explained in detail later as it is typically the most useful).
It's a good idea to also review the object inspector > Data settings that are used by the algorithm to come up with categories. These include fields like:
- Minimum category size - Anything that does not meet this threshold will be classified as "Unclassified". It is set to 1 by default so that all items are classified, but you can set that to a higher number if you want to be more conservative as to what gets categorized.
- The Delimiters / Split Text section has options for common delimiters like semicolons and commas, but you can also specify your own in Other. There are also settings as to how to separate items when there are no delimiters.
- Spelling Correction section is where you can turn off auto-correct for spelling or specify how you want spelling to be corrected. If you want certain misspellings to NOT be corrected, you can add those here as well.
- The Categories to Discard section is where you can Add categories to discard, if needed.
Any changes to the above will automatically recalculate the output so you can see how your modifications affect the results.
Refine category variants
In the Diagnostic section, there is a Variant suggestions section (see picture below). This is a table listing all the current categories and the items from the raw text that go into them, such as Alltdl and Alltell being categorized as Alltel. You are able to select the table of variants in the output and copy and paste that into the Required Categories > Add required phrases or variants window in the object inspector to ensure those variants and categories are always mapped together when the output is recalculated (in case changing other settings may change results).
You can also use the Add required phrases or variants button to combine categories manually. Notice in the Variant suggestions above AT, AT&T, and Atnt are listed as separate categories in the first column, but all are surely referring to AT&T. You can add the text from the AT and Atnt rows as variants to AT&T in order to group those categories and variants into AT&T. To combine categories:
- In the object inspector, click Required Categories > Add required phrases or variants.
- In the first column, specify the final category you want to use, in this case AT&T.
- In each column to the right of that category, paste in each variant you want to be sure to map to this category. In this example, this includes those on the AT&T row in Variant suggestions as well as the text in the AT and Atnt rows, see below.
- Click OK to close the editor and Calculate if Calculate automatically is not ticked.
The results in the output will automatically update based on the changes above. You can see below that the Frequency (1245) and Variants (50) of the AT&T Category have increased from the initial calculation (of 806 and 19, respectively) in the Review the List of Items output section above.
Saving categories into a variable
Once your list of items categorization is completed, you can save the categories into a variable set to use in further analyses. IMPORTANT - if your list of categories changes later, your saved variable set will become invalid, and you will have to resave into a new variable set and relink that to any analyses. If you plan on updating your text response data at a later date or expect your list of items to change in the future and you are only categorizing a single text variable, it's recommended you setup a workflow to manage categorizing the list of items in the Text Categorization module, see the Create QCodes File documentation on how to import your saved category variable(s) into the Text Categorization module.
- Select the List of Items output, and from the object inspector, go to Data > Save Variable(s).
- OPTIONAL: Adjust the Maximum number of unique categories to save.
- OPTIONAL: Adjust the Maximum number of categories per case to save.
- Click one of the following options:
- Categories - Save all categories classified from each response. This will create a Binary-Multi (Compact) variable set with a variable for each Phrase/Item found in each case. Where there are multiple input variables, multiple sets of variables are added for each.
- First category - Save a Nominal variable to the data set containing just the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
These new variable sets can then be used to create tables and further outputs. For example, I clicked Categories, which created a new variable set with all the categories classified from each response.
Combining saved categories variables
You can combine saved categories from a list of items where you've used multiple text variables in the analysis or with other variable sets in your data set (if you've partially categorized a list of items elsewhere).
For categories from multiple text variables
If you used multiple text variables, such as multiple spontaneous awareness variables, as inputs when creating your list of items categorization:
When you Save Variables from the output, a variable for each input variable is created:
However, you may want to show these variable sets as a single list of categories in a table:
Follow these steps:
- Select the first "Categories from..." variable in the Data Sources tree.
- From the object inspector, update the Structure to Nominal - Multi.
- Repeat Steps 1-2 for the remaining "Categories from..." variable(s).
- Select the "Categories from..." variables in the Data Sources tree.
- Hover and click + > Convert To > Binary - Multi > Values Greater Than 0 (formerly named Binary Variable(s)).
- Click Yes or No, depending on whether you'd like to handle missing data in the variables.
- A new Binary-Multi variable set will appear just below the original variables, containing the combined data.
If you'd instead like to see all of the variable sets broken out by the variable number (i.e., Mention 1, Mention 2, etc.):
You will need to create binary variables for each variable set, and then follow the steps in How to Combine Separate Questions into a Grid in Displayr, paying close attention to the labels and order of the variables.
For categories from a different categorization variable set
Sometimes you may manually categorize some/all text outside of the list of items output. You can combine that with the categories from the list of items to analyze all the results together. Note, this would only apply to text responses with multiple themes. To do so:
- Ensure you have a Binary-Multi structured variable set with the previous classifications.
- Follow the instructions in For categories from multiple text variables to convert your List of items variable set(s) into a single Binary-Multi structure.
- In the Data Sources pane, select both Binary-Multi variable sets and click + > Calculate Across Variables > Maximum. This will take the maximum value of each matching variable/category between the sets (so if a response is in a category in either set it will be classified in that category in the final variable set), while including all unmatched variables between the sets as well.
You will end up with a final Binary-Multi variable set listing all the categories from both categorizations, and responses will be categorized in a category if it is categorized in either variable set.
Technical Details
The variables created by using Save Variable(s) > Categories or First Category will become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable(s) being modified or updated, or if the input settings are modified. If the list of final categories has not changed or the number of categories in each response has not exceeded the number of Phrases saved, you can modify the underlying R code for each variable to ignore this restriction:
Comment out lines 3-7 (for the if() and stop() functions). Otherwise, there can be problems and errors with the structure of the variable.
Next
Finding the Best Text Analysis for Your Data
How to Tidy Categories When Automatically Classifying Into an Item List
Text Analysis - Create QCodes File
How To Automatically Classify Unstructured Text Data
How To Automatically Classify Unstructured Text Data Into an Entity List