This article describes how to use a subset of text data that was classified using Displayr's text categorization tool...
... to train Automatic Categorization of Unstructured Text Data to work smarter by using existing categorization to predict categories for the remaining text:
Requirements
You will need the following:
- a Text variable. Text variables are represented by an A next to the variable in the Data Sources tree:
- A Nominal or Binary-Multi variable set in the same data set as the text variable to use as the categories to train the analysis. See How to Classify Text Data or How to Manually Classify Text Data for more information.
- Note, this training variable set should contain all the themes (or categories) you want in your final result, but only needs a subset of the responses classified. This could be a random selection of the data or the first wave of a tracking study, for example.
- You cannot have a theme named "NA".
- See Additional Details for more information on how much text data should be classified in the existing categorization.
Method
- Go to Anything > Advanced Analysis > Text Analysis > Automatic Categorization > Unstructured Text from the toolbar.
- From Data > Data Source > Text variable, select the original, uncategorized text variable in your data set.
- In Data > Categories > Existing categorization, select the variable set that was created using Displayr's text categorization tool.
- Click Calculate if Calculate automatically is not already ticked.
You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally classified responses (the Accuracy column):
Save variables from categorization
Save your automatic categorization into a variable set to use in tables and other outputs by following the steps below.
- Select the automatic categorization output on the page.
- Click Data > Save Variable(s) > Categories or First Category from the object inspector.
- Categories: Save variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.
- First category: Save a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
NOTE: The variables created when using Save Variable(s) > Categories or First Category may become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable being modified or the input settings modified.
Additional Details
There's no hard and fast rule on how much text data should be classified manually or using AI-assistance first. It comes down to the specifics of the text that's being classified and how much the responses vary. For example, a brand list will be much simpler to automatically categorize into a list of brands as the text responses vary less and the categories are more straightforward. Full, unstructured text responses will require more classified cases, particularly if the responses are detailed or vary more between cases.
In general, you can use these rules of thumb:
- Every category that should be present in the final automatic categorization must be included in the existing manual/AI-assisted classification.
- Each theme should have a representative amount of data classified into it. What's "representative" can be tough to pin down precisely, but the responses included in each theme should represent the breadth of responses in the original text data for that theme.
- Very roughly, this means about 20-30% of the text responses should be classified, this will vary from categorization to categorization.
Using an existing categorization to train the automatic categorization tool is often an iterative process. Tuning the tool may likely take a few rounds of first manually/AI-assist categorizing, then running the automatic process, and checking the results. This is essentially training the model for text data the tool has never seen before, so it may involve a bit of back and forth.
Not all of the responses may fit into one of the existing categories based on the machine learning algorithm. In those instances, the response will not be classified. In order for the algorithm to train on the existing categorization, it has to compare text that has and has not been coded into a category, so there needs to be at least 1 but not all responses classified into each theme.
To capture those responses, first examine which responses haven't been classified. Then, refine the existing classification either by adding new themes with a representative sample of the text responses classified into them, classifying more text responses into the existing themes to train the automatic categorization with a more diverse data set, or a combination of both. Though there's not a blanket threshold, but know that the more you can classify into a theme, the better the model will be able to fit the data and "predict" what unclassified text should go into it.
Refining the existing categorization with a wider range of text responses and re-running the automatic categorization can often be an iterative process as you train the model to classify all of the responses correctly.