This article describes how to use a subset of manually or semi-automatically categorized text data...
... to train Automatic Categorization of Unstructured Text Analysis to work smarter:
You will need the following:
- a Text variable in order to perform manual coding. Text variables are represented by a small a next to the variable in the Data Sets tree:
- A Nominal variable or Binary-Multi variable set in the same data set as the text variable to use as the categories to train the analysis. Note, this training variable (set) should contain all the categories you want in your final result, but only needs to categorize a subset of the responses. You also cannot have a category named NA.
- Once you have manually or semi-automatically categorized a subset of text data (this could be a random selection of the data or the first wave of a tracking study), go to Anything > Advanced Analysis > Text Analysis > Automatic Categorization > Unstructured Text from the toolbar.
- See Additional Details for more information on how much text data should be categorized in the existing categorization.
- From Inputs > DATA SOURCE > Text variable, select the original, uncategorized text variable in your Data Sets tree. It will be represented by a small a next to the variable in the Data Sets tree.
- In Inputs > CATEGORIES > Existing categorization, select the variable that was created during the manual or semi-automatic categorization step that has already been performed.
- Click Calculate if Automatic is not already ticked.
- You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally categorized responses (the Accuracy column):
- Save your categories into a variable set to use in tables and other outputs by selecting the automatic categorization output and clicking Inputs > SAVE VARIABLE(S) > Categories or First Category from the object inspector.
- Categories: Save variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.
- First category: Save a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
NOTE: The variables created from this using SAVE VARIABLE(S) > Categories and First Category may become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable being modified or the input settings modified.
There's no hard and fast rule on how much text data should be categorized manually or semi-automatically first. It comes down to the specifics of the text that's being categorized and how much the responses vary. For example, a brand list will be much simpler to automatically categorize into a list of brands as the text responses vary less and the categories are more straightforward. Full, unstructured text responses will require more cases, particularly if the responses are detailed or vary more between cases.
In general, you can use these rules of thumb:
- Every category that should be present in the final automatic categorization must be included in the existing manual/semi-automatic categorization.
- Each category should have a representative amount of data categorized into it. What's "representative" can be tough to pin down precisely, but the responses included in each category should represent the breadth of responses in the original text data for that category
- Very roughly, this means about 20-30% of the text responses should be categorized, this will vary from categorization to categorization.
Using an existing categorization to train the automatic categorization tool is very often an iterative process. Tuning the tool may likely take a few rounds of first manually/semi-automatically categorizing, then running the automatic process, and checking the results. This is essentially training the model for text data the tool has never seen before, so it may involve a bit of back and forth.
In the event that not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized. In order for the algorithm to train itself on the existing categorization, it has to compare text that has and has not been coded into a category, so there needs to be at least 1 but not all responses coded into each category.
To capture those responses, first examine which responses haven't been categorized. Then, refine the existing categorization either by adding new categories with a representative sample of the text responses categorized into them, categorizing more text responses into the existing categories to train the automatic categorization with a more diverse data set, or a combination of both. Though there's not really a blanket threshold, do know that the more you can categorize into a category, the better the model will be able to fit the data and "predict" what uncategorized text should go into it.
Refining the existing categorization with a wider range of text responses and re-running the automatic categorization can often be an iterative process as you train the model to categorize all of the responses correctly.