This article describes how to use a subset of manually or semi-automatically categorized text data...
... to train Automatic Categorization of Unstructured Text Analysis to work smarter:
You will need the following:
- a Text variable in order to perform manual coding. Text variables are represented by a small a next to the variable in the Data Sets tree:
- A Nominal variable or Binary-Multi variable set in the same data set as the text variable to use as the categories to train the analysis. Note, this training variable (set) should contain all the categories you want in your final result, but only needs to categorize a subset of the responses.
- Once you have manually or semi-automatically categorized a subset of text data (this could be a random selection of the data or the first wave of a tracking study), go to Anything > Advanced Analysis > Text Analysis > Automatic Categorization > Unstructured Text from the toolbar.
- From Inputs > DATA SOURCE > Text variable, select the original, uncategorized text variable in your Data Sets tree. It will be represented by a small a next to the variable in the Data Sets tree.
- In Inputs > CATEGORIES > Existing categorization, select the variable that was created during the manual or semi-automatic categorization step that has already been performed.
- Click Calculate if Automatic is not already ticked.
- You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally categorized responses (the Accuracy column):
- Save your categories into a variable set to use in tables and other outputs by selecting the automatic categorization output and clicking Inputs > SAVE VARIABLE(S) > Categories from the object inspector.
Note the following:
Not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized. In order for the algorithm to train itself on the existing categorization, it has to compare text that has and has not been coded into a category, so there needs to be at least 1 but not all responses coded into each category.
To capture those responses, first examine which responses haven't been categorized. Then, refine the existing categorization either by adding new categories with a representative sample of the text responses categorized into them, categorizing more text responses into the existing categories to train the automatic categorization with a more diverse data set, or a combination of both. Though there's not really a blanket threshold, do know that the more you can categorize into a category, the better the model will be able to fit the data and "predict" what uncategorized text should go into it.
Refining the existing categorization with a wider range of text responses and re-running the automatic categorization can often be an iterative process as you train the model to categorize all of the responses correctly.