This article describes how to use a subset of manually or semi-automatically categorized text data...
... to train Automatic Categorization of Unstructured Text Analysis to work smarter:
You will need the following:
- a Text variable in order to perform manual coding. Text variables are represented by a small a next to the variable in the Data Sets tree:
- a subset of manually or semi-automatically categorized text data.
- Once you have manually or semi-automatically categorized a subset of text data (this could be a random selection of the data or the first wave of a tracking study), go to Anything > Advanced Analysis > Text Analysis > Automatic Categorization > Unstructured Text from the toolbar.
- From Inputs > DATA SOURCE > Text variable, select the original, uncategorized text variable in your Data Sets tree. It will be represented by a small a next to the variable in the Data Sets tree.
- In Inputs > CATEGORIES > Existing categorization, select the variable that was created during the manual or semi-automatic categorization step that has already been performed.
- Click Calculate if Automatic is not already ticked.
- You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally categorized responses (the Accuracy column):
- Save your categories into a variable set to use in tables and other outputs by selecting the automatic categorization output and clicking Inputs > SAVE VARIABLE(S) > Categories from the object inspector.
Note the following:
Not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized.
To capture those responses, first examine which responses haven't been categorized. Then, refine the existing categorization either by adding new categories with a representative sample of the text responses categorized into them, categorizing more text responses into the existing categories to train the automatic categorization with a more diverse data set, or a combination of both.
Refining the existing categorization with a wider range of text responses and re-running the automatic categorization can often be an iterative process as you train the model to categorize all of the responses correctly.