How To Automatically Classify Unstructured Text Data

This tool uses a basic Large Language Model to categorize open-ended text responses and is capable of using partially categorized data to predict the classification of unclassified responses. Note that our Displayr AI Text Categorization module is a far superior tool at creating categories from text data because it is built using Displayr AI. However, it uses the category label alone to classify data, whereas you can use partially classified data to predict categories for the remaining data using the Unstructured Text output. This might be preferred if you are working with very obscure themes in your responses. This article guides you through automatically classifying unstructured text data. It will take you from unstructured verbatim (raw) text responses:

To a state where the verbatims are automatically categorized:

Requirements

You will need a Text variable to perform automatic coding. Text variables are represented by an A next to the variable in the Data Sources tree:

Screenshot 2024-05-23 154224.png

See Finding the Best Text Analysis for Your Data to determine if this is the best method for your data.

Method

In the Data Sources tree, select the variable that contains the unstructured text.
From the Object Inspector , go to Data > Text Analysis > Categorize Unstructured Text.
Select the output on the page and in the Object Inspector , under Data > Categories > Category Creation, select Create New Categorization.
Under Data > Categories > Number of categories, enter a numeric value for the number of categories you would like to end up with. The default is 10.
Click Calculate if Calculate automatically is not ticked.

If you instead want to use an existing categorization from a different variable in your data set to train the automatic categorization algorithm, see How to Automatically Classify Text Data Using an Existing Categorization to Train a Model.

Translate the text for categorization

If your text data is in a foreign language or multiple languages, you can translate the responses as a part of classification as well.

From the Object Inspector > Data > Translate (Google Cloud Translation), specify the Source language. If your text variable contains more than one language, select Specify with variable, and select the nominal variable that contains a list of the languages in the Source language variable dropdown. If the variable containing the languages has missing values for any cases, Displayr will make a best guess about the language.
Specify the Output language that you would like the responses to be translated to.

Save categories from automatic categorization

To save the categorizations* for use in tables and other outputs:

Select the automatic categorization output on your page.
Click Data > Save Variable(s) > Categories or First Category from the object inspector .
1. Categories: Save variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.
2. First category: Save a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
[Optional]: To see the proportions of people for each category, you can drag the new variable set to the page to create a summary table.
[Optional]: To see the raw unstructured text verbatims alongside their categories, you can create a raw data table for the variable sets, see How to Create a Raw Data Table From Variable(s).

*NOTE: The variables created from this using Save Variable(s) > Categories and First Category may become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable being modified or the input settings being modified.

How to Automatically Classify Text Data Using an Existing Categorization to Train a Model

Finding the Best Text Analysis for Your Data

How To Automatically Classify Unstructured Text Data Into an Entity List

How to Automatically Classify Lists of Items