When working with files without metadata, many times categorical data may be imported as a Text variable instead of a Nominal (categorical) variable. Displayr has the ability to automatically code simple text variables and keep responses alphabetically ordered using automatic logic behind the scenes. "Simple" here means the label can be converted to a category directly with little manipulation. This is very useful for CSV files, where categorical data is often encoded as the labels rather than as numeric values (e.g. a question such as “What is your favorite animal?” would have data values of “Ants”, “Dogs”, “Cats”), but can also occur with poorly formatted data exports.
Notably, this feature is distinct from Displayr's text categorization tool and automatic text categorization outputs. See Finding the Best Text Analysis for your Data for an overview of the other methods for coding text data.
Requirements
- A variable set with a Structure of Text or Text-Multi where the text data is simple (either exactly or closely similar to the category to be assigned).
Method
How to convert a single simple text variable to a nominal variable
- Select the variable in the Data Sources tree.
- In the object inspector, change the Structure from Text to Nominal.
- Select the text variables that should have a shared code frame in the Data Sources tree.
- Right-click and select Combine.
- The text variables have now been combined into a single Text-Multi variable set.
- In the object inspector, change the Structure for the new variable set under Data > Properties from Text – Multi to Nominal – Multi.
Technical Details
How converting works
Key points
- Converting a Text variable to a Nominal variable via Structure will automatically code the text into categories.
- When working with categorical data in Displayr, the data is stored as numbers and each category is assigned a number. This is known as the code frame and is found in the object inspector > Data > Properties > Values of a variable (set).
- When automatically coding multiple text variables that are related, first use Combine to combine them into a Text – Multi variable set, and then change the Structure to Nominal - Multi (or any other categorical type). This ensures responses from all variables are auto-coded at once and alphabetically ordered.
The auto-coding rules behind the scenes
- Leading/trailing spaces and capitalization are ignored - For example, “ dogs” and “Dogs “ will both be coded as the same category.
- The label that occurs most often will become the category label - For example, if the responses were “coke“, “COKE”, “Coke” and “Coke” the auto-coded question would use “Coke” as the label for the category for those responses as it occurs twice.
- Categories are automatically alphabetized by default when created - This is both in the Value Attributes dialog and on tables.
How Displayr deals with changes in the data file
- Whenever the source text variables are updated (from either an updated data file or due to an edit within Displayr), the code frame is automatically re-coded.
- Whenever converted variables are combined into a multi-variable set, their code frames change to include unique responses from all other input text variables.
- Whenever converted variables are moved from a multi-variable set to their own single-variable, their code frames stop including responses from the other text variables and only include their own responses. Importantly, their category values stay the same.
- Existing text responses always keep their same category value (e.g. if “Ants” was originally the first alphabetical response with an auto-coded value of 1, and “Aardvarks” appeared in the new data, “Ants” would remain with a value of 1, and “Aardvarks” would get a new unique value).
- The category labels may change if another type of text response becomes the highest occurring response. (e.g. if the new responses were “coke”, “COKE”, “Coke”, “Coke”, “coke” and “coke”, the new label would be “coke”).
Next
Frequently Asked Questions about Text Analysis
Finding the Best Text Analysis for Your Data