This article guides you through how to use the Setup Text Analysis function to help clean and prepare your raw text data for further analysis. It will take you from uncategorized raw text responses:
To a state where the verbatims are cleaned and processed to reduce the number of unique words that you are including in your analysis for a more manageable, meaningful collection.
Requirements
You will need a Text variable to perform automatic categorization. Text variables are represented by an A next to the variable in the Data Sources tree:
Method
- From the toolbar, go to Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis.
- From the object inspector, go to Data > Text variable and either select your text variable from the drop-down or drag it from the Data Sources tree into the Text variable field.
- From there, you have multiple options that you can select to tidy your text data:
-
- Correct spelling: This process involves replacing incorrectly spelled words with the correct word. The algorithm uses an English-language dictionary to check the words. It also tries to account for words that are incorrectly spelled but which occur very frequently in the text (to avoid trying to correct things like brand names which probably don't appear in the dictionary).
- Perform stemming: The process of stemming is when you try to remove all of the plurals, tenses, and other suffixes from the ends of words to try and work out what the root word or stem is. For example, the words lover, loved, and loving, all have the same stem, which is lov. Displayr's stemming heuristics will replace words of the same stem with the most common word in your text belonging to that stem.
- Replace synonyms: Replacing words with synonyms is another standard processing step. Words with the same meaning will be replaced with the variant that occurs most frequently in the data set.
- Replace these words/phrases: Words to replace can also be specified. For example, in a study of the airline market, you may want to replace all appearances of baggage with luggage. In Displayr, this is done in this field using the following syntax: <current word>:<replacement word>. You can include additional replacements by separating them by commas. So to replace the word baggage with the word luggage use baggage:luggage. Separate multiple instances with commas.
- Remove these words/phrases: In some cases, your sample will contain frequent words which you don't want to keep. In Displayr, just type them in here, separated by commas. This adds them to the standard list of stopwords. A stopword is a common English word that does not convey much meaning by itself (e.g., 'the', 'of', and 'a'). Removing stopwords is so common to text analysis that it is done automatically. A useful way to work in Displayr is to use Anything > Advanced Analysis > Text Analysis > Advanced > Search at the same time as performing the setup of the text analysis. This is because when you create the setup, you are shown the most frequent words. But, to understand what to do with them (e.g., whether to merge them with other words), you need to know the context in which they appear, which is easily done using the text analysis search feature.
- Phrases: By default, tokens are usually single words. However, it is also possible to explicitly list phrases as tokens (the words or phrases that form the units of analysis). In Displayr, this is done by entering them in the Phrases field. In most cases, the output of this cleaning will be a collection of single words. However, it can be useful to replace if you want to treat word pairs together, i.e. as phrases, then type them in here, separated by commas.
- Minimum frequency as a percentage of cases: This sets the bar for how often a word must appear in the text before it will be included in your analysis. This is the most powerful tool in terms of reducing the sheer number of words to include. If you have too many words then just raise the bar.
For other advanced options, see Text Analysis - Advanced - Setup Text Analysis in our Technical Guide.
4. Once you've tidied your verbatim data, you can scroll down in the object inspector, and save it into a new variable to be used in the analysis via Data > Save Variable(s):
- Categories: Save variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.
- First category: Save a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
- Sentiment: Save a variable that assigns scores to a set of text responses that attempts to quantify how positive or negative each response is.
- Tidied Text: Save a variable to the data set that contains the tidied text.
NOTE: The variables created from this using Save Variable(s) > Categories and First Category may become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable being modified or the input settings modified.
Technical Details
A standard list of stopwords is removed automatically using this function. If you'd like to keep stopwords in your text, you can modify the underlying Data > Show Advanced Options > R Code > Edit Code to do this. You will need to add in the remove.stopwords
argument to the arguments list and set it equal to FALSE
.
Next
How to Create a Term Document Matrix
How to Show Sentiment in Word Clouds
How to Calculate Sentiment Scores for Open-Ended Responses
How to Compare Sentiment by Entities
How to Create a Predictive Tree for Text Analysis