How to Use Text Analytics to Tidy a Word Cloud

By default, wordclouds will visualize all words in text responses. However, some words are not meaningful for analysis and there can be many words that in turn mean the same thing and should be analyzed as one keyword. This article describes how to go from a word cloud that contains nearly all words in responses:

To a tidied word cloud that can be based on the frequency of responses, creating synonyms, and removing specific words that aren't useful:

Requirements

You will need a Text variable in order to perform text analysis and word cloud creation. Text variables are represented by an "A" next to the variable in the Data Sources tree:

Method

From the Report tree or toolbar, go to Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis.
In the object inspector , select the text variable you would like to use to create the word cloud from Data > Text Analysis Options > Text variable, or drag your text variable from the Data Sources tree into Data > Text Analysis Options > Text variable.
From there, you can tidy your text data based on any of the following options:
- Correct spelling: This process involves replacing incorrectly spelled words with the correct word. The algorithm uses an English-language dictionary to check the words. It also tries to account for words that are incorrectly spelled but which occur very frequently in the text (to avoid trying to correct things like brand names, which probably don't appear in the dictionary).
- Perform stemming: The process of stemming is when you try to remove all of the plurals, tenses, and other suffixes from the ends of words to try to work out what the root word or stem is. For example, the words lover, loved, and loving, all have the same stem, which is lov. Displayr's stemming heuristics will replace words of the same stem with the most common word in your text belonging to that stem.
- Replace synonyms: Replacing words with synonyms is another standard processing step. Words with the same meaning will be replaced with the variant that occurs most frequently in the data set.
- Replace these words/phrases: Words to replace can also be specified. For example, in a study of the airline market, you may want to replace all appearances of baggage with luggage. In Displayr, this is done in this field using the following syntax: <current word>:<replacement word>. You can include additional replacements by separating them by commas. So, to replace the word baggage with the word luggage use baggage:luggage. Separate multiple instances with commas.
- Remove these words/phrases: In some cases, your sample will contain frequent words that you don't want to keep. In Displayr, just type them here, separated by commas. This adds them to the standard list of stopwords. A stopword is a common English word that does not convey much meaning by itself (e.g., 'the', 'of', and 'a'). Removing stopwords is so common in text analysis that it is done automatically. A useful way to work in Displayr is to use Anything > Advanced Analysis > Text Analysis > Advanced > Search at the same time as performing the setup of the text analysis. This is because when you create the setup, you are shown the most frequent words. But, to understand what to do with them (e.g., whether to merge them with other words), you need to know the context in which they appear, which is easily done using the text analysis search feature.
- Minimum frequency as a percentage of cases: This sets the bar for how often a word must appear in the text before it will be included in your analysis. This is the most powerful tool in terms of reducing the sheer number of words to include. If you have too many words then just raise the bar.
- Minimum frequency: By unticking the Minimum frequency as a percentage of cases will allow you to enter a numeric value for the minimum number of times a word must appear to be included in the text analysis.
Once you have tidied your data, you will save the tidied text as a new text variable. From the object inspector go to Data > Save Variables and click Tidied Text.
You'll notice that a new variable has been inserted in your Data Sources tree with Tidied in its name. To create a word cloud from this tidied variable, go to Visualization > Text Analysis > Word Cloud from the toolbar.
In Data > Data Source > Rows, select the tidied text variable or drag and drop it from the Data Sources tree, and you'll now have a tidied word cloud.

How to Extract Keywords and Phrases from Text

Articles in this section

Requirements

Method

Next

Articles in this section

Requirements

Method

Next

Related articles