This article describes how to go from a word cloud that contains nearly all verbatim responses:
To a tidied word cloud that can be based on the frequency of responses, creating synonyms, and removing specific words that aren't useful:
You will need a Text variable in order to perform text analysis and word cloud creation. Text variables are represented by a small a next to the variable in the Data Sets tree:
- From the toolbar, go to Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis.
- Select the text variable you would like to use to create the word cloud from Inputs > Text Analysis Options > Text variable or drag your text variable from the Data Sets tree into Inputs > Text Analysis Options > Text variable.
- From there, you can tidy your text data based on any of the following options:
- Correct spelling: This process involves replacing incorrectly spelled words with the correct word. The algorithm uses an English-language dictionary to check the words. It also tries to account for words that are incorrectly spelled but which occur very frequently in the text (to avoid trying to correct things like brand names which probably don't appear in the dictionary).
- Perform stemming: The process of stemming is when you try to remove all of the plurals, tenses, and other suffixes from the ends of words to try and work out what the root word or stem is. For example, the words lover, loved, and loving, all have the same stem, which is lov. Displayr's stemming heuristics will replace words of the same stem with the most common word in your text belonging to that stem.
- Replace synonyms: Replacing words with synonyms is another standard processing step. Words with the same meaning will be replaced with the variant that occurs most frequently in the data set.
- Replace these words/phrases: Words to replace can also be specified. For example, in a study of the airline market, you may want to replace all appearances of baggage with luggage. In Displayr, this is done in this field using the following syntax: <current word>:<replacement word>. You can include additional replacements by separating them by commas. So to replace the word baggage with the word luggage use baggage:luggage. Separate multiple instances with commas.
- Remove these words/phrases: In some cases, your sample will contain frequent words which you don't want to keep. In Displayr, just type them here, separated by commas. This adds them to the standard list of stopwords. A stopword is a common English word that does not convey much meaning by itself (e.g., 'the', 'of', and 'a'). Removing stopwords is so common to text analysis that it is done automatically. A useful way to work in Displayr is to use Anything > Advanced Analysis > Text Analysis > Advanced > Search at the same time as performing the setup of the text analysis. This is because when you create the setup, you are shown the most frequent words. But, to understand what to do with them (e.g., whether to merge them with other words), you need to know the context in which they appear, which is easily done using the text analysis search feature.
- Phrases: By default, tokens are usually single words. However, it is also possible to explicitly list phrases as tokens (the words or phrases that form the units of analysis). In Displayr, this is done by entering them in the Phrases field. In most cases, the output of this cleaning will be a collection of single words. However, it can be useful to replace if you want to treat word pairs together, i.e. as phrases, then type them in here, separated by commas.
- Minimum frequency as a percentage of cases: This sets the bar for how often a word must appear in the text before it will be included in your analysis. This is the most powerful tool in terms of reducing the sheer number of words to include. If you have too many words then just raise the bar.
- Minimum frequency: By unticking the Minimum frequency as a percentage of cases will allow you to enter a numeric value for the minimum number of times a word must appear to be included in the text analysis.
- Once you have tidied your data, you will save the tidied text as a new text variable. From the object inspector go to Inputs > SAVE VARIABLES and click Tidied Text.
- You'll notice that a new variable has been inserted in your Data Sets tree with Tidied in its name. To create a word cloud from this tidied variable, go to Anything > Visualization > Legacy Charts > Word Cloud.
- In Inputs > DATA SOURCE > Rows, select the tidied text variable or drag and drop it from the Data Sets tree and you'll now have a tidied word cloud.