How to Extract Keywords and Phrases from Text

The most robust text analysis tool in Displayr is the Text Categorization module, which uses Displayr AI to more accurately and smartly classify text data (i.e. code open-ends). However, sometimes you may want to do a more rudimentary analysis of your text data, such as a word cloud. The Setup Text Analysis tool cleans and parses text data for you to use in a word cloud or word count analysis, so that the words used are more relevant. It corrects spelling, removes stop words, and normalizes text down to the more important words/phrases of responses. Note that this tool does not use Displayr AI, though you could create a Custom AI Variable or Custom AI Output to do something similar. This article describes how to go from a text variable with messy data:

To an output that processes the text data to extract only the keywords and phrases, making it more manageable and meaningful for rudimentary text analyses like word counts and word clouds:

Requirements

A Text variable. Text variables are represented by an A next to the variable in the Data Sources tree:

Method

From the toolbar, go to Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis.
From Properties , go to Data > Text variable and either select your text variable from the drop-down or drag it from the Data Sources tree into the Text variable field.
From there, you have multiple options that you can select to tidy your text data:
- Correct spelling: This process involves replacing incorrectly spelled words with the correct word. The algorithm uses an English-language dictionary to check the words. It also tries to account for words that are incorrectly spelled but which occur very frequently in the text (to avoid trying to correct things like brand names which probably don't appear in the dictionary).
- Perform stemming: The process of stemming is when you try to remove all of the plurals, tenses, and other suffixes from the ends of words to try and work out what the root word or stem is. For example, the words lover, loved, and loving, all have the same stem, which is lov. Displayr's stemming heuristics will replace words of the same stem with the most common word in your text belonging to that stem.
- Replace synonyms: Replacing words with synonyms is another standard processing step. Words with the same meaning will be replaced with the variant that occurs most frequently in the data set.
- Replace these words/phrases: Words to replace can also be specified. For example, in a study of the airline market, you may want to replace all appearances of baggage with luggage. In Displayr, this is done in this field using the following syntax: <current word>:<replacement word>. You can include additional replacements by separating them by commas. So to replace the word baggage with the word luggage use baggage:luggage. Separate multiple instances with commas.
- Remove these words/phrases: In some cases, your sample will contain frequent words which you don't want to keep. In Displayr, just type them in here, separated by commas. This adds them to the standard list of stopwords. A stopword is a common English word that does not convey much meaning by itself (e.g., 'the', 'of', and 'a'). Removing stopwords is so common to text analysis that it is done automatically. A useful way to work in Displayr is to use Anything > Advanced Analysis > Text Analysis > Advanced > Search at the same time as performing the setup of the text analysis. This is because when you create the setup, you are shown the most frequent words. But, to understand what to do with them (e.g., whether to merge them with other words), you need to know the context in which they appear, which is easily done using the text analysis search feature.
- Phrases: By default, tokens are usually single words. However, it is also possible to explicitly list phrases as tokens (the words or phrases that form the units of analysis). In Displayr, this is done by entering them in the Phrases field. In most cases, the output of this cleaning will be a collection of single words. However, it can be useful to replace if you want to treat word pairs together, i.e. as phrases, then type them in here, separated by commas.
- Minimum frequency as a percentage of cases: This sets the bar for how often a word must appear in the text before it will be included in your analysis. This is the most powerful tool in terms of reducing the sheer number of words to include. If you have too many words then just raise the bar.

For other advanced options, see Text Analysis - Advanced - Setup Text Analysis in our Technical Guide.

4. Once you've tidied your verbatim data, you can scroll down in Properties , and save it into a new variable to be used in the analysis via Data > Save Variable(s):

Categories: Save variables to the data set containing the unique words (which can be used as categories). Where there are multiple input variables, multiple sets of variables are added for each.

First category: Save a variable to the data set containing the first unique word (category) mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.

Sentiment: Save a variable that assigns scores to a set of text responses that attempts to quantify how positive or negative each response is.
Tidied Text: Save a variable to the data set that contains the tidied text.

NOTE: The variables created from this using Save Variable(s) > Categories and First Category may become invalid and need to be deleted and recreated if the output has changed, either due to the input text variable being modified or the input settings being modified.

If you'd like to extract the frequencies from the table or other information, you can do that via a Calculation > Custom Code item. You will use R code like below to pull off the underlying data from the output, where text.analysis.setup is the Name of your text analysis item on the page:

attr(text.analysis.setup,"ChartData")

text analysis chartdata.png

To get only the first two columns of the categories and their frequencies, use the following R code:

attr(text.analysis.setup,"ChartData")[,1:2]

Technical Details

A standard list of stopwords is removed automatically using this function. If you'd like to keep stopwords in your text, you can modify the underlying Data > Attributes > Code in Properties to do this. You will need to add in the remove.stopwords argument to the arguments list and set it equal to FALSE.

How to Create a Term Document Matrix

How to Show Sentiment in Word Clouds

How to Calculate Sentiment Scores for Open-Ended Responses

How to Compare Sentiment by Entities

How to Create a Predictive Tree for Text Analysis

How to Use Text Analytics to Tidy a Word Cloud

Text Analysis - Advanced - Setup Text Analysis