Sometimes it's helpful to look not just at the most frequent words/phrases in text but also at the sentiment of those words to see how positive/negative as a whole responses are. This article describes how to go from a standard Word Cloud:
to one that is color-coded based on the positive (green) and negative (red) sentiment of the word:
Requirements
A data file that contains a variable with the phrases you wish to use to create the Word Cloud. Text variables are represented by an A next to the variable in the Data Sources tree:
If you want to reproduce the Word Cloud form above, you can do so by pressing Data Sources > + > R
and using the code below (you will give it a name as well):
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
trump_tweets_df$text = gsub("http.*", "", trump_tweets_df$text, useBytes = TRUE)
trump_tweets_df
Please note these steps require a Displayr license.
Method
- First, we will tidy the data to make it faster for the sentiment analysis to run. From the toolbar, go to Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis.
- Select your text variable from the Text variable dropdown.
- [Optional]: Update additional settings per step 3 in How to Set Up Text for Analysis.
- From the object inspector, click Save Variable(s) > Sentiment to calculate sentiment scores based on the text analysis and save scores as a new variable in the data set.
- From the toolbar, go to Calculation > Custom Code and draw a box on the page.
- Paste the code below in the R Code box in the object inspector and edit as needed:
#### Flag each response as positive or negative
#identify the variable with sentiment scores to use put the label in backticks below
phrase.sentiment = `Sentiment scores from text.analysis.setup`
#turn all positive sentiments (1 or more) and negative sentiments (-1 or less)
phrase.sentiment[phrase.sentiment >= 1] = 1
phrase.sentiment[phrase.sentiment <= -1] = -1
#### Get data about the top words found in each response
#identify the specific text analysis setup that you want to use
text.analysis.setup = text.analysis.setup
#pull off the top words from the text
final.tokens = text.analysis.setup$final.tokens
#pull off the counts of the top words
counts = text.analysis.setup$final.counts
#create a binary table to flag if each top word is present in each response
td = t(vapply(flipTextAnalysis:::decodeNumericText(text.analysis.setup$transformed.tokenized),
function(x) { as.integer(final.tokens %in% x) },
integer(length(final.tokens))))
#### Translate response sentiment to word-level
#use the sentiment score from the response to give each top word in the response
#the sentiment score of the overall response
phrase.word.sentiment = sweep(td, 1, phrase.sentiment, "*")
#if top word is not in the response, make the sentiment for that word missing
phrase.word.sentiment[td == 0] = NA
#### See if word is statistically significant to positive or negative sentiment
#for each top word, calculate statistics
word.mean = apply(phrase.word.sentiment,2, FUN = mean, na.rm = TRUE) #average
word.sd = apply(phrase.word.sentiment,2, FUN = sd, na.rm = TRUE) #standard deviation
word.n = apply(!is.na(phrase.word.sentiment),2, FUN = sum, na.rm = TRUE) #sum
word.se = word.sd / sqrt(word.n) #calculate standard error
word.z = word.mean / word.se #calculate z-score
word.z[word.n <= 3 | is.na(word.se)] = 0 #if word has under 3 mentions z-score is 0
#### create a final table of all the top words along with their sentiment and z-score
words = text.analysis.setup$final.tokens
x = data.frame(word = words,
freq = counts,
"Sentiment" = word.mean,
"Z-Score" = word.z,
Length = nchar(words))
#sort the table based on the mentions of the words descending
word.data = x[order(counts, decreasing = TRUE), ]
#### Calculate the color
#get number of words to show in cloud
n = nrow(word.data)
#create initial list of the color of each word as grey
colors = rep("grey", n)
#change the color of the word if it's statistically significant based on z-score
colors[word.data$Z.Score < -1.96] = "Red"
colors[word.data$Z.Score > 1.96] = "Green"
#### Create the word cloud
#load the R package with the wordcloud function
library(wordcloud2)
#create the wordcloud
wordcloud2(data = word.data[, -3],
fontFamily = "Arial",
backgroundColor = "transparent",
color = colors,
size = 0.8)
Technical Notes
Note that the above code uses the function wordcloud2()
to create the wordcloud, and if you are plotting a lot of words in a small space, some words may be left out for the cloud to fit the area. There isn’t a warning if words are removed, and words removed are not necessarily ones with a small frequency. You may notice that the colors are wrong in your wordcloud when this happens. So, care should be taken to set the font size in the code or resize the output to one that will surely fit all of the words you would like to show.
An alternative to using the wordcloud2()
function is to use the wordcloud()
function instead. There are less customizations (such as custom shapes, backgrounds, custom rotations, font family, and some foreign languages), but you will be warned if words are left out of the cloud. You can also set a max number of words to show within the function (rather than manipulating the data beforehand). To use this wordcloud function you can replace all of the code underneath the #### Create the word cloud
section with below:
#load the R package with the wordcloud function
library(wordcloud)
#create the wordcloud
wordcloud(words = word.data$word, freq = word.data$freq, #provide the words and counts
random.order = FALSE, #keep words in order
scale=c(7,.5), #set the high and low font size parameters
rot.per = .1, #percentage of words that are vertical
min.freq=0, #minimum threshold for counts for plotting words
max.words=Inf, #maximum number of words to plot
colors = colors, #vector of colors for plot
ordered.colors=T) #TRUE means that colors above should correspond to order of words provided
Documentation and more examples are available for wordcloud2() and wordcloud() online.