A Text Analysis Case Study

The content of this case study was inspired by David Robinson of the blog Variance Explained. David used sentiment analysis and other techniques to investigate the authorship of the X (formerly known as Twitter) communications made by Trump's account during the US presidential primary race of 2016. David hypothesized that Trump sent angrier tweets, while Trump's assistant(s) sent other tweets.

A QPack containing all of the analyses that are discussed on this page can be downloaded here and imported into Displayr.

Data Set

The tweets used in this case study are stored in an archive on the web. This archive can be added to your Displayr document by hovering in the Data Sources tree and clicking Plus (+) > Custom Code > R > Text using the below code:

load(url("http://varianceexplained.org/files/trump_tweets_df.rda")) 
tweet.data <- trump_tweets_df 
cleanFun <- function(htmlString) { 
     return(gsub("<.*?>", "", htmlString, useBytes = TRUE)) 
} 
tweet.data$statusSource <- cleanFun(tweet.data$statusSource) 
tweet.data

The data contains a sample of 1512 tweets that originated from Trump's Twitter account @readDonaldTrump. Each case in the data set represents one tweet. The variables contain the tweet text itself, as well as various kinds of metadata provided by Twitter. The variables that are considered in this case study are:

text - the raw text from each tweet.
favoriteCount - a count of the number of Twitter users who had marked the tweet as their favorite at the time the data set was originally obtained from Twitter. This is a measure of how engaged Twitter users were with the contents of the tweets. Typical values for this data set are in the tens of thousands.
Source - a variable that measures the device that was used to send out the tweet. The two main devices in the data set are iPhone and Android. This does not distinguish between different devices of the same type - so if two people were tweeting using an iPhone then they are not differentiated here.

Text and Word Cloud

The original text from the tweets can be viewed by selecting a table of this variable and selecting Visualization > Text Analysis > Word Cloud. The contents of the text can be viewed in the table below:

By using Displayr's Word Cloud feature we can view the words in proportion to how often they occur in the text. This gives us an overview of the themes:

We can see prominent signals of "thank" (from Tweets thanking people for attending events) and "crooked" (for one of Trump's favorite slogans, attacking "Crooked Hillary"). This visualization of the text is dynamic, and you can combine words by dragging and dropping on top of one another, and you can remove words that are not of interest by dragging them to the right section of the cloud (which will appear as Ignore when viewed in Displayr). For example, to count Hillary and Clinton as the same concept in the word cloud you can drag Hillary on top of Clinton. The word Clinton will then become larger (as the count is now of any text that mentions either word), and Hillary will no longer appear as a separate word in the cloud.

Categorization

Using Displayr's Categorization tools, we can assign tweets into categories manually or using our built-in categorization tool via Anything > Advanced Analysis > Text Analysis > Text Categorization, which uses an AI algorithm to find similar responses in a "smarter" way than merely searching for exact keywords. Categorization is still the most accurate way to classify the text (so long as the person doing it does a good job!). In this example, the original tweet text has been coded into a variable set called Tweet Themes. The themes that we identified in this case study are:

Speeches, interviews, rallies - for tweets announcing speeches and rallies, or tweets that thank people for attending these rallies.
About the media (negative) - for tweets that denounced bad or biased media coverage.
Attacks on Clinton, Obama, Democrats, supporters - for tweets that attack Trump's opponents on the other side of politics.
Positives about Trump or his campaign - for tweets that show positive messages about the candidate or his campaign progress.
Policies, slogans, calls to arms - for tweets that promote Trump's policies or repeat his (positive) slogans.
Other - for tweets that talk about the Olympics, or otherwise don't fit the set of themes that we have used here.

The classified variable set can then be shown in a table with other variables from the data set. One important measure that we consider throughout this case study is the favoriteCount variable, which contains a count of the number of times each tweet was marked as a favorite by a Twitter user who read it. This can be used as a measure of engagement with the tweets - it shows us how interested or approving people are of a given tweet.

The table shows that tweets about Speeches, interviews, rallies generated a significantly lower level of engagement, and that there is not much differentiation among the other themes that were coded.

In this example, only a subset of 309 tweets out of the 1512 has been coded. This is so that the remaining tweets can be automatically coded using a predictive algorithm, as shown further below in the Automatic Categorization section.

Text Setup and Cleaning

More automated kinds of text analysis (like sentiment analysis and automatic classification) typically require an initial setup and cleaning phase where the text is first broken up into a collection of individual words (tokenized), and then each word is processed to determine if it should be kept in the analysis, modified, or excluded. The general approach is to try to reduce the total number of words being included where possible. Common cleaning techniques include:

Stopword removal, which is the removal of common words like the, to, and of which don't tend to convey a lot of information about the meaning of the text by themselves.
Spelling correction, which is where misspelled words are replaced with the most likely correction found in the text.
Stemming, where words are replaced by their root words or stems. In English, this largely involves the removal of suffixes.

In Displayr, the initial cleaning is done using Anything > Advanced Analysis > Text Analysis > Advanced > Setup Text Analysis. This option should be used first before using the other Text Analysis options available in the menu.

This does stopword removal automatically and has options for spelling correction and stemming. Further options are available for manual changes to the text. For instance, in this case study we have:

Removed some of the text that comes from website links in the tweets, like https, co, and amp, by adding these into the Remove these words section of the options.
Replaced all instances of clinton with the word hillary, so as to have all mentions of Hillary Clinton treated as a single entity in the text, by adding clinton:hillary into the Replace these words section of the options. We have done the same for sanders and bernie.
Increased the Minimum word frequency to 10, thereby excluding any words that occur less than ten times in the text. This helps to remove words that do not convey a lot of information, including misspelled words that were not captured by the spelling correction.

The result is a table which shows the words and their frequencies. By looking at the table, further decisions could be made about words to remove or combine.

When you use Setup Text Analysis, an output will appear in your Pages section. This output stores information about the cleaning and processing that has been done, and it can then be used as an input to other analyses, as discussed in the sections below.

Sentiment Analysis

Sentiment analysis quantifies the tone of a piece of text by identifying and scoring positive and negative words. In Displayr, a variable that generates the sentiment score for each tweet is generated by selecting the setup item (see above) and then running Anything > Advanced Analysis > Text Analysis > Sentiment. The new variable can be used in crosstabs with other variable sets from the data set.

From the sentiment scores, we find (unsurprisingly) that the average sentiment tends to be highest for tweets coded as Positives about Trump or his campaign and Speeches, interviews, rallies. On the other hand, tweets classified as Attacks on Clinton, Obama, Democrats are significantly lower than average, and are in fact negative (meaning that on average, these tweets tend to contain more negative words than positive words).

By crosstabulating the sentiment scores by the favoriteCount, we find a small but significant negative correlation, indicating that tweets with more negative language tended to engage Trump's Twitter followers more.

Finally, by considering the variable called Source, which shows us what kind of device was used to author the tweets, we find a significant difference in the sentiment between those tweets sent by an Android, and those sent by an iPhone. This is the key result that David Robinson discussed in his blog post - those tweets sent by the Android tend to be much more negative.

While it is possible that there are multiple people publishing tweets on this account for each type of device, the result here shows a very different tone for tweets being sent out by the different devices.

Term Document Matrix

The term document matrix represents the text as a table whose columns are binary variables that correspond to the words used in the analysis. Each row represents one of the text responses or tweets, and each column represents one of the words by taking a value of 1 when the word is present in the text for that row, and a value of 0 when it is not. This is a way of communicating the outcomes of the text setup and cleaning phase into other algorithms. If you want to design your own custom analysis, it can be useful to have the term document matrix computed explicitly within your project, and this can be done using Anything > Advanced Analysis > Text Analysis > Advanced > Term Document Matrix.

The automatic coding tool that is described below uses the term document matrix explicitly as one of its inputs. The predictive tree computes the term document matrix in the background for its own calculation and does not rely on the presence of the term document matrix as an item in the report.

Note that the original version of the term document matrix shown in the webinar displayed the full contents of the term document matrix as a table. This turned out to be an inefficient way to store this data, particularly for larger data sets, and so the term document matrix now displays information about the underlying matrix rather than displaying its contents in full.

Predictive Tree

A predictive tree based on the text can be created using Visualization > Text Analysis > Predictive Tree. This is similar to Classification And Regression Trees (CART), which is designed for creating a predictive tree between variables in the data file (as opposed to using the text).

In this case study, we used the favoriteCount as the Outcome, or variable to be predicted. Each branch of the tree shows where the presence of a particular word in a tweet predicts a much higher or lower average number of favorites. The width of each branch of the tree shows how many tweets are included in that part of the sample, and the color of the branch indicates the average value of the outcome variable - with darker reds indicating low average values, and lighter reds and blues indicating higher average values. The tree diagram is interactive. If you hover your mouse over a node you get additional information about the sample and outcome variable for that node, and you can click on the nodes to hide or show that part of the tree.

The tree shows significantly high numbers of favorites for tweets that talk about Hillary Clinton, and even higher average favorite count for those tweets that use the words hillary and spending. Similar high scores were observed for tweets containing the words bernie, law, and united.

Other variables can be analyzed by changing the Outcome selection.

Automatic Categorization

Finally, we can automatically categorize the remaining tweets using a predictive algorithm for Unstructured Text via Anything > Advanced Analysis > Text Analysis > Automatic Categorization > Unstructured Text.

By using our manually coded variable set as the Existing Categorization, the tool will map the remaining responses to the code frame we have created.

Text Analysis Video

Text Analysis

Articles in this section

Data Set

Text and Word Cloud

Categorization

Text Setup and Cleaning

Sentiment Analysis

Term Document Matrix

Predictive Tree

Automatic Categorization

Next

Articles in this section

Data Set

Text and Word Cloud

Categorization

Text Setup and Cleaning

Sentiment Analysis

Term Document Matrix

Predictive Tree

Automatic Categorization

Next

Related articles