This article describes how to use a term document matrix:
or a sparse matrix:
To feed your text analysis into a statistical algorithm, such as a random forest model for further analysis:
Requirements
Please note this requires the Data Stories module or a Displayr license.
Method
- From the toolbar, select Calculation > Custom Code.
- Enter the below R code in Data > R Code:
# Our package containing the Random Forest routine
library(flipMultivariates)
# The package needed to convert the sparse matrix
library(tm)
# Convert the sparse matrix before use
tdm <- as.matrix(term.document.matrix)
# Ensure the column names are appropriate for use in an R model
colnames(tdm) <- make.names(colnames(tdm))
# Combine the outcome variable with the term document matrix
df <- data.frame(statusSource = statusSource, tdm)
# Create the R Formula which describes the relationship we are interrogating
f <- formula(paste0("statusSource ~ ", paste0(colnames(tdm), collapse = "+")))
# Run the random forest model
rf <- RandomForest(f, df)
The code above first converts the term document into a matrix (term.document.matrix), before combining it with the dependent variable (statusSource), and selecting an appropriate R formula that relates the dependent variable to the columns of the term document matrix. You will need to replace mentions of term.document.matrix with the name of your term document or sparse matrix. You will also need to replace the mention of statusSource with the name of your dependent variable.
Once the Random Forest model runs based on the R code above, you will see an output similar to below, except with your own text data: