This article describes how to create a predictive tree that shows how an outcome variable is predicted by the results of text analysis. This means that the tree will describe which words and phrases from the text play a significant role in determining the value of the outcome. The text analysis must be done first, using How to Set Up Text for Analysis, and all settings regarding the processing of the text are achieved by this setup item.
Requirements
A Displayr document containing a setup text analysis object.
Method
-
From the toolbar, select Visualization > Text Analysis > Predictive Tree.
-
Under Data > Outcome select a Numeric, Nominal, or Ordinal variable containing the data that is to be predicted.
-
Under Data > Setup item select a setup text analysis object.
-
Change any other settings according to your requirements (see options below).
-
Ensure the Calculate automatically box is checked, or click Calculate.
OPTIONAL: The following items can be modified to adjust the output:
-
Output:
- Sankey: A Sankey Tree, which is a graphical representation of the tree that provides information about the sample contained in each of the tree branches.
- Tree: A more traditional, but plain-looking, tree diagram.
- Text: Textual information which contains the details of how the tree is split at each branch.
- Table: A table that displays the outcome variable next to the original text from the text analysis, and the transformed text from the text analysis. This allows you to see the outcomes that were given for each text response.
- Missing data - See Missing Data Options.
-
Pruning - The type of post-pruning applied to the tree. Choices are:
- Minimum error: Prune back leaf nodes to create the tree with the smallest cross-validation error.
- Smallest tree: Prune to create the smallest tree with cross-validation error at most 1 standard error greater than the minimum error.
- None: Retain the tree as it has been built. Note that choosing this option without Early stopping is prone to overfitting.
- Early stopping - Whether to stop splitting nodes before the fit stops improving. Setting this may decrease the time to build the tree, potentially at the cost of not finding the tree with the best accuracy. See here for more details.
-
Outcome category labels - Whether to shorten category labels from categorical outcome variables. The choices are:
- Full labels: The complete labels.
- Abbreviated labels: Labels that have been shortened by taking the first few letters from each word.
- Letters: Letters from the alphabet where "a" corresponds to the first category, "b" to the second category, and so on.
- Allow long-running calculations - Predictors with m categories require evaluation of 2^(m - 1) split points. This may cause calculations to run for a long time. Checking this box allows categorical variables with more than 30 categories to be included in Predictors.