How to Create a Classification and Regression Tree (CART)

A Classification and Regression Tree (CART), is a predictive model that explains how an outcome variable's values can be predicted based on other values. A CART output is a decision tree where each branch is a split in a predictor variable, and each end node contains a prediction for the outcome variable.

This article describes how to create a Classification and Regression Tree (CART). The example below shows an interactive tree created using the Sankey output option using 'Preferred Cola' as the Outcome variable and 'Age', 'Gender', and 'Exercise Frequency' as the Predictor variables.

Requirements

An Outcome variable to be predicted (eg, Prefer Cola).
Predictors variables will be considered as predictors of the outcome variable (eg, Age, Gender, Exercise frequency). Note: Predictors that are considered to be uninformative will be automatically excluded from the model.

Method

From the Report tree, select + > Advanced Analysis > Machine Learning > Classification and Regression Trees (CART).
In Properties , go to the Data tab.
On the Outcome menu, select the variable to be predicted by the predictor variables.
Select the predictor variable(s) from the Predictor(s) list.
OPTIONAL: Select the algorithm you wish to use in the Data > Algorithm. By default, the CART algorithm is used.
OPTIONAL: Select the desired Output type:
- Sankey: An interactive tree (as shown above). This is the default.
- Tree: A greyscale tree plot.
- Text: A text representation of the tree.
- Prediction-Accuracy Table: Produces a table relating the observed and predicted outcome.
- Cross Validation: A plot of the cross-validation accuracy versus the size of the tree in terms of the number of leaves.
OPTIONAL: Select the desired Missing Data treatment. (See Missing Data Options).
OPTIONAL: Select Variable names to display variable names in the output instead of labels.
OPTIONAL: Select the type of post-pruning applied to the tree from the Pruning menu:
- Minimum error: Prune back leaf nodes to create the tree with the smallest cross-validation error.
- Smallest tree: Prune to create the smallest tree with cross-validation error at most 1 standard error greater than the minimum error.
- None: Retain the tree as it has been built. Note that choosing this option without Early stopping (see below) is prone to overfitting.
OPTIONAL: Tick the Early stopping box if you wish to stop splitting nodes when the fit stops improving.
OPTIONAL: To shorten category labels from categorical predictor variables, select the desired option from the Predictor category labels menu:
- Full labels: The complete labels.
- Abbreviated labels: Labels that have been shortened by taking the first few letters from each word.
- Letters: Letters from the alphabet where "a" corresponds to the first category, "b" to the second category, and so on.
OPTIONAL: To shorten category labels of the outcome variable, select the desired option from the Outcome category labels menu. The choices are the same as above (see Step 11).
OPTIONAL: Select Allow long-running calculations to allow categorical variables with more than 30 categories to be included amongst Predictor variables. Note that predictors with m categories require evaluation of 2^(m - 1) split points. Enabling this option may cause calculations to run for a long time.