Gradient Bossing creates a predictive model for either regression or classification from an ensemble of an underlying tree or linear regression models.
Requirements
- Familiarity with the Structure and Value Attributes of Variable Sets
- Predictor variables (aka features or independent variables) - these can be numeric, categorical, or binary
- An outcome variable (aka dependent variable) - this variable can be numeric or categorical
Method
From the toolbar, go to Anything
> Advanced Analysis > Machine Learning > Gradient Boosting or in the Report tree select +>Advanced Analysis > Machine Learning > Gradient Boosting.
In the object inspector
Select your Outcome variable.
In Predictor(s), select your predictor variable(s). The fastest way to do this is to select all of them in the Data Sources tree and drag them into the Predictor(s) box.
From the Algorithm menu, select Gradient Boosting.
-
From the Object Inspector
> Data> Output menu, select Importance. This will tell you which variables have the most predictive power. Other output options include:
Accuracy - Produces measures of the goodness of model fit. For categorical outcomes, the breakdown by category is shown.
Prediction-Accuracy Table - Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
Detail - Text output from the underlying xgboost package.
Missing data - lets you decide how you want to treat missing data. You can either exclude cases with missing data, abort model creation if there are missing values (Error if missing data), or impute missing values (Imputation (replace missing values with estimates)). In this example, we excluded cases with missing values.
Random Seed - allows you to see how consistent the model is if you change the starting point for random selection. You can be more confident in your model if it produces consistent results regardless of the seed you use.
Booster provides two methods for improving the model's accuracy: gbtree and gblinear. Machine learning boosting is most often done with an underlying tree model, although you should try both options to see which one is better.
OPTIONAL: Variable names - Displays Variable Names in the output instead of labels.
OPTIONAL: Grid search - Whether to search the parameter space in order to tune the model. If not checked, the default xgboost parameters are used. Increasing this usually creates a more accurate predictor, at the cost of longer runtime.
OPTIONAL: Under Save Variables(s), click Predicted values to save the predicted values from the model. This will create a new variable called something like Predicted values from model. You can use this variable to evaluate the model's accuracy.
OPTIONAL: Click Visualization > Chart > Stacked Column Chart. For Rows, select your outcome variable, and for Columns select the variable with the predicted values created in Step 11. In this example, the model does a better job predicting which customers won't churn, though the 73% accuracy for churners is still respectable.