Gradient Bossing creates a predictive model for either regression or classification from an ensemble of an underlying tree or linear regression models. This article describes how to use Gradient Boosting in Displayr.
- Familiarity with the Structure and Value Attributes of Variable Sets
- Predictor variables (aka features or independent variables) - these can be numeric, categorical, or binary
- An outcome variable (aka dependent variable) - this variable can be numeric or categorical
- From the toolbar, go to Anything > Advanced Analysis > Machine Learning > Gradient Boosting
- In the object inspector, select your Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sets tree and drag them into the Predictor(s) box.
- From the Algorithm menu, select Gradient Boosting.
- From the Output menu, select Importance. This will tell you which variables have the most predictive power. Other output options include:
- Accuracy - Produces measures of the goodness of model fit. For categorical outcomes the breakdown by category is shown.
- Prediction-Accuracy Table - Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
- Detail - Text output from the underlying xgboost package.
- Missing data - lets you decide how you want to treat missing data. You can either Exclude cases with missing data, abort model creation if there are missing values (Error if missing data), or impute missing values (Imputation (replace missing values with estimates)). In this example, we excluded cases with missing values.
- Random Seed - allows you to see how consistent the model is if you change to the starting point for random selection. You can be more confident in your model if it gives consistent results no matter what seed you use.
- Booster - provides two methods for improving the accuracy of the model: gbtree and gblinear. Machine learning boosting is most often done with an underlying tree model, although you should try both options to see which one is better.
- OPTIONAL: Variable names - Displays Variable Names in the output instead of labels.
- OPTIONAL: Grid search - Whether to search the parameter space in order to tune the model. If not checked, the default parameters of xgboost are used. Increasing this will usually create a more accurate predictor, at the cost of taking a longer time to run.
- OPTIONAL: Under SAVE VARIABLE(S), click Predicted values to save the predicted values from the model. This will create a new variable called something like Predicted values from model. You can use this variable to evaluate the accuracy of the model.
- OPTIONAL: Click Visualization > Chart > Stacked Column Chart. For Rows, select your outcome variable, and for Columns select the variable with the predicted values created in Step 11. In this example, the model does a better job predicting which customers won't churn, though the 73% accuracy for churners is still respectable.