This article describes how to use Gradient Boosting in Displayr.
- Predictor variables (aka features or independent variables) - these can be numeric, categorical or binary.
- An outcome variable (aka dependent variable) - this variable can be numeric or categorical. One big advantage of Gradient Boosting is that it can be used for both problems,
- From the toolbar, go to Anything > Advanced Analysis > Machine Learning > Gradient Boosting
- In the object inspector, select your Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sets tree and drag them into the Predictor(s) box.
- From the Algorithm menu, select Gradient Boosting.
- From the Output menu, select Importance. This will tell you which variables have the most predictive power.
- From the Output menu, select Importance if you are interested in rank ordering the importance of the predictors. The Detail option will give you additional information about the model. If you are more interested in the overall accuracy of the model than predictor importance, you can select Prediction-Accuracy table or Accuracy. In this example, Accuracy tells the percent accurate when predicting churn or not-churn.
- Missing data lets you decide how you want to treat missing data. You can either Exclude cases with missing data, abort model creation if there are missing values (Error if missing data), or impute missing values (Imputation (replace missing values with estimates)). In this example, we excluded cases with missing values.
- Sort by importance, sorts the variables in the table in order of importance. This is usually preferred.
- Seed allows you to see how consistent the model is if you change to starting point for random selection. You can be more confident in your model if it gives consistent results no matter what seed you use.
- Booster provides two methods for improving he accuracy of the model, gbtree and gblinear. Machine learning boosting is most often done with an underlying tree model, although you should try both options to see which one is better.
- Under SAVE VARIABLE(S), click Predicted values to save the predicted values from the model. This will create a new variable called something like Predicted values from model. We will use this variable to evaluate the accuracy of the model.
- Click Insert > Chart > Stacked Column Chart. For Rows, select your outcome variable for Columns select the variable with the predicted values. In this example, the model does a better job predicting which customers won't churn, though the 73% accuracy for churners is still respectable.