This article describes how to perform logistic regression (also known as binary logit).
- Predictor variables (aka features or independent variables) - these can be ordinal, categorical, numeric, or binary.
- An outcome variable (aka dependent variable) - this variable must be binary. In Displayr, the best data format for this type is Nominal: Mutually exclusive categories, with values of “0” and “1”.
- Preliminary data checking:
- Hold down the CTRL key and click on each of your predictor and outcome variables from the Data Sets tree. Once they are selected, drag them (as a group) onto a page.
- The specifics of how to perform a preliminary check of the data depend very much on the data set and problem being studied, but you can start by taking the following examples into consideration:
- The outcome variable should contain two categories, 'No' and 'Yes' (or 0 and 1).
- Check for missing data in any outcome or predictor variables. If missing data is present, look into why and determine if this is a viable variable to use.
- Tidy any data labels. For example, if the "Senior Citizen" variable I used in the above example only had labels of 0 and 1, I can:
- Click on the "Senior Citizen" variable in the Data Sets tree.
- Click the DATA VALUES > Labels button in the object inspector and change the 0 to 'No' and the 1 to 'Yes' in the Label column and press OK. The table will automatically update to show the changes we have made to the underlying data.
- Tidy any variable labels by right-clicking on the variable in the Data Sets tree, and clicking Rename.
- Make sure that the categories are ordered sensibly to make interpretation easier. You can select an item in a table and click on the three grey lines that appear to its right (if you don't see them, click again), and drag this category to the desired location. If you accidentally merge the categories, just click the Undo arrow at the top-left of the screen.
- For numeric variables, you can add the maximum and minimum values. Select the numeric-based table(s), and then, in the object inspector, go to Statistics > Cells and click on Maximum and then Minimum, which adds these statistics to the table.
- Create estimation, validation, and testing samples:
- From the toolbar, go to Anything > Filtering > Model Checking > Filters for Train-Validation-Test Split.
- Enter the % of sample that you would like to randomly select for the training set.
- Enter the % of sample that you would like to randomly select for the validation set. This same % will be applied to the testing set. The total between the training, validation, and test sets must equal 100%.
- Create a preliminary model:
- From the toolbar, go to Anything > Advanced Analysis > Regression > Binary Logit (binary logit is another name for logistic regression).
- In the object inspector, select your binary Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sets tree and drag them into the Predictor(s) box.
- In this example, we will use the split sample approach from the previous section. Scroll down to the bottom of the Inputs tab in the object inspector and set FILTERS & WEIGHTS > Filter(s) to Training sample.
- Ensure Automatic is ticked so that your model updates whenever you modify any of the inputs.
- Compute the prediction accuracy tables:
- Click on the model output (which should look like the table above).
- From the toolbar, go to Anything > Advanced Analysis > Regression > Diagnostic > Prediction-Accuracy Table.
- Move this below the regression output and resize it to fit half the width of the screen.
- Making sure you still have the prediction-accuracy table selected, click on object inspector > Inputs > FILTERS AND WEIGHTS > Filter(s) and set it to Training sample, which causes the calculation only to be based on the training sample.
- From the toolbar, click Duplicate, and drag the new copy of the table to the right, and change the filter to Validation sample. This new table shows the predictive accuracy based on the validation sample (i.e., the out-of-sample prediction accuracy). The result should be similar to the first prediction-accuracy table.