How to Do Logistic Regression

This article describes how to run a logistic regression (also known as binary logit) in Displayr. This analysis can be used in driver analysis (see our ebook for more detail) or as a means to create a predictive model for an outcome variable with two values/categories.

Requirements

Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
Predictor variables (aka features or independent variables) - these can be ordinal, categorical, numeric, or binary.
An outcome variable (aka dependent variable) - this variable must be binary. This can be a variable that is structured as Nominal: Mutually exclusive categories, with values of “0” and “1”, or a variable within a variable set that is structured as a Binary-Multi.

Method

1. Preliminary data checking

Hold down the CTRL key and click on each of your predictor and outcome variables from the Data Sources tree. Once they are selected, drag them (as a group) onto a page to create summary tables.
The specifics of how to perform a preliminary check of the data depend very much on the data set and problem being studied, but you can start by taking the following examples into consideration:
- The outcome variable should contain two categories, 'No' and 'Yes' (or 0 and 1).
- Check for missing data in any outcome or predictor variables. If missing data is present, investigate the reasons and determine whether this variable is viable to use.
- Tidy any data labels. For example, if the "Senior Citizen" variable I used in the above example only had labels of 0 and 1, I can:
  - Click on the "Senior Citizen" variable in the Data Sources tree.
  - Go to Data > Attributes > Values & Labels or Categories button, depending on the variable's structure, in Properties , change the 0 to 'No' and the 1 to 'Yes' in the Label column, and press OK. The table will automatically update to show the changes we have made to the underlying data.
- Tidy any variable labels by right-clicking on the variable in the Data Sources tree and clicking Rename.
- For categorical variables, make sure the categories are ordered sensibly to make interpretation easier. The category listed in the first row will be a part of the base model, and a coefficient will not be estimated for that category. You can select an item in a table, click the three grey lines to its right, and drag this category to the desired location.
- For numeric variables, you can add the maximum and minimum values to the table to look for outliers. Select the numeric-based table(s), and then, in Properties , go to Data > Statistics > Cells and click on Maximum and Minimum, which adds these statistics to the table.

2. Create estimation, validation, and testing samples

In the Data Sources tree, hover over a variable and select +> Filter > Filters for Train-Validation-Test Samples.
Enter the % of the sample that you would like to randomly select for the training set.
Enter the % of the sample that you would like to randomly select for the validation set. This same % will be applied to the testing set. The total between the training, validation, and test sets must equal 100%.

3. Create a preliminary model

From the toolbar, go to Anything > Advanced Analysis > Regression > Binary Logit, or in the Report tree select +> Advanced Analysis > Regression > Binary Logit.
In Properties , select your binary Outcome variable.
In Predictor(s), select your predictor variable(s). The fastest way to do this is to select all of them in the Data Sources tree and drag them into the Predictor(s) box.
In this example, we will use the split sample approach from the previous section. Change Data > Filters & Weight > Filter(s) to Training sample in Properties .
Ensure Calculate Automatically is checked so that your model updates whenever you modify any of the inputs.

4. Compute the prediction accuracy tables

Click on the model output (which should look like the table above).
From the toolbar, go to Anything > Advanced Analysis > Regression > Diagnostic > Prediction-Accuracy Table, or in the Report tree select +> Advanced Analysis > Regression > Diagnostic > Prediction-Accuracy Table.
Move this below the regression output and resize it to fit half the width of the screen.
Making sure you still have the prediction-accuracy table selected, from Properties , go to Data > Filters & Weight > Filter(s) and set it to Training sample, which shows the accuracy of the predictions for the training sample (i.e., the in-sample accuracy).
Right-click the output and select Duplicate.
Drag the new copy of the table to the right, and change the filter to Validation sample. This new table shows the predictive accuracy based on the validation sample (i.e., the out-of-sample prediction accuracy). The result should be similar to the first prediction-accuracy table.

How To Automatically Remove Outliers from Regression and GLMs

How to Run Ordered Logit Regression

How to Save Probabilities of Each Response of Regression Models

Articles in this section

Requirements

Method

1. Preliminary data checking

2. Create estimation, validation, and testing samples

3. Create a preliminary model

4. Compute the prediction accuracy tables

Next

Articles in this section

Requirements

Method

1. Preliminary data checking

2. Create estimation, validation, and testing samples

3. Create a preliminary model

4. Compute the prediction accuracy tables

Next

Related articles