This article describes how to run a logistic regression (also known as binary logit) in Displayr. This analysis can be used in driver analysis (see our ebook for more detail) or as a means to create a predictive model for an outcome variable with two values/categories.
Requirements
- Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
- Predictor variables (aka features or independent variables) - these can be ordinal, categorical, numeric, or binary.
- An outcome variable (aka dependent variable) - this variable must be binary. This can be a variable that is structured as Nominal: Mutually exclusive categories, with values of “0” and “1”, or a variable within a variable set that is structured as a Binary-Multi.
Please note these steps require a Displayr license.
Method
1. Preliminary data checking
-
- Hold down the CTRL key and click on each of your predictor and outcome variables from the Data Sources tree. Once they are selected, drag them (as a group) onto a page to create summary tables.
- The specifics of how to perform a preliminary check of the data depend very much on the data set and problem being studied, but you can start by taking the following examples into consideration:
-
- The outcome variable should contain two categories, 'No' and 'Yes' (or 0 and 1).
- Check for missing data in any outcome or predictor variables. If missing data is present, look into why and determine if this is a viable variable to use.
- Tidy any data labels. For example, if the "Senior Citizen" variable I used in the above example only had labels of 0 and 1, I can:
- Click on the "Senior Citizen" variable in the Data Sources tree.
- Click the Data > Properties > Labels button in the object inspector, change the 0 to 'No' and the 1 to 'Yes' in the Label column, and press OK. The table will automatically update to show the changes we have made to the underlying data.
- Tidy any variable labels by right-clicking on the variable in the Data Sources tree, and clicking Rename.
- For categorical variables, make sure the categories are ordered sensibly to make interpretation easier. The category listed in the first row will be a part of the base model and a coefficient will not be estimated for that category. You can select an item in a table, click on the three grey lines that appear to its right, and drag this category to the desired location.
- For numeric variables, you can add the maximum and minimum values to the table to look for outliers. Select the numeric-based table(s), and then, in the object inspector, go to Data > Statistics > Cells and click on Maximum and Minimum, which adds these statistics to the table.
-
2. Create estimation, validation, and testing samples
-
-
- From the toolbar, go to Anything > Filter > Model Checking > Filters for Train-Validation-Test Split.
- Enter the % of sample that you would like to randomly select for the training set.
- Enter the % of sample that you would like to randomly select for the validation set. This same % will be applied to the testing set. The total between the training, validation, and test sets must equal 100%.
-
3. Create a preliminary model
-
-
- From the toolbar, go to Anything > Advanced Analysis > Regression > Binary Logit.
- In the object inspector, select your binary Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sources tree and drag them into the Predictor(s) box.
- In this example, we will use the split sample approach from the previous section. Change Data > Filters & Weight > Filter(s) to Training sample.
- Ensure Calculate Automatically is checked so that your model updates whenever you modify any of the inputs.
-
4. Compute the prediction accuracy tables
-
-
- Click on the model output (which should look like the table above).
- From the toolbar, go to Anything > Advanced Analysis > Regression > Diagnostic > Prediction-Accuracy Table.
- Move this below the regression output and resize it to fit half the width of the screen.
- Making sure you still have the prediction-accuracy table selected, click on object inspector > Data > Filters & Weight > Filter(s) and set it to Training sample, which shows the accuracy of the predictions for the training sample (i.e., the in-sample accuracy).
- Right-click the output and select Duplicate.
- Drag the new copy of the table to the right, and change the filter to Validation sample. This new table shows the predictive accuracy based on the validation sample (i.e., the out-of-sample prediction accuracy). The result should be similar to the first prediction-accuracy table.
-
Next
How To Automatically Remove Outliers from Regression and GLMs
How to Run Ordered Logit Regression
How to Save Probabilities of Each Response of Regression Models