Logistic regression, also known as binary logit and binary logistic regression, is a particularly useful predictive modeling technique, beloved in both the machine learning and the statistics communities. It is used to predict outcomes involving two options (e.g., buy versus not buy).
In this article, I explain how to interpret the standard outputs from logistic regression, focusing on those that allow us to work out whether the model is good, and how it can be improved. These outputs are pretty standard and can be generated by all major data science and statistics tools (R, Python, Stata, SAS, SPSS, Displayr, Q). In this article, I review prediction accuracy, pseudo-r-squareds, AIC, the coefficient table, and analysis of variance.
Prediction accuracy
The most basic diagnostic of a logistic regression is predictive accuracy. To understand this, we need to look at the prediction-accuracy table (also known as the classification table, hit-miss table, and confusion matrix). The table below shows the prediction-accuracy table produced by Displayr's logistic regression. At the base of the table, you can see that the percentage of correct predictions is 79.05%. This tells us that, for the 3,522 observations (people) used in the model, the model correctly predicted whether someone churned 79.05% of the time. Is this a good result? The answer depends a bit on context. In this case, 79.05% is not quite as good as it might sound.
Starting with the No row of the table, we can see that 2,301 people did not churn and were correctly predicted not to have churned, whereas only 274 people who did not churn were predicted to have churned. If you hover your mouse over each table cell, you see additional information that shows a percentage indicating that the model accurately predicted non-churn for 83% of those who did not churn. So far so good.
Now, look at the second row. It shows that among people who churned, the model was only marginally more likely to predict they churned than not (i.e., 483 versus 464). So, among people who churned, the model correctly predicted churn only 51% of the time.
If you sum the first row, you can see that 2,575 people did not churn. However, if you sum the first column, you can see that the model predicted 2,765 people did not churn. What's going on here? As most people did not churn, the model can get some easy wins by defaulting to predicting that people do not churn. There is nothing wrong with the model doing this. It is the optimal thing to do. But it is important to keep this in mind when evaluating the accuracy of any predictive model. If the groups being predicted are not of equal size, the model can get away with just predicting that people are in the larger category, so it is always important to check the accuracy separately for each group (i.e., churners and non-churners). It is for this reason that you need to be skeptical when people try to impress you with the accuracy of predictive models; when predicting a rare outcome, it is easy for a model to predict accurately (by always predicting against the rare outcome).
Out-of-sample prediction accuracy
The accuracy discussed above is computed using the same data used to fit the model. A more thorough way to assess prediction accuracy is to perform the calculation using data not used to train the model. This tests whether the model's accuracy is likely to hold up in the "real world". The table below shows the model's prediction accuracy when applied to 1,761 observations not used during logistic regression. The good news here is that in this case, the prediction accuracy has improved a smidge to 79.1%. This is a bit of a fluke. Typically, we would expect lower out-of-sample prediction accuracy, often substantially lower.
R-squared and pseudo-r-squared
The footer of the table below shows that the model's R-squared is 0.1898. This is interpreted exactly as the R-squared in linear regression, and it tells us that this model explains only 19% of the variation in churning.
Although the R-squared is a valid statistic for logistic regression, it is not widely used, as there are many situations in which better models can have lower R-squared values. A variety of pseudo-r-squared statistics are used instead. The footer for this table shows one of these, McFadden's rho-squared. Like r-squared statistics, these statistics take values from 0 to 1, with higher values indicating a better model. They are preferred over traditional R-squared because they are guaranteed to increase as the model's fit improves. The disadvantage of pseudo-r-squared statistics is that they are only useful when compared to other models fit to the same data set (i.e., it is not possible to say if 0.2564 is a good value for McFadden's rho-squared or not).
AIC
The Akaike information criterion (AIC) is a measure of the quality of the model and is shown at the bottom of the output above. This is one of the two best ways of comparing alternative logistic regressions (i.e., logistic regressions with different predictor variables). The way it is used is that all else being equal, the model with the lower AIC is superior. The AIC is generally better than pseudo-r-squareds for comparing models, as it takes into account the complexity of the model (i.e., all else being equal, the AIC favors simpler models, whereas most pseudo-r-squared statistics do not).
The AIC is also often better for comparing models than using out-of-sample predictive accuracy. Out-of-sample accuracy can be a quite insensitive and noisy metric. The AIC is less noisy because:
- There is no random component in it, whereas the out-of-sample predictive accuracy is sensitive to which data points were randomly selected for the estimation and validation (out-of-sample) data.
- It takes into account all of the probabilities. That is, when using out-of-sample predictive accuracy, both a 51% prediction and a 99% prediction have the same weight in the final calculation. By contrast, with the AIC, the 99% prediction leads to a lower AIC than the 51% prediction (i.e., the AIC takes into account the probabilities, rather than just the Yes or No prediction of the outcome variable).
The AIC is only useful for comparing relatively similar models. If comparing qualitatively different models, such as a logistic regression with a decision tree, or a very simple logistic regression with a complicated one, out-of-sample predictive accuracy is a better metric, as the AIC makes some strong assumptions regarding how to compare models, and the more different the models, the less robust these assumptions.
The table of coefficients
The table of coefficients from above has been repeated below. When making an initial check of a model it is usually most useful to look at the column called z, which shows the z-statistics. The way we read this is that the further a value is from 0, the stronger its role as a predictor. So, in this case, we can see that the Tenure variable is the strongest predictor. The negative sign tells us that as tenure increases, the probability of churning decreases. We can also see that Monthly Charges is the weakest predictor, as its z is closest to 0. Further, the p-value for monthly charges is greater than the traditional cutoff of 0.05 (i.e., it is not "statistically significant", to use the common albeit dodgy jargon). All the other predictors are "significant". To get a more detailed understanding of how to read this table, we need to focus on the Estimate column, which I've gone to town on in How to Interpret Logistic Regression Coefficients.
Analysis of Variance (ANOVA)
With logistic regressions involving categorical predictors, the table of coefficients can be difficult to interpret. In particular, when the model includes predictors with more than two categories, we have multiple estimates and p-values, and z-statistics. This is doubly problematic. First, it can be hard to get your head around how to interpret them. Second, sometimes some or all of the coefficients for a categorical predictor are not statistically significant, but for complicated reasons beyond the scope of this article it is possible to have none or some of the individual coefficients being significant, but for them all to be jointly significant (significant when assessed as a whole), and vice versa.
This problem is addressed by performing an analysis of variance (ANOVA) on the logistic regression. Sometimes these will be created as a separate table, as in the case of Displayr's ANOVA table, shown below. In other cases, the results will be integrated into the main table of coefficients (SPSS does this with its Wald tests). Typically, these will show either the results of a likelihood-ratio (LR) test or a Wald test.
The example below confirms that all the predictors other than Monthly Charges are significant. We can also make some broad conclusions about relative importance by looking at the LR Chisq column, but when doing so keep in mind that with this statistic (and also with the Wald statistic shown by some other products, such as SPSS Statistics), that: (1) we cannot meaningfully compute ratios, so it is not the case that Tenure is almost twice as important as Contract; and, (2) the more categories in any of the predictors, the less valid these comparisons.
Other outputs
The outputs described above are the standard outputs, and will typically lead to the identification of key problems. However, they are by no means exhaustive, and many other more technical outputs can be used which can lead to conclusions not detectable in these outputs. This is one of the ugly sides of building predictive models: there is always something more that can be checked, so you never can be 100% sure if your model is as good as it can be...
UPCOMING WEBINAR: The Roadmap for Market Researchers in the Age of AI