How to Troubleshoot Regression Problems

This article lists some errors that may come up when performing regression analysis and how to fix them. If you're new to regression analysis, you'll also want to read through the How Does Linear Regression Work? article in our Data Story Guide, as some of the errors below are due to failed requirements for the model.

One or more of the independent variables contains no variation

This error message indicates that a variable contains no variation and thus it is impossible to estimate the regression model. Most commonly, this is caused by dummy coding, with a variable being created for a category containing no respondents. Merge the offending category to solve the problem. Please note that Displayr automatically filters respondents who have missing data on any of the variables in the regression.

All coefficients are NaN or 0

Most likely, there is some kind of problem with your data. For example:

The range of one of your independent variables is too large (e.g., in the thousands or more), which causes numerical precision problems. If you divided each by, say, 2 times its standard deviation, you typically solve this problem and, as an added benefit, have coefficients that are directly comparable with the coefficients of categorical variables.
Perfect multicollinearity. Inspect the VIF columns (see Coefficients and related statistics).

A coefficient is zero but it shows as being statistically significant

The range of the independent variables is too large (e.g., in the thousands or more), which causes numerical precision problems. If you divided each by, say, 2 times its standard deviation, you typically solve this problem and, as an added benefit, have coefficients that are directly comparable with the coefficients of categorical variables.

Lots of NaNs in the outputs

Most likely, you have included independent variables that induce perfect multicollinearity. Inspect the VIF columns (see Coefficients and related statistics).

Implausibly big numbers

If using an Ordered Logit Model, the problem will likely result in there being categories containing too few values. Merge some of the categories and the problem should be resolved.

Error in incomplete Beta function

Possibly the included independent variables have induced perfect multicollinearity (i.e., they are linearly dependent).

An independent variable has no variation in a category of the dependent question

There are usually three solutions to this problem. Either merge smaller categories of categorical independent questions (see the table provided in the output), merge categories in the dependent question, or change the type of the dependent question (e.g., from categorical to numeric).

When Regression models take too long to estimate

If your regression model seems to take too long to execute, often you can reduce the amount of time by just rethinking the scope of the problem you are trying to run.

Typically, regression models take a long time to run if you:

Included Categorical predictors with a large number of values. In Displayr, one dummy variable is created from each unique value of a categorical variable. Check to see that the Variable Type for predictor variables like Income is Numeric, and not Categorical or Ordered Categorical. If they are, each unique value of Income will become a separate predictor variable in your model. This means that if you have 1000 unique incomes, you will end up with 999 predictors, instead of just 1 as you intended.
Included too many Independent questions. It is generally bad research practice to include 100s or 1000s of predictor variables in your model. Before you run your regression, decide on which variables might reasonably have an effect on the outcome variable. For example, you would not use variables like shoe size or hair color to predict income, so don't include them in your model. Avoid using options like Use All when selecting your predictors.

Other unexpected results, check for multicollinearity

The presence of large correlations between independent variables can create problems with the stability of the regression coefficients. The technical term for the problem is multicollinearity. There are sophisticated tests for checking the presence or absence of multicollinearity, such as Variance Inflation Factors (VIF), see How to Create Regression Multicollinearity Table (VIF). Another way is to run correlations.

For example:

The predictor variables are Gender, Age, and PhD, and the dependent variable is Salary in Thousands. As a rule of thumb, a correlation of 0.90 or higher between two variables suggests multicollinearity. Whenever this happens, one easy remedy is to simply remove one of the highly correlated variables from your analysis. In this example, none of the correlations are that high, so multicollinearity isn't suggested.

How to Do Driver Analysis

How to Create a Prediction-Accuracy Table

How to Create a Goodness-of-Fit Plot

How to Test Residual Heteroscedasticity of Regression Models

How to Save Predicted Values of Regression Models

How to Save Fitted Values of Regression Models

How to Save Probabilities of Each Response of Regression Models

How to Test Residual Normality (Shapiro-Wilk) of Regression Models

How to Test Residual Serial Correlation (Durbin-Watson) of Regression Models

How to Save Residuals of Regression Models