This article lists some errors that may come up when performing regression analysis and how to fix them. If you're new to regression analysis, you'll also want to read through the How Does Linear Regression Work? article in our Data Story Guide as some of the errors below are due to failed requirements for the model.
One or more of the independent variables contains no variation
This error message indicates that a variable contains no variation and it is thus impossible to estimate the regression model. Most commonly, this is caused by dummy coding, with a variable being created for a category containing no respondents. Collapse the offending category in the Outputs Tab to solve the problem. Please note that Displayr automatically filters respondents who have missing data on any of the variables in the regression.
All coefficients are NaN or 0
Most likely, there is some kind of problem with your data. For example:
- The range of one of your independent variables is too large (e.g., in the thousands or more), which causes numerical precision problems. If you divided each by, say, 2 times its standard deviation, you typically solve this problem and, as an added benefit, have coefficients that are directly comparable with the coefficients of categorical variables.
- Perfect multicollinearity. Inspect the VIF columns (see Coefficients and related statistics).
A coefficient is zero but is shows as being statistically significant
The range of the independent variables is too large (e.g., in the thousands or more), which causes numerical precision problems. If you divided each by, say, 2 times its standard deviation, you typically solve this problem and, as an added benefit, have coefficients that are directly comparable with the coefficients of categorical variables.
Lots of NaNs in the outputs
Most likely, you have included independent variables that induce perfect multicollinearity. Inspect the VIF columns (see Coefficients and related statistics).
Implausibly big numbers
If using an Ordered Logit Model, the problem will likely result to their being categories containing too few values. Collapse some of the categories and the problem should be resolved.
Error in incomplete Beta function
Possibly the included independent variables have induced perfect multicollinearity (i.e., they are linearly dependent).
An independent variable has no variation in a category of the dependent question
There are usually three solutions to this problem. Either, collapse smaller categories of categorical independent questions (see the table provided in the output), collapse categories in the dependent question, or change the type of the dependent question (e.g., from categorical to numeric).
When Regression Models Take too long to estimate
If your regression model seems to take too long to execute, often you can reduce the amount of time by just rethinking the scope of the problem you are trying to run.
Typically regression models take a long time to run if you:
- Included Categorical predictors with large number of values. In Displayr, one dummy variable is created from each unique value of a categorical variable. Check to see that the Variable Type for predictor variables like Income is Numeric, and not Categorical or Ordered Categorical. If they are, each unique value of Income will become a separate predictor variable in your model. This means that if you have 1000 unique incomes, you will end up with 999 predictors, instead of just 1 as you intended.
- Included too many Independent questions. If is generally bad research practice to include 100s or 1000s of predictor variables in your model. Before you run your regression, decide on which variables might reasonally have an effect on outcome variable. For example, you would not use variables like shoe size or hair color to predict income so don't include them in your model. Avoid using options like Use All when selecting your predictors.
Other unexpected results, check for Multicollinearity
The presence of large correlations between independent variables can create problems with the stability of the regression coefficients. The technical term for the problem is multicollinearity. While there are sophisticated tests for checking the presence or absense of multicollinearity, such as Variance Inflaction Factors (VIF), see How to Create Regression Multicollinearity Table (VIF). Another way, is to run correlations.
The predictor variables are Gender, Age and PhD and the dependent variable is Salary in Thousands. As a rule of thumb, a correlation of 0.90 or higher between between two variables suggests multicollinearity. Whenever this happens, one easy remedy is to simply remove one of the highly correlated variables from your analysis. In this example, none of the correlations are that high so multicollinearity isn't suggested.