Linear Regression models the linear relationship between a dependent variable and one or more independent variables. The linear regression option is most commonly used when the dependent variable is continuous.
Requirements
- Understanding of How Does Linear Regression Work?
- Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
- Predictor variables (aka features or independent variables) - these can be numeric or binary. To use categorical variables in regression, you need to create a separate dummy variable for each category and use those instead (e.g., if Employment Category has three categories (manager, custodial, and clerical), you can create three new variables called manager, custodial, and clerical).
- An outcome variable (aka dependent variable) - this variable must be numeric.
Method
From the Report tree select + > Advanced Analysis > Regression > Linear Regression.
In the object inspector
Select your numeric Outcome variable.
In Predictor(s), select your predictor variable(s). The fastest way to do this is to select all of them in the Data Sources tree and drag them into the Predictor(s) box. If you are using dummy variables created from a categorical variable, be sure to leave one out as the reference category.
From the Algorithm, choose Regression.
From Regression type, choose Linear.
-
By default, in Output, the option is Summary. This output gives you the Regression coefficients table, R-Squared, the AIC fit value, missing value treatment, and other information. There are several other options you can choose from, but you should use Summary if you are primarily interested in the regression equation and the percentage of variance the model accounts for. Other options include:
Detail - Typical R output, some additional information compared to Summary, but without the pretty formatting.
ANOVA - Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
Relative Importance Analysis - The results of a relative importance analysis. See here and the references for more information. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.
Shapley Regression - See here and the references for more information. This option is only available for Linear Regression. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Shapley.
Jaccard Coefficient - Computes the relative importance of the predictor variables against the outcome variable with the Jaccard Coefficient. See Driver (Importance) Analysis - Jaccard Coefficient. This option requires both binary outcomes and predictor variables.
Correlation - Computes the relative importance of the predictor variables against the outcome variable via the bivariate Pearson product moment correlations. See Driver (Importance) Analysis - Correlation and references therein for more information.
Effects Plot - Plots the relationship between each of the Predictors and the Outcome
Missing data gives you several options for how to treat missing values. The default is to exclude cases with missing values. This is usually the preferred choice unless the regression model contains variables with high percentages of missing values. In cases like that, you might consider excluding one or more of those variables, or use Multiple imputation if you want to keep those variables in your regression model with estimated values.
Note: Multiple Imputation is not supported with automated outlier removal. Using the two options together will result in an error. Either change the missing value handling option or set the Automated outlier removal percentage to zero.OPTIONAL: Tick Robust standard errors. This computes robust standard errors that are not affected by violations of the constant-variance assumption (i.e., heteroscedasticity). See Robust Standard Errors.
Click Calculate.
Additional Options
- Correction- The multiple comparisons correction is applied when computing the p-values of the post-hoc comparisons
- Variable names - Displays Variable Names in the output instead of labels
- Weight - Where a weight has been set for the R Output, it will automatically be applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization. See Weights, Effective Sample Size, and Design Effects.
- Filter - The data is automatically filtered using any filters prior to estimating the model.
- Crosstab Interaction - Optional variable to test for interaction with other variables in the model. The interaction variable is treated as a categorical variable. The coefficients in the table are computed by running separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test comparing it to the coefficient using the remaining data, as described in Driver Analysis. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the subtitle is calculated using a likelihood-ratio test comparing the pooled model with no interaction variable to a model in which all predictors interact with the interaction variable.
-
Automated outlier removal percentage - A numeric value between 0 and 50 (including 0 but not 50) is used to specify the percentage of the data that is removed from analysis due to outliers. If a zero value is selected for this input control, no outlier removal is performed, and a standard regression is applied to the entire (possibly filtered) dataset. If a non-zero value is selected for this option, then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user-specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used varies depending on the regression Type.
- When the regression type is Linear: The studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.
The studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation. The studentized deviance residual computes the contribution the fitted point has to the likelihood and standardizes (adjusts) based on the influence of the point and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991) for more details). The Pearson residual in the weighted case computes the distance between the observed and fitted value and adjusts appropriately for the provided survey weights. See rstudent function in R and Davison and Snell (1991) for more details of the specifics of the calculations.
- Stack data - Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. If this option is chosen then the Outcome needs to be a single . Similarly, the Predictor(s) need to be a single . In the process of stacking, the data reduction is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories.
- Random seed - Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Worked Example
Below we perform a regression analysis of salary based on other demographics and credentials. You can find the sample data set here.
- From the Report tree select + > Advanced Analysis > Regression > Linear Regression.
- In the Object Inspector
> Data > Outcome, select your numeric dependent variable - the example below uses Salary.
- In Predictor(s), select your predictor variable(s) - the example below uses Gender, Age, and whether the respondent has a PhD.
- From Algorithm, choose Regression.
- From Regression type, choose Linear.
- From Output select Summary.
The results are as follows:
- The column labeled Estimate contains the partial regression coefficients used in the prediction equation. These estimates are also known as the coefficients and parameters.
- The Standard Error column quantifies the uncertainty of the estimates. The standard error for PhD: Yes is relatively small compared to the Estimate, which tells us that the Estimate is quite precise, as is also indicated by the high t (which is Estimate / Standard), and the small p-value.
- t tests are applied to each partial regression coefficient to test whether the coefficient is different from zero in the population.
- R-Squared can be interpreted as the proportion of variance that is explained in the dependent variable by a linear combination of the independent variables. R-squared can assess the goodness of fit of the model. A larger number indicates that the model captures more of the variation of the dependent variable. See Regression Diagnostics for more information.
Interpreting the equation
The multiple regression equation accommodates more than one independent variable:
Predicted Y = a + b1 * X1 + b2 * X2 + ... bn * Xn
b1 represents the expected change in the dependent variable when X1 increments by one, all other independent variables being equal. The same holds true for b2, b3, and so forth.
The prediction equation is derived from the Estimates column. The prediction equation in this example is:
Salary in Thousands = -6.95 + 11.09 * Gender:Male + 0.85 * Age + 36.43 * PhD: Yes
The coefficient 0.85 for Age now represents the expected change in Salary in Thousands when Age increases by one, provided that Gender:Male and PhD:Yes remain fixed at a certain value. For example, think of two persons who both have a PhD and are Female, one being one year older than the second. You then expect that the first person makes $850 more salary than the second. Thus, a coefficient gives the net effect of a predictor, controlling for other variables in the equation.
Also, notice that the p-value for Gender:Male is .101 which means the variable is not significant when we hold the other two variables constant. We will eliminate it from the analysis and rerun our analysis.
The results are as follows:
The final prediction equation is:
Salary in Thousands = -3.92 + 0.89 * Age + 38.09 * PhD: Yes
For example, if two people are 25 years old, but only one of them has a PhD, you would expect the person with a PhD to make $38,090 more salary than the person without a PhD.
Next
How to Create a Prediction-Accuracy Table
How to Troubleshoot Regression Problems
How to Create a Goodness-of-Fit Plot
How to Test Residual Heteroscedasticity of Regression Models
How to Save Predicted Values of Regression Models
How to Save Fitted Values of Regression Models
How to Save Probabilities of Each Response of Regression Models
How to Test Residual Normality (Shapiro-Wilk) of Regression Models
How to Test Residual Serial Correlation (Durbin-Watson) of Regression Models