Linear Regression models the linear relationship between a dependent variable and one or more independent variables. The linear regression option is most commonly used when the dependent variable is continuous. This article describes how to create and run a linear regression model in Displayr.
- Understanding of How Does Linear Regression Work?
- Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
- Predictor variables (aka features or independent variables) - these can be numeric or binary. To use categorical variables in regression, you need to create a separate dummy variable for each category and use those instead (e.g. if Employment Category has three categories (manager, custodial, clerical) you can create three new variables called manager, custodial and clerical).
- An outcome variable (aka dependent variable) - this variable must be numeric.
- From the toolbar, go to Anything > Advanced Analysis > Regression > Linear Regression
- In the object inspector, select your numeric Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sets tree and drag them into the Predictor(s) box, but if you are using dummy variables you created from a categorical variable, be sure to leave one out to serve as the reference category.
- From Algorithm, choose Regression.
- From Regression type, choose Linear.
- From Output, the default is Summary. This output gives you the Regression coefficients table, R-Squared, the AIC fit value, missing value treatment, and other information. There are several other options you can choose from but you should use Summary if you are primarily interested in the regression equation and the percentage of variance the model accounts for. Other options include:
- Detail - Typical R output, some additional information compared to Summary, but without the pretty formatting.
- ANOVA - Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
- Relative Importance Analysis - The results of a relative importance analysis. See here and the references for more information. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.
- Shapley Regression - See here and the references for more information. This option is only available for Linear Regression. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Shapley.
- Jaccard Coefficient - Computes the relative importance of the predictor variables against the outcome variable with the Jaccard Coefficients. See Driver (Importance) Analysis - Jaccard Coefficient. This option requires both binary variables for the outcome variable and the predictor variables.
- Correlation - Computes the relative importance of the predictor variables against the outcome variable via the bivariate Pearson product moment correlations. See Driver (Importance) Analysis - Correlation and references therein for more information.
- Effects Plot - Plots the relationship between each of the Predictors and the Outcome
Missing data gives you several options for how to treat missing values. The default is to exclude cases with missing values. This is usually the preferred choice unless the regression model contains variables with high percentages of missing values. In cases like that, you might consider excluding one or more of those variables, or use Multiple imputation if you want to keep those variables in your regression model with estimated values.
Note: Multiple Imputation is not supported with automated outlier removal. Using the two options together will result in an error. Either change the missing value handling option or set the Automated outlier removal percentage to zero.
- OPTIONAL: Tick Robust standard errors. This computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors.
- Click Calculate.
- Correction- The multiple comparisons correction applied when computing the p-values of the post-hoc comparisons
- Variable names - Displays Variable Names in the output instead of labels
- Weight - Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization. See Weights, Effective Sample Size and Design Effects.
- Filter - The data is automatically filtered using any filters prior to estimating the model.
- Crosstab Interaction - Optional variable to test for interaction with other variables in the model. The interaction variable is treated as a categorical variable. Coefficients in the table are computed by creating separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test of the coefficient compared to the coefficient using the remaining data as described in Driver Analysis. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the sub-title is calculated using a likelihood ratio test between the pooled model with no interaction variable, and a model where all predictors interact with the interaction variable.
Automated outlier removal percentage - A numeric value between 0 and 50 (including 0 but not 50) is used to specify the percentage of the data that is removed from analysis due to outliers. All regression types except for the case of Multinomial Logit support this feature. If a zero value is selected for this input control then no outlier removal is performed and a standard regression output for the entire (possibly filtered) dataset is applied. If a non-zero value is selected for this option then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user-specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used varies depending on the regression Type.
- Linear: The studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.
The studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation. The studentized deviance residual computes the contribution the fitted point has to the likelihood and standardizes (adjusts) based on the influence of the point and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991)for more details). The Pearson residual in the weighted case computes the distance between the observed and fitted value and adjusts appropriately for the provided survey weights. See rstudent function in R and Davison and Snell (1991) for more details of the specifics of the calculations.
- Stack data - Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. If this option is chosen then the Outcome needs to be a single . Similarly, the Predictor(s) need to be a single . In the process of stacking, the data reduction is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories.
- Random seed - Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Below we perform a regression analysis of salary based on other demographics and credentials.
- Select Anything > Advanced Analysis > Regression > Linear Regression
- In Outcome, select your numeric dependent variable - the example below uses Salary.
- In Predictor(s), select your predictor variable(s) - the example below uses Gender, Age, and whether respondent has a Phd.
- From Algorithm, choose Regression.
- From Regression type, choose Linear.
- From Output select Summary.
The results are as follows:
- The column labelled Estimate contains the partial regression coefficients used in the prediction equation. These estimates are also known as the coefficients and parameters.
- The Standard Error column quantifies the uncertainty of the estimates. The standard error for Advertising is relatively small compared to the Estimate, which tells us that the Estimate is quite precise, as is also indicated by the high t (which is Estimate / Standard), and the small p-value. Furthermore, the R-Squared statistic of 0.98 is very high, suggesting it is a good model.
- t tests are applied to each partial regression coefficient to test whether the coefficient is different from zero in the population.
- R-Squared can be interpreted as the proportion of variance that is explained in the dependent variable by a linear combination of the independent variables. In this example, there is only one independent variable. In this example, YearsExperience accounts for 95.7% of the variance in Salary.
Interpreting the equation
The multiple regression equation accommodates more than one independent variable:
Predicted Y = a + b1 * X1 + b2 * X2 + ... bn * Xn
b1 represents the expected change in the dependent variable when X1 increments by one, all other independent variables being equal. The same holds true for b2, b3, and so forth.
The prediction equation is derived from the Estimates column. The prediction equation in this example is:
Salary in Thousands = -6.95 + 11.09 * Gender:Male + 0.85 * Age + 36.43 * PhD: Yes
The coefficient 0.85 for Age now represents the expected change in Salary in Thousands when Age increases by one, provided that Gender:Male and PhD:Yes remain fixed at a certain value. For example, think of two persons who both have a Ph.D. and are both Female, one being one year older than the second. You then expect that the first person has 850 dollars more salary than the second. Thus, a coefficient gives the net effect of a predictor, controlling for other variables in the equation.
Also, notice that the p value for Gender:Male is .101 which means the variable it is not significant when we hold the other two variables constant. We will elimate it from the analysis and rerun our analysis.
The results are as follows:
The final prediction equation is:
Salary in Thousands = -3.92 + 0.89 * Age + 38.09 * PhD: Yes
For example, if two people are 25 years old, but only one of them has a Ph.D., you would expect the person with a Ph.D. to have $38,090 more salary than the person without the Ph.D.