Linear Regression models the linear relationship between a dependent variable and one or more independent variables. The linear regression option is most commonly used when the dependent variable is continuous. This article describes how to create and run a linear regression model in Displayr.
- Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
- Predictor variables (aka features or independent variables) - these can be numeric or binary. To use categorical variables in regression, you need to create a separate dummy variable for each category and use those instead (e.g. if Employment Category has three categories (manager, custodial, clerical) you can create three new variables called manager, custodial and clerical).
- An outcome variable (aka dependent variable) - this variable must be numeric.
- From the toolbar, go to Anything > Advanced Analysis > Regression > Linear Regression
- In the object inspector, select your numeric Outcome variable.
- In Predictor(s), select your predictor variable(s). The fastest way to do this is to select them all in the Data Sets tree and drag them into the Predictor(s) box, but if you are using dummy variables you created from a categorical variable, be sure to leave one out to serve as the reference category.
- From Algorithm, choose Regression.
- From Regression type, choose Linear.
- From Output, the default is Summary. This output gives you the Regression coefficients table, R-Squared, the AIC fit value, missing value treatment, and other information. There are several other options you can choose from but you should use Summary if you are primarily interested in the regression equation and the percentage of variance the model accounts for. Other options include:
- Detail - Typical R output, some additional information compared to Summary, but without the pretty formatting.
- ANOVA - Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
- Relative Importance Analysis - The results of a relative importance analysis. See here and the references for more information. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.
- Shapley Regression - See here and the references for more information. This option is only available for Linear Regression. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Shapley.
- Jaccard Coefficient - Computes the relative importance of the predictor variables against the outcome variable with the Jaccard Coefficients. See Driver (Importance) Analysis - Jaccard Coefficient. This option requires both binary variables for the outcome variable and the predictor variables.
- Correlation - Computes the relative importance of the predictor variables against the outcome variable via the bivariate Pearson product moment correlations. See Driver (Importance) Analysis - Correlation and references therein for more information.
- Effects Plot - Plots the relationship between each of the Predictors and the Outcome
- Missing data gives you several options for how to treat missing values. The default is to exclude cases with missing values. This is usually the preferred choice unless the regression model contains variables with high percentages of missing values. In cases like that, you might consider excluding one or more of those variables, or use Multiple imputation if you want to keep those variables in your regression model with estimated values.
- OPTIONAL: Tick Robust standard errors. This computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors.
- Click Calculate.
- Correction- The multiple comparisons correction applied when computing the p-values of the post-hoc comparisons
- Variable names - Displays Variable Names in the output instead of labels
- Weight - Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization. See Weights, Effective Sample Size and Design Effects.
- Filter - The data is automatically filtered using any filters prior to estimating the model.
- Crosstab Interaction - Optional variable to test for interaction with other variables in the model. The interaction variable is treated as a categorical variable. Coefficients in the table are computed by creating separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test of the coefficient compared to the coefficient using the remaining data as described in Driver Analysis. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the sub-title is calculated using a likelihood ratio test between the pooled model with no interaction variable, and a model where all predictors interact with the interaction variable.
- Automated outlier removal percentage - A numeric value between 0 and 50 (including 0 but not 50) is used to specify the percentage of the data that is removed from analysis due to outliers. All regression types except for the case of Multinomial Logit support this feature. If a zero value is selected for this input control then no outlier removal is performed and a standard regression output for the entire (possibly filtered) dataset is applied. If a non-zero value is selected for this option then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user-specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used varies depending on the regression Type.
- Linear: The studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.
The studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation. The studentized deviance residual computes the contribution the fitted point has to the likelihood and standardizes (adjusts) based on the influence of the point and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991)for more details). The Pearson residual in the weighted case computes the distance between the observed and fitted value and adjusts appropriately for the provided survey weights. See rstudent function in R and Davison and Snell (1991) for more details of the specifics of the calculations.
- Stack data - Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. If this option is chosen then the Outcome needs to be a single . Similarly, the Predictor(s) need to be a single . In the process of stacking, the data reduction is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories.
- Random seed - Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.