This article describes how to run Shapley Regression in Displayr. Shapley Regression is also known as Shapley Value Regression and is a leading method for driver analysis. It calculates the importance of different predictors in explaining an outcome variable and is prized for its ability to address multicollinearity.
Requirements
- Familiarity with the Structure and Value Attributes of Variable Sets, and how they are used in regression models per our Driver Analysis ebook.
- An Outcome variable to be predicted.
- Predictors variables will be considered as predictors of the outcome variable.
Refer to the Requirements section of How to Do Driver Analysis and the eBook for more information about best practices when structuring your variables. Consider the following structures for the Outcome and Predictor variables:
- As numeric (e.g. Numeric or Numeric - Multi in Structure).
- The higher levels of performance/satisfaction have higher numbers/values (this isn't a technical requirement, but it makes interpretation easier).
Often driver analysis is performed using data for multiple brands at the same time. Traditionally this is addressed by creating a new data file that stacks the data from each brand on top of each other (see Stacking Data). However, when performing driver analysis in Displayr, the data can be automatically stacked.
Please note these steps require a Displayr license.
Method
- From the toolbar, go to Anything > Advanced Analysis > Regression > Driver Analysis.
- In the object inspector, select the Outcome and Predictor(s).
- Change Output to Shapley Regression.
- The model output will look similar to the below:
-
To interpret this output, the first column shows the estimated Importance of the drivers. We can see that 'Network Coverage' is the most important. The absolute values of these importance scores add up to 100. Note that we have a negative value for 'Cancel your subscription/plan'. This is a special feature of our Shapley Regression. In the background, a traditional linear regression is run and uses its signs in the Shapley, as a way of alerting the user to the possibility that some of the effects may be negative. You can turn this feature off by selecting the option Absolute importance scores from object inspector > Inputs > Linear Regression. The second column shows the Raw score, which is the same as Importance, except that rather than adding up to 100, it adds up to the R-squared statistic, which in this case is 0.3871 (shown in the footer). Thus, we can say that Network coverage, for example, explains 7.3% of the variance in Net Promoter Score (the outcome variable).
Additional Options
- Robust standard errors - Computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors. This is only available when Regression Type is Linear.
- Missing data - See Missing Data Options
-
Correction - The multiple comparisons correction applied when computing the p-values of the post-hoc comparisons
-
Variable names - Displays Variable Names in the output instead of labels
-
Absolute importance scores - Whether the absolute value of Relative Importance Analysis scores should be displayed
-
Crosstab Interaction - Optional variable to test for interaction with other variables in the model. The interaction variable is treated as a categorical variable. Coefficients in the table are computed by creating separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test of the coefficient compared to the coefficient using the remaining data as described in Driver Analysis. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the sub-title is calculated using a likelihood ratio test between the pooled model with no interaction variable, and a model where all predictors interact with the interaction variable.
-
Automated outlier removal percentage - A numeric value between 0 and 50 (including 0 but not 50) is used to specify the percentage of the data that is removed from analysis due to outliers. All regression types except for the case of Multinomial Logit support this feature. If a zero value is selected for this input control then no outlier removal is performed and a standard regression output for the entire (possibly filtered) dataset is applied. If a non-zero value is selected for this option then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user-specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used varies depending on the Regression Type.
Note: Automated outlier removal with Multiple imputation of missing values is not supported. Using them together will result in an error. Either change the missing value handling option or set the Automated outlier removal percentage to zero.-
Linear: The studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.
-
Binary Logit and Ordered Logit: A type of surrogate residual from the sure R package (see Greenwell, McCarthy, Boehmke, and Liu (2018)[1] for more details). In Binary Logit it uses the resids function with the jitter parametrization. In Ordered Logit it uses the resids function with the latent parametrization to exploit the ordered logit structure.
-
NBD Regression and Poisson Regression: A studentized deviance residual in an unweighted regression and the Pearson residual in a weighted regression.
-
Quasi-Poisson Regression: A type of quasi-deviance residual via the rstudent function in an unweighted regression and the Pearson residual in a weighted regression.
The studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation. The studentized deviance residual computes the contribution the fitted point has to the likelihood and standardizes (adjusts) based on the influence of the point and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991)[2] for more details). The Pearson residual in the weighted case computes the distance between the observed and fitted value and adjusts appropriately for the provided survey weights. See rstudent function in R and Davison and Snell (1991) for more details of the specifics of the calculations.
-
-
Stack data - Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. If this option is chosen then the Outcome needs to be a single
. Similarly, the Predictor(s) need to be a single . In the process of stacking, the data reduction is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories. -
Random seed - Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
References
- Greenwell, B., M., McCarthy, A., J., Boehmke, B., C., & Liu, D. (2018). Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package. The R Journal, 10(1), 281. https://journal.r-project.org/archive/2018/RJ-2018-004/
- Davison, A., C., and Snell, E., J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, eds. Hinkley, D. V., Redi, N. and Snell, E. J., Chapman & Hall.