Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. It is a dimension reduction technique with some similarities to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space, we perform a regression against the outcome variable. In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.
Requirements
A Displayr document with data. In this example, we are going to load the data found in this link.
Please note these steps require a Displayr license.
Method
- Drag the variable "Q6" onto the Page from the Data Sources tree to create a summary table. This produces a table showing the breakdown of the respondents by category. It includes a Don't Know category that doesn't fit in the ordered scale from Love to Hate.
- To remove Don't Know, click on Q6 in the Data Sourcestree, then from the object inspector, go to Data > Properties > Missing Values.
- Change Missing Values for the Don't Know category to Exclude from analyses, which produces the table below:
- Restructure the variable to Numeric by selecting Q6 in the Data Sources tree, and from the object inspector, go to Data > Properties > Structure > Numeric, which produces a table that looks like this:
- To create the PLS model, select Calculation > Custom Code from the toolbar.
- Click onto the page to place the custom calculation.
- Paste the following snippet into the R Code editor:
dat = data.frame(Q6_, Q5_0_, Q5_1_, Q5_2_, Q5_3_, Q5_4_, Q5_5_, Q5_6_, Q5_7_, Q5_8_,
Q5_9_, Q5_10_, Q5_11_, Q5_12_, Q5_13_, Q5_14_, Q5_15_, Q5_16_, Q5_17_,
Q5_18_, Q5_19_, Q5_20_, Q5_21_, Q5_22_, Q5_23_, Q5_24_, Q5_25_, Q5_26_,
Q5_27_, Q5_29_, Q5_28_, Q5_30_, Q5_31_, Q5_32_, Q5_33_)
library(pls)
library(flipFormat)
library(flipTransformations)
dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE)
pls.model = plsr(Q6_ ~ ., data = dat, validation = "CV")
The first line selects Q6_ as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. In your project, these variables can be dragged across from the Data Sources tree on the left into the R Code editor rather than typing them in one by one.
Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and Displayr's publicly available packages, flipFormat, and flipTransformations are included to help transform and tidy the data. Since the R pls package requires inputs to be numerical, we converted the variables from categorical to numeric. In the final line above the plsr function does the work and creates pls.model.
- Adding the following lines recreates the model with the optimal number of dimensions:
# Find the number of dimensions with lowest cross validation error
cv = RMSEP(pls.model)
best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1
# Rerun the model
pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims)You will need to replace pref on the last line of code with your outcome variable. In this example, I used Q6_ as the outcome variable.
- Finally, extract the useful information and format the output by adding the following lines of code:
coefficients = coef(pls.model)
The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.
sum.coef = sum(sapply(coefficients, abs))
coefficients = coefficients * 100 / sum.coef
names(coefficients) = TidyLabels(Labels(dat)[-1])
coefficients = sort(coefficients, decreasing = TRUE)
The results below show Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors, and Tough has little relevance.