How to Do Principal Component Analysis in Displayr

Principal Components Analysis (PCA) is a technique for taking many variables and creating a new, smaller set of variables that aims to capture as much of the variation in the data as possible. This article describes how to run a Principal Component Analysis in Displayr. This is useful when you want to identify redundant concepts in new product testing or reduce the options/questions presented to people in a survey.

Requirements

See this article in the Data Story Guide for an introduction to PCA.
Familiarity with the Structure and Value Attributes of Variable Sets.
A data set containing several Numeric, Numeric - Multi, or Binary - Multi variables that you want to combine and reduce down to a smaller number of variables or components. For this example, we'll use a series of attitudinal questions about mobile device attributes on a 1-5 scale where 1="Strongly agree" and 5="Strongly disagree".

Screenshot 2024-05-28 092049.png

Method - Running the PCA

From the toolbar, select Anything > Advanced Analysis > Dimension Reduction > Principal Component Analysis or in the Report tree by selecting +> Advanced Analysis > Dimension Reduction > Principal Component Analysis.
From the Properties Select your input variables and/or variable sets from the Variables drop-down.
Click the Calculate button to run the PCA.

Method - Removing Redundancies

Where multiple variables have very high positive or negative correlations on a component in the loadings table above (also called the Rotated Component Matrix), the implication is that they are redundant, and some of them can be removed. The basic process is:

Identify sets of concepts with high correlations to a particular component. Those will be highlighted by long bars beneath each component. Typically, when interpreting the component matrix, correlations with magnitudes of less than 0.3 or 0.4 are considered trivial.
Rank order the concepts according to their overall suitability (e.g., taking into account average appeal, fit with the brand, and other relevant criteria).
Retain the concepts that have the highest rank order on a particular component.

It can be helpful to check that changing the number of components does not greatly alter conclusions.

Additional Settings

Normalize variables - if selected, the correlation matrix will be used instead of the covariance matrix. This is checked by default. Create binary variables from categories - if selected, unordered categorical variables will be represented as binary variables. Otherwise, their Value Attributes are used. Numeric - Multi-variable sets are treated according to their numeric values and not converted to binary. This is unchecked by default. Rule for selecting components - the following options are available:

Kaiser rule keeps components - keeps components with eigenvalues greater than 1. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues exceeding the mean eigenvalue are retained. This is selected by default.
Eigenvalue over-keep components with eigenvalues greater than a user-specified number. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues exceeding a multiple of the eigenvalue mean are retained.
Number of components - manually select the number of components to keep.

Rotation method - rotations of the principal components are used to produce solutions where the loadings tend to be closer to 0, 1, or -1, making interpretation of the solution easier.

The Varimax, Quartimax, and Equamax rotations are orthogonal, which means that the components produced are always uncorrelated with one another.
The Promax and Oblimin rotations are oblique, meaning that the components can be correlated with one another.

After rotation, components with large negative loadings will have their signs flipped, so that the largest loadings are positive, making interpretation easier.Missing data - determines how to handle missing data. See Missing Data Options for more details. Output - the following PCA output options are available:

Loadings Table - displays a table of the component loadings, which is sometimes referred to as a Pattern matrix. This is selected by default.
Structure Matrix - displays the structure matrix, which is the loadings matrix multiplied by the correlations between the components.
Variance Explained - displays the eigenvalues of the original, unrotated components, along with the variance explained, and cumulative variance explained.
Eigenvalues are a number that comes out of the maths of the process of determining the new principal components. It represents the amount of variance in the original data that is captured by that component. The percentage figures in the top row represent the percentage of variance represented by that component, and these percentages are worked out by dividing each eigenvalue by the total of all the eigenvalues of all of the components (before the smallest ones are chucked out).PCA is a process of finding a new, smaller set of variables that captures as much variance as possible. So if you want 5 new components, you are picking the 5 new variables that have the largest eigenvalues.
Component Plot - displays a scatterplot of the loadings of the first two principal components.
Scree Plot - displays a chart of the eigenvalues of the correlation or covariance matrix.
Detailed Output - shows more details on the results, including the loadings, structure matrix, variable communalities, sum of squared loadings, and score weights.
2D Scatterplot - shows the data charted with axes of the first 2 components and labeled according to Grouping Variable.
Sort coefficients by size - when displaying loadings or the structure matrix, sort the components according to their size.
Suppress small coefficients - when displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.
Absolute value below - in tables, cells which have absolute values smaller than the entered value will be replaced with blank spaces.
Variable names - if checked, displays variable names in the output instead of variable labels.