Principal Components Analysis (PCA) is a technique for taking many variables and creating a new, smaller set of variables which aims to capture as much of the variation in the data as possible. This article describes how run a Principal Component Analysis in Displayr. This is useful when you want to identify redundant concepts in new product testing or reduce the options/questions presented to people in a survey.
- See this blog post for an introduction to PCA.
- Familiarity with the Structure and Value Attributes of Variable Sets.
- A data set containing several Numeric, Numeric - Multi, or Binary - Multi variables that you want to combine and reduce down to a smaller number of variables or components. For this example, we'll use a series of attitudinal questions about mobile device attributes on a 1-5 scale where 1="Strongly agree" and 5="Strongly disagree".
Method - Running the PCA
1. From the toolbar menu, select Anything > Advanced Analysis > Dimension Reduction > Principal Component Analysis.
2. From the object inspector on the right, select your input variables and/or variable sets from the Variables drop-down.
3. Click the Calculate button to run the PCA.
Method - Removing Redundancies
Where multiple variables have very high positive or negative correlations on a component in the loadings table above (also called the Rotated Component Matrix) the implication is that they are redundant and some of them can be removed. The basic process is:
- Identify sets of concepts with high correlations to a particular component. Those will be highlighted as having long bars underneath each component. Typically, when interpreting the component matrix, correlations with a magnitude of less than 0.3 or 0.4 are regarded as being trivial.
- Rank order the concepts according to their overall suitability (e.g., taking into account average appeal, fit with the brand and other relevant criteria).
- Retain the concepts that have the highest rank order on a particular component.
It can be helpful to check that changing the number of components does not greatly alter conclusions.
Normalize variables - if selected the the correlation matrix will be used instead of the covariance matrix. This is checked by default.
Create binary variables from categories - if selected, unordered categorical variables will be represented as binary variables. Otherwise, their Value Attributes are used.are treated according to their numeric values and not converted to binary. This is unchecked by default.
Rule for selecting components - the following options are available:
- Kaiser rule keeps components - keeps components with eigenvalues greater than 1. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues greater than the mean eigenvalue are kept. This is selected by default.
- Eigenvalue over - keeps components with eigenvalues greater than a user-specified number. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues greater than a multiple of the eigenvalue mean are kept.
- Number of components - manually select the number of components to keep.
Rotation method - rotations of the principal components are used to produce solutions where the loadings tend to be closer to 0, 1, or -1, making interpretation of the solution easier.
The Varimax, Quartimax, and Equamax rotations are orthogonal, which means that the components produced are always uncorrelated with one another.
The Promax and Oblimin rotations are oblique, meaning that the components can be correlated with one another.
After rotation, components with large negative loadings will have signs flipped, so that the largest loadings are positive, to make interpretation easier.
Missing data - determines how to handle missing data. See Missing Data Options for more detail.
Output - the following PCA output options are available:
- Loadings Table - displays a table of the component loadings, which is sometimes referred to as a Pattern matrix. This is selected by default.
- Structure Matrix - displays the structure matrix, which is the loadings matrix multiplied by the correlations between the components.
Variance Explained - displays the eigenvalues of the original, unrotated components, along with the variance explained, and cumulative variance explained.
Eigenvalues are a number that comes out of the maths of the process of determining the new principal components. It represents the amount of variance in the original data that is captured by that component.The percentage figures in the top row represent the percentage of variance represented by that component, and these percentages are worked out by dividing each eigenvalue by the total of all the eigenvalues of all of the components (before the smallest ones are chucked out).PCA is a process of finding a new, smaller set of variables which captures as much variance as possible. So if you want 5 new components, you are picking the 5 new variables which have the largest eigenvalues.
- Component Plot - displays a scatterplot of the loadings of the first two principal components.
- Scree Plot - displays a chart of the eigenvalues of the correlation or covariance matrix.
- Detailed Output - shows more details on the results, including the loadings, structure matrix, variable communalities, sum of squared loadings, and score weights.
- 2D Scatterplot - shows the data charted with axes of the first 2 components and labelled according to Grouping Variable.
Sort coefficients by size - when displaying loadings or the structure matrix, sort the components according to their size.
Suppress small coefficients - when displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.
Absolute value below - in tables, cells which have absolute values smaller than the entered value will be replaced with blank spaces.
Variable names - if checked, displays variable names in the output instead of variable labels.