This article describes how to conduct a Principal Components Analysis (PCA) on a text variable using Google's Universal Encoder to convert the text variable to numeric data. The results are presented in a modified loadings table that shows the cross-correlation between the principal component scores and an augmented document term matrix for ease of interpretation.
Requirements
A Displayr document with a text variable.
Please note these steps require a Displayr license.
Method
- Go to Anything > Advanced Analysis > Text Analysis > Advanced > Principal Components Analysis (Text).
- Under Data > Text Variable select a text variable.
- Make any other selections or changes to the settings you require such as component selection rules and/or rotations (see details below).
- Ensure the Calculate automatically box is checked, or click Calculate.
OPTIONAL: The following settings can be updated to modify the output:
- Truncate cases when characters exceed - Truncate cases whose number of characters exceeds this number. This is done as the algorithm expects text data to consist of one sentence per case and may not function correctly when cases are too long. If any cases are truncated due to this setting (a warning will be shown), it is likely that the data does not conform to this assumption (one sentence per case) and is not appropriate for this analysis.
-
Rule for selecting components - Method for determining the number of principal components to keep in the analysis:
- Number of components - Manually select the number of components to keep.
- Eigenvalues over - Keep components with eigenvalues greater than a user-specified number.
-
Rotation method:
- Varimax - A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and an augmented documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.
- None - The original principal component scores are used in creating the cross-correlation matrix with an augmented document term matrix for each case.
- Sort coefficients by size - When displaying loadings or the structure matrix sort the components according to their size.
- Suppress small coefficients - When displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.
- Absolute value below - In tables, cells that have absolute values smaller than this will be replaced with blank spaces.
Extracting the Principal Component Scores
To extract the (possibly rotated) principal component scores from this output as a variable into your Data Set, take these steps:
- Select the Principal Components Analysis (Text) output and then select the Data tab in the object inspector.
- Click the Save Variable(s) > Components/Dimensions button.
A n variable is created in your Data Set that contains a variable for the Principal Component scores for each component. These can then be used like any other variable set to create tables and further outputs.
Technical details:
-
Rotations:
- The Varimax-type rotation is orthogonal, meaning that the components produced are always uncorrelated with one another.
-
Scores:
- Principal components analysis can be used to create a new set of variables that give the new values for each case on the components that have been identified. The initial analysis is done without rotation and computed using the Regression method. If a Varimax type rotation is specified (default), then the principal component scores are rotated with an orthogonal matrix to maximize the variance of the cross-correlation matrix of the rotated principal components scores against the presented document term type matrix. The scores can be saved by going to Data > Save Variable(s) > Components/Dimensions.