This article describes how to conduct a Principal Components Analysis (PCA) on a text variable using Google's Universal Encoder to convert the text variable to numeric data. The results are presented in a modified loadings table that shows the cross-correlation between the principal component scores and an augmented document term matrix for ease of interpretation.
A Displayr document with a text variable.
- Go to Anything > Advanced Analysis > Text Analysis > Advanced > Principal Components Analysis (Text).
- Under Inputs > Text Variable select a Text variable.
- Make any other selections or changes to the settings you require such as component selection rules and/or rotations (see details below).
- Ensure the Automatic box is checked, or click Calculate.
OPTIONAL: The following settings can be updated to modify the output:
- Truncate cases when characters exceed - Truncate cases whose number of characters exceeds this number. This is done as the algorithm expects text data to consist of one sentence per case and may not function correctly when cases are too long. If any cases are truncated due to this setting (a warning will be shown), it is likely that the data does not conform to this assumption (one sentence per case) and is not appropriate for this analysis.
- Rule for selecting components - Method for determining the number of principal components to keep in the analysis:
- Number of components - Manually select the number of components to keep.
- Eigenvalue over - Keep components with eigenvalues greater than a user-specified number.
- Rotation method:
- Varimax - A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and an augmented documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.
- None - The original principal component scores are used in creating the cross-correlation matrix with an augmented document term matrix for each case.
- Sort coefficients by size - When displaying loadings or the structure matrix sort the components according to their size.
- Suppress small coefficients - When displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.
- Absolute value below - In tables, cells that have absolute values smaller than this will be replaced with blank spaces.
- The Varimax-type rotation is orthogonal, meaning that the components produced are always uncorrelated with one another.
- Principal components analysis can be used to create a new set of variables that give the new values for each case on the components that have been identified. The initial analysis is done without rotation and computed using the Regression method. If a Varimax type rotation is specified (default), then the principal component scores are rotated with an orthogonal matrix to maximize the variance of the cross-correlation matrix of the rotated principal components scores against the presented document term type matrix. The scores can be saved by going to Inputs > SAVE VARIABLES > Components/Dimensions.