Correspondence analysis is a technique that summarizes the patterns in a table of data as a visualization. Tables with multiple rows and columns can become difficult to read and identify patterns in the data. Correspondence analysis makes it easier to see the story in your data.
This article describes how to go from a data table containing multiple rows and columns:
To a correspondence analysis output in which you can better visualization the patterns in the data:
- A Displayr document
- A table with multiple rows and columns containing data that are all on the same scale. This includes crosstabs showing counts, percentages, or averages, grids of data created from binary variables, and even raw numeric data.
1. Create a table that you want to use as an input to the correspondence analysis. See the requirements section above for table specifications. In this example, we'll use the table shown above which represents device ownership among different income groups.
2. Select Anything > Advanced Analysis > Dimension Reduction > Correspondence Analysis of a Table.
3. From the object inspector on the right, select your table from the Input table(s) drop-down box.
5. OPTIONAL: Remove any additional rows which correspond to 'NET' or 'Total' by adding the corresponding row/column labels in the Rows to ignore and Columns to ignore options. These should typically not be included in the analysis, and Displayr automatically removes 'NET', 'Total', and 'SUM' by default.
6. OPTIONAL: Customize your title, colors, fonts, and gridlines using the settings under Chart.
7. Click the Calculate button to generate the correspondence analysis output which will appear as a scatterplot on your page.
8. OPTIONAL: Instead of a scatterplot, you can create a Moonplot visualization by selecting 'Moonplot' from the Output drop-down box in the object inspector.
9. OPTIONAL: Adjust the settings further in the object inspector:
- Trend lines - When multiple tables are used as input, there is an option to show trend lines between corresponding points across different tables.
- Switch rows and columns- Whether or not to transpose the input data source.
- Bubble Chart
- Bubble sizes - A numeric vector of sizes for the bubbles with names equal to the row labels.
- Bubble colors - A numeric vector of values for with names equal to the row labels. A divergent color scale will be constructed using the range of the values as end points. The center of the color scale can be either the median of the values, or zero. Bubbles will be colored according to the corresponding value. The colors at the ends of the color scale can be specified in controls under the Chart tab.
- Bubble legend title - Title of the legend showing bubble sizes.
- Normalization - The method used to normalize the coordinates of the correspondence analysis chart. This blog post explains the differences between the normalization option. Options are:
- Principal (default option) - charts the principal coordinates (i.e., the standard coordinates multiplied by the singular values) for both rows and columns.
- Row principal - charts rows in principal coordinates and columns in standard coordinates.
- Row principal (scaled) - is as Row principal except columns are scaled by the first singular value so as to appear on a similar scale to rows.
- Column principal - charts columns in principal coordinates and rows in standard coordinates.
- Column principal (scaled) - is as Column principal except rows are scaled by the first singular value so as to appear on a similar scale to columns.
- Symmetrical (½) - charts the standard coordinates multiplied by the square roots of the singular values for both rows and columns.
- None - charts the standard coordinates for both rows and columns.
- Focus - The label of a row or column to focus the output. The axes will be rotated so that the label lies along the first dimension. This means that the entirety of the variance due to the label is visible in a 2-dimensional plot. This is useful if the analysis is intended to explain the relationship between the focus label and all other labels, rather than the general relationship between all labels. Note that the first dimension will no longer explain the maximum amount of variance. The second dimension explains the maximum amount of remaining variance whilst remaining perpendicular to the first dimension.
- Supplementary - A comma delimited list of rows and/or columns which are not used to fit the low-dimensional space, but are plotted in the space. This article describes the uses of supplementary points.
- Horizontal dimension/Vertical dimension - The dimensions to plot on the horizontal and vertical axes respectively. Since dimensions are output in order of decreasing variance, the first and second dimensions are usually of most interest.
- Flip horizontally/Flip vertically - Whether to reverse (i.e. invert the sign of) the output coordinates for the specified dimension. This may allow better visualization, especially when comparing maps that are similar apart from reflections.
- Rows to ignore/Columns to ignore - The names of any rows or columns to be removed from the table prior to analysis.
- Use logos for rows - When this option is selected, the user can replace the labels in the scatterplot with logos. The logos should be supplied as a comma-separated list of URLs.
- Maximum row labels to plot/Maximum column labels to plot - These options limit the number of labels shown. It is useful when there are many points with overlapping labels. The remaining points will be shown without labels.