A Scatterplot uses dots to represent values for two different variables (typically numeric). The position of each dot indicates values for an individual data point along the axes. Dots can also be labeled, color-coded, and sized based on other variables. A gallery of examples can be found here.
This article describes how to create a scatter plot visualization that displays the values of two different variables (When exercising and At work or school) as points. The data for each point is represented by its horizontal (x) and vertical (y) positions on the visualization. There are two methods discussed:
Requirements
Please note this requires the Data Stories module or a Displayr license.
- A Scatter, Labeled Scatter, and Small Multiples Scatter can either take a table or variables (if using a Displayr license) as input. More specific information on the formatting of inputs are listed in the technical documentation: Visualization - Scatter - Scatter. The table used in this example is below:
Method - Create a Scatter
- From the toolbar, go to the Visualization icon > Scatter > Scatter.
- Click on the page to place the visualization.
- From the object inspector, go to Data > Data Source and select either an existing output on a Page (such as a table or R output) in the Data drop-down, or select the variables from your Data Sources tree to represent the X coordinates and Y coordinates.
- Click Calculate.
5. Optional: The row labels are displayed when hovering on each point. To add these labels to show on the chart, change Chart > Appearance > Show labels > On chart.
6. Optional: You can drag labels around to reposition them if wanted. This positioning will be remembered upon recalculation, but reset if the data changes.
8. Optional: You can use other options on the Chart tab to further customize the colors, fonts, axes, hover, legends, and plot area. You can also add lines with arrows, trend lines, annotations, and quadrants. A list of other specific customization articles for Scatter charts are listed under Next below, including How to Create a Quad Chart.
Method - Customize colors on a Scatter
You will need to change the input data to customize colors by either specifying a Data > Data Source > Colors variable (if using variables in your scatter) or by reformatting your input table to include a 4th column that specifies the category or value for the coloring. You can do this by manually inputting the full table into Data Source > Paste or type data, creating a custom R table to add in the new columns, or use Combine Tables to combine the input and a Pasted Table of the size and color. The legend for the colors will be ordered as they appear in the data, so I've reordered my table as well. The final table for my example is below:
- For this example, I used the table above in Data > Data Source > Data.
- Adding the size column will create bubbles on the plot. To make these smaller and remove the legend, change Chart > Data Series > Marker size > 1 and uncheck Legend > Show bubble legend (if applicable).
- You can use the Data Series > Color palette to set a color palette for the visualization. You may also want to reset the Data Series > Opacity > 1.
- Optional: You can use numeric numbers for colors and have a gradient coloring scale by changing Chart > Appearance > Treat colors variable as > Numeric scale and selecting your Data Series > Color palette.
- Optional: To color brands by their brand color instead, you will change the 4th column from Caffeine to Brand and copy in the brand names.
Then create an R Visualization Template using Color palette > Named colors to use in the visualization's Appearance > Template field.
Technical Input Notes:
The following is an explanation of the various inputs a Scatter visualization can take, more technical detail is in Visualization - Scatter - Scatter.
Scatter plots accept tables supplied using either Paste or type data or existing output on the page. These are expected to be tables where each row of the input data is shown as a separate point.
Columns 1 and 2: control the x and y coordinates, respectively.
Column 3: if provided, controls the sizes
Column 4: if provided, controls the colors of the points
Column .... : Additional columns in the table can be referred to for use with annotations.
Rownames: When the input table contains rownames, these will be used as the data labels.
If multiple tables are selected, each one is expected to be in the same format as described above, but row names and column names must be the same across all tables. Note that the default format of the input data for scatter plots is different from other visualizations and Row/Column manipulations may not behave as expected. In these cases, you may want to select Input data contains y-values in multiple columns.
Alternatively, a user can assign X coordinates, Y coordinates, Sizes and Colors to be variables or outputs. This option is more flexible because each of these 4 components can be separately assigned instead of being extracted from the same table. However, it is also more complicated because the behavior may change slightly depending on the inputs chosen.
- Inputs are variables. This is the simplest use case; a marker is shown for each entry in the variables (i.e the variables are expected to be the same length).
- Inputs are tables. In this case, if the tables are simple 1-column tables, then they will behave exactly the same as the variable. However, where they have additional attributes, the chart will attempt to use these as well. If the tables have row labels, these will be used as the labels to the data points. It is also possible to explicitly use the row labels as X coordinates by selecting Use category labels instead of values. In the case where this is selected and a banner is used, the span labels are used instead of the row labels. If the Y coordinates is a 2-dimensional table, then the columns will be treated as separate data series (i.e. in different colors). If the Y coordinates table contains multiple statistics, then these may be used in the annotations.
- X or Y coordinates are a Standard R Regression model output. In this case either the regression coefficients or the importance scores are used as the data input. This is useful in particular for creating Quad Maps from a Driver Analysis output.
Input data contains y-values in multiple columns. When this is selected, each cell in the input table is shown as a separate point. The values in the table are used as the y-coordinates, whereas the x-coordinates is taken from the row labels. Each column is shown as a separate group, with the colors of the groups controlled by the color palette (under Data series). All points will be shown with the same size. If the table contains multiple statistics, these can be used to add annotations to the chart.
Next
How to Add Logos to a Scatter Plot
How to Create an Importance vs Performance Scatterplot in Displayr