How to Create a Sankey Diagram

Sankey charts show the sizes of the links between different items (called nodes). Nodes are organized into groups, also known as stages or levels. These diagrams are useful to visually see how much of something flows through to other items.

This article describes how to use a set of variables:

sankey variables.png

or a data table:

sankey table.png

To create a Sankey visualization, which shows the flows:

Requirements

You will need either:

A data set of raw data in the Data Sources pane with at least two variables of any type.
A table with at least one row for each full path of the sankey, with or without a count/weight column at the end. Examples of possible formats are listed under Data Table formats below. This can be a Raw Data Table, R table, summary.table in your Data Sources pane, Pasted table, or native drag and drop table.

Method

Either in the Report tree hover > + menu or from the toolbar (if you are on a Page), go to Visualization > Exotic > Sankey.
In the object inspector , go to Data > Data Source, and select the type of data source you wish to use to create the Sankey diagram.
- If you wish to use an existing table, go to Input table and select the desired table from the drop-down menu.
- If you wish to use variables, go to Variables and select the desired variables from the drop-down menu. Alternatively, you can drag and drop the variables from the Data Sources tree into the menu itself. In this example, we have selected Gender and Preferred cola as variables.
OPTIONAL: If using a table as the input, you may need to check Last column contains weights if the last column of the table contains the flow values to tally for the Sankey.
Click Calculate.
OPTIONAL: Specify the maximum number of categories by entering a number in Maximum number of categories.

You can customize the look of the diagram by going to the object inspector > Chart and adjusting the settings for Appearance, Labels, and Hover text. More detail is found in our technical documentation here.

Settings for Chart > Appearance > Links colored by are as follows:

None: all links are shown in grey.
Source: links are shown in the same color as the source node (left).
Target: links are shown in the same color as the target node (right).
First variable: (or if using a table, the left-most column in your table) This is similar to Source, but nodes will also be the same color as nodes they are linked to on the left. If there are multiple such nodes, then the color will be taken from the node that is linked with the largest weight.
Last variable: (or if using a table, the right-most column in your table) similar to First variable, but using the color of the Target node, and looking at downstream links.

NOTE: An error will occur if more than 20 variables are selected. It is generally advisable to show a relatively small number (e.g., 4 or 5).

Technical Notes

Data Table formats

If you'd like to use a data table as the input to the Sankey, there are various ways of formatting the data. Generally, each column is a group of nodes (aka stage such as Gender and Preferred Cola from above) and lists all the category combinations (nodes). You will need at least 2 columns in the table and will need to list all combinations of nodes to plot them. You can also include a last column, which is the size of the flow or weight. Possible examples are below:

1. Raw case-level data for each path

A table of "raw data" in the sense that each combination of categories is repeated in a new row for each observation and the flows will automatically be tallied by the visualization:
pasted nodes.png
You'll see the flows in the following Sankey are how many rows of the specific combination: raw data sankey.png

2. A row for each full path of links and a count value for that path

A table of each combination of categories, plus a final column with the count for that row to use as the value to tally instead of one. Paths can be repeated, and the counts will automatically be added together for duplicate paths. If formatted with a count column, you will need to also check Data > Data Source > Last column contains weights:
raw data plus weight.png
You'll see the a -> d flow is now bigger because one of the rows is weighted as 5: raw with weight sankey.png