Introduction
This article describes how to calculate Jaccard Coefficients in Displayr using R.
Jaccard coefficients, also know as Jaccard indexes or Jaccard similarities, are measures of the similarity or overlap between a pair of binary variables. In Displayr, this can be calculated for variables in your data easily by using Anything > Advanced Analysis > Regression > Driver Analysis and selecting Inputs > Output > Jaccard Coefficient. However, you can also calculate them using R, which is what this blog post focuses on.
Requirements
A Data Set with variables appropriate for a Linear Regression analysis
Method
To calculate Jaccard coefficients for a set of binary variables, you can use the following:
- Select Calculation > Custom Code.
- Paste the code below into to the R CODE section on the right.
- Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name can be found by hovering over the variable in the Data Sets pane, or by selecting the variable and looking under Properties > GENERAL > Name.
The code for the Jaccard coefficients is:
To calculate Jaccard coefficients for a set of binary variables, you can use the following:
- Select Insert > R Output.
- Paste the code below into to the R CODE section on the right.
- Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name can be found by hovering over the variable in the Data Sets pane, or by selecting the variable and looking under Properties > GENERAL > Name.
The code for the Jaccard coefficients is:
Jaccard = function (x, y) {
M.11 = Sum(x == 1 & y == 1)
M.10 = Sum(x == 1 & y == 0)
M.01 = Sum(x == 0 & y == 1)
return (M.11 / (M.11 + M.10 + M.01))
}
input.variables = data.frame(Q6_01, Q6_02, Q6_03, Q6_04, Q6_05, Q6_06, Q6_07, Q6_08, Q6_09)
m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
for (c in 1:length(input.variables)) {
if (c == r) {
m[r,c] = 1
} else if (c > r) {
m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
}
}
}
variable.names = sapply(input.variables, attr, "label")
colnames(m) = variable.names
rownames(m) = variable.names
jaccards = m
In this code:
- I have defined a function called Jaccard. The function takes any two variables and calculates the Jaccard coefficient for those two variables. A function is a set of instructions that can be used elsewhere in the code. Particularly for more complicated blocks of code, writing a function like this can make your code more efficient and easier to read and check for mistakes.
- In case of missing values, the Sum function excludes any case with a missing value for that pair from the coefficient for that pair.
- input.variables contains a data frame which has each of the variables you want to analyze as the columns.
- Initially, I created a matrix full of missing values as a place to store my calculations.
- I have used two for loops to go through and calculate the Jaccard coefficients and fill up the top half of the matrix.
- The bottom half of the matrix is left empty. In Displayr, missing values are displayed as empty cells. As the bottom half of the matrix would be identical to the top half, empty cells help us to read the results more easily.
- I have used the sapply function to obtain the labels for each variable so that they may be displayed in the row labels (rownames) and column labels (colnames) of the table. In this case, sapply is using the attr function to obtain the label attribute of each variable. As R does not recognize the same set of meta data for each variable, Displayr adds the meta data to the attributes of the variables so that it may be returned later if necessary.
The result is a table that contains all of the Jaccard coefficients for each pair of variables.
Visualize the results
A heatmap is an ideal way to visualize tables of coefficients like this. To create a heatmap for this data in Displayr,
- Select Toolbar > Visualization > Heatmap.
- Under Inputs > DATA SOURCE, click into Output in 'Pages' and select the output for the Jaccard coefficients that was created above.
- Tick Automatic.
You'll get a result that looks like the following. With the blue default color palette, the largest Jaccard coefficients will be the darkest blue. Looking for dark patches off the diagonal of the table allows you to locate the pairs of products which have the biggest overlap according to the Jaccard index. In this case we see strong overlaps between iPhone, iPod, and iPad owners in the top left, and between Samsung owners and people who own non-Mac computers over to the right.
See Also
How to Run a Linear Regression in Displayr