Want to save a heap of time narrowing down thousands of crosstabs to just the ones you need? This post describes the approaches of creating lots of crosstabs and automatically deleting tables, then looks at creating a heatmap to summarize crosstabs.
What makes a table interesting?
Two crosstabs are shown below. Which of these is interesting? If we are going to automate the process of sifting through crosstabs we need to decide what makes a table interesting. The simplest way to do this is to see if there are meaningful differences between the percentages, reading across the rows. But instead of manually having to read them all, you can automate the process using tests of statistical significance. In our example below, I’ve used arrows and font colors to denote significant results. This can also be accomplished in different ways like having letters indicate which columns are different from others. Looking at these tables, only one of them jumps out as “interesting” and I bet you picked it right away.
Creating lots of crosstabs
In this post, I'm using a file about mobile phones called phone.sav. The first step is to create lots of crosstabs. You can find instructions here: How to Create Lots of Crosstabs.
In this example, I've selected all of the variables to place in the rows, and various five-point agreement scale variables: Allows to keep in touch, Technology fascinating, ..., Would like to do mobile banking with phone.
A new folder will appear in the Report tree, and you'll now have many pages, each containing tables crosstabbing all the variable sets in your project by the key variable sets selected. If you are using the phone.sav data set that I'm using, you will have almost 2,000 crosstabs!
Adding additional statistics to the crosstabs
By default, your crosstabs will display Column %. You can add other statistics for each table at a time, or, if you prefer, you can modify the statistics for all of the tables at once by selecting the folder in the Report tree that contains your crosstabs. To see the p-values or z-statistics on any of the tables, click on them, and from Data > Statistics > Cells, select Statistics – Cells > p and/or z-statistic.
Delete crosstab tables for different significance levels
If you created many crosstabs and only want to keep the ones that are significant above a particular level, for example, >= 0.05, you can follow the steps in How To Automatically Delete Tables With No Significant Results.
Creating a heatmap summarizing all the tables
While the first approach wins points for effectiveness and ease, it loses points for being binary. In this approach, tables are either black or white, significant or insignificant. There’s no allowance for shades of grey. Luckily, we can enter a technicolor world with an even more powerful approach. Introducing using heatmaps to summarize thousands of crosstabs!
You can run one of Displayr's built-in functions to first identify interesting tables and then automatically create a heatmap to summarize the results. See How to Identify Interesting Tables for instructions.
The heatmap below summarizes all of my crosstabs. Each colored box shows the degree of statistical significance, where the degree is something called a z-Statistic. The darker the box is shaded, the more the underlying table is significant. Darker blue indicated higher z-scores, and the z-scores are capped at 5 (i.e., any value greater than 5 is changed to 5, as beyond 5 the differences are immaterial). Make sure you check out the Why does the heatmap use z-Statistics? section below for the technical details on why the heatmap uses z-Statistics versus p-values!
What can we glean from this? For example:
- The first column shows how other questions in the study were related to agreeing with Allows to keep in touch. Reading down this column we can quickly see that agreement with this attitude can be predicted by Work status, Occupation, and Age. Reading across the rows we can see that there is no other attitude that is related to these three.
- If we scroll down further you will see a white diagonal line of boxes. This shows the crosstabs of each attitude with each other. Putting aside the white, note how there are a lot more dark cells here. This tells us that the attitudes are highly correlated with each other. Also, note however that there is a lot of variation. Some of the cells are much darker than others, telling us that we can likely group together similar attitudes (e.g., using PCA or cluster analysis).
- Scrolling even further down, you will see that the blue gets very, very, pale for most, but not all, of the variables relating to behavior. We can see two things here. First, the attitudes and behavior in these examples are not closely related. Second, there are a small number of stronger relationships meriting more exploration (i.e., the very dark cells).
Why does the heatmap use z-Statistics?
There are a few issues with the traditional approach of deleting tables that exceed the p-value cutoff for significance (0.05). One of the main issues is that, as they get smaller and closer to 0, it becomes difficult to compare them without having to squint at a lot of decimal places. In the table below, you can see the p-values in the third row of numbers. If you zoom in on the Strongly disagree column, the p-values could be a whole range of numbers like 0.003, or 0.000000001 – it’s simply impossible to tell.
But wait, there’s a solution! The z-Statistics contain the same information as the p-value, except re-scaled to make comparison easier. Check the table below for a handy guide to converting p-Value to z-Statistic. The key value here is that the difference between p-values of 0.0001 and 0.0000001 is much bigger when viewed as a z-Statistic making it much easier to understand the practical differences between the two.
Next
How to Apply Significance Testing in Displayr