A client complained about the performance of a dashboard, and we investigated. This article explains how optimizing how the filter was computed led to the time taken to calculate a filter reducing from 13.82 seconds to 0.55 seconds in a large data file containing more than 2 million cases.
Diagnosing the bottleneck
The initial version of the dependency graph showed that almost all the calculations in the document were using a common filter variable, which took 13.82 seconds to calculate. This filter was the bottleneck for the document.
The filter was in a data set of 2,190,693 cases. A common misunderstanding is that very slow times are inevitable with data sets in the millions. However, things can often be done to speed up large documents dramatically.
The code from the filter is shown below.
Removing a redundant variable: 8% time saving
The code above contains an unnecessary calculation. Each country's regions are unique in the data set, meaning the row containing COUNTRY can be removed from the filter calculations. When removed, the time dropped from 13.82 to 12.66 seconds.
Removing trimws: 32% time saving
The trimws function removes whitespace (e.g., spaces, tabs) from the beginning and end of text (e.g., replacing " California " with "California"). This is a data-cleaning operation designed to protect against dirty data. This would be fine with smaller data sets. However, due to the number of cases in the data set and the number of variables in the filter code, white spaces were checked for and removed 13 million times, every time the filter was updated.
Removing the trimws brought the time to calculate the filter down to 8.67.
Initially, it also led to the wrong results being calculated, as the raw data contained blank spaces at the end of some variable labels. The data had to be cleaned to rectify this.
Replacing rowSums: 73% time saving
A non-obvious problem in the code is using QAllowLargeResultObject, a special Displayr function that allows more than 128MB of data to be returned in an R variable or calculation. This is a large amount of data for a filter variable, even with such a big data file, suggesting something strange was happening.
By commenting out lines of code and seeing how long the calculation took, it became clear that the main problem was with the line of code beginning with rowSums. This was replaced with more efficient code, shown below, which reduced the time to 2.36 seconds.
Why did this make such a difference? The original code refers to a variable set, which was returned as a data frame, and data frames contain row names, which take up memory. Due to some weirdness in R, these row numbers were converted to characters, which took up a lot of memory.
Using an unstacked data file: 76% time saving
The next big time saving came from realizing that a stacked data file was being used in calculations but was not required. The stacked file was used to deal with multiple response and grid data. While this is a sensible thing to do if preparing data for BI tools, it's rarely necessary in Displayr as Displayr has native support for multiple response and grid data (e.g., the variable set structures Nominal - Multi, Binary - Multi, and Pick Any - Grid).
Cumulative time saving: 96%
The cumulative effect of all these time savings was to reduce the time from 13.82 seconds to 0.55 seconds, which is a time saving of 96%.
If you find a slow bottleneck and need some help, please reach out to us at support@displayr.com, and we'll try and help!