This article describes how to optimize the speed of Displayr documents
A Displayr Account
Determinants of speed
Your machine and its resources: Despite running out of a web browser Displayr is a very resource intensive application, and for many customers it's the most complex web application they use. It tests the limits of web browers and the PC its running on.
The strain on the PCs resources grows with dashboard and dataset size, so closing other programs frees those resources and allows Chrome to take its fill.
Similarly, Chrome extensions also consume resources on a PC. It's worth starting Chrome in private mode (extensions don't run in private mode) and seeing if the issues improves.
To speed web pages up, Chrome includes an option to use the graphics card in any PC to draw web pages. We can double check it's turned on by opening Chrome settings, typing "hardware" (without quotes) into the search box and checking the "Use hardware acceleration when available" option is highlighted.
If that still doesn't work then it might be worth trying a different browser (such as Firefox or the newest version of Edge which Microsoft has also released for Windows 7). Firefox and Chrome should be identical for Displayr. If they're not then we know it's a setting in Chrome that needs fixing.
And if the issue still persists (and I'm sorry if we had to go this far!) I'd ask you to run some browser benchmarks from Chrome in normal mode which will give me a better understanding of your computer. The benchmarks take a total of 10-15 minutes:
How many apps you have open
How many tabs you have open
The amount of data
The amount of data, which includes all of:
- The file size.
- The number of observations/cases.
- The number of variables in the original data file(s).
- The number of derived variables.
- The number of characters in text variables.
- How the data files have been created. In particular, often SPSS files are filled with blank spaces.
- The number of data sets.
- The size of images.
The number of outputs
The more tables, charts, and, in particular Calculations (Calculations), the slower Displayr will be.
The number of calculations on a page
When a page is viewed, everything on that page will automatically update if so required. Consequently, the more calculations on a page, the slower the document will be when the page is viewed (as everything updates). Note, though, that Displayr uses caching (described below), and will only recompute things that it has not had to compute recently.
The number of variables used on a page
For example, if you have a grid question that contains 200 variables, it is going to be slower to compute than a table with a smaller number of variables. This becomes more noticeable as the number of cases in a data set becomes larger.
The size of data passed between objects
Consider creating a pie chart of gender. If the visualization is based on a table which contains the percentages of gender categories, much less data needs to be supplied to the visualization than if the visualization needed to compute these percentages from the raw data for the gender variable. In a typical project you won't be able to tell the difference if just summarizing gender, but when you've got hundreds of variables versus a small number of tables, the differences add up.
The number of Calculations, including visualizations and multivariate methods
Calculations are slower relative to other things that are calculated in Displayr. The Visualization submenu and almost all the advanced methods (e.g., Regression, Machine Learning, the entirety of the Anything submenu) are running R in the background. There are three different aspects to the slowness of Calculations:
- R itself is not a fast language. Yes, it's awesome and flexible, but this comes at a price.
- There is a small overhead that occurs each time you use R, of around 0.1 seconds. When you have multiple Calculations that need to be computed one after another this can build up.
- There is a bit of additional overhead related to moving data to and from Calculations.
The number of dependencies
For example, if you have a visualization that depends on a table and a list box, and the table is computed in R, and this table is computed based on an 100 R variables, and these are computed from other tables, you have a lot of dependencies. The more dependencies, the slower things get.
The structure of your dependency graph
The dependency graph is the relationship between all the objects. That is, if A needs to compute before B, and B before C, then A -> B -> C is the dependency graph (some people draw the arrows in the other direction...). Sometimes people inadvertently create very inefficient dependency graphs. For example, let's say you create one Calculation and have every other Calculation linked to it. If you then conduct a trivial modification to this one Calculation, it will cause everything else to update. Similarly, if you have a long chain of Calculations they will all need to be executed in sequence, which will be slower than if you create a structure that permits them to be calculated in parallel.
Your implicit caching strategy
Once a particular result has been calculated, Displayr only updates it if it has a reason to believe that the inputs might have changed (e.g., if the raw data file has been updated or a control has changed). Different ways of structuring calculations can make different use of this feature.
For example, let's say that you have a Calculation that you are referring to in lots of different places in your document. Further, let's say that 99% of the data in that Calculation never changes, but there is a small section of data does change. When you change the small section, all the other things in the document that depend on this Calculation will have to be recalculated.
The number and complexity of derived variables
- If you duplicate a variable that is in the raw data file it will rarely have an effect (unless you do it a lot). This is because the actual data isn't copied. Rather, any calculations that use the duplicated variable are just based off the original data.
- R variables. These are generally orders of magnitude slower than duplicated variables. Consequently, while you should use them, you should not use them to do things that can be done natively in Displayr (more about this below).
- Banners. These are derived variables, so do run slower.
If you automatically update a data set, every output that depends on it then needs to be re-updated. Consequently, with automatic updates that occur regularly, you will regularly experience periods of slowness.
The heaviness of use
If you click on a page that triggers lots of calculations, and then move onto a new page before the first page has been calculated, and then click on another, and so on, Displayr will get slower and slower until it completes the calculations.
Similarly, if you have two people in a document in Edit mode at the same time, if one user does something resource intensive (e.g., exports all the pages to PowerPoint, which requires every calculation to be updated), the other user will experience slowness.
In View mode, if you have a large number of users and they are doing more computationally expensive actions, such as filtering or using Explore mode, this will use up more resources and, at some point, this will affect users. However, the effect of multiple users in View mode is relatively minor (having 50 or more users interacting at the same time should not be a problem).
Strategies for increasing the speed of Displayr
Send us examples that are slow
When people experience great slowness in documents it is often because they are using Displayr in a way that was not envisaged by its engineers. By sharing your examples of slowness with us, we can often either:
- Identify opportunities for improving Displayr.
- Show you better ways of doing things.
Edit things on pages with nothing on them
Consider the situation where you have a table, and this table is linked to multiple Calculations. Each time you edit the table, the outputs updated, and this takes time. If you instead move the table to a separate page, edit it there, and then return to the original page, Displayr will be faster during the edit time.
Put intermediary tables on separate pages
If you have an intermediary table for a calculation which is often re-run (eg. you have combo boxes to change filters) then it is best to put the intermediary table on a different page to the combo boxes and final output. The reason for this is that when you change a combo box which changes a table on the current page, it will re-draw anything that has changed on the current page, even if that item has been placed outside the page margins where it is not intended to look at it. The drawing of tables can take more time than you'd expect for large tables / large projects, so if you notice slow updates when combo boxes are changed, try moving intermediary tables to a different page to avoid re-drawing.
Use the Displayr User Interface rather than R code where possible
Calculations and R Variables are extremely flexible. But, they are slow and there are risks of typos in R code. Most data management and a lot of analysis is more sophisticated, easier, and computationally more efficient if done using Displayr's inbuilt tools. In particular:
- Creating crosstabs and summary tables of variables in data sets.
- Filtering, in terms of creation and using the in-built menu in View mode.
- Recoding variables, both into existing variables and new values.
- Merging categories.
- Dealing with missing values.
- Management of labels (i.e., make use of Displayr's ability to store a Label in addition to its Name.
- Using Variable Sets rather than variables when referencing data in R code.
- Data File Relationship(s) rather than writing joins in code.
Use duplicated original variables, rather than construct variables
Duplicating variables that are in the raw data, and then modifying these (e.g., merging categories, recoding), is much faster than using banners and other constructed variables.
If wanting to create filters from a Nominal or Ordinal variable, the most computationally-efficient way to do this is to:
- Duplicate the variable once for each category.
- Combine them: Anything > Combine
- One by one, click on each of the variables, and rename it to each of the categories, and then click on Object Inspector > DATA VALUES > Select categories and choose the corresponding Count this value categories.
Remove column comparisons from tables
Column comparisons are computationally expensive as they involve comparisons of all pairs of columns.
View the document
When a document is viewed in View mode, lots of results are cached. The next person to view it in View mode will typically have a faster experience. Similarly, if you apply a filter, in View mode, it will also be cached, making the next person to use that filter have a faster experience.
Keep the data in View mode longer
By default a document in view mode stays live for 10 minutes after last viewed. While in view mode, results are cached so do not need to be recomputed. Once it returns to sleep, the results need to be re-computed. However, if you modify the sleep settings (click on the cog at the top-right and select Document Settings > Properties, you can delay how long it takes to go back to sleep. (This will use your view time credits up at a faster rate.)
You can also schedule times for the document to wake automatically. See Automatically Updating Calculations, R Variables, and R Data Sets.
Split data to avoid unnecessary calculations and to exploit caching
Many data objects can be split into multiple parts. A data file can be split into multiple data files, by case or variable, or both. A variable set can be split into smaller variable sets. A table into multiple tables. A Calculation into multiple Calculations.
While the general advice is to minimize the number of objects, there are situations where it can be useful to instead create multiple separate objects. In particular:
- Where a lot of the data is passed to calculations, but then not needed. For example, if you have a table of a grid question that contains 20 brands, but you are only using one of the brands in the visualization, it's a good idea to duplicate the data set, remove the irrelevant brands, and then create the table. In this case the computational saving achieved will more than compensate for the additional derived variables.
- Where one part of a data object changes regularly (e.g., maybe it is being updated, or maybe it is linked to a control or filter), and the other parts do not. In such situations if you split the data object so that the bit that doesn't change is able to cache, you can achieve substantial performance benefits. But, if doing this do check it makes a difference, as the potential gain may be swamped by the resulting increased size of the document.
Store the results of computationally-expensive tables as Calculations
Consider the situation where you have a large table that has lots of data as inputs and is slow to compute, and this table is then used as an input into other tables and visualizations. If the large table does not need to be regularly updated, you can instead:
- Create a Calculation, linking it to the table (by typing the name of the table in as the R CODE).
- Uncheck Automatic.
- Link all your other tables and analyses to this Calculation.
When you do this, Displayr only has to retrieve the results of the table, which is typically much faster than re-creating the whole table from scratch.
Minimize the number of Calculations and R Variables
Combine multiple Calculations (particularly if duplicating)
When creating a page on a dashboard it is often useful to have lots of Calculations, with each having a distinct role. This tends to lead to cleaner logic. But, it also slows down performance. Once you have a set of Calculations that are computing what you want, it is often useful to combine them back into one or a small number of Calculations. This is particularly useful if you plan on duplicating the page multiple times.
Deliberately creating redundant code
Let's say you have three Calculations, where the first one is computed and then Calculations 2 and 3 and computed using the first one as an input. It is often the case that it will be more efficient to include the code from the first Calculation at beginning of the second and third Calculations, and then remove the first Calculation. That is, the reduction in time achieved by reducing the number of outputs will make up for the duplication of code. This is partly due to Displayr performing things in parallel and partly due to the overhead associated with Calculations.
Note, however, that an even better strategy will be to abstract the common code by writing a function, and having that function appear as a separate Calculation used by the other two Calculations. The key thing to appreciate about doing this is that while you still have three outputs, the function itself will be cached, so will not have any real effect on performance.
Common reference tables or lists
In support, we commonly see pages with lots of visualizations, where each one of these visualizations is based on a table, and each of these tables is a part of a larger table (e.g., maybe the larger table contains data on brands, and each small table is for a separate brand). A much better way of structuring this is to make each of the visualizations pull data from the larger table, using ROW MANIPULATIONS > Rows to show and COLUMN MANIPULATIONS > Columns to show.
Even if not using Displayr's inbuilt visualizations, using a common reference table or `list`, which is then referenced by lots of other outputs is often an effective way of reducing the number of Calculations.
Consider the situation where you have a Calculation that computes one table, tbl and some text summarizing the table, dscr. The orthodox way of dealing with this in R is via a list: out <- list(table = tbl, description = dscr). However, if you want to hook the table up to a visualization, you would need to create a separate Calculation with code out$table, and refer to this in your visualization.
An alternative way of doing the same thing is via attributes. For example:
attr(table, 'description') <- dscr table
This return the table which can then be referenced directly in an visualizations without any need for an intermediary table. The description is then extracted by referring to attr(table, 'description') in R code.
Hacking visualizations and advanced analyses to take direct feeds of the data
Calculations created automatically, such as visualizations, have user controls in the Object Inspector where you hook up the data. Alternatively, you can bypass these controls and modify the R code to extract and format the data as per your needs. This can lead to the elimination of Calculations, as any manipulation that you were doing in intermediary Calculations is instead done as a part of the final Calculation.
When linking an Autofit table to a table you have computed in an Calculation, you could instead use the CreateCustomTable() function to perform the table computation and render the table in a single output.
Keep your Document running for longer
In Settings, Displayr has options that govern when a document goes to sleep in view and edit modes. If you increase the time in this fields, Displayr will not need to wake up (but, your usage costs will increase). Additionally, you can control when Displayr automatically updates, so as to prevent it updating when you are working (see Automatically Updating Calculations, R Variables, and R Data Sets).
Writing efficient R code
If you are using for or lots of if statements in your R code, this will make life slow whether using code in Displayr or elsewhere. The key time savers to make sure you know how to exploit are vectorized math, apply, sapply, lapply, and aggregate.
More aggressive data management strategies
Use SPSS .sav files instead of Excel and CSV files
One way to do this is to create such files by the Excel or CSV files into Q and then writing them back out as SPSS .sav files.
Where you have a lot of looped data (e.g., sets of variables for lots of different product concepts), by stacking the data and using filtering you can massively reduce the number of Calculations.
A document can be split into multiple documents, each containing different data sets and or outputs. By creating hyperlinks between the documents you can create the feeling to the end user of a single document.
Let's say you have a large database with a million records, and you want users to be able to filter the data by age and gender. Rather than add the data set with all million records, you can instead aggregate the data (e.g., using R code), storing only one row of data for each age and gender combination (i.e., the first row will be males under 18, etc.), using a weight in all the analyses.
SQL and Data Cubes
R code can be used to query databases and data cubes, which can be faster than performing some operations with huge databases directly in Displayr.
Hacks as of January 2020
The list here are are tactics that will work as of the day of writing, but which our engineering team will make unnecessary in the future:
- Avoid using banners on huge projects if they are slow.
- Create single tables or visualizations that contain lots of smaller results, rather than have each result as a separate output. For example, if wanting to display a row, column, or grid-shaped layout of single figures, consider using an Autofit table (which is created using Insert > Data > Paste Table, selecting the input table in the Output in 'Pages' section, and then ticking the Autofit tick box).