Minimize the Size and Distance of Data Being Moved – Displayr Help

Whenever Displayr creates an output (e.g., a calculation or a visualization) we can think of the time taken as having three components:

The time taken to create the output. For example, if computing an average, the time taken to calculate the average.
The amount of data that needs to be used in the calculation. For example, if we are calculating the Average of 1,000 cases of data, then we have 1,000 values. This data all needs to be "moved" to where the calculations are performed.
The distance that the data needs to be moved. Broadly speaking, we can think of this as being:
- Movements involving variables and tables. E.g., when a variable set is dragged onto a page, causing a variable to be created. These movements essentially take 0 seconds and can be ignored.
- If the R code is simple and straightforward enough, Displayr is able to process the code using Displayr's Native R Interpreter. More typically, R calculations, variables, and data sets require the data to be moved to an R server. This takes a very small amount of time if the size of the data is small, but it's not 0.
- To and from the Displayr Cloud Drive. This is typically longer than an R calculation.
- To or from an external source (e.g., a SQL query of your company's database). This is the longest distance, so takes the most time.
- The distance from Displayr's computer to your browser. This is typically the longest distance.

The rest of this article contains a simple worked example and a more realistic worked example.

Simple worked example

Consider the following calculation that creates 10 million random numbers:

random.numbers = runif(10000000)

And, consider a second calculation that computes the average:

Average(random.numbers)

At the time this article was written, the dependency graph for this was as follows, suggesting a best case of 3.75 + 2.23 = 5.98 seconds to calculate:

If we look at the raw R output, we can see that of the 3.75 seconds taken to generate the random numbers, only 0.36 seconds were required to generate the random numbers (i.e., Time executing code), and the rest of the time was transferring data (0.80 seconds) and "overhead" (2.59 seconds).

An alternative way of structuring the same calculations is to have the first calculation instead just return the sum of the random numbers and the number of random numbers:

random.numbers = runif(10000000)
results = list(numerator = Sum(random.numbers),
                               denominator = length(random.numbers))

We can then compute the result using:

results$numerator / results$denominator

The dependency graph reveals that everything is much, much faster.

When we look at the raw R output, we can see that we've massively reduced the amount of time spent transferring the data (from .80 to 0.01 seconds) and the overhead from 2.59 to 0.17 seconds. The time executing code is greater, but that's because the first calculation is now doing some of the work previously done in the second calculation.

More realistic worked example

The screenshot below is from a simplified version of the campground dashboard. It shows 18 visualizations.

The Cooking visualization, showing as 14%, for example, is a number visualization, and its value is derived from two variables:

A variable measuring satisfaction (Q4d), has (due to some dodgy file preparation) two categories:
- Very dissatisfied+Somewhat dissatisfied+Neither satisfied nor dissatisfied
- Satisfied
A dynamic filter variable, called 'Campsight Filter', linked to a combo box.

An efficient way of creating the visualizations is to create one (Visualization > Number > Number in Oval), set Data source > Data to Use an existing R output in the object inspector , and enter the following code, which calculates the percentage of the filtered data that are satisfied:

Average(Q5d[`Campsite Filter` == 1] == "Satisfied")

Then, the other visualizations can be very efficiently created by just duplicating the first visualization and replacing Q5d with the correct variable name in the code.

While this is an elegant solution, it's far from optimal in terms of the amount of data moved around. Every time one of the visualizations needs to be updated, it needs to be sent the variable measuring satisfaction, which contains 1,000 values, and the filter, which also contains 1,000 values.

However, we can reduce the size and distance of the data being moved instead if we:

Create a summary table of the variable by dragging it onto a page.
Link the visualization to the table, rather than the raw data.

Articles in this section

Simple worked example

More realistic worked example

Related articles