Sometimes the results that Displayr will compute will be different from those produced by some other software. The main causes of this are:
- Differences in the data
- Differences in how statistics are computed
- Differences in how statistical tests are calculated
The final section of this article discusses how to get help from Displayr in tracking down the explanation for such differences.
Differences in the data
Sometimes differences in results are caused by different data files (e.g., different sample sizes, different metadata). Please see What To Do When a Result Looks Wrong for more information about how to track down such differences.
Differences in how things are computed
Displayr rounds up to the nearest integer. IBM SPSS products default to rounding to even numbers. Some crosstab products have two-stage rounding (they first round to the nearest decimal and then round to the nearest integer).
If a weight is applied in one program but not in another this will cause results to change. The way that different programs address weights also impacts significance tests (discussed below).
Differences in means are generally attributable to differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters).
The standard formula for the sample standard deviation is:
where is the value of the th of observations.
This formula does not take weights into account. A simple modification of the formula is to treat the th observation's weight, , as representing a frequency, which leads to:
This formula is widely used (e.g., in SPSS Statistics). However, it is incorrect in situations where the weights reflect the probability of a respondent being selected in a survey. For example, if the average weight is 1 this leads to a different standard deviation than if the average weight is 2. This formula is only used if Q has been used to modify the setting for the document (where Weights and significance in Statistical Assumptions has been set to Un-weighted sample size in tests). Otherwise, Displayr instead uses the following formula:
Effective sample sizes and design effects
Displayr uses a more exact formula for calculating effective sample size than most other software (see Weights, Effective Sample Size and Design Effects on the Q wiki).
Typically, different sample sizes will be caused by:
- Actual differences in the data.
- Weights containing missing values or 0s (Q automatically excludes these from sample size computations, even if Un-weighted sample sizes in tests has been selected in Q).
- Rounding. Some programs round sample size data. For example, SPSS either defaults to rounding sample size (e.g., in Custom Tables), or gives user options for controlling this.
- Displayr's treatment of missing data on multiple response questions.
- Selecting the wrong statistics in Displayr (e.g., Row sample size instead of Sample size or vice versa).
Differences in percentages
Differences in percentages on Nominal and Nominal - Multi variable sets are generally attributable to differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters), or, different bases. This is most easily assessed by comparing sample sizes.
Differences in percentages on Binary - Multi, Binary - Compact, and Binary - Grid variable sets are typically explained by:
- Differences in the data (e.g., the data having been recoded inconsistently, such as with different treatment of missing values, application of filters). This is most easily assessed by comparing samples sizes.
- Where multiple response data is stored in the Binary - Compact format, some programs count repeated values twice. For example, in SPSS if one person chose the fourth option in two separate variables, they count as two people in the percentage of responses section, whereas in Displayr they do not. That is, Displayr computes percentages of respondents, whereas SPSS computes percentages of responses.
- Different definitions applying to NET, Base and Total columns. In Dispalyr, NET and SUM are often not the same as a Base or Total in other programs except, when the data is Nominal.
- Binary - Compact variable sets having the wrong variable set structure (e.g., Binary - Multi).
Differences between regression models produced in different programs tend to relate to:
- Treatment of missing values.
- Treatment of weights. Most statistics programs (e.g., SPSS) interpret weights as being frequency weights, unless specific instructions are given to interpret the data in other ways. Displayr assumes that the weights are sampling weights. Identifying if the cause of the problem is weights is best achieved by running the analysis without weights.
- Whether standard errors are robust or not.
- Selection of type of regression model (e.g. using linear regression in one program and binary logit in another).
- Differences in the intercepts of regression models where missing is set to 'Use partial data (pairwise)'.
- Differences in the standard errors of weighted regression models where missing is set to 'Use partial data (pairwise)'.
Principal Components Analysis (PCA)
Differences in PCA results between Displayr and other programs, and also between Q's different PCA implementations, are due to:
- Different treatments of missing values. For example, whether the analysis is based on pairwise correlations or not.
- Arbitrary choices of sign. For example, one program may show all the loadings as being the negative of the loadings shown by another program.
- Specific algorithm (e.g., whether PCA or factor analysis is being conducted).
- Local optima in rotations. PCA itself is an exact algorithm; provided that the above issues are addressed, the results should be the same. However, the rotation methods are not exact, and different programs can find different solutions.
Latent class models, mixture models, cluster analysis, trees (segmentation)
There is no reason to expect different programs to get the same results for any latent class, cluster analysis, and tree models. This is because:
- Most companies use slightly different statistical models, even though they have the same name. For example:
- There are multiple widely used k-means algorithms, and the defaults used in, for example, SPSS, R, Displayr, and Q, are all different.
- There are multiple different mixture modeling algorithms (e.g., latent class). For example, when attempting to specify the same basic model, Displayr may use Maximum Likelihood estimation, Latent Gold may use Posterior Mode estimation and Sawtooth may use Hierarchical Bayesian estimation.
- There are dozens and perhaps hundreds of different tree algorithms.
- Even when the same algorithm is used, minor differences in implementation make results inconsistent. For example:
- Most mixture algorithms (e.g., latent class analysis) involve some element of randomization, and differences in which random numbers are generated can change outputs.
- Different programs have slightly different stopping rules.
More generally, the reason that segmentation algorithms give different results is that all only return approximate solutions, as it is computationally infeasible to find the best solution for all but the most trivial problems. For example, with 1,000 respondents and 5 segments, there are 8,250,291,000,000 possible segmentations. As latent class allows for people to be partially in multiple segments, it permits an infinite number of segments. Consequently, when computer programs try and find segments, they start by making a few guesses, and one of the key differences between the different programs relates to how they make those guesses, with things like the order of cases and variables in the data file being a determinant to how the initial guesses are made.
Choice of test
Most programs use slightly different statistical tests. In particular, Displayr does not default to the tests that are standard in SPSS, Quantum, and Survey Reporter, but often equivalent tests can be selected by modifying the options object inspector > PROPERTIES > Significance.
The role of weights
How weights are treated can have a major impact on computations of statistical significance. Most statistics programs treat all weights as frequency weights (e.g., SPSS Base, R). Most market research programs assume that weights are sampling weights and use a calibrated weight when computing statistical tests (e.g., Quantum, Survey Reporter, Wincross, Uncle). Most specialized survey analysis programs (e.g., SPSS Complex Samples, R's survey package) uses special-purpose variance estimation algorithms for dealing with weights. Displayr uses a combination of special-purpose variance estimation and calibrated weights in its analysis (which is used and when is discussed for each test).
Additionally, some packages, such as SPSS Custom Tables automatically round weighted data to whole numbers prior to performing tests.
Multiple comparison corrections
By default, Displayr does not use multiple comparison corrections, whereas Q by default uses the False Discovery Rate Correction (FDR).
Where using repeated measures data particular care should be taken as different program often make different assumptions regarding how they treat the likely occurrence of violations in the normality assumptions.
Upper versus lower case letters in Column Comparisons
Some programs show all results using upper-case letters when performing Column comparisons. Some programs use lowercase to indicate results between 0.05 and 0.1 levels of significance and uppercase for p-Values less than or equal to 0.05. Displayr and Q use lowercase for results less than 0.001 and uppercase for more significant results.
Displayr uses Corrected p when determining whether to assign letters or not and whether these letters are UPPERCASE or lowercase. This is the same as p, unless the false discovery rate correction has been applied.
Treatment of 0% and 100%
Some programs do not compute significance when performing comparisons involving either 0% or 100% (e.g., SPSS Custom Tables).
Sometimes differences between the results of different programs are caused by bugs. If you suspect a bug, please contact us!
Obtaining assistance from Displayr to reconcile differences
If you require assistance in reconciling results obtained in DIsplayr with those obtained in other programs, please:
- Review this page and check that the issue is not described here.
- Send an email to firstname.lastname@example.org containing the following:
- A link to the page in your document.
- If using proprietary or internally-developed software, the actual algorithms used in the testing (i.e., the code). If these are not available, detailed formulas are needed. Please note that short descriptions such as "t-tests were used" or "chi-square tests were used" are not useful, as there are dozens of such tests, and there are no standard versions of these tests (e.g., the tests in introductory statistics books and on Wikipedia are rarely used in commercial software). Similarly, descriptions written in non-technical language, such as descriptions referring to things like the "average" or the "total" are too ambiguous to be useful, or,
- If using well-known commercial software:
- The name of the application used to conduct the testing (e.g., SPSS Version 12).
- Information about any specific tests/options selected in the program. That is, either the scripts used to conduct the testing, or, screenshots of the options selected if using a program with a graphical user interface.
- Information from the program's technical manual about how the test are computed, which includes either the specific formulas used, or, references to formulas in books or journals.
- Detailed technical outputs provided by the other program. That is, most data analysis programs will contain options to export various information used in the calculation of significant results. For example, Quantum exports a tstat.dmp file. Please note that providing us with one or two crosstabs and noting inconsistencies in terms of what is marked as statistically significant does not constitute a detailed technical output. A detailed technical output needs to contain one or more of z, t or p statistics/values.
- Tables created by the other program.
- A list of a few specific examples explaining the differences. E.g., "On table 1 from the Quantum outputs you can see that it shows the 18 to 24s are significantly lower in their preference for Coke, but Q is not showing this."