Sometimes, you will need to update your document with new observations/cases (versus updating the same records with new variables). This can occur when you build your report as you are collecting your data. Though, it regularly happens for tracking studies where you want to combine waves of data together into a single data set.
Displayr has a built-in feature that compares and matches the variables in the data sets to be able to add data sets on top of one another and keep any unmatched variables. This feature not only produces the final merged data file, but also shows detailed information on how the matching and merging is happening for each variable. You can review this output to ensure that the same data is being merged together from both files and identify any conflicts in values, labels, etc. This feature also has the added benefit of being able to merge more than one file at once. This article reviews all of the functionality for the Data > Data Set > Combine > By Case feature.
- Select the Data Sets
- Select the Variable Matching method
- Set the Variables to include or exclude from the data set or matching
- Set the preference for Names and Labels
- Reading the Output
- Automatic updating
- Add the data set to your document
In this example, we have a Cola Tracking - January to September 2017 data set and a Cola Tracking - October to December 2017 data set which we wish to combine together so that we can compare the data together.
Requirements
- Files to be combined must be stored in the Displayr Cloud Drive. These files can be added by clicking the Initials icon at the top right > Displayr Cloud Drive > + Upload. See How to Use the Displayr Cloud Drive for details.
- Only .SAV files can be combined using this method.
- See Data - Data Set - Combine - By Case for further technical details.
Method
Select the Data Sets
1. In the toolbar, select > Data > Data Set > Combine > By Case.
2. Under Data > Data sets > Data set 1, type the name of the first file from the Cloud Drive to combine. In this example, we will enter Cola Tracking - January to September 2017.sav (including the file extension).
3. Under Data > Data sets > Data set 2, type the name of the second file from the Cloud Drive to combine. In this example, we will enter Cola Tracking - October to December 2017.sav (including the file extension).
4. [OPTIONAL] If you have more data sets to combine, the object inspector will keep adding a further data set field to give you this option.
5. Type the name of the new data set you wish to create in the Combined data set name field. Here, we will enter Cola Tracking - January to December 2017 (without the file extension):
Select the Variable Matching method
By default, Displayr will perform automatic variable matching based on variable names, variable labels and value labels.
1. [OPTIONAL] If you want more control over the matching process, you can uncheck Automatically determine what to match and choose which matching methods to use.
2. [OPTIONAL] The degree to how close these methods match is set in the Minimum fuzzy match percentage field. When this is set to 100, names and labels will only match if they are exact matches. The lower this value, the more it will attempt to match based on similarity rather than exactness. It should be noted that lowering the threshold too far could result in unintentional combining of variables that have similar names or labels, but should not actually be combined.
Examples
- If you know that both data sets have the exact same variable names for the corresponding variables, you can instead untick Match by variable labels and Match by value labels to only have Match by variable names checked. Setting Minimum fuzzy match percentage to 100 would then ensure only those with exactly the same variable names across both data sets will be matched.
- If, on the other hand, the variable names are not the same for corresponding questions, but the variable labels are, you would instead need to ensure that Match by value labels is ticked, and you may need to lower the fuzzy match value.
Set the Variables to include or exclude from the data set or matching
By default all variables are included across all input data sets.
1. [OPTIONAL] Tick Combined variables must contain input from at least one data set (select to apply) if you wish to only include variables that appear in both data sets, for example, according to the matching criteria above.
2. [OPTIONAL] Alternatively, if you only wish to include combined variables from specific data sets, you can untick the Include combined variables that contain input from data set option for the relevant data set.
3. [OPTIONAL] When using the above option, you can then specify the Variables to manually include from the combined data set.
4. [OPTIONAL] You can also specify the Variables to manually omit from the combined data set.
When the fuzzy match method is used for matching or there are numerous inconsistencies in variable names and labels across the data sets being combined, you additionally have the option of setting up rules to help match variables correctly.
5. [OPTIONAL] The Variables to manually combine option can be used to specify variables that should be matched but were not able to be automatically using the above criteria.
6. [OPTIONAL] The Variables that should not be combined option should be used when variables are automatically matched but should not be.
7. When manually specifying these variables to include or exclude from the combined data set or matching, you can do so by using any of the below methods:
- List the variable names as a list separated by a comma, e.g.
Q1,Q2
. - Set a range of variable names that also include all variables in between based on data set order, e.g.
Q1-Q6
. - Use wildcards that include any variables that, for example, start with a variable name prefix, e.g.
Q2_*
. - When specifying a variable from a specific data set, you append the data set value in brackets, e.g.
Q2(2)
to indicate Q2 from data set 2. - You can combine any of these methods by separating them with a comma.
For example, specifying d2,d2(2)
would ensure that d2 in both data sets would be left as separate variables in the combined data set.
Set the preference for Names and Labels
When the labels or names for matching variables are not identical, you can set how they should be treated in the combined data set.
1. The Prefer names and labels from the option allows you to use the names and labels from either the First data set or Last data set.
2. And When there are multiple labels for the same value, you can instruct to either Create new values for the labels or Use label from preferred data set.
In this example, d1 has a different value label for 65+ vs 65 or more. Using the Create new values for the labels setting will append the latter as a new category with a new value.
If we instead select the Use label from preferred data set setting, it will map it with the original category and value:
Reading the Output
The merge output on the page summarizes how all of the variables in the files are matched and merged together. It is color coded to draw your attention to how and why things are matched, as well as draw your attention to any data issues that you may need to review. All variables in the combined data set will be listed in the output. The coloring of the merged variables are as follows:
- Pink represents a manual match based on the Variables to manually combine option.
- Orange represents any fuzzy matches based on the Automatic Variable Matching section.
- Blue represents any differences in label or value.
Any variable in the output that has been matched is expandable and will display details of what each variable matches to and how this match was performed.
- Displayr will always add a categorical variable to the file named mergesrc and labeled Sources of cases which identifies the source file for each record.
- By default, this list will be truncated when there are many variables, but you can select Diagnostics > Variables from Combined Data in the object inspector to output the entire list of variables.
Automatic updating
If your input data sets will be periodically updated in the Cloud Drive, you can set this as a repeatable workflow that automatically combines the latest input data sets and exports the combined data file to the Cloud Drive. This can be achieved by adding a schedule to your Data - By Case output.
Add the data set to your document
Once the output has run, it will save the combined data set to your Displayr Cloud Drive using the name you specified. You can then do one of the following to add this data set to your document, depending on your requirements.
Add a new data set
1. Click the icon in the Data Sources tree.
2. Select Displayr Cloud Drive.
3. Select the combined data set from the list of files and click OK.
Update an existing data set
1. Select the original data set that you would like to update with the merged file in the Data Sources tree.
2. From the object inspector, click Update.
3. Select Displayr Cloud Drive.
4. Select the combined data set from the list of files and click OK.
5. A Data Difference Warning may appear by alerting you of any changes to the data set. Review these results and click Accept or Remind Me Later.
6. The data set and any connected outputs will now be updated with the combined data.
Tips
- If possible, it is generally preferable to obtain a single data file that has already been merged than to merge data files yourself. This is because:
- It can be difficult to manually construct a Date variable.
- It can be difficult when questionnaires have changed between waves.
- It requires the person doing it to have a good understanding of data (as ill-considered decisions can have large ramifications).
Next
How to Combine Data Files by Adding New Variables
How to Use the Displayr Cloud Drive
Multiple Data Sets Video Guide