How to Combine Data Files by Adding New Records

Sometimes, you will need to update your document with new observations/cases (versus updating the same records). This regularly happens for tracking studies where you want to combine waves of data together into a single data set, or when you build your report as you are collecting your data.

Displayr has a built-in feature that compares and matches the variables in .sav data files (see Requirements if using Excel files) to be able to add the data sets on top of one another and add any additional variables that can't be matched. This feature not only produces the final merged data file, but also shows detailed information on how the matching and merging are happening for each variable (via the grey, pink, orange, and blue variables). You can review this output to ensure that the same data is being merged together from both files and identify any conflicts in values, labels, etc. This feature also has the added benefit of being able to merge more than two files at once. This article reviews all of the functionality for the Combine > By Case feature.

Select the .sav data set files from the Displayr Cloud Drive
Select the Variable Matching method
Set the Variables to include or exclude from the data set or matching
Set the preference for Names and Labels
Reading the Output
Automatic updating
Import or Update data sets in your Document

In this example, we have a Cola Tracking - January to September 2017 data set and a Cola Tracking - October to December 2017 data set, which we wish to combine together so that we can compare the data together.

Requirements

Only .SAV files can be combined using this method.
- If needing to combine Excel files, it's usually easiest to do this manually in Excel. Otherwise, you can import the files into a Displayr document and use Share > Export Data > To Displayr Cloud Drive as SPSS File (.SAV) to convert them to .sav files to use with the tool.
Files to be combined must be stored in the Displayr Cloud Drive. These files can be added by clicking the Initials icon at the top right > Displayr Cloud Drive > + Upload. See How to Use the Displayr Cloud Drive for details.
Path to files on the Displayr Cloud Drive is case-sensitive and can have spaces. Folders can be separated using one backslash \. For example: Example Folder\Cola Tracking - January to September 2017.sav
If the relevant files are located within a Displayr Cloud Drive folder with configured access, ensure your document also has access to this folder by including the user groups that can access it in either the permissions of the document's folder or the document itself. See How to Manage User Groups and Permissions in Displayr and How to Use the Displayr Cloud Drive for more details.
See Data - Data Set - Combine - By Case for further technical details.

Method

Since reformatting data can be data-intensive, you might consider doing this in a separate data preparation document from the document where you create your Report.

Select the .sav data set files from the Displayr Cloud Drive

1. Insert a new merge output by clicking on the name of any data set in the Data Sources pane, and from Properties , select Combine > By Case.

2. Under Data > Data Sets > Data set 1, type the name of the first file from the Cloud Drive to combine. In this example, we will enter Cola Tracking - January to September 2017.sav (including the file extension). Note folders and files are case-sensitive, and folders are separated with one backslash \ i.e., Example Folder\Cola Tracking - January to September 2017.sav.

File names are case-sensitive and must match the name of the file in the Cloud Drive exactly.

3. Under Data > Data Sets > Data set 2, type the name of the second file from the Cloud Drive to combine. In this example, we will enter Cola Tracking - October to December 2017.sav (including the file extension). Note folders and files are case-sensitive, and folders are separated with one backslash \ i.e., Example Folder\Cola Tracking - October to December 2017.sav.

4. [OPTIONAL] If you have more data sets to combine, Properties > Data > Data Sets will keep adding a further data set field to give you this option.

5. Type the name of the new data set you wish to create in the Combined data set name field. Here, we will enter Cola Tracking - January to December 2017 (without the file extension):

To add the file to a folder in the Cloud Drive, use one backslash (\) to add the file inside the folder: Example Folder\Cola Tracking - January to December 2017

Select the Variable Matching method

By default, Displayr will perform automatic variable matching based on variable names, variable labels, and value labels.

1. [OPTIONAL] If you want more control over the matching process, you can uncheck Automatically determine what to match and choose which matching methods to use.

2. [OPTIONAL] The degree to how close these methods match is set in the Minimum fuzzy match percentage field. When this is set to 100, names and labels will only match if they are exact matches. The lower this value, the more it will attempt to match based on similarity rather than exactness. It should be noted that lowering the threshold too far could result in unintentional combining of variables that have similar names or labels, but should not actually be combined.

Examples

If you know that both data sets have the exact same variable names for the corresponding variables, you can instead untick Match by variable labels and Match by value labels to only have Match by variable names checked. Setting Minimum fuzzy match percentage to 100 would then ensure only those with exactly the same variable names across both data sets will be matched.
If, on the other hand, the variable names are not the same for corresponding questions, but the variable labels are, you would instead need to ensure that Match by value labels is ticked, and you may need to lower the fuzzy match value.

Set the Variables to include or exclude from the data set or matching

By default, all variables are included across all input data sets.

1. [OPTIONAL] Tick Combined variables must contain input from at least one data set (select to apply) if you wish to only include variables that appear in both data sets, for example, according to the matching criteria above.

2. [OPTIONAL] Alternatively, if you only wish to include combined variables from specific data sets, you can untick the Include combined variables that contain input from data set option for the relevant data set.

3. [OPTIONAL] When using the above option, you can then specify the Variables to manually include from the combined data set.

4. [OPTIONAL] You can also specify the Variables to manually omit from the combined data set.

When the fuzzy match method is used for matching, or there are numerous inconsistencies in variable names and labels across the data sets being combined, you additionally have the option of setting up rules to help match variables correctly.

5. [OPTIONAL] The Variables to manually combine option can be used to specify variables that should be matched but were not able to be automatically matched using the above criteria.

6. [OPTIONAL] The Variables that should not be combined option should be used when variables are automatically matched, but should not be.

7. When manually specifying these variables to include or exclude from the combined data set or matching, you can do so by using any of the following methods:

List the variable names as a list separated by a comma, e.g. Q1,Q2.
Set a range of variable names that also include all variables in between based on the data set order, e.g. Q1-Q6.
Use wildcards that include any variables that, for example, start with a variable name prefix, e.g. Q2_*.
When specifying a variable from a specific data set, you append the data set value in brackets, e.g. Q2(2) to indicate Q2 from Data Set 2.
You can combine any of these methods by separating them with a comma.

For example, specifying d2,d2(2) would ensure that d2 in both data sets would be left as separate variables in the combined data set.

Set the preference for Names and Labels

When the labels or names for matching variables are not identical, you can set how they should be treated in the combined data set.

1. The Prefer names and labels from the option allows you to use the names and labels from either the First data set or Last data set.

2. And When there are multiple labels for the same value, you can instruct to either Create new values for the labels or Use label from preferred data set.

In this example, d1 has a different value label for 65+ vs 65 or more. Using the Create new values for the labels setting will append the latter as a new category with a new value.

If we instead select the Use label from preferred data set setting, it will map it with the original category and value:

Reading the Output

The merge output on the page summarizes how all of the variables in the files are matched and merged together. It is color-coded to draw your attention to how and why things are matched, as well as draw your attention to any data issues that you may need to review. All variables in the combined data set will be listed in the output. The coloring of the merged variables is as follows:

Pink represents a manual match based on the Variables to manually combine option.
Orange represents any fuzzy matches based on the Automatic Variable Matching section.
Blue represents any differences in label or value.

Any variable in the output that has been matched is expandable and will display details of what each variable matches to and how this match was performed.

Displayr will always add a categorical variable to the file named mergesrc and labeled Sources of cases, which identifies the source file for each record.
By default, this list will be truncated when there are many variables, but you can select Diagnostics > Variables from Combined Data in Properties to output the entire list of variables.

Automatic updating

If your input data sets will be periodically updated in the Cloud Drive, you can set this as a repeatable workflow that automatically combines the latest input data sets and exports the combined data file to the Cloud Drive. This can be achieved by adding a schedule to your Data - By Case output.

Import or Update the data sets in your document

Once the output has run, it will save the combined data set to your Displayr Cloud Drive using the name you specified. You can then do one of the following to add this data set to your document, depending on your requirements.

Add a new data set

Click the icon in the Data Sources tree.
Select Displayr Cloud Drive.
Select the combined data set from the list of files and click OK.

Update an existing data set

Select the original data set that you would like to update with the merged file in the Data Sources tree.
From Properties , click Update.
Select Displayr Cloud Drive.
Select the combined data set from the list of files and click OK.
A Data Difference Warning may appear, alerting you to any changes to the data set. Review these results and click Accept or Remind Me Later.
The data set and any connected outputs will now be updated with the combined data.

Tips

If possible, it is generally preferable to obtain a single data file that has already been merged than to merge data files yourself. This is because:
- It can be difficult to manually construct a Date variable.
- It can be difficult when questionnaires have changed between waves.
- It requires the person doing it to have a good understanding of data (as ill-considered decisions can have large ramifications).
If you get a 'file does not exist' error, please be sure to double-check your path on the Displayr Cloud Drive for case sensitivity and extra spaces. You also should only use one backslash to separate folders \ (unlike when using R).

How to Combine Data Files by Adding New Variables

How to Use the Displayr Cloud Drive

Multiple Data Sets Video Guide

Articles in this section

Requirements

Method