This article describes how to remove duplicates from your data set by creating a filter that can be used for deletion. This article describes how to do this using the Data Quality > Duplicates automation as well as custom R code. The filter variable will flag duplicate responses based on one or more variables, which you can select in the Data Editor using the filter, and then delete.
Requirements
- A Document with a data set that contains duplicate records.
- One or more id variables that can be used to de-duplicate against. In this example, we have a variable called RespondentID.
Method
1. Create your filter variable that identifies duplicates. Either:
- Use the automation: select the variable(s) to deduplicate on in the Data Sources tree and click + > Data Quality > Duplicates.
- Or use R code: select any variable in your Data Sources tree, click + > Custom Code > R > Numeric, in the object inspector check Usable as a filter, and enter the below into the R Code editor to keep the first instance of a duplicate (more examples of code in the Notes below):
#flag duplicated cases based on the variable RespondentID - keep the first instance
#replace RespondentID below with the id variable name you want to deduplicate on
duplicated(RespondentID)
2. Because Displayr doesn't change the underlying raw data file, it has to map edits and deletions based on either a unique variable (such as a ResponseID) or the case (row) number in the data set. To set the Unique identifier for the data set, select the name of your data set from the Data Sources tree, and in the object inspector > General > General > Unique identifier field,d select either your unique variable name or [Use case number] to use the row number of your data set.
3. Now you can use the Data Editor to filter the data and delete the records. Select any variables in your Data Sources tree that you wish to view as raw data, and right-click > View in Data Editor.
4. Select your Duplicates filter variable created in Step 1 above in the Data Editor's Filter dropdown so that all the rows selected by the filter will appear in green.
5. Now, click the row header > Delete Row(s) Matching Filter to delete these cases from your data set.
Notes
If you wish to keep the last instance of a duplicate, enter the below instead:
#flag duplicated cases based on the variable RespondentID - keep the last instance
#replace RespondentID below with the id variable name you want to deduplicate on
duplicated(RespondentID, fromLast=TRUE)
If you'd like to deduplicate your data set on more than one variable, you can use the paste function to paste together the values for the variables needed to be de-deduplicated. For example, if you have a data set that has multiple rows for respondent-brand pairs, but you only want to keep one, you could use:
#flag duplicated cases based on two variables (RespondentID and Brand) - keep the first instance
#replace RespondentID and Brand below with the names of the two variables you want to use
#you can deduplicate on more than two variables by adding more variable names separated by
#commas within the paste() parentheses
duplicated(paste(RespondentID, Brand))
Next
How to De-duplicate Raw Data Using R
How to Delete Cases From a Data Set