This article describes how to remove duplicates from your data set by creating an R filter that can be used for deletion.
Requirements
- A Document with a data set that contains duplicate records
- An id variable that can be used to de-duplicate against, that is, it should be unique. In this example, we have a variable called UID.
Please note these steps require a Displayr license.
Method
1. Either hover over any variable in your Data Sources tree > Plus (+) > Insert Variable(s) > Custom Code > R - Numeric, or from the toolbar, select the icon > Data > Variables > New > Custom Code > R - Numeric.
2. If you wish to keep the first instance of a duplicate, enter the below under Data > R CODE in the object inspector:
#flag duplicated cases based on the variable UID - keep the first instance
#replace UID below with the id variable name you want to deduplicate on
duplicated(UID)
If you wish to keep the last instance of a duplicate, enter the below instead:
#flag duplicated cases based on the variable UID - keep the last instance
#replace UID below with the id variable name you want to deduplicate on
duplicated(UID, fromLast=TRUE)
Replace UID with the Name of your unique identifier variable.
3. Rename your new variable by going to General > GENERAL and editing the name to Duplicates.
4. Tick Usable as a filter and Hidden except in the data tree in the object inspector.
5. Select the name of your data set from the Data Sources tree.
6. Go to the object inspector > General > GENERAL > Unique identifier and select [Use case number].
7. Select any variables in your Data Sources tree that you wish to view as raw data, and right-click > View in Data Editor.
8. Select your Duplicates filter variable in the Data Editor's Filter dropdown so that all the rows selected by the filter will appear in green.
9. Now click the row header > Delete Row(s) Matching Filter to delete these cases from your data set.
Notes
If you'd like to deduplicate your data set on more than one variable, you can use the paste function to paste together the values for the variables needed to be de-deduplicated. For example, if you have a data set that has multiple rows for respondent-brand pairs, but you only want to keep one, you could use:
#flag duplicated cases based on two variables (UID and Brand) - keep the first instance
#replace UID and Brand below with the names of the two variables you want to use
#you can deduplicate on more than two variables by adding more variable names separated by
#commas within the paste() parentheses
duplicated(paste(UID, Brand))
Next
How to De-duplicate Raw Data Using R
How to Delete Cases From a Data Set
How to Randomly Remove A Subset of Cases From a Data Set
How to View or Edit Raw Data in Displayr