This article describes how to remove duplicates from your data set by creating an R filter that can be used for deletion.
- A Document with a data set that contains duplicate records.
- An id variable that can be used to de-duplicate against, that is, it should be unique. In this example, we have a variable called UID.
1. Either hover over any variable in your Data Sets tree > Plus (+) > Insert Variable(s) > Custom Code > R - Numeric, or else from the toolbar menu, select Anything > Data > Variables > New > Custom Code > R - Numeric.
2. If you wish to keep the first instance of a duplicate, enter the below under Properties > R CODE in the object inspector:
If you wish to keep the last instance of a duplicate, enter the below instead:
Replace UID with the Name of your unique identifier variable.
3. Name your new variable by going to Properties > GENERAL and editing the name to Duplicates.
4. Tick Usable as a filter and Hidden except in the data tree in the object inspector.
5. Select the name of your data set from the Data Sets tree.
6. Go to the object inspector > Properties > GENERAL > Unique identifier and select [Use case number].
7. Select any variables in your Data Sets tree that you wish to view as raw data, and right-click > View in Data Editor.
8. Select your filter variable in the Data Editor's Filter drop-down so that all the rows selected by the filter will appear in green.
9. Now click the row header > Delete Row(s) Matching Filter to delete these cases from your data set.
If you'd like to deduplicate your data set on more than one variable, you can use the paste function to paste together the values for the variables needed to be de-deduplicated. For example, if you have a data set that has multiple rows for respondent-brand pairs, but you only want to keep one, you could use: