How to Randomly Choose a Subset of Cases From a Data Set

There are occasions when you have collected more records than necessary for a survey, and you want to randomly remove the surplus, or you simply want to select a random subset of records to do something with. This article explains how to create a filter for a random sample of respondents in your dataset. You can then use this filter to refine tables or analyses and remove cases from the data, if needed.

Requirements

A Document with a data set.
For the second Method, a variable that can be used to filter the selection to the subgroup to sample. In this example, we have a variable called Males 25-29.

Method - Random filter across all respondents

1. Hover over any variable in your Data Sources tree and click Plus (+) > Custom Code > R > Numeric.

2. In the Code panel, paste the code below:

##code to modify
#specify any variable in the dataset (this is used to calculate how many respondents are in the data)
id = UniqueID
#specify number of respondents to randomly select
select = 10

##standard code
#set the seed so randomization doesn't change if calculated again later
set.seed(123)
#calculate total sample size
ss = length(id)
#select the random respondents/rows in the data
indices = sample.int(ss, select)
#create an empty filter
filter = rep(0, ss)
#change the random selection values in the filter to 1
filter[indices] = 1
#return the final filter
filter

3. Name your new variable by going to Properties > General > General > Name and editing the name to Random sample.

4. Tick Usable as a filter and Hidden (except in variables and code) under Data > Attributes in Properties .

5. [OPTIONAL]: Use this filter to delete those random selections. See How to Remove Cases From Raw Data Using a Filter.

6. [OPTIONAL]: Use this filter to filter in only that random selection into your table or analysis by using the Filters & Weight > Filter(s) dropdown.

7. [OPTIONAL]: If you want to create a filter to filter out the random selection from a table or analysis, change lines 14-17 to:

#create a filter including everyone
filter = rep(1, ss)
#change the random selection values in the filter to 0 to filter out
filter[indices] = 0

Method - Random filter across a subgroup of respondents

1. Hover over any variable in your Data Sources tree and click Plus (+) > Custom Code > R > Numeric.

2. In the Code panel, paste the code below:

##code to modify
#specify a filter variable of your subgroup
subgroup = `Males 25-29`
#specify label of those selected in the subgroup variable
selected = "Selected"
#specify number of respondents to randomly select
select = 10

##standard code
#set the seed so randomization doesn't change if calculated again later
set.seed(123)
#get the list of rows of the subgroup in the data
subgroup_rows = which(subgroup == selected)
#select the random respondents/rows from those rows
indices = sample(subgroup_rows, select)
#create an empty filter
filter = rep(0, length(subgroup))
#change the random selection values in the filter to 1
filter[indices] = 1
#return the final filter
filter

3. Name your new variable by going to Properties > General > General > Name and editing the name to Random sample of Males 25-29.

4. Tick Usable as a filter and Hidden (except in variables and code) from Data > Attributes in Properties .

5. [OPTIONAL]: Use this filter to delete those random selections see How to Remove Cases From Raw Data Using a Filter.

6. [OPTIONAL]: Use this filter to filter in only that random selection into your table or analysis by using the Filters & Weight > Filter(s) dropdown.

7. [OPTIONAL]: If you want to create the filter to filter out the random selection from a table or analysis change lines 16-19 to:

#create a filter including everyone
filter = rep(1, length(subgroup))
#change the random selection values in the filter to 0 to filter out
filter[indices] = 0

How to Remove Duplicate Cases From a Data Set

How to De-duplicate Raw Data Using R

How to Delete Observations From a Data Set

How to View or Edit Raw Data in Displayr