How to Automatically Check and Prepare Your Data with Data Preparation Agent

The old adage of spending 90% of your time cleaning your data and 10% of your time analyzing your data has been preached by data workers for decades. Now you can spend less time doing monotonous data preparation tasks and instead use the Data Preparation Agent to automatically clean and tidy your data. This feature can be used on its own or within Chat or Research Agent to generate a report. This improves the quality of analyses and gives you more time to actually dig into findings.

This article explains how to run the Data Preparation Agent and describes the checks that are included as part of data preparation. There is also a Frequently Asked Questions section at the end.

Requirements

A Displayr document with a data set.
Displayr AI needs to be enabled. See Opting In and Out of Displayr AI.

Method

You can run the Data Preparation Agent either within Chat or Research Agent before automatically creating a report or on its own to prepare the data without automatically creating a report. To run it along with the Research Agent, see Research Agent, or as part of a "happy path" see Chat for more details.

To run it alone, see the steps below.

In the Data Sources tree, right-click the name of your data set and select Data Preparation Agent.
You will be presented with a list of preparation steps (in sequential order) to run on your data. You can search for and modify which specific checks to run, if desired. See the Data preparation steps below for a general explanation of the checks. Click Continue.
Data Preparation Agent will then review your data and prepare it for analysis according to your chosen steps. This process reviews data for every variable in your data set, and may take several minutes for large data sets.
When the agent is finished, there will be a new Page named "Data Preparation" in your Report tree outlining the changes made for each check. You can review these and modify/delete any variables in the Data Sources tree as needed.

Data preparation steps

The following is a list of the data preparation steps the automation runs through - not in that order necessarily. You can see more specifics in our technical reference here:

Split categorical grid variable sets (Nominal-Multi and Ordinal-Multi) that are incorrectly grouped together.
Find and combine individual variables that are similar and should be combined into a grid variable set.
Ensure categorical variable sets are correctly marked as ordered (Ordinal) or not ordered (Nominal). For example, if you have a nominal (Pick One) variable set that contains an ordered list, the structure will be changed to an ordinal (Pick One) variable set.
Convert dates stored as text, ordinal, or nominal variables to Date/Time variables.
Identify nominal/ordinal variables to treat as waves in statistical tests.
Create better names for variable sets (e.g., How old are you → Age).
In addition to creating better names, add the question number if it exists to the variable set name (e.g., How old are you → Q3. Age).
For ordered categorical (Ordinal-Multi) variable sets:
- Create variables to show how many Don't Know or Not Applicable observations were excluded.
- Change value attributes to exclude Don't Knows and Not Applicable responses from analyses.
- For those on a rating scale:
  - Ensure the rating scale uses ascending values from worst to best (i.e., Strongly disagree = 1 to Strongly agree = 5).
  - Create a variable set for the top 2 boxes.
  - Create a numeric version of the variable set so you can analyze average scores.
Identify Net Promoter Score (NPS) variable sets, specifically, and create a new numeric NPS version that is recoded to produce NPS scores when used in tables. See How to Recode Net Promoter Score (NPS) Variable(s) for more details.
Set the Unique Identifier for the data set (required to manually edit and/or delete raw data). If there isn't a suitable variable found, it will flag duplicates for other ID variables for your review.
Create variables that flag cases where variable sets fail Data Quality checks, including:
- Text variable responses that are blank or don't have at least one letter in the response.
- Straight-lining behavior in multi-item scale variable sets.
- Flag respondents who completed the survey in an abnormally short time. See the Frequently Asked Questions below for how the number of flags is assigned and how cases are potentially deleted.
- Delete rows of data where 30% or more of the data quality flags listed above failed. See Identifying and Restoring Deleted Cases for information about reviewing and restoring deleted cases.
Hide variable sets that contain no responses.
Use automatic text categorization to create variables from text variables. If you want to check and/or edit the categorization, see How to Refine and Edit Text Themes After Classification.
Automatically tag variables that look like weight variables as Usable as a weight.

Frequently Asked Questions

Why does it automatically add Top 2 Box and NPS variable sets?

Large crosstabs from Nominal-Multi or Ordinal-Multi questions can be difficult to interpret. Converting these to Top 2 Box and NPS formats is a standard research practice to simplify analysis and improve the quality of insights.

How does it tag weight variables?

The Data Preparation Agent evaluates variable labels and names looking for the word "weight" or an abbreviation of "weight", such as "wts", "wtd", "weights", and "weight_demographics". It then checks the raw values for these variables, and whether the values are numeric, whether the values are equal to or greater than 0, and contain decimals (i.e., not integers). If all of these checks are true, the variable is tagged as a weight.

How are flags assigned to speeders?

If respondents completed the survey in an abnormally short amount of time, they will be assigned the following number of flags:

Respondents who completed the survey in less than half the median duration receive one flag.
Respondents who completed the survey in less than a third of the median receive two flags.
Respondents who completed the survey in less than a quarter of the median receive three flags.

These flags help to identify lazy respondents and bots in your data set, and are taken into consideration when the agent deletes poor-quality cases from your data set.

How are variables identified as waves for statistical testing?

The Data Preparation Agent evaluates variable labels and names, looking for common tracking study terminology, as well as values to determine whether it represents a survey wave. Because it uses AI rather than simple keyword matching, it can distinguish between a variable named "Survey Wave" (identified as a wave) and a question like "Which wave of technological innovation has had the most impact on your industry?" (not identified as a wave). When a match is found, the variable set is tagged as "Treat as wave" so that significance testing compares results across time periods

The following are examples of variable names/labels that are identified as waves for statistical testing:

Wave / Survey Wave / survey_wave
Quarter
Month
Half year / half_year
Year
Financial Year / fin-year
Fieldwork Period / fw_period
Fieldwork Wave / fw_wave
Tracking Period / tracking_period
Sweep
Data Collection Round / dc_round

Chat

Research Agent

How to Check and Clean Data After Importing

Articles in this section

Requirements

Method

Data preparation steps

Frequently Asked Questions

Why does it automatically add Top 2 Box and NPS variable sets?

How does it tag weight variables?

How are flags assigned to speeders?

How are variables identified as waves for statistical testing?

Next

Articles in this section

Requirements

Method

Data preparation steps

Frequently Asked Questions

Why does it automatically add Top 2 Box and NPS variable sets?

How does it tag weight variables?

How are flags assigned to speeders?

How are variables identified as waves for statistical testing?

Next

Related articles