The old adage of spending 90% of your time cleaning your data and 10% of your time analyzing your data has been preached by data workers for decades. Now you can spend less time doing monotonous data preparation tasks and instead use the Data Preparation Agent to automatically clean and tidy your data. This feature can be used on its own or within Chat or Research Agent to generate a report. This improves the quality of analyses and gives you more time to actually dig into findings.
This article explains how to run the Data Preparation Agent and describes the checks that are included as part of data preparation. There is also a Frequently Asked Questions section at the end.
Requirements
- A Displayr document with a data set.
- Displayr AI needs to be enabled. See Opting In and Out of Displayr AI.
Method
You can run the Data Preparation Agent either within Chat or Research Agent before automatically creating a report or on its own to prepare the data without automatically creating a report. To run it along with the Research Agent, see Research Agent, or as part of a "happy path" see Chat for more details.
To run it alone, see the steps below.
- In the Data Sources tree, right-click the name of your data set and select Data Preparation Agent.
- You will be presented with a list of preparation steps (in sequential order) to run on your data. You can search for and modify which specific checks to run, if desired. See the Data preparation steps below for a general explanation of the checks. Click Continue.
- Data Preparation Agent will then review your data and prepare it for analysis according to your chosen steps. This process reviews data for every variable in your data set, and may take several minutes for large data sets.
- When the agent is finished, there will be a new Page named "Data Preparation" in your Report tree outlining the changes made for each check. You can review these and modify/delete any variables in the Data Sources tree as needed.
Data preparation steps
The following is a list of the data preparation steps the automation runs through - not in that order necessarily. You can see more specifics in our technical reference here:
- Split categorical grid variable sets (Nominal-Multi and Ordinal-Multi) that are incorrectly grouped together.
- Find and combine individual variables that are similar and should be combined into a grid variable set.
- Ensure categorical variable sets are correctly marked as ordered (Ordinal) or not ordered (Nominal). For example, if you have a nominal (Pick One) variable set that contains an ordered list, the structure will be changed to an ordinal (Pick One) variable set.
- Create better names for variable sets (e.g., How old are you → Age).
- In addition to creating better names, add the question number if it exists to the variable set name (e.g., How old are you → Q3. Age).
- For ordered categorical (Ordinal-Multi) variable sets:
- Create variables to show how many Don't Know or Not Applicable observations were excluded.
- Change value attributes to exclude Don't Knows and Not Applicable responses from analyses.
- For those on a rating scale:
- Ensure the rating scale uses ascending values from worst to best (i.e., Strongly disagree = 1 to Strongly agree = 5).
- Create a variable set for the top 2 boxes.
- Create a numeric version of the variable set so you can analyze average scores.
- Identify Net Promoter Score (NPS) variable sets, specifically, and create a new numeric NPS version that is recoded to produce NPS scores when used in tables. See How to Recode Net Promoter Score (NPS) Variable(s) for more details.
- Set the Unique Identifier for the data set (required to manually edit and/or delete raw data). If there isn't a suitable variable found, it will flag duplicates for other ID variables for your review.
- Create variables that flag cases where variable sets fail Data Quality checks, including:
- Text variable responses that are blank or don't have at least one letter in the response.
- Straight-lining behavior in multi-item scale variable sets.
- Delete rows of data where 30% or more of the data quality flags listed above failed. See Identifying and Restoring Deleted Cases for information about reviewing and restoring deleted cases.
- Hide variable sets that contain no responses.
- Use automatic text categorization to create variables from text variables. If you want to check and/or edit the categorization, see How to Refine and Edit Text Themes After Classification.
- Automatically tag variables that look like weight variables as Usable as a weight.
Frequently Asked Questions
Why does it automatically add Top 2 Box and NPS variable sets?
Large crosstabs from Nominal-Multi or Ordinal-Multi questions can be difficult to interpret. Converting these to Top 2 Box and NPS formats is a standard research practice to simplify analysis and improve the quality of insights.
How does it tag weight variables?
The Data Preparation Agent evaluates variable labels and names looking for the word "weight" or an abbreviation of "weight", such as "wts", "wtd", "weights", and "weight_demographics". It then checks the raw values for these variables, and whether the values are numeric, whether the values are equal to or greater than 0, and contain decimals (i.e., not integers). If all of these checks are true, the variable is tagged as a weight.