The basic workflow for checking and cleaning a data involves working through the activities described in the different sections on this page. Other than needing to begin by importing data, the steps can be done in any order. This article builds off of the basic Check Your Data steps. Once you've checked and cleaned data, you may also want to tidy up the data and document before diving into your analysis and reporting.
Requirements
- A data set imported into a new Displayr document, see Get Your Data Into Displayr.
- You've gone through the steps in Check Your Data.
- You understand how Sample Size, NETs, and SUMs are calculated in Displayr, see How to Investigate a Sample Size or NET that is Too Small.
Method - Checking the data file
Sometimes data files contain errors that can make data analysis difficult and, sometimes, impossible. There is an automation that can check for the most common problems, see How to Check for Errors in Data File Construction.
This script will scan through the data in your project, looking for common errors in the data file setup. It will present a set of tables that highlight the errors so that you can address them, or if they are serious, ask your data provider to fix them and send you a new copy of your data.
Errors that the script tries to identify include:
- When a variable is the wrong Variable Type. For example, numeric data is stored in a Text variable.
- Incorrect Missing Data settings in binary variables.
- Missing labels.
A full list is available on the documentation for the script here.
Method - Checking the data
Run the automation called Tables for Data Checking, which will focus on creating tables that contain results that are automatically identified as requiring attention (e.g., tables with very small cell counts). See, How to Create Tables for Data Checking.
You can review all of the variables sets to be sure they are as you expect. You can run the Summary Tables automation to automatically create a table from all the variable sets in your data, see How to Create Summary Tables. Then, you can review the tables and address any problems. In particular:
-
- Check that the NET value is sensible. See NET is not 100%.
- Look at the Sample Size at the bottom of the table, as it will often highlight data integrity issues. If it shows a range of values (e.g., Sample Size = from 120 to 139) this indicates that different cells on the tables vary in their Sample Size. Use Statistics > Cells > Sample Size, Count to explore this in more detail. Where the Sample Size shows a really low number (e.g., Sample Size n = 0 to 139), this generally indicates either a problem with the Value Attributes, or, that the NET or SUM on the table should be hidden (right-click on it and select Hide).
- If the variable set is not showing in the table as you like see Checking Variable Set structure below.
See How to Investigate a Sample Size or NET that is Too Small for more background info on how these statistics are calculated and potential solutions based on how you'd like to present the data. The following sections also provide various ways of cleaning the data
Hiding irrelevant data
You can manually hide a variable or variable set by selecting it in the Data Sets pane and clicking Hide. You can also automate this processing, see How to Hide Uninteresting Data.
Checking Variable Set structure
You should review each variable set to ensure the variable groupings and Structure are as you'd like. Next to name of each variable set in the Data Sets tree (and other fields), there is an icon denoting what the structure of the variable set is, see Variable Sets for a reference of how each Structure is presented in a table. If the variables have been correctly grouped, but the Structure is wrong, change the Structure in the object inspector.
Reviewing the Value Attributes
If the color of the variable set is orange in the Data Sets tree, that means you need to review the Value Attributes (in general, you should review the Value Attributes anyway if new to Displayr or if you have obtained data from a new supplier). See the A variable set's value attributes also determine how it is analyzed section in Variable Sets for how to set the value attributes.
You can also recode data by editing the contents of the Value column. In particular, if a category is not checked in the Missing Data column, the Value shown will be used when computing averages, medians, and other non-categorical statistics. Consequently, it is often useful to replace the Value in the data set with some other more useful value. Common things to change are:
-
- Using a Value that more accurately reflects the label (e.g., a mid-point).
- Setting categories that correspond to missing data to 0, where it can be deduced that the data is missing because 0 is the appropriate answer (e.g., if respondents were asked their purchase frequency only if they were aware of the product).
- Setting the value for Don't know responses to NaN.
You can manually edit the Label and Value for the data, see How to Recode into Existing or New Variables. We also have some automations that can help in the Anything > Data > Variables > Modify > Recode menu, such as:
- How to Recode a Variable Using Category Midpoints
- How to Recode Numeric Variable(s) from Code/Category Midpoints
- How to Reverse Scales in Variable Sets
- How to Remove Don't Know Categories
- How to Recode High Values (Capping) in Numeric Variables
Tidying Names and Labels
If the names of questions appear to be very short then it may be possible to obtain better question names from the labels in the raw data, see How to Suggest Better Variable Names from Source Labels.
Similarly, if labels appear messy, with information about the question included, then it may be possible to tidy them up, see How to Remove Truncated Text from Variable Labels.
Checking questionnaire skips
There are four different ways of checking questionnaire skips:
- Create filters from questions that were used to determine the skips and apply these to tables.
- Open the variable set and upstream variable sets in the Data Editor, sort any variables that are used as skips (which causes the other variables rows to be aligned with the variable used in the skips).
- If using an Enterprise Displayr license, you can create a QScript to Check for Invalid Data.
Changing data (i.e., changing a respondent's values)
There are a variety of different tools for changing respondents' data, including:
- Manually changing the values in the Data Editor, see How to View or Edit Raw Data in Displayr. There are a few tricks to doing it quickly:
- You can sort or see filtered data for all variables in the editor by any column shown in the editor.
- You can use the Search bar in the Data Sets pane to search variable names that you want to edit and drag them to the editor if it's already open.
- Recoding data (see Reviewing the Value Attributes above).
- Using automations, in particular:
Back-coding 'other specifies'
See How to Back Code Variables in Displayr.
Deleting cases
If when cleaning and checking the data it is identified that the data contains cases (respondents) that should be deleted, you can do this manually or in bulk from the Data Editor pane, see How to Delete Cases From a Data Set. Keep in mind, Deleted cases are not deleted from the data file, but they are excluded from any analyses. You can return deleted rows by right-clicking and selecting Revert Deleted Rows.
For further help with cleaning data see: How to Fix Metadata Issues in Displayr.