A popular approach to handling missing data is imputation, which estimates the value of missing data.
Click here for more information on when to use imputation.
Requirements
A data set loaded into Displayr with at least one instance of missing data.
Method
- Select the variable(s) with missing data that you would like to impute in the Data Sources tree.
- Click + > Data Quality > Impute Missing Values.
This will add an imputed variable for each variable selected in Step 1, with "imputed" in the Name and Label.
To change how the imputation is performed, select the new variable or a variable within the new variable set and update any of the following settings from the object inspector :
- OPTIONAL: Auxiliary variables: You can add additional variables to this dropdown box to use the data from those variables in the imputation. These variables' data are used to inform the imputation but are not themselves added to the data set.
- OPTIONAL: Seed: This is the random number seed used in the imputation. Changing this number will result in a different solution.
- OPTIONAL: Method:
- Try mice: The imputation will initially try to use the mice (Multivariate Imputation by Chained Equations in R) algorithm, and if this is not successful, it will attempt to use the hotdeck algorithm.
- Hot Deck: Force the imputation to only use the hotdeck algorithm.
- Mice: Force the imputation to only use the mice algorithm.
- Click Calculate.
Technical Details
By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [1]. Care should be taken to ensure variables have the correct data type, as this has a significant impact on this algorithm. Where a technical error is encountered with mice, imputation is performed using hot-decking via the hot.deck package in R.[2]
When applied with regression, missing values in the outcome variable are excluded from the analysis after imputation.[3]
Note that although imputation can reduce bias in parameter estimates, it can lead to misleading statistical inference (e.g., when the simulated sample size is used instead of the actual sample size in calculations).
The new Variables are imputed jointly. This means that if you make changes to one of them, the others will change too.
There are some technical limitations with regard to how you can change the new variables:
- You cannot add or remove variables from the Variables dropdown.
- You cannot change the order of variables in the Variables dropdown.
- If you wish to delete any of the imputed variables, you must delete them altogether because they are linked.
References
- Stef van Buuren and Karin Groothuis-Oudshoorn (2011), 'mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
- Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
- von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Mulitply Imputed Data."
Next
How to Turn Off Missing Data Selection for Specific Values
How To Check for Missing Data Using Plot by Case
How To Check for Missing Data Using Plot of Patterns