Missing Data and Multiple Imputation
Overview
Data that we plan to analyze are often incomplete. Study design strategies should ideally be set up to obtain complete data in the first place through questionnaire design, interviewer training, study protocol development, real-time data checking, or re-contacting participants to obtain complete data. When obtaining complete data is not feasible, proxy reports or the collection of characteristics associated with the missing values can help. Missing data can be categorized in multiple ways. Perhaps the most troubling are the data missing on entire observations (e.g., due to selection bias) or on entire variables that have been omitted from the study design. Somewhat more tractable, but still potentially problematic, are data missing on a subset of variables that are missing for a subset of the observations. In this case, it can be useful to label those observations without missing data as “complete cases” and those with some missing data as “partial cases.” Ideally, we hope that the amount of missing data is limited, in which case we will rely less heavily on our assumptions about the pattern of missing data. Missing data can bias study results because they distort the effect estimate of interest (e.g. β). Missing data are also problematic if they decrease the statistical power by effectively decreasing the sample size, or if they complicate comparisons across models that differ in both the analysis strategy and the number of included observations.
Description
The amount of bias potentially introduced by missing data depends on the type of missing data.
-
What you hope for: Missing completely at random (MCAR). By stating that data are MCAR, we assume that the missing values are not systematically different from the values we did observe. For example, imagine a standardized test which randomly assigns a subset of questions to each student. We could reasonably assume that the characteristics of students receiving different versions of the test would be similar, given large enough sample sizes. Even though some of the questions will have missing data, we have a clear understanding of the random process leading to these missing data patterns.
-
Second best: Missing at random (MAR). When data are MAR, the missing values are systematically different from the observed values, but the systematic differences are fully accounted for by measured covariates. In this situation we can use what we know about partial cases to compensate for bias due to missing data. For example, imagine a pop quiz administered on a single day to all students, with complete data among those present and missing data for all who were absent. By linking to the full enrollment and attendance records, we see that quiz scores were lower on average among students with a poor attendance record, and there was more missing data for this group. Yet if we assume that being absent on quiz day was random after you account for the prior attendance record, we can use the available data to extend what we know about observed scores to the missing scores.
-
The worst: Non-ignorable (NI) missing data, also sometimes labeled not missing at random (NMAR) or informative missing data. Concerns about NI data may be raised when missing values are thought to systematically differ from observed values. This can happen if (1) the missing value itself influences the probability of missingness or (2) some unmeasured quantity predicts both the value of the missing variable and the probability of missingness. Building on the example given above, let’s consider an optional quiz for which scores will be displayed publicly. Students who are apprehensive about their quiz score may avoid participating. They may have an unobserved history of low scores on practice quizzes, or the high-level of anxiety itself may hinder their performance. In either case, the characteristics of those abstaining from the quiz would make it difficult to identify a comparable group of students who completed the quiz. Other examples could include loss to follow-up as a direct result of illness in a prospective health study, or study assessments that were incomplete due to participant symptoms during the procedure.
How can we distinguish MCAR, MAR, and NI missing data? In reality, we often have to rely on prior knowledge and assumptions. Showing that observed characteristics are similar among those with and without missing data can help to support a MCAR assumption. However, we cannot usually rule out NI missing data, since these are defined by a systematic difference across unmeasured quantities. Often, the best we can do is to investigate how sensitive our results are to different missing data assumptions.
Another way to categorize missing data patterns is as monotone or arbitrary, a distinction that has practical implications in planning your strategy to address missing data. The most concise definition of monotone missing data that I’ve seen is that the data can be arranged such to make the following true: if Variable J is missing then Variable K is also missing for all K>J. This is often depicted visually is an array with observations as rows, and variables as columns, as a triangular or square block of data missing from the lower right corner. I can most easily imagine a monotonic missing data pattern occurring from loss to follow up: everyone with missing values at a particular study visit has dropped out and is also missing those values at all subsequent visits. Monotone missing data are in some ways simpler to work with, but this pattern is often suggestive of NI missing data if not by design.
Options for analysis
Options for dealing with missing data are relatively easy to implement in standard software. Comparisons across multiple methods may reveal that results are robust to the assumptions made about missing data, or they may provide extreme cases that likely surround the truth.
1. Complete case (aka listwise deletion) is often the default, provided that missing data are coded in a way that the software recognizes (e.g., “.”). This approach discards partial cases, and is asymptotically unbiased if data are MCAR.
2. Missing values can be treated as a separate category. Using this approach for confounders may allow for residual confounding if the missing category is not homogenous. (Note: if you decide to use this approach with continuous variables by replacing missing values with the mean, consider adding an interaction term between the predictor of interest and the indicator of missingness to minimize bias.)
3. Censoring is a strategy commonly used for longitudinal data in a proportional hazards model when the outcome is missing. When the outcome can no longer be observed for certain individuals, those individuals are simply removed from the comparisons going forward. Another type of censoring may take the form of a “floor” or “ceiling” beyond which data are missing. Censoring-related strategies use the available information and may be appropriate for extreme NI missing data.
4. Single imputation essentially consists of filling in the missing data with plausible values. The range of single imputation strategies differ in their strengths and weaknesses:
-
-
Impute to mean or median (simply filling in a typical value for all missing data may be biased, but it limits the leverage of missing data)
-
Impute based on regression analysis (accounts for MAR data, but is optimistic because the regression error term is not carried forward)
-
Stochastic regression imputation (like above but appropriately adds uncertainty)
-
Hot deck imputation (non-parametric approach based on matching partial and complete cases)
-
Cold deck (like above, but matched to external data)
-
Carry forward/carry backward (for longitudinal data with relatively stable characteristics)
-
interpolation/extrapolation (for longitudinal trends, usually assumes linearity)
-
Worst-case analysis (commonly used for outcomes, e.g. missing data are replaced with the “worst” value under NI assumption)
-
4. Multiple imputation relies on regression models to predict the missingness and missing values, and incorporates uncertainty through an iterative approach. Key advantages over a complete case analysis are that it preserves N without introducing bias if data are MAR, and provides corrects SEs for uncertainty due to missing values.
Tips for implementing multiple imputation
-
Input variables to include: any that predict whether data are missing as well as variables that are correlated with the value of the missing data. Often this includes exposure, covariates, outcome, and other available data on study administration or on proxies for the variable with missing data
-
Consider transformations to improve normality of variables with missing data or to enforce restrictions (e.g. log-transformation to force positive values only)
-
Include interactions or nonlinear forms if they improve the models predicting missingness or missing values
-
Diminishing returns make 5-10 imputed datasets sufficient in most situations (but some recommend as few as 3 or as many as 20)
-
Set a seed number in order to get reproducible results (otherwise, results will vary slightly from one run to the next)
-
Make sure data are logically consistent after MI (avoid “impossible” combinations e.g. never-smokers with a non-zero value for pack-years)
Readings
Textbooks & Chapters
Allison, P.D. (2002) Missing Data. Sage Publications.
A nice brief text that builds up to multiple imputation and includes strategies for maximum likelihood approaches and for working with informative missing data
Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data. J. Wiley & Sons, New York.
Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.
Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London.
Gelman, A. and Hill, J. (2007) Ch 25: Missing-data imputation in Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York.
https://publicifsv.sund.ku.dk/~nk/epiF14/Glymour_DAGs.pdf
Methodological Articles
Use of multiple imputation in the epidemiologic literature
Author(s): MA Klebanoff, SR Cole
Journal: American journal of epidemiology
Year published: 2008
What do we do with missing data? Some options for analysis of incomplete data
Author(s): TE Raghunathan
Journal: Annu Rev Public Health
Year published: 2004
Author(s): GJ van der Heijden, AR Donders, T Stijnen, KG Moons
Journal: J Clin Epidemiol
Year published: 2006
Author(s): JA Sterne, IR White, JB Carlin, M Spratt, P Royston, MG Kenward, AM Wood, JR Carpenter
Journal: BMJ
Year published: 2009
Author(s): PD Faris, WA Ghali, R Brant, CM Norris, PD Galbraith, ML Knudtson
Journal: J Clin Epidemiol
Year published: 2002
Software/Programming Articles
Author(s): RM Yucel
Journal: J Stat Software
Year published: 2011
Author(s): NJ Horton, K Kleinman
Journal: Am Stat
Year published: 2007
Application Articles
Association of black carbon with cognition among children in a prospective birth cohort study
Author(s): SF Suglia, A Gryparis, RO Wright, J Schwartz, RJ Wright
Journal: Am J Epidemiol
Year published: 2008
Survival associated with two sets of diagnostic criteria for congestive heart failure
Author(s): GD Schellenbaum, TD Rea, SR Heckbert, NL Smith, T Lumley, VL Roger, et al.
Journal: Am J Epidemiol
Year published: 2004
Early-life and adult socioeconomic status and inflammatory risk markers in adulthood
Author(s): RA Pollitt, JS Kaufman, KM Rose, AV Diez-Roux, D Zeng, G Heiss
Journal: Eur J Epidemiol
Year published: 2007
Author(s): N Krieger, JT Chen, JH Ware, A Kaddour
Journal: Cancer Causes Control
Year published: 2008
Author(s): GS Lovasi, JW Quinn, VA Rauh, FP Perera, HF Andrews, R Garfinkel, L Hoepner, R Whyatt, A Rundle
Journal: Am J Public Health
Year published: 2011
Software
https://stefvanbuuren.name/fimd/
Description: All standard statistical programs can be used to implement missing data techniques, though some may allow for more sophisticated techniques over others. We recommend using this webpage by Stef Van Buuren which provides an annotated list of the software and packages that can be used to implement missing data techniques.
Websites
Statistical Computing Seminars: Multiple Imputation in Stata, Part 1
Website overview: This webpage is hosted by UCLA’s Institute for Digital Research and Education. This particular page is the first of a two part series on implementing multiple imputation techniques in Stata.
Website overview: This website is solely devoted to missing data. It has information on courses, books and workshops, as well as discussion groups and other helpful tips on how to address missing data.
Website overview: This website is a companion to the book “Flexible Imputation of Missing Data” by Stef Van Buuren. This website contains an overview, course materials as well as helpful information for implementing missing data techniques in numerous software packages such as R, Stata, S-Plus, SAS and SPSS.
http://support.sas.com/rnd/app/da.html
http://cran.r-project.org/web/packages/mi/index.html
http://cran.r-project.org/web/packages/mitools/index.html
Courses
Missing Data and Multiple Imputation
Host/program: The Epidemiology and Population Health Summer Institute at Columbia University (EPIC)
Software used: SAS and Stata