Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Statistics
  • Help & Advice
University of Warwick

The Library

  • Login

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Tools
- Tools
+ Tools

Marshall, A. (Andrea), Altman, Douglas G., Royston, Patrick and Holder, Roger L.. (2010) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Medical Research Methodology, Vol.10 (Article 7). ISSN 1471-2288

[img] Text
WRAP_Marshall_Comparison_Techniques.pdf - Draft Version

Download (1248Kb)
Official URL: http://dx.doi.org/10.1186/1471-2288-10-7

Abstract

Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

Item Type: Journal Article
Subjects: Q Science > QA Mathematics
R Medicine > RC Internal medicine
Divisions: Faculty of Medicine > Warwick Medical School > Health Sciences
Faculty of Medicine > Warwick Medical School
Library of Congress Subject Headings (LCSH): Analysis of covariance, Prognosis -- Mathematical models, Missing observations (Statistics), Regression analysis -- Mathematical models
Journal or Publication Title: BMC Medical Research Methodology
Publisher: BioMed Central Ltd.
ISSN: 1471-2288
Date: 19 January 2010
Volume: Vol.10
Number: Article 7
Identification Number: 10.1186/1471-2288-10-7
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Open Access
Funder: Cancer Research UK (CRUK), Medical Research Council (Great Britain) (MRC)
References: 1. Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer 2004, 91(1):4-8. 2. Vach W, Blettner M, Armitage P, Colton T: Missing data in epidemiologic studies. Encyclopedia of Biostatistics New York: John Wiley & Sons 1998, 2641-2654. 3. Demissie S, LaValley MP, Horton NJ, Glynn RJ, Cupples LA: Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Statistics in Medicine 2003, 22(4):545-557. 4. Lipsitz SR, Ibrahim JG: Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis 1996, 2(1):5-14. 5. Lipsitz SR, Ibrahim JG: Estimating equations with incomplete categorical covariates in the Cox model. Biometrics 1998, 54(3):1002-1013. 6. Meng XL, Schenker N: Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. Computational Statistics & Data Analysis 1999, 29(4):471-483. 7. Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons 2004. 8. Little RJA, Rubin DB: Statistical Analysis with Missing Data, Second edition. New York: John Wiley and Sons 2002. 9. van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 1999, 18(6):681-694. 10. Meng XL: Multiple-imputation inferences with uncongenial sources of input. Statistical Science 1994, 9(4):538-558. 11. Hu M, Salvucci S, Weng S, Cohen MP: Evaluation of Proc Impute and Schafer’s imputation software. Proceedings of the survey research methods section of the American Statistical Association. Chicago, Illinois 1996, 287-292. 12. Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychological Methods 2002, 7(2):147-177. 13. Schafer J, Ezzati-Rice T, Johnson W, Khare M, Little R, Rubin D: The NHANES III multiple imputation project. Proceedings of the Survey Research Methods Section of the American Statistical Association. Chicago, Illnois 1996, 28-37. 14. Schenker N, Taylor JMG: Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis 1996, 22(4):425-446. 15. Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology 2002, 55(2):184-191. 16. Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 1995, 142(12):1255-1264. 17. Chen HY: Double-semiparametric method for missing covariates in Cox regression models. Journal of the American Statistical Association 2002, 97(458):565-576. 18. Herring AH, Ibrahim JG, Lipsitz SR: Non-ignorable missing covariate data in survival analysis: a case-study of an International Breast Cancer Study Group trial. Journal of the Royal Statistical Society Series C-Applied Statistics 2004, 53(2):293-310. 19. Oostenbrink R, Moons KGM, Bleeker SE, Moll HA, Grobbee DE: Diagnostic research on routine care data prospects and problems. Journal of Clinical Epidemiology 2003, 56(6):501-506. 20. Harrell FE: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer- Verlag 2001. 21. Barzi F, Woodward M: Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. American Journal of Epidemiology 2004, 160(1):34-45. 22. Scheffer J: Dealing with missing data. Research Letters in the Information and Mathematical Sciences 2002, 3:153-160. 23. R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing 2004. 24. Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M: Modelling the effects of standard prognostic factors in node-positive breast cancer. German Breast Cancer Study Group (GBSG). British Journal of Cancer 1999, 79(11-12):1752-1760. 25. Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Statistics in Medicine 2006, 25(24):4279-4292. 26. Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine 2005, 24(11):1713-1723. 27. Clark TG, Stewart ME, Altman DG, Gabra H, Smyth JF: A prognostic model for ovarian cancer. British Journal of Cancer 2001, 85(7):944-952. 28. Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods 2001, 6(4):330-351. 29. Royston P, Sauerbrei W: A new measure of prognostic separation in survival data. Statistics in Medicine 2004, 23(5):723-748. 30. Kong FH: Adjusting regression attenuation in the Cox proportional hazards model. Journal of Statistical Planning and Inference 1999, 79(1):31-44. 31. Schafer JL: Analysis of Incomplete Multivariate Data. New York: Chapman and Hall 1997. 32. Marshall A, Altman D, Holder R, Royston P: Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Medical Research Methodology 2009, 9(1):57. 33. Li KH, Meng XL, Raghunathan TE, Rubin DB: Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica 1991, 1(1):65-92. 34. Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Statistics in Medicine 2008, 27(17):3227-3246. 35. Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications. Statistics in Medicine 1991, 10(4):585-598. 36. Tang LQ, Song JW, Belin TR, Unutzer J: A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine 2005, 24(14):2111-2128. 37. Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association 1996, 91(434):473-489. 38. Schafer JL, Olsen MK: Modelling and imputation of semicontinuous survey variables. The Methodology Center, Penn State University, USA 2000. 39. Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician 2003, 57(4):229-232. 40. White I, Royston P: Imputing missing covariate values for the Cox model. Statistics in Medicine 2009, 28(15):1982-1998. 41. Schafer JL, Novo AA: norm: Analysis of multivariate normal datasets with missing values. R package version 1.0.9 2002. 42. Schafer JL: mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. R package version 1.0.4 2003. 43. van Buuren S, Oudshoorn CGM: mice: Multivariate Imputation by Chained Equations library. R package version 1.13.1 2005. 44. Harrell FE: Hmisc: Harrell Miscellaneous library for R statistical software. R package 2.2-3 2004.
URI: http://wrap.warwick.ac.uk/id/eprint/3001

Data sourced from Thomson Reuters' Web of Knowledge

Request changes to a record

Actions (login required)

View Item View Item

Document Downloads

More statistics for this item...
twitter

Email us: publications@warwick.ac.uk
Contact Details
About Us