How to Decide Which Imputation Method to Use? Our Vote for Realistic Simulation Comparisons
Maria Thurow1, Markus Pauly* 1
Abstract
In statistical survey analysis, (partial) non-responders are integral elements during data acquisition. Treating missing values during data preparation and data analysis is therefore a non-trivial underpinning for which there exist many different possibilities. But how to decide which to choose?
The present talk first explains the importance of having (many) realistic data sets available (for a particular context) to allow an appropriate and objective method comparison. However, sometimes there are not enough suitable benchmark data sets available, e.g. due to privacy issues. A potential solution for this are simulation studies. However, it is sometimes not clear, which simulation models are suitable for generating realistic data. A challenge is that potentially unrealistic assumptions have to be made about the distributions. Focusing on the introductory research question which imputation method to choose, we propose some possibilities to answer this by means of simulation studies.
In particular, we investigate various imputation methods, including modern machine learning approaches, regarding their imputation accuracy and its impact on parameter estimates in the analysis phase after imputation. Since imputation accuracy measures are not uniquely determined in theory and practice, we study different measures for assessing imputation accuracy: Beyond the most common measures, the normalized-root mean squared error (NRMSE) and the proportion of false classification (PFC), we put a special focus on (distribution) distance measures for assessing imputation accuracy. The aim is to deliver guidelines for correctly assessing distributional accuracy after imputation and the potential effect on inference and parameter estimates.
1: TU Dortmund University, Research Center Trustworthy Data Science and Security - Germany