In broad engineering fields, missing data is a common issue which often causes undesired bias and sparseness impeding rigorous data analyses. To tackle this problem, many imputation theories have been proposed and widely used. However, prior methods often require distributional assumptions and prior knowledge regarding data which may cause some difficulty for engineering research. Essentially, the fractional hot-deck imputation (FHDI) is an assumption-free imputation method, holding broad applicability in the engineering domains. FHDIs internal parameters and impact on statistical and machine learning methods, however, have been rarely understood. Thus, this study investigates the behavior and impacts of FHDI on prediction methods including generalized additive model, support vector machine, extremely randomized trees, and artificial neural network, for which four practical datasets (appliance energy, air quality, phenotypes, and weather) are used. Results show that FHDI performs better for improving the prediction accuracy compared to a simple naive method which cures missing data using the mean value of attributes, and FHDI has an asymptotically positive effect on prediction accuracy with decreasing response rates. Regarding an optimal setting, 30 to 35 is recommended for the FHDIs internal categorization number while 5 is recommended for the FHDI donors, which is aligned with Rubins recommendation.
|Number of pages||11|
|Journal||IEEE Transactions on Knowledge and Data Engineering|
|Publication status||Published - 2020 Dec 1|
Bibliographical noteFunding Information:
This research is supported by the research funding of the Department of Civil, Construction, and Environmental Engineering of Iowa State University. The parallel computing research reported herein is partially supported by the HPC@ISU equipment at ISU, some of which has been purchased through funding provided by US National Science Foundation under MRI grant number CNS 1229081 and CRI grant number 1205413. I. Cho’s research is also supported by the US National Science Foundation under grants CBET-1605275, and J. Im’s research is supported by the National Research Foundation (NRF) Korea, NRF-2018R1D1A1B07045220. The data sharing of Dr. Lawrence and Dr. Cetin is appreciated.
© 1989-2012 IEEE.
All Science Journal Classification (ASJC) codes
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics