Missing Data Imputation Based on Accuracy of Binary Classification

Jumlong Vongprasert

PDF

Published: Jan 26, 2021

Keywords:

Missing Data Imputation Binary Classification

Jumlong Vongprasert

Applied Statistics Department, Faculty of Science, Ubon Ratchathani Rajabhat University, Ubon Ratchathani

Abstract

The purpose of this study was to compare accuracy of binary classification based on missing data imputations methods namely: Support Vector Machines (SVM); Neural Networks (NN); Random Forests (RF); Multiple Imputation (MI) and Bagged Tree Imputation (BTI). Three data sets comprise: 1) 7 categorical and 9 continuous independent variables, 2) 9 categorical independent variables and 3) 9 continuous independent variables. The comparisons were made with the following conditions: 1) Three data sets; 2) three types of missing data: Missing Completely at Random (MCAR), Missing at Random (MAR) and Not Missing at Random (NMAR); 3) six levels of percentage of missing data (5, 10, 15, 20, 25 and 30). We analyze which imputation method influences most the classifiers’ accuracy. The best imputations in overall were obtained using RF and SVM, the imputation under MAR and MCAR were obtained using SVM, the imputation under NMAR were obtained using RF.

Issue

Vol. 31 No. 1 (2021): January-March, 2021

Section

Applied Science Research Articles

The articles published are the opinion of the author only. The author is responsible for any legal consequences. That may arise from that article.

References

[1] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons Inc, 1987.

[2] W. E Becker and W. B. Walstad. “Data loss from pretest to posttest as a sample selection problem,” The Review of Economics and Statistics, vol. 72, no. 1, pp. 184–188, 1990.

[3] W. Becker and J. Powers, “Student performance, attrition, and class size given missing student data,” Economics of Education Review, vol. 20, no. 4, pp. 377–388, 2001.

[4] S. X. Chen, D. H. Leung, and J. Qin. “Improving semiparametric estimation by using surrogate data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 4, pp. 803–823, 2008.

[5] P. S. Kott and T. Chang, “Using calibration weighting to adjust for nonignorable unit nonresponse,” Journal of the American Statistical Association, vol. 105, no. 491, pp. 1265–1275, 2010.

[6] R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed., New York: John Wiley & Sons Inc, 2020, pp. 408.

[7] D. Dua and C. Graff, “UCI machine learning repository,” Irvine, CA: University of California, School of Information and Computer Science, 2019.

[8] Z. H. O. U. Xin, W. U. Ying, and Y. A. N. G. Bin, “Signal classification method based on support vector machine and high-order cumulants,” Wireless Sensor Network, vol. 2, no. 1, pp. 48–52, 2010.

[9] N. K. Ibrahim, R. S. A. Raja Abdullah, and M. I. Saripan, “Artificial neural network approach in radar target classification,” Journal of Computer Science, vol. 5, no. 1, pp. 23–32, 2009.

[10] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[11] I. Jordanov, N. Petrov, and A. Petrozziello. “Classifiers accuracy improvement based on missing data imputation,” Journal of Artificial Intelligence and Soft Computing Research, vol. 8, no. 1, pp. 31–48, 2018.

[12] S. Verboven, K. V. Branden, and P. Goos, “Sequential imputation for missing values,” Computational Biology and Chemistry, vol. 31, no. 5–6, pp. 320–327, 2007.

[13] M. Saar-Tsechansky and F. Provost, “Handling missing values when applying classification models,” Journal of Machine Learning Research, vol. 8, pp. 1623–1657, 2007.

[14] G. Rahman and Z. Islam, “A decision tree-based missing value imputation technique for data pre-processing,” in Proceedings of the Ninth Australasian Data Mining Conference, 2011, pp. 41–50.

Article Sidebar

Main Article Content

Abstract

Article Details

References