Missing Data Imputation Based on Accuracy of Binary Classification

Main Article Content

Jumlong Vongprasert

Abstract

The purpose of this study was to compare accuracy of binary classification based on missing data imputations methods namely: Support Vector Machines (SVM); Neural Networks (NN); Random Forests (RF); Multiple Imputation (MI) and Bagged Tree Imputation (BTI). Three data sets comprise: 1) 7 categorical and 9 continuous independent variables, 2) 9 categorical independent variables and 3) 9 continuous independent variables. The comparisons were made with the following conditions: 1) Three data sets; 2) three types of missing data: Missing Completely at Random (MCAR), Missing at Random (MAR) and Not Missing at Random (NMAR); 3) six levels of percentage of missing data (5, 10, 15, 20, 25 and 30). We analyze which imputation method influences most the classifiers’ accuracy. The best imputations in overall were obtained using RF and SVM, the imputation under MAR and MCAR were obtained using SVM, the imputation under NMAR were obtained using RF.

Article Details

Section
Applied Science Research Articles

References

[1] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons Inc, 1987.

[2] W. E Becker and W. B. Walstad. “Data loss from pretest to posttest as a sample selection problem,” The Review of Economics and Statistics, vol. 72, no. 1, pp. 184–188, 1990.

[3] W. Becker and J. Powers, “Student performance, attrition, and class size given missing student data,” Economics of Education Review, vol. 20, no. 4, pp. 377–388, 2001.

[4] S. X. Chen, D. H. Leung, and J. Qin. “Improving semiparametric estimation by using surrogate data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 4, pp. 803–823, 2008.

[5] P. S. Kott and T. Chang, “Using calibration weighting to adjust for nonignorable unit nonresponse,” Journal of the American Statistical Association, vol. 105, no. 491, pp. 1265–1275, 2010.

[6] R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed., New York: John Wiley & Sons Inc, 2020, pp. 408.

[7] D. Dua and C. Graff, “UCI machine learning repository,” Irvine, CA: University of California, School of Information and Computer Science, 2019.

[8] Z. H. O. U. Xin, W. U. Ying, and Y. A. N. G. Bin, “Signal classification method based on support vector machine and high-order cumulants,” Wireless Sensor Network, vol. 2, no. 1, pp. 48–52, 2010.

[9] N. K. Ibrahim, R. S. A. Raja Abdullah, and M. I. Saripan, “Artificial neural network approach in radar target classification,” Journal of Computer Science, vol. 5, no. 1, pp. 23–32, 2009.

[10] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[11] I. Jordanov, N. Petrov, and A. Petrozziello. “Classifiers accuracy improvement based on missing data imputation,” Journal of Artificial Intelligence and Soft Computing Research, vol. 8, no. 1, pp. 31–48, 2018.

[12] S. Verboven, K. V. Branden, and P. Goos, “Sequential imputation for missing values,” Computational Biology and Chemistry, vol. 31, no. 5–6, pp. 320–327, 2007.

[13] M. Saar-Tsechansky and F. Provost, “Handling missing values when applying classification models,” Journal of Machine Learning Research, vol. 8, pp. 1623–1657, 2007.

[14] G. Rahman and Z. Islam, “A decision tree-based missing value imputation technique for data pre-processing,” in Proceedings of the Ninth Australasian Data Mining Conference, 2011, pp. 41–50.