Evaluation of missing value handling methods in machine learning for emergency department mortality prediction

Main Article Content

Narawish Kophimai
Krisanarach Nitisiri
Pariwat Phugoen
Kanchana Sethanan
Kuo-Jui Wu

Abstract

Missing data remains a significant challenge in emergency medicine, particularly in mortality prediction models. This study investigates five distinct missing value handling methods applied to various machine learning algorithms using a dataset of 331,151 emergency department records from a Thai hospital (2016–2021). The study evaluates complete case analysis, zero imputation, mean imputation, k-Nearest Neighbors (kNN) imputation, and MissForest, combined with logistic regression, decision tree, random forest, Light Gradient Boosting Machine (LightGBM), and Extreme Gradient Boosting (XGBoost). The results indicate that XGBoost with zero imputation delivers the best performance, achieving an accuracy of 0.8659, precision of 0.8726, recall of 0.8659, F1-score of 0.8681, and an AUC ranging from 0.8848 to 0.9947 across eight prediction classes. Furthermore, tree-based models demonstrated greater stability across different missing value handling methods, whereas linear models were more sensitive to imputation techniques. These findings suggest that strategic selection of missing data handling approaches can significantly enhance the reliability of mortality predictions in emergency care settings.

Article Details

How to Cite
Kophimai, N., Nitisiri, K., Phugoen, P., Sethanan, K., & Wu, K.-J. . (2025). Evaluation of missing value handling methods in machine learning for emergency department mortality prediction. Engineering and Applied Science Research, 52(5), 532–540. retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/261515
Section
ORIGINAL RESEARCH

References

El Ariss AB, Kijpaisalratana N, Ahmed S, Yuan J, Coleska A, Marshall A, et al. Development and validation of a machine learning framework for improved resource allocation in the emergency department. Am J Emerg Med. 2024;84:141-8.

Brossard C, Goetz C, Catoire P, Cipolat L, Guyeux C, Jardine CG, et al. Predicting emergency department admissions using a machine-learning algorithm: a proof of concept with retrospective study. BMC Emerg Med. 2025;25:1-11.

Abatal A, Mzili M, Benlalia Z, Khallouki H, Mzili T, Billah MEK, et al. Hybrid long short-term memory and decision tree model for optimizing patient volume predictions in emergency departments. Int J Electr Comput Eng. 2025;15(1):669-76.

VanderPlas J. Python data science handbook: essential tools for working with data. Santa Rosa: O'Reilly Media; 2016.

Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920-30.

Chang YH, Lin YC, Huang FW, Chen DM, Chung YT, Chen WK, et al. Using machine learning and natural language processing in triage for prediction of clinical disposition in the emergency department. BMC Emerg Med. 2024;24:237.

Batista GEPA, Monard MC. An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell. 2003;17(5-6):519-33.

Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8:140.

Levin S, Toerper M, Hamrock E, Hinson JS, Barnes S, Gardner H, et al. Machine-Learning-Based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the emergency severity index. Ann Emerg Med. 2018;71(5):565-574.e2.

Brasen CL, Andersen ES, Madsen JB, Hastrup J, Christensen H, Andersen DP, et al. Machine learning in diagnostic support in medical emergency departments. Sci Rep. 2024;14:17889.

Raita Y, Goto T, Faridi MK, Brown DFM, Camargo CA Jr, Hasegawa K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit Care. 2019;23(1):64.

Graham B, Bond R, Quinn M, Mulvenna M. Using data mining to predict hospital admissions from the emergency department. IEEE Access. 2018;6:10458-69.

Little R, Rubin D. Statistical analysis with missing data. 3rd Edition. Hoboken: John Wiley & Sons; 2019.

Bartlett JW, Carpenter JR, Tilling K, Vansteelandt S. Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics. 2014;15(4):719-30.

Goto T, Camargo CA Jr, Faridi MK, Freishtat RJ, Hasegawa K. Machine learning–based prediction of clinical outcomes for children during emergency department triage. JAMA Netw Open. 2019;2(1):e186937.

Quesada JA, Orozco-Beltran D. Analysis of missing data in electronic health records of people with diabetes in primary care in Spain: a population-based cohort study. Int J Med Inform. 2025;194:105722.

Chen Z, Tan S, Chajewska U, Rudin C, Caruna R. Missing values and imputation in healthcare data: can interpretable machine learning help?. Proc Mach Learn Res. 2023;209:86-99.

Psychogyios K, Ilias L, Askounis D. Comparison of missing data imputation methods using the Framingham heart study dataset. 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI); 2022 Sep 27-30; Ioannina, Greece. USA: IEEE; 2022. p. 1-5.

Nagarajan G, Dhinesh Babu LD. A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records. J Biomed Inform. 2019;94:103190.

Almasinejad P, Golabpour A, Meybodi MRM, Mirzaie K, Khosravi A. A dynamic model for imputing missing medical data: a multiobjective particle swarm optimization algorithm. J Healthc Eng. 2021;2021:1203726.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16(1):321-57.

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-5.

Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112-8.

Nababan AA, Sutarman, Zarlis M, Nababan EB. Multiclass logistic regression classification with PCA for imbalanced medical datasets. Math Model Eng Probl. 2024;11(9);2377-87.

Mienye ID, Jere N. A survey of decision trees: concepts, algorithms, and applications. IEEE Access. 2024;12;86716-27.

Breiman L. Random forests. Machine learning. 2001;45:5-32.

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, editors. Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4-9; Long Beach: Curran Associates Inc; 2017. p. 3149-57

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Krishnapuram B, Shah M, editors. The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13-17; San Francisco, USA. New York: Association for Computing Machinery; 2016. p. 785-94.

Majhi B, Kashyap A. Wavelet based ensemble models for early mortality prediction using imbalance icu big data. Smart Health. 2023;28:100374.