Comparison of Efficiency for Imbalanced Data Classification via Simulation

Main Article Content

Kantana La-orsirikul
Prapasiri Ratchaprapapornkul
Surasak Kao-Iean

Abstract

The aims of this research are to compare the efficiency of imbalance techniques between over sampling and hybrid methods and to compare performance of classification techniques: random forest, logistic regression, and support vector machine, via simulation. The study is given by high imbalanced data and predicted variables which are mostly categorical data. The criteria of the simulation are sample sizes, ratio of the number of predicted variables between categorical variables and continuous variables, and odds ratio. The results shown that balancing data with over sampling method before classify had higher accuracy, sensitivity, and specificity than hybrid method in each sample sizes. In addition, the balanced data classified with random forest had the highest accuracy, sensitivity, and specificity, the average were 0.996, 0.999 and 0.998 respectively. Moreover, logistic regression technique yields less accurate classification when the number of categorical variables is higher. The result of research can be used as a guideline for choosing a data balancing method which appropriate to data conditions in real situations.

Article Details

Section
บทความวิจัย

References

B. Jantarakongkul, S. Rasmequan, S. Rimcharoen, P. Kulsasem, K.Chinnasarn, A. Rodtook, P.Voraboot, and J. Onpans. Optimal Methods for Classification of Highly Imbalanced Datasets, 2557.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research, Vol. 16, pp. 321-357, 2002.

N. V. Chawla, N. Japkowicz, and A. Kotcz. "Special issue on learning from imbalanced data sets." ACM SIGKDD explorations newsletter, Vol. 6, No. 1, pp. 1-6, 2004.

M. A. H. Farquad and I. Bose. "Preprocessing unbalanced data using support vector machine." Decision Support Systems, Vol. 53, No. 1, pp. 226-233, 2012.

W. Janewit, P. Saran, and T. Sakesan. "Dropout and Persistence Phenomena of Undergraduate Students of Burapha University: The Causal Relationship Model." Journal of Graduate School Sakon Nakhon Rajabhat University, Vol. 17, No. 78, pp. 20-29, 2020.

A. Talim. "A Survival Analysis of Dropping Out of Undergraduate Students, Burapha University." Rajabhat Rambhai Barni Research Journal, Vol. 14, No. 3, pp. 72-83, 2020.

S. Hussain, N. A. Dahan, F. M. Ba-Alwib, and N. Ribata. "Educational data mining and analysis of students’ academic performance using EKA." Indonesian Journal of Electrical Engineering and Computer Science, Vol. 9, No. 2, pp. 447-459, 2018.

Equitable Education Fund, complete report: Project to develop knowledge and role model for caring and supporting the education of children in the street (Children in Street) in Bangkok, 2020

The Secretariat of the Prime Minister Government House, Teacher O moves forward to push stability for "Children's teachers on the road" across the country hope to build morale in their work passing on faith and trust to the target group. Available Online at https://www.thaigov.go.th/news/contents/details/ 37497, accessed on 12 December 2022.

T. Srisawat, and P. Ruengtip. "A Causal Relationship of Job Burnout Syndrome and Internal Factors to Personnel Performance in Burapha University." KKBS Journal of Business Administration and Accountancy, Vol. 5, No. 1, pp. 151-166, 2021.

P. Chujai. Ensemble Learning for Imbalanced Data Classification Problem. A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Engineering, Suranaree University of Technology, 2014.

P. Thongpool, P. Jamrueng, R. Boonrit and S. Sinsomboonthong. "Performance Comparison in Prediction of Imbalanced Datain Data Mining Classification." Thai Journal of Science and Technology, Vol. 8, No. 6, pp. 565-584, 2019.

A. Phaeobang and S. Sinsomboonthong. "Adjusting the Imbalanced Data with 5 Classification Methods." Thai Journal of Science and Technology, Vol. 9, No. 4, pp. 418-435, 2020.

V. López, A. Fernández, J. G. Moreno-Torres and F. Herrera. "Class imbalance methods for translation initiation site recognition in DNA sequences." Knowledge-Based Systems, Vol. 25, No. 1, pp. 22-34, 2012.

K. Chomboon. Rare Class Discovery Techniques for Highly Imbalanced Data. A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Computer Engineering, Suranaree University of Technology, 2012.

K. Suksut. Imbalanced Data Classification Using Data Improvement and Parameter Optimization with Restarting Genetic Algorithm. A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Engineering, Suranaree University of Technology, 2016.

Y. Qian, Y. Liang, M. Li, G. Feng and X. Shi. "A resampling ensemble algorithm for classification of imbalance problems." Neurocomputing, Vol. 143, pp. 57-67, 2014.

R. Dubey, J. Zhou, Y. Wang, P. M. Thompson, J. Ye, and Alzheimer's Disease Neuroimaging Initiative. "Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study." NeuroImage, Vol. 87, pp. 220-241, 2014.

B. Leo. "Random Forests." Journal Machine Learning, Vol. 45, No. 1, pp. 5-32, 2001.

S. Hartshorn. "Machine learning with random forests and decision trees: A Visual guide for beginners." Kindle edition, 2016.

Y. Kaiyawan. "Principleand Using Logistic Regression Analysis for Research." Research and Development Institute, Rajamangala University of Technology Srivijaya, Vol. 4, No. 1, pp. 1-12, 2012.

C. Cortes and V. Vapnik. "Support-vector networks." Machine learning, Vol. 20, pp. 273-297, 1995.

C. Chaiyaphan and K. Ransikarbum. "Study of Factors and Market Layout Using the Analytic Hierarchy Process and Monte Carlo Simulation: A Comparative Study Between Private and Public Markets." KKU Research Journal (Graduate Studies), Vol. 21, No. 4, pp. 48-60, 2021.

K. Kittithanusorn and V. Sa-ing. News Category Classification with Machine Learning Method. A Master’s Project Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science Data Science), Srinakharinwirot University, 2020.

S. Sripaoraya and S. Sinsomboonthong. "Efficiency Comparison of Data Mining Classification Methods for Chronic Kidney Disease: A Case Study of a Hospital in India." Thai Journal of Science and Technology (TJST), Vol. 25, No. 5, pp. 839-853, September-October, 2017.

L. Zhu, X. Zhou and C. Zhang. "Rapid identification of high-quality marine shale gas reservoirs based on the oversampling method and random forest algorithm." Artificial Intelligence in Geosciences, Vol. 2, pp. 76-81, 2021.