Application of Binary Whale Optimization Algorithm for Solving Imbalanced Data Problems

Main Article Content

Jakkrit Polrob
Benjawan Rodjanadid
Jessada Tanthanuch
Eckart Schulz

Abstract

This research is aimed at developing a novel undersampling algorithm by combining the ideas of the whale and binary whale optimization algorithms with K­ nearest neighbor classification, in order to solve imbalanced data problems. Twelve datasets of varying imbalance ratios ranging from 1.82 to 42.01 were selected from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and also the imbalanced­learn repository, to be used in the evaluation of the novel algorithm. This research work started by splitting each dataset into two parts, the training set and the testing set. Whereas the minority class of each training set remained untouched, its majority class was processed by the proposed algorithm with the parameter in K-nearest neighbor classification fixed to K = 1, to obtain an optimal representative subset of the majority class. Then a support vector machine classifier was trained with the new and reduced training set for performance assessment. It was found that the proposed algorithm had best overall performance when compared with another three undersampling methods, namely random undersampling, cluster centroid, and near-miss algorithms, showing average efficiency measurement results as follows: Accuracy = 0.8596, F1 score = 0.6255, G-mean = 0.8941, AUROC = 0.9363, AUPRC = 0.6978, Sensitivity = 0.9444, Precision = 0.5271, MCC = 0.6204, and Kappa = 0.5695.

Article Details

Section
Research Article

References

S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level analysis for cancer diagnosis on imbalanced data,” J. Biomed. Inform., vol. 90, Feb. 2019, Art. no. 103089, doi: 10.1016/j.jbi.2018.12.003.

N. M. Mqadi, N. Naicker, and T. Adeliyi, “Solving misclassification of the credit card imbalance problem using near miss,” Math. Probl. Eng., vol. 2021, Jul. 2021, Art. no. 7194728, doi: 10.1155/2021/7194728.

W. Kesornsit, V. Lorchirachoonkul, and J. Jitthavech, “Imbalanced data problem solving in classification of diabetes patients,” (in Thai), KKU Res. J. (Graduate Studies), vol. 18, no. 3, pp. 11–21, 2018.

A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from Imbalanced Data Sets. Cham, Switzerland: Springer, 2018.

H. Yu, J. Ni, and J. Zhao, “ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data,” Neurocomputing, vol. 101, pp. 309–318, 2013.

V. López, I. Triguero, C. J. Carmona, S. García, and F. Herrera, “Addressing imbalanced classification with instance generation techniques: IPADE-ID,” Neurocomputing, vol. 126, pp. 15–28, 2014.

H.-J. Kim, N.-O. Jo, and K.-S. Shin, “Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction,” Expert Syst. Appl., vol. 59, pp. 226–234, 2016.

J. Li et al. “Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data,” PloS One, vol. 12, no. 7, 2017, Art. no. e0180830, doi: 10.1371/journal.pone.0180830.

V. Kumar and D. Kumar, “Binary whale optimization algorithm and its application to unit commitment problem,” Neural. Comput. Appl., vol. 32, no. 7, pp. 2095–2123, 2020.

M. M. Mafarja and S. Mirjalili, “Hybrid whale optimization algorithm with simulated annealing for feature selection,” Neurocomputing, vol. 260, pp. 302–312, 2017.

A. G. Hussien, A. E. Hassanien, E. H. Houssein, S. Bhattacharyya, and M. Amin, “S-shaped binary whale optimization algorithm for feature selection,” in Recent Trends in Signal and Image Processing (Advances in Intelligent Systems and Computing), vol 727, S. Bhattacharyya, A. Mukherjee, H. Bhaumik, S. Das, K. Yoshida Eds., Singapore, Singapore: Springer, 2019, pp. 79–87.

G. I. Sayed, A. Darwish, and A. E. Hassanien, “Binary whale optimization algorithm and binary moth flame optimization with clustering algorithms for clinical breast cancer diagnoses,” J. Classif., vol. 37, no. 1, pp. 66–96, 2020.

A. G. Hussien, A. E. Hassanien, E. H. Houssein, M. Amin, and A. T. Azar, “New binary whale optimization algorithm for discrete optimization problems,” Eng. Optim., vol. 52, no. 6, pp. 945–959, 2020.

D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, 1991.

C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995.

R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proc. Mach. Learn.: ECML 2004: 15th Eur. Conf. Mach. Learn., Pisa, Italy, Sep. 2004, pp. 39–50.

S. Mishra, “Handling imbalanced data: SMOTE vs. random undersampling,” Int. Res. J. Eng. Technol., vol. 4, no. 8, pp. 317–320, 2017.

The Imbalanced-learn Developers. “ClusterCentroids.” IMBALANCED-LEARN.org. https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.ClusterCentroids.html (accessed Mar. 3, 2022).

J. Zhang and I. Mani, “kNN approach to unbalanced data distributions: A case study involving information extraction,” presented at ICML'2003 Workshop on Learning from Imbalanced Data Sets (II), Washington, DC, USA, Aug. 21, 2003.

A. Orriols-Puig and E. Bernadó-Mansilla, “Evolutionary rule-based systems for imbalanced data sets,” Soft Comput. vol. 13, no. 3, pp. 213–225, 2009.

S. Mirjalili and A. Lewis, “The whale optimization algorithm,” Adv. Eng. Softw., vol. 95, pp. 51–67, 2016.

J. S. Akosa, “Predictive accuracy: A misleading performance measure for highly imbalanced data,” presented at the SAS Global Forum 2017, Orlando, FL, USA, Apr. 2–5, 2017, Paper 942–2017.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, pp. 1–13, 2020.

J. Cohen, “A coefficient of agreement for nominal scales,” Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, 1960.

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, no. 8, pp. 861–874, 2006.

scikit-learn 1.2.2: Precision-Recall. (2023). [Online]. Available: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

K. Battula, “Research of machine learning algorithms using K-fold cross validation,” Int. J. Eng. Adv. Technol., vol. 8, no. 6S, pp. 215–218, 2021.

Imbalanced data sets, KEEL, 2011. [Online]. Available: http://www.keel.es/

fetch_datasets, The imbalanced-learn developers, 2018. [Online]. Available: https://imbalanced-learn.org/stable/references/generated/imblearn.datasets.fetch_datasets.html