Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction

Main Article Content

Lalu Ganda Rady Putra
Khairani Marzuki
Hairani Hairani

Abstract

Indonesia is an archipelago with the fourth largest population in the world, with a population of 283 million. In Indonesia, breast cancer ranks first in cancer and is the highest contributor to death. Deaths caused by breast cancer can be minimized by screening and early detection to avoid the risk of more severe cancer. Early detection of breast cancer can delay the growth of cancer cells and increase the chances of recovery. This research proposed a machine learning-based application for screening and early detection of breast cancer independently based on perceived symptoms. However, developing breast cancer early detection applications requires a very high level of accuracy to minimize prediction errors. This research focused on finding the optimal accuracy of the machine learning method so that it could predict breast cancer with a very low error rate. This research aimed to improve the performance of classification methods in breast cancer disease prediction using the correlation feature selection approach and hybrid sampling Smote-Tomek Link. This research utilized Support Vector Machine (SVM) and Naive Bayes classification methods with a combination of Smote-Tomek Link hybrid sampling approach and correlation feature selection. Hybrid Sampling Smote-Tomek Link balanced the data by minimizing noise in the data created. At the same time, the correlation feature selection method was used to select relevant or influential attributes with class attributes based on a strong correlation level (≥ 0.6) between input attributes and classes. The results of this study obtained that the SVM method with hybrid sampling and correlation feature selection obtained the best performance compared to the Naive Bayes method and previous research referred to with an accuracy of 96.80%, sensitivity of 96.80%, and specificity of 96.80%. Thus, using the Smote-Tomek Link hybrid sampling approach and correlation feature selection positively impacted the performance increase in the SVM and Naive Bayes methods for breast cancer prediction.

Article Details

How to Cite
Lalu Ganda Rady Putra, Khairani Marzuki, & Hairani, H. (2023). Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction. Engineering and Applied Science Research, 50(6), 577–583. Retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/253528
Section
ORIGINAL RESEARCH

References

Gautama W. Breast cancer in Indonesia in 2022 : 30 years of marching in place. Indones J Cancer. 2022;16(1):1-2.

Marfianti E. Peningkatan Pengetahuan Kanker Payudara dan Ketrampilan Periksa Payudara Sendiri ( SADARI ) untuk Deteksi Dini Kanker Payudara di Semutan Jatimulyo Dlingo. J Abdimas Madani dan Lestari. 2021;3(1):25-31. (In Indonesian)

Han L, Yin Z. A hybrid breast cancer classification algorithm based on meta-learning and artificial neural networks. Front Oncol. 2022;12:1-9.

Alafeef M, Srivastava I, Pan D. Machine learning for precision breast cancer diagnosis and prediction of the nanoparticle cellular internalization. ACS Sens. 2020;5(6):1689-98.

Behravan H, Hartikainen JM, Tengström M, Kosma VM, Mannermaa A. Predicting breast cancer risk using interacting genetic and demographic factors and machine learning. Sci Rep. 2020;10(1):1-16.

Nemade V, Fegade V. Machine learning techniques for breast cancer prediction. Procedia Comput Sci. 2023;218:1314-20.

Saleh H, Abd-el ghany SF, Alyami H, Alosaimi W. Predicting breast cancer based on optimized deep learning approach. Comput Intell Neurosci. 2022;2022:1-11.

Naji MA, El Filali S, Aarika K, Benlahmar EH, Abdelouhahid RA, Debauche O. Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Comput Sci. 2021;191:487-92.

Iparraguirre-Villanueva O, Epifanía-Huerta A, Torres-Ceclén C, Ruiz-Alvarado J, Cabanillas-Carbonell M. Breast cancer prediction using machine learning models. Int J Adv Comput Sci Appl. 2023;14(2):610-20.

Chen H, Wang N, Du X, Mei K, Zhou Y, Cai G. Classification prediction of breast cancer based on machine learning. Comput Intell Neurosci. 2023;2023:1-9.

Juarto B. Breast cancer classification using outlier detection and variance inflation factor. Eng Math Comput Sci J. 2023;5(1):17-23.

Hasan R, Shafi ASM. Feature selection based breast cancer prediction. Int J Image Graph Signal Process. 2023;15(2):13-23.

Dehdar S, Salimifard K, Mohammadi R, Marzban M, Saadatmand S, Fararouei M, et al. Applications of different machine learning approaches in prediction of breast cancer diagnosis delay. Front Oncol. 2023;13:1-10.

Singh AP, Agrawal S. Accuracy prediction on detection of breast cancer using machine learning classifiers. 14th International Conference on Computational Intelligence and Communication Networks (CICN); 2022 Dec 4-6; Al-Khobar, Saudi Arabia. USA: IEEE; 2022. p. 401-5.

Sunardi S, Yudhana A, Windra Putri AR. Mass classification of breast cancer using CNN and faster R-CNN model comparison. KINETIK. 2022;7(3):243-50.

Chowanda A. Exploring the best parameters of deep learning for breast cancer classification. CommIT J. 2022;16(2):143-8.

Aslam MA, Aslam, Cui D. Breast cancer classification using deep convolutional neural network. J Phys Conf Ser. 2020;1584:1-10.

Nurtiyasari D, Abdurakhman A, Hilmi MR. The application of deep neural network for breast cancer classification. J Sains Dasar. 2018;7(1):1-4.

Jabeen K, Khan MA, Balili J, Alhaisoni M, Almujally NA, Alrashidi H, et al. BC2NetRF: Breast cancer classification from mammogram images using enhanced deep learning features and features selection. Diagnostics. 2023;13(7):1-22.

Hairani H, Anggrawan A, Priyanto D. Improvement performance of the random forest method on unbalanced diabetes data classification using Smote-Tomek Link. Int J Informatics Vis. 2023;7(1):258-64.

Swana EF, Doorsamy W, Bokoro P. Tomek Link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors. 2022;22(9):1-21.

Yang H, Li M. Software defect prediction based on SMOTE-Tomek and XGBoost. In: Pan L, Cui Z, Cai J, Li L, editors. International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2021). Communications in Computer and Information Science, vol 1566. Singapore: Springer; 2022. p. 12-31.

Hairani H, Innuddin M, Rahardi M. Accuracy enhancement of correlated naive bayes method by using correlation feature selection (CFS) for health data classification. 2020 3rd International Conference on Information and Communications Technology (ICOIACT); 2020 Nov 24-25; Yogyakarta, Indonesia. USA: IEEE; 2020. p. 51-5.

Tasnim F, Habiba SU. A comparative study on heart disease prediction using data mining techniques and feature selection. 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST); 2021 Jan 5-7; Dhaka, Bangladesh. USA: IEEE; 2021. p. 338-41.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE : Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321-57.

Tomek I. Two modifications of CNN. IEEE Trans Syst Man Cybern. 1976:769-72.

Anggrawan A, Hairani H, Satria C. Improving SVM classification performance on unbalanced student graduation time data using SMOTE. Int J Inf Educ Technol. 2023;13(2):289-95.

Rezvani S, Wang X. A broad review on class imbalance learning techniques. Appl Soft Comput. 2023;143:110415.

Blessie EC, Karthikeyan E. Sigmis: A feature selection algorithm using correlation based method. J Algorithm Comput Technol. 2012;6(3):385-94.

Hairani H, Priyanto D. A new approach of hybrid sampling SMOTE and ENN to the accuracy of machine learning methods on unbalanced diabetes disease data. Int J Adv Comput Sci Appl. 2023;14(8):585-90.

Sun Y, Que H, Cai Q, Zhao J, Li J, Kong Z, et al. Borderline SMOTE algorithm and feature selection-based network anomalies detection strategy. Energies. 2022;15(13):1-13.

Ramos-Pérez I, Arnaiz-González Á, Rodríguez JJ, García-osorio C. When is resampling beneficial for feature selection with imbalanced wide data ?. Expert Syst Appl. 2022;188:1-12.

Nakkaş BN. Feature selection and SMOTE based recommendation for Parkinson’s imbalanced dataset prediction problem. 2022 30th Signal Processing and Communications Applications Conference (SIU); 2022 May 15-18; Safranbolu, Turkey. USA: IEEE; 2022. p. 1-4.

Sreejith S, Khanna Nehemiah H, Kannan A. Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection. Comput Biol Med. 2020;126:103991.

Assegie TA. An optimized K-Nearest neighbor based breast cancer detection. J Robot Control. 2021;2(3):115-8.

Kurian B, Jyothi VL. Breast cancer prediction using an optimal machine learning technique for next generation sequences. Concurr Eng Res Appl. 2021;29(1):49-57.

Alfian G, Syafrudin M, Fahrurrozi I, Fitriyani NL, Atmaji FTD, Widodo T, et al. Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers. 2022;11(9):1-14.

Imran B, Hambali H, Subki A, Zaeniah Z, Yani A, Alfian MR. Data mining using random forest, naïve bayes, and adaboost models for prediction and classification of benign and malignant breast cancer. J Pilar Nusa Mandiri. 2022;18(1):37-46.

Enriko IKA, Melinda M, Sulyani AC, Astawa IGB. Breast cancer recurrence prediction system using k-nearest neighbor, naïve-bayes, and support vector machine algorithm. J Infotel. 2021;13(4):185-8.

Anklesaria S, Maheshwari U, Lele R, Verma P. Breast cancer prediction using optimized machine learning classifiers and data balancing techniques. 2022 6th International Conference On Computing, Communication, Control And Automation (ICCUBEA); 2022 Aug 26-27; Pune, India. USA: IEEE; 2022. p. 1-7.

Telsang VA, Hegde K. Breast cancer prediction analysis using machine learning algorithms. 2020 International Conference on Communication, Computing and Industry 4.0 (C2I4); 2020 Dec 17-18; Bangalore, India. USA: IEEE; 2020. p. 1-5.

Rabiei R, Ayyoubzadeh SM, Sohrabei S, Esmaeili M, Atashi A. Prediction of breast cancer using machine learning approaches. J Biomed Phys Eng. 2022;12(3):297-308.