Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes
Main Article Content
Abstract
Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Marée R, Geurts P, Visimberga G, Piater J, Wehenkel L. A comparison of generic machine learning algorithms for image classification. In: Research and Development in Intelligent Systems XX. Springer; 2004. p. 169–82.
Maruf S, Javed K, Babri H. Improving Text Classification Performance with Random Forests-Based Feature Selection. Arab J Sci Eng Springer Sci Bus Media BV. 2016;41(3).
Peralta B, Caro LA. Improved Object Recognition with Decision Trees Using Subspace Clustering. J Adv Comput Intell Intell Inform. 2016;20(1):41–8.
Wu C, Marchese M, Jiang J, Ivanyukovich A, Liang Y. Machine Learning-Based Keywords Extraction for Scientific Literature. J UCS. 2007;13(10):1471–83.
Rout AK, Dash PK. Forecasting foreign exchange rates using hybrid functional link RBF neural network and Levenberg-Marquardt learning algorithm. Intell Decis Technol. 2016;10(3):299–313.
Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning. ACM; 2006. p. 161–8.
Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on Machine learning. ACM; 2008. p. 96–103.
Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. 2007.
El-Halees A. Filtering spam e-mail from mixed arabic and english messages: a comparison of machine learning techniques. Int Arab J Inf Technol. 2009;6(1):52–9.
Ahmed NK, Atiya AF, Gayar NE, El-Shishiny H. An empirical comparison of machine learning models for time series forecasting. Econom Rev. 2010;29(5–6):594–621.
Tan P-N. Introduction to data mining. Pearson Education India; 2006.
Lughofer E, Kazienko P. Hybrid and ensemble methods in machine learning. J Univers Comput Sci. 2013;19(4):457–61.
Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: European conference on computational learning theory. Springer; 1995. p. 23–37.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
King RD, Feng C, Sutherland A. Statlog: comparison of classification algorithms on large real-world problems. Appl Artif Intell Int J. 1995;9(3):289–333.
LeCun Y, Jackel LD, Bottou L, Brunot A, Cortes C, Denker JS, et al. Comparison of learning algorithms for handwritten digit recognition. In: International conference on artificial neural networks. Perth, Australia; 1995. p. 53–60.
Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach Learn. 1999;36(1):105–39.
Sood M, Kumar V, Bhooshan SV. Comparison of Machine Learning Methods for prediction of epilepsy by Neurophysiological EEG signals. Int J Pharm Bio Sci. 2014;5(2):6–15.
Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems. J Mach Learn Res. 2014;15(1):3133–81.
Omidiora EO, Adeyanju IA, Fenwa OD. Comparison of machine learning classifiers for recognition of online and offline handwritten digits. Comput Eng Intell Syst. 2013;4(13):39–47.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol TIST. 2011;2(3):27.
Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid J-J, Sandhu S, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64(5):304–10.
Frey PW, Slate DJ. Letter recognition using Holland-style adaptive classifiers. Mach Learn. 1991;6(2):161–82.
Bache K, Lichman M. UCI Machine Learning Repository. Retrieved March 13, 2016. 2013.
Siebert JP. Vehicle recognition using rule based methods. 1987;