Large-scale Data Classification based on K-means Clustering and Deep Learning
Main Article Content
Abstract
Common problems in classifying large data are revealed as long processing time and a lot of training data in order to maintain high accuracy. To solve these problems, researchers study methods for classifying large data to reduce the use of large amounts of training data without sacrificing high classification efficiency. The proposed method reduces the size of the training data by combining K-means and deep learning. To study the effectiveness of the proposed method, the accuracy and AUC values were determined. In addition, it was compared with the original deep learning method using 80% and 90% training data out of the total data and was compared with the original deep learning using the same amount of training data. The results show that the proposed method can significantly reduce the size of the training data. Less than 1% of the total data size was used as training data, but the method yielded the high average percent of accuracy and the high average AUC of the classification. In the case of normal distribution and the size is 1,000,000 × 5 (N × Feature), the proposed method exhibits the average percent of accuracy as high as 97.4878% and the average AUC as 0.9735. When the proposed method was compared with the deep learning method using training data about 80% and 90% of the total data size, classification efficiency was relatively as high as that of the deep learning, but the classification time was 2–4 times less than the processing time of the deep learning method.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The articles published are the opinion of the author only. The author is responsible for any legal consequences. That may arise from that article.
References
N. Suradet and W. Yathongkhum, “Supervised learning for demospongiae identification using graph mining technique,” UTK Research Journal, vol. 13, no. 1, pp. 167–179, 2019 (in Thai).
T. Tang, S. Chen, M. Zhao, W. Huang, and J. Luo, “Very large-scale data classification based on K-means clustering and multi-kernel SVM,” Soft Computing, vol. 23, no. 11, pp. 3793–3801, 2018.
K. Boonkiatpong and S. Sinthupinyo “Applying multiple neural networks on large scale data,” M.S. thesis, Graduate School, Chulalongkorn University, 2011 (in Thai).
K. Kowsrihawat, “A criminal case outcome and issue prediction model on Thai supreme court cases using deep learning techniques,” M.S. thesis, Graduate School, Chulalongkorn University, 2018 (in Thai).
W. Hirun and T. Pobutdee, “Trip attraction model using social network data and deep learning,” Sripatum Review of Science and Technology, vol. 10, pp. 146–157, 2019 (in Thai).
W. Boonpook, Y. Tan, Y. Ye, P. Torteeka, K. Torsri, and S. Dong, “A deep learning approach on road detection from unmanned aerial vehicle-based images in rural road monitoring,” Sensors, vol. 18, no. 11, pp. 3921, 2018.
N. Pholberdee and P. Taeprasartsiit, “Woundregion segmentation from image by using deep learning and various data augmentation methods,” M.S. thesis, Graduate School, Silpakorn University, 2018 (in Thai).
M. S. Kim, “Robust, scalable anomaly detection for large collections of images,” presented at International Conference on Social Computing, Alexandria, VA, USA, September 8–14, 2013.
T. Tang, S. Chen, M. Zhao, W. Huang, and J. Luo, “Very large-scale data classification based on K-means clustering and multi-kernel SVM,” Soft Computing, vol. 23, no. 1, pp. 3793–3801, 2018.
Y. Yoru and T. Hikmet Karakoc, “Application of artificial neural network (ANN) method to exergy analysis of thermodynamic systems,” presented at International Conference on Machine Learning and Applications, Miami Beach, FL, USA, 2009.
S. Nissen. (2003, October). Implementation of a Fast Artificial Neural Network. Department of Computer Science, University of Copenhagen. [Online]. Available: http://fann.sourceforge. net/report/
D.W. Hosmer and S. Lemeshow, Applied Logistic Regression. John Wiley & Sons, Inc., 2013, pp. 162.
M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427–437, 2009.
A. I. Marqués, V. García, and J. S. Sánchez, “On the suitability of resampling techniques for the class imbalance problem in credit scoring,” Journal of the Operational Research Society, vol. 64, pp. 1060–1070, 2013.
P. Wiriyathammabhum, “An approach to basis selection for dimensional reduction techniques,” M.S. thesis, Graduate School, Chulalongkorn University, 2009 (in Thai).