Toxicity Posts and Hateful Comments Detection in Thai Language Using Supervised Ensemble Classification

Authors

  • sutthisak sukhamsri Department of Information Technology, Faculty of Science and Agricultural Technology, Rajamangala University of Technology Lanna Tak

DOI:

https://doi.org/10.14456/rmutlengj.2025.1

Keywords:

Natural Language Processing, Toxicity Posts, Word Vectorization, Thai Language Corpus, Ensemble Model

Abstract

Social media platforms are the community people gather in where they can generally express their free willing opinions to others on any topics they attend. However, on many occasions, the cause of violating arguments or an unpleasant atmosphere in the community is initiated by negative, toxic, and hateful posts or comments. For that reason, monitoring post systems on social media is an essential topic in the natural language processing area, especially in multi-linguistics research. In this study, we proposed a method of improvement for the Thai language's toxic and hateful classification that was trained on the dataset of 2,160 posts from the Thai toxicity Twitter corpus for training and verifying. Therefore, we designated the ensemble approach which includes the combination of XGBoost, multinomial naive Bayes, logistic regression, support vector machine, and random forest for classifiers. In summary, the ensemble classifier improved the previous study in the same dataset with 0.7808 precision, 0.7778 recall, and 0.7721 average accuracies in the weighted F1 scoring with an accuracy of 0.8235 in the F1 binary scoring.

References

Kemp S. Digital 2023: Global Overview Report - DataReportal – Global Digital Insights. [cited 2 February 2024]. Available from: https://datareportal.com/reports/digital-2023-global-overview-report.

Yuenyong S, Hnoohom N, Wongpatikaseree K, Ayutthaya TPN. Classification of Tweets Related to Illegal Activities in Thai Language. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP); 2018. p. 1-6.

Mathew B, Kumar N, Ravina, Goyal P, Mukherjee A. Analyzing the hate and counter-speech accounts on Twitter. (Cornell University); 2018.

Sornlertlamvanich V, Takahashi N, Isahara H. Building a Thai part-of-speech tagged corpus (orchid). J Acoust Soc Jpn (E). 1999;20(3):189-198.

Kosawat K, Boriboon M, Chootrakool P, Chotimongkol A, Klaithin S, Kongyoung S, Kriengket K, Phaholphinyo S, Purodakananda S, Thanakulwarapas T, et al. BEST 2009: Thai word segmentation software contest, Natural Language Processing 2009. In: SNLP'09 Eighth International Symposium; 2009. p.83-88.

Lertpiya A, et al. A Preliminary Study on Fundamental Thai NLP Tasks for User-generated Web Content. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). 2018. p.1-8.

Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media. 2017;11(1):512-515.

Watanabe H, Bouazizi M, Ohtsuki T. Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection. IEEE Access. 2018;6:13825-13835.

Piyaphakdeesakun C, Facundes N, Polvichai J. Thai Comments Sentiment Analysis on Social Networks with Deep Learning Approach. In: 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). 2019. p.1-4.

Thiengburanathum P, Charoenkwan P. A Performance Comparison of Supervised Classifiers and Deep-learning Approaches for Predicting Toxicity in Thai Tweets. In: 2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering. Cha-am, Thailand, 2021. p.238-242.

Sirihattasak S, Komachi M, Ishikawa H. Annotation and classification of toxicity for Thai Twitter. Proceedings of LREC 2018 Workshop and the 2nd Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS’18). 2018.

Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016.

McCallum A, Nigam K, Ungar LH. A comparison of event models for naive Bayes text classification. In: Proceedings of the 15th International Conference on Machine Learning. 1998. p.41-48.

Joachims T. Text categorization with support vector machines: learning with many relevant features. In: European Conference on Machine Learning. Springer, Berlin, Heidelberg. 1998. p.137-142.

Weston J, Chakrabarti S, Weiss Y. Text classification using string kernels. In: Proceedings of the 10th ACM International Conference on Information and Knowledge Management. ACM. November 2001. p.191-198.

Poon H, Domingos P. Random forests for text classification. In: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence. 2007. p.907-914.

Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. October 2014. p.1746-1751.

Downloads

Published

2025-04-24

How to Cite

sukhamsri, sutthisak. (2025). Toxicity Posts and Hateful Comments Detection in Thai Language Using Supervised Ensemble Classification . RMUTL Engineering Journal, 10(1), 1–8. https://doi.org/10.14456/rmutlengj.2025.1

Issue

Section

Research Article