Thai Word Segmentation using a Replacing the English Alphabet Approach to Enhance Thai Text Sentiment Analysis

Main Article Content

Vuttichai Vichianchai
Sumonta Kasemvilas

Abstract

Thai word segmentation is an important method used that is in several document analysis applications. Dictionary-based techniques are popular for Thai word segmentation because of their high accuracy. However, these techniques are prone to errors, especially when some words are not in the dictionary. A solution to this problem is to add more vocabulary to the dictionary. Moreover, traditional techniques cannot be applied to segment misspelled words. Therefore, this research proposes a new Thai word segmentation method that replaces Thai letters with English letters. Replacing the English alphabet (REA) is a novel approach for generating short English character sequences using various formats with the same Thai writing structures. This approach improves the accuracy of Thai word segmentation, thus increasing the accuracy of Thai text classification and sentiment analysis. An evaluation is performed using Thai social media messages and Thai post comments on Pantip. These datasets are labeled by their sentiments (positive, neutral, or negative). The performance of the REA approach with the TF-G and RF techniques is better than that of the other methods, and the experimental results may be acceptable upon comparison with those of earlier well-known studies.

Article Details

How to Cite
Vichianchai, V., & Kasemvilas, S. (2024). Thai Word Segmentation using a Replacing the English Alphabet Approach to Enhance Thai Text Sentiment Analysis. Journal of Applied Informatics and Technology, 6(2), 158–178. https://doi.org/10.14456/jait.2024.10
Section
Research Article

References

Boonkwan, P. & Supnithi, T. (2018). Bidirectional deep learning of context representation for joint word segmentation and POS tagging. In: Le, NT., van Do, T., Nguyen, N., Thi, H. (eds) Advanced Computational Methods for Knowledge Engineering. ICCSAMA 2017. Advances in Intelligent Systems and Computing, vol 629. Springer, Cham. https://doi.org/10.1007/978-3-319-61911-8_17

Chaonithi, K. & Prom-on, S. (2016). A hybrid approach for Thai word segmentation with crowdsourcing feedback system. Proceeding of the 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2016), Chiang Mai, Thailand, 28 June 2016, 1-6. https://doi.org/10.1109/ECTICon.2016.7561298

Charnyapornpong, S. (1983). A Thai syllable separation algorithm [Master’s Thesis, Asian Institute of Technology].

Charoensuk, J. & Sornil, O. (2018). A hierarchical emotion classification technique for Thai reviews. Journal of ICT

Research and Applications, 12(3), 280-296. https://doi.org/10.5614/itbj.ict.res.appl.2018.12.3.6

Chen, K., Zhang, Z., Long, J., & Zhang, H. (2016). Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Systems with Applications, 66, 245-260. https://doi.org/10.1016/j.eswa.2016.09.009

Chirawichitchai, N. (2014). Emotion classification of Thai text based using term weighting and machine learning techniques. Proceeding of the 11th International Joint Conference on Computer Science and Software Engineering (JCSSE 2014), 91-96. https://doi.org/10.1109/JCSSE.2014.6841848

Chormai, P., Prasertsom, P., & Rutherford, A. (2019). Attacut: A fast and accurate neural thai word segmenter. arXiv, preprint arXiv:1911.07056. https://doi.org/10.48550/arXiv.1911.07056

Dogan, T. & Uysal, A. K. (2019). Improved inverse gravity moment term weighting for text classification. Expert Systems with Applications, 130, 45-59. https://doi.org/10.1016/j.eswa.2019.04.015

Eamwiwat, C., Thanasutives, P., Saetia, C., & Chalothorn, T. (2019). Using label noise filtering and ensemble method for sentiment analysis on Thai social data. Proceeding of the 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2019), Chiang Mai, Thailand, 30 October 2019, 1-6. https://doi.org/10.1109/iSAI-NLP48611.2019.9045419.

Esichaikul, V., & Phumdontree, C. (2018). Sentiment analysis of Thai financial news. Proceedings of the 2nd International Conference on Software and e-Business (ICSEB 2018), Zhuhai, China, 18 December 2018, 39-43.https://doi.org/10.1145/3301761.3301773

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289-1305. https://dl.acm.org/doi/10.5555/944919.944974

Hemtanon, S. & Kittiphattanabawon, N. (2019). An automatic screening for major depressive disorder from social media in Thailand. Proceeding of the 10th National & International Conference, 103-113. http://journalgrad.ssru.ac.th/index.php/8thconference/article/view/1880

Horsuwan, T., Kanwatchara, K., Vateekul, P., & Kijsirikul, B. (2019). A comparative study of pretrained language

models on Thai social text categorization. In: Nguyen, N., Jearanaitanakij, K., Selamat, A., Trawiński, B., Chittayasothorn, S. (eds) Intelligent Information and Database Systems. ACIIDS 2020. Lecture Notes in Computer Science(), vol 12033. Springer, Cham. https://doi.org/10.1007/978-3-030-41964-6_6

Inrak, P. & Sinthupinyo, S. (2010). Applying latent semantic analysis to classify emotions in Thai text. Proceeding of the 2nd International Conference on Computer Engineering and Technology (ICCET 2010), Chengdu, China, 16 April 2010, 450-454. https://doi.org/10.1109/ICCET.2010.5486137

Kawtrakul, A., Thumkanon, C., & Seriburi, S. (1995). A statistical approach to Thai word filtering. Proceeding of the 2nd Symposium on Natural Language Processing (SNLP’95), 2-4 August 1995, 398-406.

Kittinaradorn, R. et al. (2019). Deepcut: A Thai word tokenization library using deep neural network. Retrieved 10

November 2023. Retrieved from https://github.com/rkcosmos/Deepcut.

Kongyoung, S., Rugchatjaroen, A., Kosawat, K. (2018). TLex+: A hybrid method using conditional

random fields and dictionaries for Thai word segmentation. In: Theeramunkong, T., Skulimowski, A., Yuizono, T., Kunifuji, S. (eds) Recent Advances and Future Prospects in Knowledge, Information and Creativity Support Systems. KICSS 2015. Advances in Intelligent Systems and Computing, vol 685. Springer, Cham. https://doi.org/10.1007/978-3-319-70019-9_10

Kooptiwoot, C. (1999). Segmentation of ambiguous Thai words by inductive logic programming. Bangkok: Chulalongkorn University.

Kosawat, K. et al. (2009). BEST 2009: Thai word segmentation software contest. Proceeding of the Eighth International Symposium on Natural Language Processing (SNLP 2009), Bangkok, Thailand, 20 October 2009, 83-88. https://doi.org/10.1109/SNLP.2009.5340941

Mahatthanachai, C., Malaivongs, K., Tantranont, N., & Boonchieng, E. (2015). Development of thai word segmentation technique for solving problems with unknown words. Proceeding of the International Computer Science and Engineering Conference (ICSEC 2015), Chiang Mai, Thailand, 23 November 2015, 1-6. https://doi.org/10.1109/ICSEC.2015.7401423

Meknavin, S., Charoenpornsawat, P., & Kijsirikul, B. (1997). Feature-based Thai word segmentation. Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS’97), Phuket, Thailand, 2 December 1997, 41-46.

Pitichotchokphokhin, P., Chuangkrud, P., Kalakan, K., Suntisrivaraporn, B., Leelanupab, T., & Kanungsukkasem, N.

(2020). Discover underlying topics in Thai news articles: A comparative study of probabilistic and matrix factorization approaches. Proceeding of the 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2020), Phuket, Thailand, 24 June 2020, 759-762. https://doi.org/10.1109/ECTI-CON49241.2020.9158065

Poovorawan, Y., & Imarom, V. (1986). Thai syllable separater by dictionary. Proceedings of the 9th National Conference on Electrical Engineering, Khon Kaen, Thailand.

Soisoonthorn, T., Unger, H., & Maliyaem, M. (2023). Thai word segmentation with a brain-Inspired sparse

distributed representations learning memory. Computational Intelligence and Neuroscience, 2023.

https://doi.org/10.1155/2023/8592214

Sornlertlamvanich, V. (1993). Word segmentation for Thai in machine translation system. Machine translation. [In Thai].

Sunkpho, J., Hofmann, M. (2020). Thai words segmentation using an unsupervised learning technique. In: Meesad, P., Sodsee, S. (eds) Recent Advances in Information and Communication Technology 2020. IC2IT 2020. Advances in Intelligent Systems and Computing, vol 1149. Springer, Cham. https://doi.org/10.1007/978-3-030-44044-2_9

Tapsai, C. et al. (2021). TLS-ART-MC, A new algorithm for Thai word segmentation. In: Thai Natural Language Processing. Studies in Computational Intelligence, vol 918. Springer, Cham. https://doi.org/10.1007/978-3-030-56235-9_3

Text classification corpus. (2021). Pittawat2542/krathu-500. Retrieved 10 November 2023. Retrieved from https://github.com/Pittawat2542/krathu-500.

Thairatananond, Y. (1981). Towards the design of a Thai text syllable analyzer. [Master’s Thesis, Asian Institute of Technology].

Thong-iad, K. & Netisopakul, P. (2020). Comparison of Thai sentence sentiment tagging methods using Thai

sentiment resource In: Boonyopakorn, P., Meesad, P., Sodsee, S., Unger, H. (eds) Recent Advances in Information and Communication Technology 2019. IC2IT 2019. Advances in Intelligent Systems and Computing, vol 936. Springer, Cham. https://doi.org/10.1007/978-3-030-19861-9_9

Urathammakul, P. & Runapongsa, K. (2006). Improved rule-based and new dictionary for Thai Word segmentation. Proceedings of the 3rd Joint Conference on Computer Science and Software Engineering, Bangkok, Thailand, 34-40. [In Thai]

Vichianchai, V. (2014). The comparison of Thai word segmentation with Thai writing structures and syllable structures. Journal of Science and Technology Mahasarakham University, 33(5), 503-509. [In Thai]

Vichianchai, V. & Kasemvilas, S. (2021). A new term frequency with Gaussian technique for text classification and sentiment analysis. Journal of ICT Research & Applications, 15(2), 152-168. https://doi.org/10.5614/itbj.ict.res.appl.2021.15.2.4

Wisesight Sentiment Corpus. (2019). PyThaiNLP/wisesight-sentiment. Retrieved 18 September 2022. Retrieved from https://github.com/PyThaiNLP/wisesight-sentiment.

Wongpatikaseree, K., Kaewpitakkun, Y., Yuenyong, S., Matsuo, S., & Yomaboot, P. (2021). Emocnn: Encoding emotional expression from text to word vector and classifying emotions—A case study in thai social network conversation. Engineering Journal, 25(7), 73-82. https://doi.org/10.4186/ej.2021.25.7.73