LCS-based Thai Trending Keyword Extraction from Online News

Kietikul Jearanaitanakij; Nattapong Kueakool; Puwadol Limwanichsin; Tiwat Kullawan; Chankit Yongpiyakul

doi:10.14456/nuej.2022.14

PDF No.7

Published: Nov 23, 2022

DOI: https://doi.org/10.14456/nuej.2022.14

Keywords:

Longest common substring Natural language processing Online news Thai trending keyword Varying-length keyword

Kietikul Jearanaitanakij

Department of Computer Engineering, School of Engineering, King Mongkut's Institute of Technology Ladkrabang

Nattapong Kueakool

InfoQuest Limited

Puwadol Limwanichsin

InfoQuest Limited

Tiwat Kullawan

InfoQuest Limited

Chankit Yongpiyakul

InfoQuest Limited

Abstract

A trending keyword is a common word or a phrase that is most frequently mentioned in the current period. Extracting trending keywords from Thai online news is not trivial. A too-short keyword may not have a specific meaning because it may be just a common word that does not have any significance to the interpretation. On the other hand, a long common keyword conveys a better meaning. However, the running time to extract the long keyword from a collection of documents may not be bounded within a reasonable time. A problem statement of this research is to find a varying-length trending keyword from Thai online news within a reasonable running time. We propose a novel method to extract trending keywords by applying the longest common substring (LCS) algorithm. The common keywords having high occurrence frequency are selected as the trending keywords. The proposed method inherits the advantage of the reasonable running time from the dynamic programming technique of the LCS algorithm. The experimental results on various sources of Thai online news agencies indicate a superior precision of the proposed method over char-N-gram and word-N-gram strategies.

How to Cite

Jearanaitanakij, K., Kueakool, N., Limwanichsin, P., Kullawan, T., & Yongpiyakul, C. (2022). LCS-based Thai Trending Keyword Extraction from Online News. Naresuan University Engineering Journal, 17(2), 54–61. https://doi.org/10.14456/nuej.2022.14

Issue

Vol. 17 No. 2 (2022): July-December 2022

Section

Research Paper

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Aiello, L. M., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., G ̈oker, A., Kompatsiaris, Y., & Jaimes, A. (2013). Sensing trending topics in Twitter. IEEE Transaction on Multimedia, 15(6), 1268–1282. https://doi.org/10.1109/TMM.2013.2265080

Akmal, S., & Williams, V. V. (2021). Improved approximation for longest common subsequence over small alphabets. 48th International Colloquium on Automata, Languages, and Programming (pp. 1-19). ArXiv. https://doi.org/10.48550/arXiv.2105.03028

Alzubi, S., Hawashin, B., Mughaid, A., & Jararweh, Y. (2020). Whats trending? an efficient trending research topics extractor and recommender. 11th International Conference on Information and Communication Systems (pp. 191-196). IEEE. https://doi.org/10.1109/ICICS49469.2020.239519

Beal, R., Afrin, T., Farheen, A., & Adjeroh, R. (2016). A new algorithm for “the LCS problem” with application in compressing genome resequencing data. BMC Genomics, 17(4), 369–381. https://doi.org/10.1186/s12864-016-2793-0

Charalampopoulos, P., Kociumaka, T., Pissis, S. P., & Radoszewski, J. (2021). Faster algorithms for longest common substring. 29th Annual European Symposium on Algorithms (pp. 1-30). ArXiv. https://doi.org/10.48550/arXiv.2105.03106

Chormai, P., Prasertsom, P., & Rutherford, A. (2019). AttaCut: a fast and accurate neural Thai word segmenter. ArXiv, 1911(7056), 1–13. https://doi.org/10.48550/arXiv.1911.07056

Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press. https://doi.org/10.1017/CBO9780511574931

Indra, S. K., Winarko, E., & Pulungan, R. (2019). Trending topics detection of Indonesian tweets using BN-grams and Doc-p. Journal of King Saud University - Computer and Information Sciences, 31(2), 266–274. https://doi.org/10.1016/j.jksuci.2018.01.005

Kittinaradorn, R., Chaovavanich, K., Achakulvisut, T., Srithaworn, K., Chormai, P., Kaewkasi, C., Ruangrong, T., & Oparad, K. (2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Retrieved from https://doi.org/10.5281/zenodo.3457707

Lee, S., & Kim, H. (2008). News keyword extraction for topic tracking. Fourth International Conference on Networked Computing and Advanced Information Management (pp. 554-559). IEEE. https://doi.org/10.1109/NCM.2008.199

Ma, L., He, T., Li, F., Guil, Z., & Chen, J. (2008). Query-focused multi-document summarization using keyword extraction. International Conference on Computer Science and Software Engineering (pp. 20-23). IEEE. https://doi.org/10.1109/CSSE.2008.1323

Madani, A., Boussaid, O., & Zegour, D. E. (2015). Real-time trending topics detection and description from Twitter content. Social Network Analysis and Mining, 5(59), 1–13. https://doi.org/10.1007/s13278-015-0298-5

Mousavi, S. R., & Tabataba, F. (2012). An improved algorithm for the longest common subsequence problem. Computers & Operations Research, 39(3), 512–520. https://doi.org/10.1016/j.cor.2011.02.026

Ousirimaneechai, N., & Sinthupinyo, S. (2018). Extraction of trend keywords and stop words from Thai Facebook pages using character n-grams. International Journal of Machine Learning and Computing, 8(6), 589–594. http://www.ijmlc.org/vol8/750-ML0015.pdf

Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., Suriyawongkul, A., Lowphansirikul, L., & Chormai, P. (2016, June 27). PyThaiNLP: Thai Natural Language Processing in Python. Retrieved from http://doi.org/10.5281/zenodo.3519354

Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520. https://doi.org/10.1108/00220410410560582

Shimizu, Y., Akiyoshi, M., & Komoda, N. (2005). A method of extracting product trend keywords from press releases to analyze product strategy of competitors. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (pp. 631-635). IEEE. https://doi.org/10.1109/CIMCA.2005.1631539

Tanantong, T., Kreangkriwanich, S., & Laosen, N. (2020). Extraction of trend keywords from Thai Twitters using n-gram word combination. International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (pp. 320-323). IEEE. https://doi.org/10.1109/ECTI-CON49241.2020.9158061

Trakultaweekoon, K., Porkaew, P., & Supnithi, T. (2007). LEXiTRON vocabulary suggestion system with recommendation and vote mechanism. Proceedings of Symposium of Natural Language Processing (pp. 43-48). National Electronics and Computer Technology Center. https://lexitron.nectec.or.th/2009_1/paper/paper_3.pdf

Zhang, C., Wang, H., Liu, Y., Wu, D., Liao, Y., & Wang, B. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computer Information Systems, 4(3), 1169–1180. https://core.ac.uk/download/pdf/11884499.pdf

Article Sidebar

Main Article Content

Abstract

Article Details

References