Fast Hybrid Approach for Thai News Summarization

Authors

  • Kietikul Jearanaitanakij Department of Computer Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang
  • Suratan Boonpong Department of Computer Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang
  • Kirttiphoom Teainnagrm Department of Computer Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang
  • Thanakrit Thonglor Department of Computer Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang
  • Tiwat Kullawan Dataxet Limited
  • Chankit Yongpiyakul Dataxet Limited

DOI:

https://doi.org/10.55003/ETH.410307

Keywords:

News summarization, Natural language processing, TextRank, TF.IDF, mT5 model, mBART model

Abstract

News summarization presents a significant challenge in Natural Language Processing (NLP). Lengthy news articles not only consume valuable time but also lead to confusion regarding key points. The ideal news summarization should swiftly produce a succinct summary while retaining the essence of the information conveyed by the news writer. While intelligent chatbots like ChatGPT and Gemini offer user-friendly text summarization, their embedded Large Language Model (LLM) cannot be downloaded for private use. Moreover, implementing them in a business process can be expensive, both in terms of pay-per-use costs and response time. The objective of this research is to develop a private Thai news summarization that effectively extracts sentences encapsulating the main idea and abstractly summarizes them. The proposed model consists of two components. The first extracts a contiguous region containing important sentences using the TextRank algorithm, while the second employs the finetuned mBART as an LLM to generate the abstractive summary from the previously extracted sentences. In other words, the proposed model extracts an important news region before passing it to mBART.  This approach produces a news summary with key information and a syntactic style akin to the natural Thai language. We evaluate the summarization quality by ROUGE scores and BERTScore (precision, recall, and F1-score). The evaluation metrics Experimental results on the ThaiSum dataset show relatively high ROUGE scores and BERTScore for the proposed model compared to most of the other approaches. Furthermore, it significantly reduces the runtime, keeping it within a reasonable limit.

References

M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez and K. Kochut, “Text Summarization Techniques: A Brief Survey,” International Journal of Advanced Computer Science and Applications, vol. 8, no. 10, pp. 397–405, 2017, doi: 10.14569/IJACSA.2017.081052.

K. F. Wong, M. Wu and W. Li, “Extractive summarization using supervised and semi-supervised learning,” in Proc. 22nd International Conference on Computational Linguistics, Manchester, UK, Aug. 18–22, 2008, pp. 985–992, doi: 10.3115/1599081.1599205.

V. Qazvinian, D. R. Radev, S. M. Mohammad, B. Dorr, D. M. Zajic, M. Whidby and T. Moon, “Generating Extractive Summaries of Scientific Paradigms,” Journal of Artificial Intelligence Research, vol. 46, pp. 165–201, 2013, doi: 10.1613/jair.3732.

A. Jain, D. Bhatia and M. K. Thakur, “Extractive Text Summarization Using Word Vector Embedding,” in 2017 International Conference on Machine Learning and Data Science (MLDS), Noida, India, Dec. 14–15, 2017, pp. 51–55, doi: 10.1109/MLDS.2017.12.

R. Nallapati, B. Zhou, C. Santos, Ç. Gu̇lçehre and B. Xiang, “Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond,” in Proc. 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, Aug. 11–12, 2016, pp. 280–290, doi: 10.18653/V1/K16-1028.

J. Tan, X. Wan and J. Xiao, “Abstractive Document Summarization with a Graph-Based Attentional Neural Model,” in Proc. the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, Jul. 30–4, 2017, pp. 1171–1181, doi: 10.18653/v1/P17-1108.

S. Gehrmann, Y. Deng and A. Rush, “Bottom-Up Abstractive Summarization,” in Proc. 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Oct. 31–4, 2018, pp. 4098–4109, doi: 10.18653/v1/D18-1443.

Y. Zhang, D. Li, Y. Wang, Y. Fang and W. Xiao, “Abstract Text Summarization with a Convolutional Seq2seq Model,” applied sciences, vol. 9, no. 8, 2019, Art. no. 1665, doi: 10.3390/app9081665.

Z. Hao, J. Ji, T. Xie and B. Xue, “Abstractive Summarization Model with a Feature-Enhanced Seq2Seq Structure,” in 2020 5th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), Singapore, Jul. 17–19, 2020, pp. 163–167, doi: 10.1109/ACIRS49895.2020.9162627.

Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis and L. Zettlemoyer, “Multilingual Denoising Pre-training for Neural Machine Translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020, doi: 10.1162/tacl_a_00343.

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov and L. Zettlemoyer, “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880. [Online]. Available: https://aclanthology.org/2020.acl-main.703.pdf

M. Yang, X. Wang, Y. Lu, J. Lv., Y. Shen and C. Li, “Plausibility promoting generative adversarial network for abstractive text summarization with multi-task constraint,” Information Sciences, vol. 521, pp. 46–61, 2020, doi: 10.1016/j.ins.2020.02.040.

M. Yang, C. Li, Y. Shen, Q. Wu, Z. Zhao and X. Chen, “Hierarchical Human-Like Deep Neural Networks for Abstractive Text Summarization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2744–2757, 2021, doi: 10.1109/TNNLS.2020.3008037.

L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua and C. Raffel, “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 6–11, 2021, pp. 483–498. [Online]. Available: https://aclanthology.org/2021.naacl-main.41.pdf.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.

W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanumas, A. Suriyawongkul, L. Lowphansirikul and P. Chormai, “PyThaiNLP: Thai Natural Language Processing in Python,” in 2023 Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Singapore, Dec. 6, 2023, pp. 25–36, doi: 10.18653/v1/2023.nlposs-1.4.

P. Ngamcharoen, N. Sanglerdsinlapachai and P. Vejjanugraha, “Automatic Thai Text Summarization Using Keyword-Based Abstractive Method,” in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, Nov. 5–7, 2022, pp. 1–5, doi: 10.1109/iSAI-NLP56921.2022.9960265.

P. Chormai, P. Prasertsom and A. Rutherford, “AttaCut: a fast and accurate neural Thai word segmenter,” arXiv, vol. abs/1911.07056, pp. 1–13, 2019, doi: 10.48550/arXiv.1911.07056.

DeepCut: A Thai word tokenization library using Deep Neural Network, Zenodo, Sep. 23, 2019, doi: 10.5281/zenodo.3457707.

R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” in Proc. 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 25–26, 2004, pp. 404–411.

S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998, doi: 10.1016/S0169-7552(98)00110-X.

J. Leskovec, A. Rajaraman and J. D. Ullman, “Data mining,” in Mining of Massive Datasets, 2nd ed. Cambridge, UK: Cambridge University Press, 2014, ch. 1, sec. 1.3.1, pp. 8–9.

C. Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Proc. Workshop on Text Summarization Branches Out, Barcelona, Spain, Jul. 25–26, 2004, pp. 74–81.

N. Chumpolsathien, “Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization,” M.S. thesis, Beijing Institute of Technology, Beijing, China, 2020.

Downloads

Published

2024-09-30

How to Cite

[1]
K. Jearanaitanakij, S. Boonpong, K. Teainnagrm, T. Thonglor, T. Kullawan, and C. Yongpiyakul, “Fast Hybrid Approach for Thai News Summarization”, Eng. & Technol. Horiz., vol. 41, no. 3, p. 410307, Sep. 2024.

Issue

Section

Research Articles