A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis
Main Article Content
Abstract
Pali Sandhi is a phonetic transformation from two words into a new word. The phonemes of the neighbouring words are changed and merged. Pali Sandhi word segmentation is more challenging than Thai word segmentation because Pali is a highly inflected language. This study proposes a novel approach that predicts splitting locations by classifying the sample Sandhi words into five classes with a bidirectional long short-term memory model. We applied the classified rules to rectify the words from the splitting locations. We identified 6,345 Pali Sandhi words from Dhammapada Atthakatha. We evaluated the performance of our proposed model on the basis of the accuracy of the splitting locations and compared the results with the dataset. Results showed that 92.20% of the splitting locations were correct, 1.10% of the Pali Sandhi words were predicted as non-splitting location words and 5.83% were not matched with the answers (incomplete segmentation).
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Khongtum O, Promrit N, Waijanya S. Text-based LSTM networks for automatic Thai love quotes generation on twitter. Inform Tech J. 2019;14(2):1-8.
Khongtum O, Promrit N, Waijanya S. The entity recognition of Thai poem compose by Sunthorn Phu by using the bidirectional long short term memory technique. In: Chamchong R, Wong K, editors. International conference on multi-disciplinary trends in artificial intelligence; 2019 Nov 17-19; Kuala Lumpur, Malaysia. Berlin: Springer; 2019. p. 97-108.
Phonson N. The rule-based machine translation system from Pali to Thai [thesis]. Bangkok: Mahidol University; 2001.
Kornwirat B. A program for the machine translation of Pali into English (Pali MT) [thesis]. Bangkok: Mahidol University; 2003.
Khaing PP, Thwe KZ. Proposed framework for Pali words to Myanmar text translation. Int Conf Comput Appl. 2015:90-5.
Wanglem B, Tongtep N. Pattern-sensitive loanword estimation for Thai text clustering. Walailak J Sci Tech. 2017;14(10):813-23.
Maung ZM. Identification of adopted Pali words in Myanmar text. Int J Comput Sci Issues. 2012;9(6):128-36.
Mache S, Mahender C. Development of text-to-speech synthesizer for Pali language. J Comput Eng. 2016:18(3):35-42.
Haribhakta Y, Nadageri L. Parts of speech tagger for Pali language. International J Sci Res Comput Sci, Eng Informat Tech. 2018:2(4):845-53.
Knauth J, Alfter D. A dictionary data processing environment and its application in algorithmic processing of Pali dictionary data for future NLP tasks. In: Boitet C, Malik MGA, editors. Proceedings of the fifth workshop on south and Southeast Asian natural language processing; 2014 Aug 23; Dublin, Ireland. Dublin: Association for Computational Linguistics and Dublin City University; 2014. p. 65-73.
Elwert F, Sellmer S, Wortmann S, Pachurka M, Knauth J, Alfter D. Toiling with the Pali Canon. In: Mambrini F, Passarotti M, Sporleder C, editors. Proceedings of the workshop on corpus-based research in the humanities; 2015 Dec 10; Warsaw, Poland. 2015. p. 39-48.
Alfter D. Morphological analyzer and generator for Pali [Bachelor thesis]. Trier: University of Trier; 2015.
Basapur S, Shivani V, Nair SS. Pali Sandhi - a computational approach. In: Goyal P, editor. Proceedings of the 6th International Sanskrit Computational Linguistics Symposium; 2019 Oct 23-25; West Bengal, India. Stroudsburg: Association for Computational Linguistics; 2019. p. 182-193.
Scharf PM. Modeling Paṇinian grammar. In: Huet G, Kulkarni A, Scharf P, editors. International Sanskrit computational linguistics symposium; 2008 May 15-17; France. Berlin: Springer; 2009. p. 95-126.
Hellwig O. Morphological disambiguation of classical Sanskrit. In: Mahlow C, Piotrowski M, editors. Systems and frameworks for computational morphology; 2015 Sep 17-18; Stuttgart, Germany. Berlin: Springer; 2015. p. 41-59.
Hellwig O, Hettrich H, Modi A, Pinkal M. Multi-layer annotation of the Rigveda. Proceedings of the eleventh international conference on language resources and evaluation (LREC); 2018 May 7-12; Miyazaki, Japan. France: European Language Resources Association; 2018.
Hellwig O. Detecting sentence boundaries in Sanskrit texts. In: Matsumoto Y, Prasad R, editors. Proceedings of COLING 2016, the 26th international conference on computational linguistics; 2016 Dec 11-16; Osaka, Japan. Japan: The COLING 2016 Organizing Committee; 2016. p. 288-297.
Hellwig O, Nehrdich S. Sanskrit word segmentation using character-level recurrent and convolutional neural networks. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of the 2018 conference on empirical methods in natural language processing; 2018 Oct 31-Nov 4; Brussels: Belgium. Stroudsburg: Association for Computational Linguistics; 2018. p. 2754-63.
Aralikatte R, Gantayat N, Panwar N, Sankaran A, Mani S. Sanskrit sandhi splitting using seq2(seq)2. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of the 2018 conference on empirical methods in natural language processing; 2018 Oct 31-Nov 4; Brussels: Belgium. Stroudsburg: Association for Computational Linguistics; 2018. p. 4909-14.
Dave S, Singh AK, PA P, Lall B. Neural compound-word (Sandhi) generation and splitting in Sanskrit language. In: Haritsa J, Roy S, Gupta M, Mehrotra S, Srinivasan BV, Simmhan Y, editors. CODS COMAD 2021: 8th ACM IKDD CODS and 26th COMAD; 2021 Jan 2-4; Bangalore, India. New York: Association for Computing Machinery; 2021. p. 171-7.
Natarajan A, Charniak E. S³ - Statistical sandhi splitting. In: Wang H, Yarowsky D, editors. Proceedings of 5th international joint conference on natural language processing; 2011 Nov 8-13; Chiang Mai, Thailand. Hong Kong: Asian Federation of Natural Language Processing; 2011. p. 301-8.
Bhardwaj S, Gantayat N, Chaturvedi N, Garg R, Agarwal S. Sandhikosh: a benchmark corpus for evaluating Sanskrit Sandhi tools. In: Calzolari N, Choukri K, C Cieri C, Declerck T, Goggi S, Hasida K, et al, editors. Proceedings of the eleventh international conference on language resources and evaluation (LREC); 2018 May 7-12; Miyazaki, Japan. France: European Language Resources Association; 2018.
Goyal P, Huet G. Completeness analysis of a Sanskrit reader. 5th International Symposium on SansSkrit Computational Linguistics; 2013 Jan 4-6; Mumbai, India.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735-80.
Xu H, Hongsu W, Sanqian Z, Qunchao F, Jun L. Sentence segmentation for classical Chinese based on LSTM with radical embedding. J China Univ Post telecomm. 2019;26(2):1-8.
Hellwig O. Using recurrent neural networks for joint compound splitting and sandhi resolution in Sanskrit. The 7th language & technology conference: human language technologies as a challenge for computer science and linguistics; 2015 Nov 27-29; Poznan, Poland. p. 289-93.
Kittinaradorn R, Achakulvisut T, Chaovavanich K, Srithaworn K, Chormai P, Kaewkasi C, et al. DeepCut: a Thai word tokenization library using deep neural network [computer program]. Version 1.0. Zenodo; 2019.
Chormai P, Prasertsom P, Rutherford A. AttaCut: A fast and accurate neural Thai word Segmenter. arXiv:1911.07056. 2019:1-13.