การประมวลผลภาษาธรรมชาติ เพื่อปรับปรุงคำผิดที่เกิดจากวิธีการรู้จำอักขระ

ทวีศักดิ์  คุ้มภักดี; สัจจาภรณ์  ไวจรรยา; ณัฐโชติ  พรหมฤทธิ์

doi:10.14456/kkuscij.2023.12

PDF

เผยแพร่แล้ว: ก.ค. 19, 2023

DOI: https://doi.org/10.14456/kkuscij.2023.12

คำสำคัญ:

การแก้ไขคำผิด การรู้จำอักขระ การออกเสียงพยางค์ภาษาไทย การวัดความเหมือนกันของสายอักขระด้วยบลูสกอร์ ลายมือเขียนภาษาไทย

ทวีศักดิ์ คุ้มภักดี

ภาควิชาคอมพิวเตอร์ คณะวิทยาศาสตร์ มหาวิทยาลัยศิลปากร วิทยาเขต พระราชวังสนามจันทร์

สัจจาภรณ์ ไวจรรยา

ภาควิชาคอมพิวเตอร์ คณะวิทยาศาสตร์ มหาวิทยาลัยศิลปากร วิทยาเขต พระราชวังสนามจันทร์

ณัฐโชติ พรหมฤทธิ์

ภาควิชาคอมพิวเตอร์ คณะวิทยาศาสตร์ มหาวิทยาลัยศิลปากร วิทยาเขต พระราชวังสนามจันทร์

บทคัดย่อ

บทความนี้นำเสนอการแก้ไขคำผิดที่เกิดขึ้นจากการรู้จำอักขระลายมือเขียนภาษาไทย การรู้จำอักขระลายมือเขียนภาษาไทยนั้นเป็นสิ่งที่ท้าทายเนื่องจากลายมือของคนแต่ละคนเขียนออกมาได้แตกต่างกัน ดังนั้นผลลัพธ์ที่ได้จากการรู้จำอักขระอาจเกิดผลลัพธ์ที่ผิดขึ้นได้แก่ คำที่อ่านออกเสียงไม่ได้หรือได้คำผิดที่ต้องมีการปรับให้ถูกต้อง โดยนำงานด้านการประมวลผลธรรมชาติมาปรับปรุงผลลัพธ์ที่ได้จากการรู้จำอักขระให้ดียิ่งขึ้น มีข้อมูลนำเข้า คือ ข้อความที่ได้จากการรู้จำอักขระลายมือเขียนภาษาไทยไปกรอกที่หน้าเว็บแอปพลิเคชันส่งไปแก้ไขโดยอาศัยความรู้จากหลักการออกเสียงพยางค์ในภาษาไทยมาใช้แก้ไขผลลัพธ์การรู้จำอักขระที่ผิด เมื่อแก้ไขแล้วข้อความนั้นจะถูกนำมารวมกันและหาคำที่มีความเหมือนกัน เริ่มที่พยางค์สูงสุดของคำในภาษาไทยคือ 7 พยางค์ไล่ไปจนถึง 1 พยางค์ โดยต้องมีความเหมือนกันอย่างน้อย 66% ของคำที่มี 1 พยางค์ 80% ของคำที่มี 2 - 3 พยางค์ และ 90% ของคำที่มี 4 พยางค์ขึ้นไปจนถึง 7 พยางค์ โดยใช้ไลบรารีภาษาไพธอน คือ difflib และวัดผลการแก้ไขข้อความโดยใช้ Bleu Score แบบ unigram มาวัดผลการแก้ไขจากข้อความตัวอย่างได้คะแนน 0.66 หลังจากเสร็จสิ้นกระบวนการจะนำผลลัพธ์จากการแก้ไขไปแสดงผลบนหน้าเว็บแอปพลิเคชัน

How to Cite

คุ้มภักดี ท. ., ไวจรรยา ส. ., & พรหมฤทธิ์ ณ. . (2023). การประมวลผลภาษาธรรมชาติ เพื่อปรับปรุงคำผิดที่เกิดจากวิธีการรู้จำอักขระ. วารสารวิทยาศาสตร์ มข., 51(2), 126–141. https://doi.org/10.14456/kkuscij.2023.12

ฉบับ

ปีที่ 51 ฉบับที่ 2 (2023): พฤษภาคม - สิงหาคม 2566

บท

บทความวิจัย

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Aiman, K., Qamar, U., Zafar, I. and Shaheen, A. (2018). Automated misspelling detection and correction in clinical free-text records. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD). IEEE, Chengdu. 277 - 280. doi: 10.1109/ICAIBD.2018.8396209.

Chumwatana, T., Rattana-amnuaychai, W. and Chauychu, P. (2022). Patient Information Extraction Using Optical Character. Journal of The Thai Medical Informatics Association 8(1): 22 - 27.

Islam, M.M., Kabir, M.N., Sadi, M.S., Morsalin, I., Haque, A. and Wang, J. (2019). A Novel Approach Towards Tamper Detection of Digital Holy Quran Generation. In: Conference: 5th International Conference on Electrical, Control and Computer Engineering (InECCE2019). Pahang, Malaysia. 297 - 308. doi: 10.1007/978-981-15-2317-5_25.

Meesad, P., Kleechaya, P., Aun-a-nan, A. and Kijrungpaisarn, K. (2022). Artificial Intelligent Techniques for Thai Fake News Detection. The Journal of Applied Science 21(1): 1 - 19. doi: 10.14416/j.appsci.2022.01.012.

Ngamcharoen, P., Sanglerdsinlapachai, N. and Vejjanugraha, P. (2022). Automatic Thai Text Summarization Using Keyword-Based Abstractive Method. In: 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, Chiang Mai. 1 - 5. doi: 10.1109/iSAI-NLP56921.2022.9960265.

Pal, A., Mallick, S. and Pal, A.R. (2021). Detection and Automatic Correction of Bengali Misspelled Words using N-Gram Model. In: 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE. 1 - 5. doi: 10.1109/ICAECT49130.2021.9392406.

Papineni, K., Roukos, S., Ward, T. and Zhu, W-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia. 311 - 318. doi: 10.3115/1073083.1073135.

Phaphan, W. and Pimpisal, A. (2020). The predictions of a daily stock price direction from the Thai news content by using natural language processing. The Journal of Applied Science 19(1): 59 - 79. doi: 10.14416/j.appsci.2020.01.006.

Puttipornchai, C., Chanyachatchawan, S. and Tuaycharoen, N. (2022). Multi-Label Classification for Articles in Thai Journal Database from Article's Abstract. In: 2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE), , IEEE. 1 - 6. doi: 10.1109/JCSSE54890.2022.98 36270.

Tanaka, Y., Murawaki, Y., Kawahara, D. and Kurohashi, S. (2020). Building a Japanese Typo Dataset from Wikipedia’s Revision History. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Student Research Workshop. Association for Computational Linguistics. 230 - 236. doi: 10.18653/v1/2020.acl-srw.31.

Article Sidebar

Main Article Content

บทคัดย่อ

Article Details

References