Natural Language Processing to Improve the Errors Caused by the Optical Character Recognition

Main Article Content

Taweesak Khumphakdee
Sajjaporn Waijanya
Nuttachot Promrit

Abstract

This article presents a solution for correcting errors that occur in recognizing handwritten Thai characters. Recognizing handwritten Thai characters is challenging because each person's handwriting is different. Therefore, the results obtained from recognition may contain errors, such as words that cannot be pronounced or incorrect words that need to be corrected. This article applies natural language processing techniques to improve the results obtained from character recognition. The input data is text obtained from recognizing handwritten Thai characters, which is then entered into a web application for correction. The correction process utilizes knowledge of Thai phonetics to rectify the inaccuracies in the character recognition results. After the correction, the texts are combined, and similar words are identified. The process starts with words with the highest number of syllables in the Thai language, which is seven syllables, and proceeds down to one syllable. The requirement is that the words have a minimum similarity of 66% for one-syllable words, 80% for two to three syllables, and 90% for four to seven syllables. The Python library "difflib" is used for this task, and the effectiveness of the correction is evaluated using the unigram BLEU score. The sample text achieved a score of 0.66. Upon completion of the process, the corrected results are displayed on the web application.

Article Details

How to Cite
Khumphakdee, T. ., Waijanya, S. ., & Promrit, N. . (2023). Natural Language Processing to Improve the Errors Caused by the Optical Character Recognition. KKU Science Journal, 51(2), 126–141. https://doi.org/10.14456/kkuscij.2023.12
Section
Research Articles

References

Aiman, K., Qamar, U., Zafar, I. and Shaheen, A. (2018). Automated misspelling detection and correction in clinical free-text records. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD). IEEE, Chengdu. 277 - 280. doi: 10.1109/ICAIBD.2018.8396209.

Chumwatana, T., Rattana-amnuaychai, W. and Chauychu, P. (2022). Patient Information Extraction Using Optical Character. Journal of The Thai Medical Informatics Association 8(1): 22 - 27.

Islam, M.M., Kabir, M.N., Sadi, M.S., Morsalin, I., Haque, A. and Wang, J. (2019). A Novel Approach Towards Tamper Detection of Digital Holy Quran Generation. In: Conference: 5th International Conference on Electrical, Control and Computer Engineering (InECCE2019). Pahang, Malaysia. 297 - 308. doi: 10.1007/978-981-15-2317-5_25.

Meesad, P., Kleechaya, P., Aun-a-nan, A. and Kijrungpaisarn, K. (2022). Artificial Intelligent Techniques for Thai Fake News Detection. The Journal of Applied Science 21(1): 1 - 19. doi: 10.14416/j.appsci.2022.01.012.

Ngamcharoen, P., Sanglerdsinlapachai, N. and Vejjanugraha, P. (2022). Automatic Thai Text Summarization Using Keyword-Based Abstractive Method. In: 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, Chiang Mai. 1 - 5. doi: 10.1109/iSAI-NLP56921.2022.9960265.

Pal, A., Mallick, S. and Pal, A.R. (2021). Detection and Automatic Correction of Bengali Misspelled Words using N-Gram Model. In: 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE. 1 - 5. doi: 10.1109/ICAECT49130.2021.9392406.

Papineni, K., Roukos, S., Ward, T. and Zhu, W-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia. 311 - 318. doi: 10.3115/1073083.1073135.

Phaphan, W. and Pimpisal, A. (2020). The predictions of a daily stock price direction from the Thai news content by using natural language processing. The Journal of Applied Science 19(1): 59 - 79. doi: 10.14416/j.appsci.2020.01.006.

Puttipornchai, C., Chanyachatchawan, S. and Tuaycharoen, N. (2022). Multi-Label Classification for Articles in Thai Journal Database from Article's Abstract. In: 2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE), , IEEE. 1 - 6. doi: 10.1109/JCSSE54890.2022.98 36270.

Tanaka, Y., Murawaki, Y., Kawahara, D. and Kurohashi, S. (2020). Building a Japanese Typo Dataset from Wikipedia’s Revision History. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Student Research Workshop. Association for Computational Linguistics. 230 - 236. doi: 10.18653/v1/2020.acl-srw.31.