P D F Extraction Based on Lexical Analysis for Thai Texts

Main Article Content

Santipong Thaiprayoon
Choochart Haruechaiyasak
Alisa Kongthon

Abstract

Today, one of the most widely used digital document file format is the Portable Document Format or PDF. The main advantage of PDF is the ability to create and share documents across platforms with different operating systems and hardware environments. Although, many tools for generating PDF files from text documents exist, however, there is no standard tool for converting PDF files into texts with 100% accuracy. The errors are mainly caused by the misplacement of some characters in the resulting texts. For Thai language, the problem is more intensified due to the complex lexeme structure, i.e., character composition, of Thai words. In this paper, we first surveyed PDF extraction tools which is suitable for Thai language. To further improve the quality of the extracted texts, we propose an approach called PDF-PP (Thai PDF Post Processor), which performs text cleansing based on the lexical analysis. The experiment results using a large corpus showed that the proposed Thai PDF-PP could help improve the accuracy of extracted texts up to 99.78%.

Article Details

Section
ACTIS Article