P D F Extraction Based on Lexical Analysis for Thai Texts
Main Article Content
Abstract
Today, one of the most widely used digital document file format is the Portable Document Format or PDF. The main advantage of PDF is the ability to create and share documents across platforms with different operating systems and hardware environments. Although, many tools for generating PDF files from text documents exist, however, there is no standard tool for converting PDF files into texts with 100% accuracy. The errors are mainly caused by the misplacement of some characters in the resulting texts. For Thai language, the problem is more intensified due to the complex lexeme structure, i.e., character composition, of Thai words. In this paper, we first surveyed PDF extraction tools which is suitable for Thai language. To further improve the quality of the extracted texts, we propose an approach called PDF-PP (Thai PDF Post Processor), which performs text cleansing based on the lexical analysis. The experiment results using a large corpus showed that the proposed Thai PDF-PP could help improve the accuracy of extracted texts up to 99.78%.
Article Details
Section
ACTIS Article
It is the policy of ACTISNU to own the copyright to the published contributions on behalf of the interests of ACTISNU, its authors, and their employers, and to facilitate the appropriate reuse of this material by others. To comply with the Copyright Law, authors are required to sign an ACTISNU copyright transfer form before publication. This form, a copy of which appears in this journal (or website), returns to authors and their employers full rights to reuse their material for their own purposes.