CycleAugment: Efficient data augmentation strategy for handwritten text recognition in historical document images

Main Article Content

Sarayut Gonwirat
Olarik Surinta
http://orcid.org/0000-0002-0644-1435

Abstract

Predicting the sequence pattern of the handwritten text images is a challenging problem due to various writing styles, insufficient training data, and also background noise appearing in the text images. The architecture of the combination between convolutional neural network (CNN) and recurrent neural network (RNN), called CRNN architecture, is the most successful sequence learning method for handwritten text recognition systems. For handwritten text recognition in historical Thai document images, we first trained nine different CRNN architectures with both training from scratch and transfer learning techniques to find out the most powerful technique. We discovered that the transfer learning technique does not significantly outperform scratch learning. Second, we examined training the CRNN model by applying the basic transformation data augmentation techniques: shifting, rotation, and shearing. Indeed, the data augmentation techniques provided more accurate performance than without applying data augmentation techniques. However, it did not show significant results. The original training strategy aimed to find the global minima value and not always solve the overfitting problems. Third, we proposed a cyclical data augmentation strategy, called CycleAugment, to discover many local minima values and prevent overfitting. In each cycle, it rapidly decreased the training loss to reach the local minima. The CycleAugment strategy allowed the CRNN model to learn the input images with and without applying data augmentation techniques to learn from many input patterns. Hence, the CycleAugment strategy consistently achieved the best performance when compared with other strategies. Finally, we prevented image distortion by applying a simple technique to the short word images and achieved better performance on the historical Thai document image dataset.

Article Details

How to Cite
Gonwirat, S., & Surinta, O. (2022). CycleAugment: Efficient data augmentation strategy for handwritten text recognition in historical document images. Engineering and Applied Science Research, 49(4), 505–520. Retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/246869
Section
ORIGINAL RESEARCH

References

Surinta O, Karaaba MF, Schomaker LRB, Wiering MA. Recognition of handwritten characters using local gradient feature descriptors. Eng Appl Artif Intell. 2015;45:405-14.

Inkeaw P, Bootkrajang J, Marukatat S, Gonçalves T, Chaijaruwanich J. Recognition of similar characters using gradient features of discriminative regions. Expert Syst Appl. 2019;134:120-37.

Wang T, Xie Z, Li Z, Jin L, Chen X. Radical aggregation network for few-shot offline handwritten Chinese character recognition. Pattern Recognit Lett. 2019;125:821-7.

Kavitha BR, Srimathi C. Benchmarking on offline handwritten Tamil character recognition using convolutional neural networks. J King Saud Univ Comp Info Sci. In press 2019.

Choudhary A, Rishi R, Ahlawat S. A new character segmentation approach for off-line cursive handwritten words. Procedia Comput Sci. 2013;17:88-95.

Lue HT, Wen MG, Cheng HY, Fan KC, Lin CW, Yu CC. A novel character segmentation method for text images captured by cameras. ETRI J. 2010;32(5):729-39.

Inkeaw P, Bootkrajang J, Charoenkwan P, Marukatat S, Ho SY, Chaijaruwanich J. Recognition-based character segmentation for multi-level writing style. Int J Doc Anal Recognit. 2018;21(1-2):21-39.

Giménez A, Juan A. Embedded Bernoulli mixture HMMs for handwritten word recognition. 10th International Conference on Document Analysis and Recognition; 2009 Jul 26-29; Barcelona, Spain. New York: IEEE; 2009. p. 896-900.

Bluche T, Ney H, Kermorvant C. Feature extraction with convolutional neural networks for handwritten word recognition. 12th International Conference on Document Analysis and Recognition; 2013 Aug 25-28; Washington, USA. New York: IEEE; 2013. p. 285-9.

Wang K, Babenko B, Belongie S. End-to-end scene text recognition. International Conference on Computer Vision; 2011 Nov 6-13; Barcelona, Spain. New York: IEEE; 2011. p. 1457-64.

Lee C, Bhardwaj A, Di W, Jagadeesh V, Piramuthu R. Region-based discriminative feature pooling for scene text recognition. IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus, USA. New York: IEEE; 2014. p. 4050-7.

Mishra A, Alahari K, Jawahar CV. Enhancing energy minimization framework for scene text recognition with top-down cues. Comput Vis Image Underst. 2016;145:30-42.

Singh S, Sharma A, Chauhan VK. Online handwritten Gurmukhi word recognition using fine-tuned deep convolutional neural network on offline features. Mach Learn with Appl. 2021;5(article 100037):1-15.

Ameryan M, Schomaker L. A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification. Neural Comput Appl. 2021;33:8615-34.

Chen Y, Shu H, Xu W, Yang Z, Hong Z, Dong M. Transformer text recognition with deep learning algorithm. Comput Commun. 2021;178:153-60.

Abdurahman F, Sisay E, Fante KA. AHWR-Net: offline handwritten Amharic word recognition using convolutional recurrent neural network. SN Appl Sci. 2021;3(760):1-11.

Yan H, Xu X. End-to-end video subtitle recognition via a deep residual neural network. Pattern Recognit Lett. 2020;131:368-75.

Sujatha P, Bhaskari DL. A survey on offline handwritten text recognition of popular Indian scripts. Int J Comput Sci Eng. 2019;7(7):138-49.

Butt H, Raza MR, Ramzan MJ, Ali MJ, Haris M. Attention-based CNN-RNN Arabic text recognition from natural scene images. Forecasting. 2021;3(3):520-40.

Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ. Snapshot ensembles: train 1, get M for free. International Conference on Learning Representations (ICLR); 2017 Apr 24-26; Toulon, France. New York: DBLP; 2017. p. 1-14.

Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):2298-304.

Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell. 2018;41(9):2035-48.

Luo C, Jin L, Sun Z. A multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019;90:109-18.

Chen X, Jin L, Zhu Y, Luo C, Wang T. Text recognition in the wild: a survey. ACM Comput Surv. 2021;54(2):1-35.

Xu Y, Shan S, Qiu Z, Jia Z, Shen Z, Wang Y, et al. End-to-end subtitle detection and recognition for videos in East Asian languages via CNN ensemble. Signal Process Image Commun. 2018;60:131-43.

Chamchong R, Gao W, McDonnell MD. Thai handwritten recognition on text block-based from Thai archive manuscripts. International Conference on Document Analysis and Recognition (ICDAR); 2019 Sep 20-25; Sydney, Australia. New York: IEEE; 2019. p. 1346-51.

Srinilta C, Chatpoch S. Multi-task learning and Thai handwritten text recognition. 6th International Conference on Engineering, Applied Sciences and Technology (ICEAST); 2020 Jul 1-4; Chiang Mai, Thailand. New York: IEEE; 2020. p. 1-4.

Chamchong R, Saisangchan U, Pawara P. Thai handwritten recognition on BEST2019 datasets using deep Learning. International Conference on Multi-disciplinary Trends in Artificial Intelligence (MIWAI); 2021 Jul 2-3. Cham: Springer; 2021. p. 152-63.

Wu B, Liu Z, Yuan Z, Sun G, Wu C. Reducing overfitting in deep convolutional neural networks using redundancy regularizer. 26th International Conference on Artificial Neural Networks (ICANN); 2017 Sep 11-14; Alghero, Italy. Cham: Springer; 2017. p. 49-55.

Thanapol P, Lavangnananda K, Bouvry P, Pinel F, Leprévost F. Reducing overfitting and improving generalization in training convolutional neural network (CNN) under limited sample sizes in image recognition. 5th International Conference on Information Technology (InCIT); 2020 Oct 21-22; Chonburi, Thailand. New York: IEEE; 2020. p. 300-5.

Gonwirat S, Surinta O. Improving recognition of Thai handwritten character with deep convolutional neural networks. International Conference on Information Science and Systems (ICISS); 2020 Mar 19-22; Cambridge, UK. New York: Association for Computing Machinery; 2020. p. 87-7.

Pawara P, Okafor E, Surinta O, Schomaker L, Wiering M. Comparing local descriptors and bags of visual words to deep convolutional neural networks for plant recognition. International Conference on Pattern Recognition Applications and Methods (ICPRAM); 2017 Feb 24-26; Porto, Portugal. Setúbal: SciTePress; 2017. p. 479-86.

Pawara P, Okafor E, Schomaker L, Wiering M. Data augmentation for plant classification. Advanced Concepts for Intelligent Vision Systems (ACIVS); 2017 Sep 18-21; Antwerp, Belgium. Cham: Springer; 2017. p. 615-26.

Enkvetchakul P, Surinta O. Effective data augmentation and training techniques for improving deep learning in plant leaf disease recognition. Appl Sci Eng Prog. 2021;15(3):1-12.

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations (ICLR); 2015 May 7-9; San Diego, USA. New York: DBLP; 2015. p. 1-14.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jul 27-30; Las Vegas, USA. New York: IEEE; 2016. p. 770-8.

Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21-26; Honolulu, USA. New York: IEEE; 2017. p. 2261-9.

Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. MobileNetV2: inverted residuals and linear bottlenecks. Conference on Computer Vision and Pattern Recognition (CVPR); 2018 Jun 18-23; Salt Lake City, USA. New York: IEEE; 2018. p. 4510-20.

Tan M, Le QV. EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning (ICML); 2019 Jun 9-15; Long Beach, California. New York: PMLR; 2019. p. 6105-14.

Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. International Conference on Machine Learning (ICML); 2014 Jun 21-26; Beijing, China. New York: PMLR; 2014. p. 1764-72.

Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):677-91.

Alhagry S, Fahmy AA, El-Khoribi RA. Emotion recognition based on EEG using LSTM recurrent neural network. Int J Adv Comput Sci Appl. 2017;8(10):355-8.

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735-80.

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing; 2014 Oct 25-29; Doha, Qatar. USA: Association for Computational Linguistics; 2014. p. 1724-34.

Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. International Conference on Machine Learning (ICML); 2006 Jun 25-29; Pittsburgh, USA. New York: Association for Computing Machinery; 2006. p. 369-76.

Bluche T. Deep neural networks for large vocabulary handwritten text recognition networks for large vocabulary handwritten text recognition [thesis]. Orsay: Universite´ Paris Sud-Paris XI; 2015.

Fogel S, Averbuch-Elor H, Cohen S, Mazor S, Litman R. ScrabbleGAN: semi-supervised varying length handwritten text generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13-19; Seattle, USA. New York: IEEE; 2020. p. 4323-32.

Atienza R. Vision transformer for fast and efficient scene text recognition. In: Lladós J, Lopresti D, Uchida S, editors. Document analysis and recognition-ICDAR 2021. Cham: Springer; 2021. p. 319-34.