A novel technique for Thai document plagiarism detection using syntactic parse trees
Main Article Content
Abstract
The act of plagiarism is a serious offense and all involved parties will be penalized according to most Thai university rules. The lack of effective tools for plagiarism detection in the Thai language is a problem for academic and research institutes in Thailand. A practical framework and detection tool would facilitate the development of academic integrity and honesty. This paper presents an effective alternative method to detect plagiarism in Thai academic articles utilizing a syntactic parse tree technique (SPT). The main concept of this method is the dynamic weighing of each sentence according to the roles of its words. The experimental results, empirically compared with three existing tools: tri-grams, semantic role labeling (SRL), Turnitin and Akarawisut, yield comparable or higher precision and recall in all four plagiarism study cases of word-by-word, word-reordering, modifier-insertion, and synonym-replacement plagiarism. SPT shows promise and should be incorporated in similarity comparison tools to improve the accuracy of plagiarism detection in the Thai language.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Young D. Perspectives on cheating at a thai university. Lang Test Asia 2013;3:1–15. doi:10.1186/2229-0443-3-6.
Nidapoll. Survey of cheating of students in universities of Thailand. Dailynews 2013. http://www.dailynews.co.th/education/196761 (accessed October 8, 2015).
Bretag T. Challenges in addressing plagiarism in education. PLoS Med 2013;10:e1001574. doi:10.1371/journal.pmed.1001574.
Roig M. Avoiding Plagiarism, Self-Plagiarism, and Other Questionable Writing Practices: A Guide to Ethical Writing. New York: St. Johns University Press; 2006.
Chulalongkorn University. Akarawisut 2015. http://akarawisut.com/ (accessed October 8, 2015).
Turnitin. Defining Plagiarism: The Plagiarism Spectrum 2015. http://go.turnitin.com/paper/plagiarism-spectrum (accessed October 8, 2015).
Osman AH, Salim N, Binwahlan MS, Alteeb R, Abuobieda A. An improved plagiarism detection scheme based on semantic role labeling. J Appl Soft Comput 2012;12:1493–1502. doi:10.1016/j.asoc.2011.12.021.
Ali AMET, Abdulla HMD, Snasel V. Overview and comparison of plagiarism detection tools. Proc. Dateso 2011 Annu. Int. Workshop DAtabases TExts Specif. Objects, 2011, p. 161–72.
Donaldson J, Lancaster A-M, Sposato P. A plagiarism detection system. Proc. 12th SIGCSE Tech. Symp. Comput. Sci. Educ., 1981, p. 21–25. doi:10.1145/800037.800955.
Lukashenko R, Graudina V, Grundspenkis J. Computer-based plagiarism detection methods and tools: an overview. Proc. 2007 Int. Conf. Comput. Syst. Technol., 2007, p. 40:1–40:6. doi:10.1145/1330598.1330642.
Alzahrani SM, Salim N, Abraham A. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C Appl Rev 2012;42:133–49. doi:10.1109/TSMCC.2011.2134847.
Schleimer S, Wilkerson DS, Aiken A. Winnowing: local algorithms for document fingerprinting. Proc. 2003 ACM SIGMOD Int. Conf. Manag. Data, 2003, p. 76–85.
Wibowo AT, Sudarmadi KW, Barmawi AM. Comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on Bahasa Indonesia documents. 2013 Int. Conf. Inf. Commun. Technol., 2013, p. 128–33. doi:10.1109/ICoICT.2013.6574560.
Gipp B, Meuschke N. Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. Proc. 11th ACM Symp. Doc. Eng., 2011, p. 249–258.
Akewonganone A, Aroonmanakul W. Identification of Thai and transliterated words by N-Gram models. Chulalongkorn University, 2005.
White DR, Joy MS. Sentence-based natural language plagiarism detection. J Educ Resour Comput 2004;4. doi:10.1145/1086339.1086341.
Jadalla A, Elnagar A. PDE4Java: Plagiarism detection engine for java source code: a clustering approach. Int J Bus Intell Data Min 2008;3:121–135. doi:10.1504/IJBIDM.2008.020514.
Kustanto C, Liem I. Automatic source code plagiarism detection. 10th ACIS Int. Conf. Softw. Eng. Artif. Intell. Netw. ParallelDistributed Comput., 2009, p. 481–6. doi:10.1109/SNPD.2009.62.
Lesner B, Brixtel R, Bazin C, Bagan G. A novel framework to detect source code plagiarism: now, students have to work for real! Proc. 2010 ACM Symp. Appl. Comput., 2010, p. 57–58. doi:10.1145/1774088.1774101.
Mariani L, Micucci D. AuDeNTES: Automatic Detection of teNtative plagiarism according to a rEference Solution. ACM Trans Comput Educ 2012;12:2:1–2:26. doi:10.1145/2133797.2133799.
Liu H, Wang P. Assessing text semantic similarity using ontology. J Softw 2014;9:490–7. doi:10.4304/jsw.9.2.490-497.
Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.; 2005.
Prapanitisatian S, Kesorn K. Semantic-based technique for Thai documents plagiarism detection. KKU Eng J 2014;41:109–17.
Charoenpornsawat P. Feature-based Thai Word Segmentation. Master’s Thesis. Chulalongkorn University, 1999.
Thoongsup S, Robkop K, Mokarat C, Sinthurahat T, Charoenporn T, Sornlertlamvanich V, et al. Thai WordNet construction. Proc. 7th Workshop Asian Lang. Resour., 2009, p. 139–144.
Trakultaweekoon K, Porkaew P, Supnithi T. LEXiTRON vocabulary suggestion system with recommendation and vote mechanism. Proc. 7th Int. Symp. Nat. Lang. Process., 2007, p. 43–8.
Isahara H, Sornlertlamvanich V, Takahashi N. ORCHID: building linguistic resources in Thai. Lit Linguist Comput 2000;15:465–78. doi:10.1093/llc/15.4.465.
Pankhuenkhat R. Thai sentence analysis. J Thai Lang Cult 2007;1:42–57.