首页 | 本学科首页   官方微博 | 高级检索  
     


Text plagiarism classification using syntax based linguistic features
Affiliation:1. Department of Computer Science and Engineering, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, Amrita University, India;2. Department of Mathematics, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, Amrita University, India;1. CINI Assistive Technologies National Lab & DAUIN, Politecnico di Torino, Italy;2. Department of Mathematics, University of Turin, Via Carlo Alberto 10, 10121 Torino, Italy;3. Istituto Superiore Mario Boella, Center for Applied Research on ICT, Via Pier Carlo Boggio 61, 10138, Torino, Italy;1. Advanced Visualization Laboratory–VizLab–Vale do Rio dos Sinos University (UNISINOS), Av. Unisinos, 950, São Leopoldo, 93022-000, RS, Brazil;2. Department of Civil Construction, Federal Institute of Santa Catarina (IFSC), Florianopolis, 88020-300, SC, Brazil;3. Institute of Geography, Federal University of Uberlandia (UFU), Monte Carmelo, 38500-000, MG, Brazil;4. Graduate Program in Transportation Engineering, University of São Paulo, São Carlos School of Engineering (EESC), São Carlos - SP, Brazil;1. Department of Electronics and Communication Engineering, National Institute of Technology Goa, Farmagudi, Ponda, Goa, 403401, India;2. Department of Instrumentation and Control Engineering, PSG College of Technology, Coimbatore, 641004, India;3. Department of Electronics and Communication Engineering, Institute of Aeronautical Engineering, Dundigal, Hyderabad, 500 043, India;1. Technical Staff Member, IBM India Research Labs, New Delhi, India;2. Manager and Senior Technical Staff Member, IBM India Research Labs, New Delhi, India;3. Senior Manager and Senior Technical Staff Member, IBM India Research Labs, New Delhi, India;1. Department of Information and Communication Engineering, Sejong University, 209 Neungdong-ro, Gwangjin-gu, Seoul 05006, Republic of Korea;2. School of Electronics and Information Engineering, Korea Aerospace University, 76 Hanggongdaehak-ro, Deogyang-gu, Goyang-si, Gyeonggi-do 10540, Republic of Korea;1. Advanced Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, Thailand;2. Department of Biochemistry, Faculty of Science, Chulalongkorn University, Bangkok, Thailand;3. Department of Microbiology, Faculty of Public Health, Mahidol University, Bangkok, Thailand
Abstract:The proposed work models document level text plagiarism detection as a binary classification problem, where the task is to distinguish a given suspicious-source document pair as plagiarized or non-plagiarized. The objective is to explore the potency of syntax based linguistic features extracted using shallow natural language processing techniques for plagiarism classification task. Shallow syntactic features, viz., part of speech tags and chunks are utilized after effective pre-processing and filtrations for pruning the irrelevant information. The work further proposes the modelling of this classification phase as an intermediate stage, which will be post candidate source retrieval and before exhaustive passage level detections. A two-phase feature selection approach is proposed, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers. The proposed approach is evaluated on smaller and larger test conditions using the corpus of plagiarized short answers (PSA) and plagiarism instances collected from PAN corpus respectively. Under both the test conditions, performances are evaluated using general as well as advanced classification metrics. Another main contribution of the current work is the analysis of dependencies and impact of the extracted features, upon the type and complexity of plagiarism imposed in the documents. The proposed results are compared with the two state-of-the-art approaches and they outperform the baseline approaches significantly. This in turn reflects the cogency of syntactic linguistic features in document level plagiarism classification, especially for the instances close to manual or real plagiarism scenarios.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号