首页 | 本学科首页   官方微博 | 高级检索  
     


Cross-language plagiarism detection
Authors:Martin Potthast  Alberto Barrón-Cedeño  Benno Stein  Paolo Rosso
Affiliation:1.Web Technology and Information Systems (Webis),Bauhaus-Universit?t Weimar,Weimar,Germany;2.Natural Language Engineering Lab, ELiRF,Universidad Politécnica de Valencia,Valencia,Spain
Abstract:Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号