首页 | 本学科首页   官方微博 | 高级检索  
     


Construction of an aligned monolingual treebank for studying semantic similarity
Authors:Erwin Marsi  Emiel Krahmer
Affiliation:1. Department of Computer and Information Science, Norwegian University of Science and Technology, Sem S?lands vei 7-9, 7491, Trondheim, Norway
2. Tilburg Center for Cognition and Communication (TiCC), Tilburg University, P.O. Box 90153, 5000 LE, Tilburg, The Netherlands
Abstract:Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This publicly available corpus is richly annotated with alignments between syntactic nodes, which are also classified using five different semantic similarity relations. A quarter of the corpus is manually annotated, and this informs the development of an automatic tree aligner used to annotate the remainder of the corpus. We argue that this corpus is the first of this size and kind, and offers great potential for paraphrasing research.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号