Construction of an aligned monolingual treebank for studying semantic similarity |
| |
Authors: | Erwin Marsi Emiel Krahmer |
| |
Affiliation: | 1. Department of Computer and Information Science, Norwegian University of Science and Technology, Sem S?lands vei 7-9, 7491, Trondheim, Norway 2. Tilburg Center for Cognition and Communication (TiCC), Tilburg University, P.O. Box 90153, 5000 LE, Tilburg, The Netherlands
|
| |
Abstract: | Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This publicly available corpus is richly annotated with alignments between syntactic nodes, which are also classified using five different semantic similarity relations. A quarter of the corpus is manually annotated, and this informs the development of an automatic tree aligner used to annotate the remainder of the corpus. We argue that this corpus is the first of this size and kind, and offers great potential for paraphrasing research. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|