共查询到20条相似文献,搜索用时 592 毫秒
1.
近十年来,依存句法分析由于具有表示形式简单、灵活、分析效率高等特点,得到了学术界广泛关注。为了支持汉语依存句法分析研究,国内同行分别标注了几个汉语依存句法树库。然而,目前还没有一个公开、完整、系统的汉语依存句法数据标注规范,并且已有的树库标注工作对网络文本中的特殊语言现象考虑较少。为此,该文充分参考了已有的数据标注工作,同时结合实际标注中遇到的问题,制定了一个新的适应多领域多来源文本的汉语依存句法数据标注规范。我们制定规范的目标是准确刻画各种语言现象的句法结构,同时保证标注一致性。利用此规范,我们已经标注了约3万句汉语依存句法树库。 相似文献
2.
短语树到依存树的自动转换研究 总被引:1,自引:0,他引:1
不同标注体系的树库之间的相互转换是计算语言学研究的重要内容之一。本文在总结国内外几种树库标注体系及相互转换实践的基础上,结合清华汉语树库(Tsinghua Chinese Treebank ,简称TCT) 标注体系的特点,提出了一种将TCT从短语结构转换成依存结构(Dependency Structure) 的算法。这种算法充分利用了TCT具有的功能、结构的双重标记,转换得到的依存树不仅包含了各个节点之间相互依存的层次关系,更包含了相互依存的两个节点的具体的依存关系类型。我们对转换的效果进行了抽样评估,准确率可以达到97137 %。 相似文献
3.
由于对越南语的研究工作相对较少,因此还没有建立规模相对较大的依存树库。相对于已经拥有了形态丰富、语料成熟的汉语,越南语的依存句法分析要困难得多,所以该文提出了一种借助汉-越双语词对齐语料构建越南语依存树库的方法。首先对汉语-越南语句子对进行词对齐处理,然后对汉语句子进行依存句法分析。最后结合越南语本身的语言特点和有关的语法规则将汉语的依存关系通过汉-越双语词对齐关系映射到越南语句子中,从而生成越南语的依存树库。实验表明,该方法简化了人工收集和标注越南语依存树库的过程,节省了人力和构建树库的时间。实验结果表明,该方法相比采用机器学习的方法准确率明显提高。 相似文献
4.
5.
旨在探索利用语言学手段来提高句法分析精度的可能性.采用MaltParser和自建的汉语依存树库进行相关汉语依存句法分析实验.通过对句法分析结果的分析,找出影响句法分析精度的主要因素,并据此对树库中处理某些语言结构的方式进行修改.然后再对得到的句法分析数据进行进一步分析,以确定所用方法的有效性.结果表明,无标记依存句法分析精度提高了5.5%,有标记依存句法分析精度提高了7.5%. 相似文献
6.
7.
In this work, we report large-scale semantic role annotation of arguments in the Turkish dependency treebank, and present the first comprehensive Turkish semantic role labeling (SRL) resource: Turkish Proposition Bank (PropBank). We present our annotation workflow that harnesses crowd intelligence, and discuss the procedures for ensuring annotation consistency and quality control. Our discussion focuses on syntactic variations in realization of predicate-argument structures, and the large lexicon problem caused by complex derivational morphology. We describe our approach that exploits framesets of root verbs to abstract away from syntax and increase self-consistency of the Turkish PropBank. The issues that arise in the annotation of verbs derived via valency changing morphemes, verbal nominals, and nominal verbs are explored, and evaluation results for inter-annotator agreement are provided. Furthermore, semantic layer described here is aligned with universal dependency (UD) compliant treebank and released to enable more researchers to work on the problem. Finally, we use PropBank to establish a baseline score of 79.10 F1 for Turkish SRL using the mate-tool (an open-source SRL tool based on supervised machine learning) enhanced with basic morphological features. Turkish PropBank and the extended SRL system are made publicly available. 相似文献
8.
9.
汉语依存树库的建设相对其他语言如英语,在规模和质量上还有一些差距。树库标注需要付出很大的人力物力,并且保证树库质量也比较困难。该文尝试通过规则和统计相结合的方法,将宾州汉语短语树库Penn Chinese Treebank转化为哈工大依存树库HIT-IR-CDT的体系结构,从而增大现有依存树库的规模。将转化后的树库加入HIT-IR-CDT,训练和测试依存句法分析器的性能。实验表明,加入少量经转化后的树库后,依存句法分析器的性能有所提高;但加入大量树库后,性能反而下降。经过细致分析,作为一种利用多种树库提高依存句法分析器性能的方法,短语转依存还存在很多需要深入研究的方面。 相似文献
10.
Quy T. Nguyen Yusuke Miyao Ha T. T. Le Nhung T. H. Nguyen 《Language Resources and Evaluation》2018,52(1):269-315
Treebanks are important resources for researchers in natural language processing. They provide training and testing materials so that different algorithms can be compared. However, it is not a trivial task to construct high-quality treebanks. We have not yet had a proper treebank for such a low-resource language as Vietnamese, which has probably lowered the performance of Vietnamese language processing. We have been building a consistent and accurate Vietnamese treebank to alleviate such situations. Our treebank is annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. We developed detailed annotation guidelines for each layer by presenting Vietnamese linguistic issues as well as methods of addressing them. Here, we also describe approaches to controlling annotation quality while ensuring a reasonable annotation speed. We specifically designed an appropriate annotation process and an effective process to train annotators. In addition, we implemented several support tools to improve annotation speed and to control the consistency of the treebank. The results from experiments revealed that both inter-annotator agreement and accuracy were higher than 90%, which indicated that the treebank is reliable. 相似文献
11.
Bharat Ram Ambati Tejaswini Deoskar Mark Steedman 《Language Resources and Evaluation》2018,52(1):67-100
In this paper, we present an approach for automatically creating a combinatory categorial grammar (CCG) treebank from a dependency treebank for the subject–object–verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. An exhaustive CCG parser then creates a treebank of CCG derivations. We also discuss special cases of this generic algorithm to handle linguistic phenomena specific to Hindi. In doing so we extract different constructions with long-range dependencies like coordinate constructions and non-projective dependencies resulting from constructions like relative clauses, noun elaboration and verbal modifiers. 相似文献
12.
Sanjay Chatterji Tanaya Mukherjee Sarkar Pragati Dhang Samhita Deb Sudeshna Sarkar Jayshree Chakraborty Anupam Basu 《Language Resources and Evaluation》2014,48(3):443-477
Dependency grammar is considered appropriate for many Indian languages. In this paper, we present a study of the dependency relations in Bangla language. We have categorized these relations in three different levels, namely intrachunk relations, interchunk relations and interclause relations. Each of these levels is further categorized and an annotation scheme has been developed. Both syntactic and semantic features have been taken into consideration for describing the relations. In our scheme, there are 63 such syntactico–semantic relations. We have verified the scheme by tagging a corpus of 4167 Bangla sentences to create a treebank (KGPBenTreebank). 相似文献
13.
14.
构建藏语依存树库是实现藏语句法分析的重要基础,对藏语本体研究和信息处理具有重要价值。基于此,该文提出了一种基于树库转换的藏语依存树库构建方法。该方法首先扩充了前期构建的藏语短语结构树库,然后根据藏语短语结构树和依存树的特征设计树库转换规则,实现藏语短语结构树到依存结构树的初步转换,最后对自动转换结果进行人工校验,得到了2.2万句藏语依存树。为了对转换结果做出量化评价,该文抽取了依存树库中5%的依存树,对其依存关系进行校验和统计,最终依存关系的准确率达到89.36%,中心词的准确率达到92.09%。此外,该文使用基于神经网络的句法分析模型验证了依存树库的有效性。在该模型上,UAS值和LAS值分别达到83.62%和81.90%。研究证明,使用半自动的树库转换方法能够有效地完成藏语依存树库构建工作。 相似文献
15.
汉语句法树库标注体系 总被引:16,自引:10,他引:16
语料库的句法标注是语料库语言学研究的前沿课题。本文在研究和总结国内外句法树库标注实践的基础上,提出了一套汉语真实文本的句法树标注体系。它以完整的层次结构树为基础,对句法树上的每个非终结符节点都给出两个标记:成分标记和关系标记,形成双标记集的句法信息描述体系。目前,这两个标记集分别包含了16和27个标记,对汉语句子的不同句法组合的外部功能分布和内部组合特点进行了详细描述。在此基础上,我们开发完成了100万词规模的汉语句法树库TCT,对其中各种复杂语言现象的标注实践显示了这套标注体系具有很好的信息覆盖率和语料适应性。 相似文献
16.
17.
18.
19.
Katri Haverinen Jenna Nyblom Timo Viljanen Veronika Laippala Samuel Kohonen Anna Missilä Stina Ojala Tapio Salakoski Filip Ginter 《Language Resources and Evaluation》2014,48(3):493-531
In this paper, we present the final version of a publicly available treebank of Finnish, the Turku Dependency Treebank. The treebank contains 204,399 tokens (15,126 sentences) from 10 different text sources and has been manually annotated in a Finnish-specific version of the well-known Stanford Dependency scheme. The morphological analyses of the treebank have been assigned using a novel machine learning method to disambiguate readings given by an existing tool. As the second main contribution, we present the first open source Finnish dependency parser, trained on the newly introduced treebank. The parser achieves a labeled attachment score of 81 %. The treebank data as well as the parsing pipeline are available under an open license at http://bionlp.utu.fi/. 相似文献