首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
短语树到依存树的自动转换研究   总被引:1,自引:0,他引:1  
不同标注体系的树库之间的相互转换是计算语言学研究的重要内容之一。本文在总结国内外几种树库标注体系及相互转换实践的基础上,结合清华汉语树库(Tsinghua Chinese Treebank ,简称TCT) 标注体系的特点,提出了一种将TCT从短语结构转换成依存结构(Dependency Structure) 的算法。这种算法充分利用了TCT具有的功能、结构的双重标记,转换得到的依存树不仅包含了各个节点之间相互依存的层次关系,更包含了相互依存的两个节点的具体的依存关系类型。我们对转换的效果进行了抽样评估,准确率可以达到97137 %。  相似文献   

2.
汉语句法树库标注体系   总被引:16,自引:10,他引:16  
语料库的句法标注是语料库语言学研究的前沿课题。本文在研究和总结国内外句法树库标注实践的基础上,提出了一套汉语真实文本的句法树标注体系。它以完整的层次结构树为基础,对句法树上的每个非终结符节点都给出两个标记:成分标记和关系标记,形成双标记集的句法信息描述体系。目前,这两个标记集分别包含了16和27个标记,对汉语句子的不同句法组合的外部功能分布和内部组合特点进行了详细描述。在此基础上,我们开发完成了100万词规模的汉语句法树库TCT,对其中各种复杂语言现象的标注实践显示了这套标注体系具有很好的信息覆盖率和语料适应性。  相似文献   

3.
由于对越南语的研究工作相对较少,因此还没有建立规模相对较大的依存树库。相对于已经拥有了形态丰富、语料成熟的汉语,越南语的依存句法分析要困难得多,所以该文提出了一种借助汉-越双语词对齐语料构建越南语依存树库的方法。首先对汉语-越南语句子对进行词对齐处理,然后对汉语句子进行依存句法分析。最后结合越南语本身的语言特点和有关的语法规则将汉语的依存关系通过汉-越双语词对齐关系映射到越南语句子中,从而生成越南语的依存树库。实验表明,该方法简化了人工收集和标注越南语依存树库的过程,节省了人力和构建树库的时间。实验结果表明,该方法相比采用机器学习的方法准确率明显提高。  相似文献   

4.
该文简要回顾了中文信息处理30年的主要成果,以及近20年来中文信息处理中的计算语言学研究的状况。该文分析了汉语与英语的主要差异,讨论了语言的共性与个性。该文表示了对于中文大规模语料的词性标注、树库建设的质疑。该文提出未来的中文语言资源建设的一些设想,期望一些新的尝试,提出以语义取代现有的句法,以深度标注取代现有的浅层标注,具体将包括标注的目标的定点化,内容的多样化,步骤的阶段化,标注人员的大众化、群体化。文章还提出了未来发展的关键点 技术的融合,人本计算。  相似文献   

5.
汉语树库的构建   总被引:11,自引:7,他引:11  
本文讨论了汉语树库构建的若干基础问题, 包括一个适合于自动分析和人工标注的汉语句法标记集、汉语树库加工处理规范和人机互助的树库加工模型, 介绍了一个已经实现的汉语自动句法标注系统, 和在此基础上进行的一些树库构建实验, 最后提出了构建大规模汉语树库的设想。  相似文献   

6.
7.
汉语学习者依存句法树库为非母语者语料提供依存句法分析,对第二语言教学与研究,以及面向第二语言的句法分析、语法改错等相关研究有重要意义.然而,现有的汉语学习者依存句法树库数量较少,且在标注方面仍存在一些问题.为此,该文提出一个依存句法标注规范,搭建在线标注平台,并开展汉语学习者依存句法标注.该文重点介绍了数据选取、标注流...  相似文献   

8.
Dependency grammar is considered appropriate for many Indian languages. In this paper, we present a study of the dependency relations in Bangla language. We have categorized these relations in three different levels, namely intrachunk relations, interchunk relations and interclause relations. Each of these levels is further categorized and an annotation scheme has been developed. Both syntactic and semantic features have been taken into consideration for describing the relations. In our scheme, there are 63 such syntactico–semantic relations. We have verified the scheme by tagging a corpus of 4167 Bangla sentences to create a treebank (KGPBenTreebank).  相似文献   

9.
近十年来,依存句法分析由于具有表示形式简单、灵活、分析效率高等特点,得到了学术界广泛关注。为了支持汉语依存句法分析研究,国内同行分别标注了几个汉语依存句法树库。然而,目前还没有一个公开、完整、系统的汉语依存句法数据标注规范,并且已有的树库标注工作对网络文本中的特殊语言现象考虑较少。为此,该文充分参考了已有的数据标注工作,同时结合实际标注中遇到的问题,制定了一个新的适应多领域多来源文本的汉语依存句法数据标注规范。我们制定规范的目标是准确刻画各种语言现象的句法结构,同时保证标注一致性。利用此规范,我们已经标注了约3万句汉语依存句法树库。  相似文献   

10.
汉语句子的组块分析体系   总被引:26,自引:1,他引:25  
周强  孙茂松  黄昌宁 《计算机学报》1999,22(11):1158-1165
介绍了一种描述能力介于线性词序列和完整句法树表示之间的浅层句法知识描述体系-组块分析体系,并详细讨论了其中两大部分;词界块和成分组的基本内容及其自动识别算法,在此基础上,提出了一种分阶段构造汉语树库的新设想,即先构造组块库,再构造树库,进行了一系列句法分析和知识获取实验,包括1)自然识别汉语最长名词短语;2)自动获取汉语句法知识等。所有这些工作都证明了这种知识描述体系的实用性和有效性。  相似文献   

11.
树库是自然语言处理中一项重要的基础资源,现有树库基本上都是单视图树,支持短语结构语法或者依存语法。该文提出一套基于依存语法的多视图汉语树库标注体系,仅需标注中心语和语法角色两类信息,之后可以自动地推导出描述句法结构所需的短语结构功能和层次信息,从而可以在不增加标注工作量的前提下获得更多语法信息。基于该体系,构建了北京大学多视图汉语树库(PMT)1.0版,含有64000句、140万词,支持短语结构语法和依存语法两个视图。  相似文献   

12.
林丽 《中文信息学报》2013,27(6):201-209
越南是中国的重要邻国,相应的越南语海量信息处理正日益凸显其重要性。参考国内外有关框架语义标注的研究和实践,我们在构建越南语新闻语料库,对越南语文本进行分词和词性标注、命名实体标注等研究的基础上,尝试构建越南语框架语义知识库并初步探索了框架语义标注在越南语新闻事件抽取中的应用。  相似文献   

13.
In this paper, we present an approach for automatically creating a combinatory categorial grammar (CCG) treebank from a dependency treebank for the subject–object–verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. An exhaustive CCG parser then creates a treebank of CCG derivations. We also discuss special cases of this generic algorithm to handle linguistic phenomena specific to Hindi. In doing so we extract different constructions with long-range dependencies like coordinate constructions and non-projective dependencies resulting from constructions like relative clauses, noun elaboration and verbal modifiers.  相似文献   

14.
In the field of constituency parsing, there exist multiple human-labeled treebanks which are built on non-overlapping text samples and follow different annotation standards. Due to the extreme cost of annotating parse trees by human, it is desirable to automatically convert one treebank (called source treebank) to the standard of another treebank (called target treebank) which we are interested in. Conversion results can be manually corrected to obtain higher-quality annotations or can be directly used as additional training data for building syntactic parsers. To perform automatic treebank conversion, we divide constituency parses into two separate levels: the part-of-speech (POS) and syntactic structure (bracketing structures and constituent labels), and conduct conversion on these two levels respectively with a feature-based approach. The basic idea of the approach is to encode original annotations in a source treebank as guide features during the conversion process. Experiments on two Chinese treebanks show that our approach can convert POS tags and syntactic structures with the accuracy of 96.6 and 84.8 %, respectively, which are the best reported results on this task.  相似文献   

15.
本着构建维吾尔语依存树库的目的,该文根据黏着性语言的结构特点及其在依存属性中对依存角色的影响,提出构建维吾尔语依存树库时需要考虑的几点要素。其包含依存粒度的确定、维吾尔语依存关系、标注原则、依存树结构以及标注工具的设计与实现。然后根据《维吾尔语依存树库标注手册》人工标注了3 400多条句子并从三个角度对依存树库信息做了统计分析。  相似文献   

16.
该文针对古汉语文本小、句简短、模式性强的结构特点,提出了一种基于“词-词性”匹配模式获取的快速树库构建方法,将句法标注过程规约为获取候选匹配模式、制定句法转换规则、自动生成句法树和最终人工校对四个步骤。该方法可大大缩减人工标注工作量,节省树库构建的工程成本,且所获取的匹配规则在古汉语教学研究中具有一定的实用价值。  相似文献   

17.
We describe annotation of multiword expressions (MWEs) in the Prague dependency treebank, using several automatic pre-annotation steps. We use subtrees of the tectogrammatical tree structures of the Prague dependency treebank to store representations of the MWEs in the dictionary and pre-annotate following occurrences automatically. We also show a way to measure reliability of this type of annotation.  相似文献   

18.
汉语依存树库的建设相对其他语言如英语,在规模和质量上还有一些差距。树库标注需要付出很大的人力物力,并且保证树库质量也比较困难。该文尝试通过规则和统计相结合的方法,将宾州汉语短语树库Penn Chinese Treebank转化为哈工大依存树库HIT-IR-CDT的体系结构,从而增大现有依存树库的规模。将转化后的树库加入HIT-IR-CDT,训练和测试依存句法分析器的性能。实验表明,加入少量经转化后的树库后,依存句法分析器的性能有所提高;但加入大量树库后,性能反而下降。经过细致分析,作为一种利用多种树库提高依存句法分析器性能的方法,短语转依存还存在很多需要深入研究的方面。  相似文献   

19.
Corpora with high-quality linguistic annotations are an essential component in many NLP applications and a valuable resource for linguistic research. For obtaining these annotations, a large amount of manual effort is needed, making the creation of these resources time-consuming and costly. One attempt to speed up the annotation process is to use supervised machine-learning systems to automatically assign (possibly erroneous) labels to the data and ask human annotators to correct them where necessary. However, it is not clear to what extent these automatic pre-annotations are successful in reducing human annotation effort, and what impact they have on the quality of the resulting resource. In this article, we present the results of an experiment in which we assess the usefulness of partial semi-automatic annotation for frame labeling. We investigate the impact of automatic pre-annotation of differing quality on annotation time, consistency and accuracy. While we found no conclusive evidence that it can speed up human annotation, we found that automatic pre-annotation does increase its overall quality.  相似文献   

20.
In this work, we report large-scale semantic role annotation of arguments in the Turkish dependency treebank, and present the first comprehensive Turkish semantic role labeling (SRL) resource: Turkish Proposition Bank (PropBank). We present our annotation workflow that harnesses crowd intelligence, and discuss the procedures for ensuring annotation consistency and quality control. Our discussion focuses on syntactic variations in realization of predicate-argument structures, and the large lexicon problem caused by complex derivational morphology. We describe our approach that exploits framesets of root verbs to abstract away from syntax and increase self-consistency of the Turkish PropBank. The issues that arise in the annotation of verbs derived via valency changing morphemes, verbal nominals, and nominal verbs are explored, and evaluation results for inter-annotator agreement are provided. Furthermore, semantic layer described here is aligned with universal dependency (UD) compliant treebank and released to enable more researchers to work on the problem. Finally, we use PropBank to establish a baseline score of 79.10 F1 for Turkish SRL using the mate-tool (an open-source SRL tool based on supervised machine learning) enhanced with basic morphological features. Turkish PropBank and the extended SRL system are made publicly available.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号