首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
2.
We have constructed a large scale and detailed database of lexical types in Japanese from a treebank that includes detailed linguistic information. The database helps treebank annotators and grammar developers to share precise knowledge about the grammatical status of words that constitute the treebank, allowing for consistent large-scale treebanking and grammar development. In addition, it clarifies what lexical types are needed for precise Japanese NLP on the basis of the treebank. In this paper, we report on the motivation and methodology of the database construction.
Melanie SiegelEmail:
  相似文献   

3.
针对现代藏语句法,在参照宾大中文树库的基础上,构建藏语短语句法树库,并建立了树库编辑工具,为藏汉机器翻译服务。在短语句法树库的基础上,提出一种融合藏语句法特征的藏汉机器翻译方法。实验分析结果表明,该方法可以很好地应用于藏汉机器翻译系统。  相似文献   

4.
This article describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the original analyses in the treebank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people do), and can be used to induce a grammar capable of analysing such sentences. This article demonstrates these two applications using the Penn Treebank. In a robustness evaluation experiment, two state-of-the-art statistical parsers are evaluated on an ungrammatical version of Sect. 23 of the Wall Street Journal (WSJ) portion of the Penn treebank. This experiment shows that the performance of both parsers degrades with grammatical noise. A breakdown by error type is provided for both parsers. A second experiment retrains both parsers using an ungrammatical version of WSJ Sections 2–21. This experiment indicates that an ungrammatical treebank is a useful resource in improving parser robustness to grammatical errors, but that the correct combination of grammatical and ungrammatical training data has yet to be determined.  相似文献   

5.
构建藏语依存树库是实现藏语句法分析的重要基础,对藏语本体研究和信息处理具有重要价值。基于此,该文提出了一种基于树库转换的藏语依存树库构建方法。该方法首先扩充了前期构建的藏语短语结构树库,然后根据藏语短语结构树和依存树的特征设计树库转换规则,实现藏语短语结构树到依存结构树的初步转换,最后对自动转换结果进行人工校验,得到了2.2万句藏语依存树。为了对转换结果做出量化评价,该文抽取了依存树库中5%的依存树,对其依存关系进行校验和统计,最终依存关系的准确率达到89.36%,中心词的准确率达到92.09%。此外,该文使用基于神经网络的句法分析模型验证了依存树库的有效性。在该模型上,UAS值和LAS值分别达到83.62%和81.90%。研究证明,使用半自动的树库转换方法能够有效地完成藏语依存树库构建工作。  相似文献   

6.
该文针对古汉语文本小、句简短、模式性强的结构特点,提出了一种基于“词-词性”匹配模式获取的快速树库构建方法,将句法标注过程规约为获取候选匹配模式、制定句法转换规则、自动生成句法树和最终人工校对四个步骤。该方法可大大缩减人工标注工作量,节省树库构建的工程成本,且所获取的匹配规则在古汉语教学研究中具有一定的实用价值。  相似文献   

7.
汉语依存树库的建设相对其他语言如英语,在规模和质量上还有一些差距。树库标注需要付出很大的人力物力,并且保证树库质量也比较困难。该文尝试通过规则和统计相结合的方法,将宾州汉语短语树库Penn Chinese Treebank转化为哈工大依存树库HIT-IR-CDT的体系结构,从而增大现有依存树库的规模。将转化后的树库加入HIT-IR-CDT,训练和测试依存句法分析器的性能。实验表明,加入少量经转化后的树库后,依存句法分析器的性能有所提高;但加入大量树库后,性能反而下降。经过细致分析,作为一种利用多种树库提高依存句法分析器性能的方法,短语转依存还存在很多需要深入研究的方面。  相似文献   

8.
短语树到依存树的自动转换研究   总被引:1,自引:0,他引:1  
不同标注体系的树库之间的相互转换是计算语言学研究的重要内容之一。本文在总结国内外几种树库标注体系及相互转换实践的基础上,结合清华汉语树库(Tsinghua Chinese Treebank ,简称TCT) 标注体系的特点,提出了一种将TCT从短语结构转换成依存结构(Dependency Structure) 的算法。这种算法充分利用了TCT具有的功能、结构的双重标记,转换得到的依存树不仅包含了各个节点之间相互依存的层次关系,更包含了相互依存的两个节点的具体的依存关系类型。我们对转换的效果进行了抽样评估,准确率可以达到97137 %。  相似文献   

9.
In this paper, we present the final version of a publicly available treebank of Finnish, the Turku Dependency Treebank. The treebank contains 204,399 tokens (15,126 sentences) from 10 different text sources and has been manually annotated in a Finnish-specific version of the well-known Stanford Dependency scheme. The morphological analyses of the treebank have been assigned using a novel machine learning method to disambiguate readings given by an existing tool. As the second main contribution, we present the first open source Finnish dependency parser, trained on the newly introduced treebank. The parser achieves a labeled attachment score of 81 %. The treebank data as well as the parsing pipeline are available under an open license at http://bionlp.utu.fi/.  相似文献   

10.
奚建清  罗强 《计算机工程》2007,33(3):172-174
提出了一种基于隐马尔可夫模型(HMM)的介词短语界定模型,通过HMM的介词短语边界自动识别和依存语法错误校正2个处理阶段,较好地完成了对一个经过分词和词性标注的句子进行介词短语界定任务,为更进一步的句法分析工作打下良好的基础。试验结果显示:该模型的识别正确率达到了86.5%(封闭测试)和77.7%(开放测试),取得了令人满意的结果。  相似文献   

11.
In this paper, we present an approach for automatically creating a combinatory categorial grammar (CCG) treebank from a dependency treebank for the subject–object–verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. An exhaustive CCG parser then creates a treebank of CCG derivations. We also discuss special cases of this generic algorithm to handle linguistic phenomena specific to Hindi. In doing so we extract different constructions with long-range dependencies like coordinate constructions and non-projective dependencies resulting from constructions like relative clauses, noun elaboration and verbal modifiers.  相似文献   

12.
汉语树库的构建   总被引:11,自引:7,他引:11  
本文讨论了汉语树库构建的若干基础问题, 包括一个适合于自动分析和人工标注的汉语句法标记集、汉语树库加工处理规范和人机互助的树库加工模型, 介绍了一个已经实现的汉语自动句法标注系统, 和在此基础上进行的一些树库构建实验, 最后提出了构建大规模汉语树库的设想。  相似文献   

13.
树库是自然语言处理中一项重要的基础资源,现有树库基本上都是单视图树,支持短语结构语法或者依存语法。该文提出一套基于依存语法的多视图汉语树库标注体系,仅需标注中心语和语法角色两类信息,之后可以自动地推导出描述句法结构所需的短语结构功能和层次信息,从而可以在不增加标注工作量的前提下获得更多语法信息。基于该体系,构建了北京大学多视图汉语树库(PMT)1.0版,含有64000句、140万词,支持短语结构语法和依存语法两个视图。  相似文献   

14.
Treebanks are important resources for researchers in natural language processing. They provide training and testing materials so that different algorithms can be compared. However, it is not a trivial task to construct high-quality treebanks. We have not yet had a proper treebank for such a low-resource language as Vietnamese, which has probably lowered the performance of Vietnamese language processing. We have been building a consistent and accurate Vietnamese treebank to alleviate such situations. Our treebank is annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. We developed detailed annotation guidelines for each layer by presenting Vietnamese linguistic issues as well as methods of addressing them. Here, we also describe approaches to controlling annotation quality while ensuring a reasonable annotation speed. We specifically designed an appropriate annotation process and an effective process to train annotators. In addition, we implemented several support tools to improve annotation speed and to control the consistency of the treebank. The results from experiments revealed that both inter-annotator agreement and accuracy were higher than 90%, which indicated that the treebank is reliable.  相似文献   

15.
为提升依存分析并分析影响其精度的相关因素,该文构建了大规模中文通用依存树库和中等规模领域依存树库。基于这一系列树库,通过句法分析实验考察质量、规模、领域差异等因素对中文依存分析的影响,实验结果表明: (1)树库规模和质量均与句法分析精度成正相关关系,质量应先于规模因素被优先考虑;(2)通用树库和领域树库之间的差异程度与前者对后者的替代性成相关关系;(3)两种树库混合使用的效果同样与领域差异有关。
  相似文献   

16.
句法分析是自然语言处理的基础技术,主流的由数据驱动的神经网络句法分析模型需要大规模的标注数据,但是通过人工标注扩展树库成本很高,因此如何利用现有标注树库进行数据增强成为研究焦点。在汉语句法分析的数据增强任务中,对于给定的标注树库,要求数据增强所生成的句子满足如下条件: 第一,要求生成句具有多样化且完整的句法树结构;第二,要求生成句具有合理的语义。对此,我们首次提出基于词汇化树邻接语法的数据增强方法。针对第一个需求,该文设计实现基于词汇化树邻接语法的词汇化树抽取算法与句法树合成算法,基于该语法可以在句法树之间进行“接插”和“替换”的操作,从而推导生成新的句法树,并且用语言学的知识保证生成句符合语法规则且具有完整的句法树结构。针对第二个需求,该文利用语言模型对生成句进行语义合理性评估,选取语义合理的句子作为最终的增强数据,从而获取高质量的标注树库。我们以汉语为例开展研究,在汉语树库CTB5上进行句法分析的数据增强评测实验。实验结果显示,在小样本(CTB5的20%)实验中,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.39%和2.14%。在鲁棒性实验中,该文通过构建扩展测试集进行评测实验,在扩展测试集上,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.43%和0.44%,表现出更好的鲁棒性。  相似文献   

17.
旨在探索利用语言学手段来提高句法分析精度的可能性.采用MaltParser和自建的汉语依存树库进行相关汉语依存句法分析实验.通过对句法分析结果的分析,找出影响句法分析精度的主要因素,并据此对树库中处理某些语言结构的方式进行修改.然后再对得到的句法分析数据进行进一步分析,以确定所用方法的有效性.结果表明,无标记依存句法分析精度提高了5.5%,有标记依存句法分析精度提高了7.5%.  相似文献   

18.
汉语句法树库标注体系   总被引:16,自引:10,他引:16  
语料库的句法标注是语料库语言学研究的前沿课题。本文在研究和总结国内外句法树库标注实践的基础上,提出了一套汉语真实文本的句法树标注体系。它以完整的层次结构树为基础,对句法树上的每个非终结符节点都给出两个标记:成分标记和关系标记,形成双标记集的句法信息描述体系。目前,这两个标记集分别包含了16和27个标记,对汉语句子的不同句法组合的外部功能分布和内部组合特点进行了详细描述。在此基础上,我们开发完成了100万词规模的汉语句法树库TCT,对其中各种复杂语言现象的标注实践显示了这套标注体系具有很好的信息覆盖率和语料适应性。  相似文献   

19.
In this article we present a Spanish grammar implemented in the Linguistic Knowledge Builder system and grounded in the theoretical framework of Head-driven Phrase Structure Grammar. The grammar is being developed in an international multilingual context, the DELPH-IN Initiative, contributing to an open-source repository of software and linguistic resources for various Natural Language Processing applications. We will show how we have refined and extended a core grammar, derived from the LinGO Grammar Matrix, to achieve a broad-coverage grammar. The Spanish DELPH-IN grammar is the most comprehensive grammar for Spanish deep processing, and it is being deployed in the construction of a treebank for Spanish of 60,000 sentences based in a technical corpus in the framework of the European project METANET4U (Enhancing the European Linguistic Infrastructure, GA 270893GA; http://www.meta-net.eu/projects/METANET4U/.) and a smaller treebank of about 15,000 sentences based in a corpus from the press.  相似文献   

20.
In the field of constituency parsing, there exist multiple human-labeled treebanks which are built on non-overlapping text samples and follow different annotation standards. Due to the extreme cost of annotating parse trees by human, it is desirable to automatically convert one treebank (called source treebank) to the standard of another treebank (called target treebank) which we are interested in. Conversion results can be manually corrected to obtain higher-quality annotations or can be directly used as additional training data for building syntactic parsers. To perform automatic treebank conversion, we divide constituency parses into two separate levels: the part-of-speech (POS) and syntactic structure (bracketing structures and constituent labels), and conduct conversion on these two levels respectively with a feature-based approach. The basic idea of the approach is to encode original annotations in a source treebank as guide features during the conversion process. Experiments on two Chinese treebanks show that our approach can convert POS tags and syntactic structures with the accuracy of 96.6 and 84.8 %, respectively, which are the best reported results on this task.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号