首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a new approach for automatically acquiring translation templates from unannotated bilingual spoken language corpora. Two basic algorithms are adopted: a grammar induction algorithm, and an alignment algorithm using bracketing transduction grammar. The approach is unsupervised, statistical, and data-driven, and employs no parsing procedure. The acquisition procedure consists of two steps. First, semantic groups and phrase structure groups are extracted from both the source language and the target language. Second, an alignment algorithm based on bracketing transduction grammar aligns the phrase structure groups. The aligned phrase structure groups are post-processed, yielding translation templates. Preliminary experimental results show that the algorithm is effective.  相似文献   

2.
This paper presents a robust parsing approach which is designed to address the issue of syntactic errors in text. The approach is based on the concept of an error grammar which is a grammar of ungrammatical sentences. An error grammar is derived from a conventional grammar on the basis of an analysis of a corpus of observed ill-formed sentences. A robust parsing algorithm is presented which is applied after a conventional bottom–up parsing algorithm has failed. This algorithm combines a rule from the error grammar with rules from the normal grammar to arrive at a parse for an ungrammatical sentence. This algorithm is applied to 50 test sentences, with encouraging results.  相似文献   

3.
语料库作为基本的语言数据库和知识库,是各种自然语言处理方法实现的基础。随着统计方法在自然语言处理中的广泛应用,语料库建设已成为重要的研究课题。自动分词是句法分析的一项不可或缺的基础性工作,其性能直接影响句法分析。本文通过对85万字节藏语语料的统计分析和藏语词的分布特点、语法功能研究,介绍基于词典库的藏文自动分词系统的模型,给出了切分用词典库的结构、格分块算法和还原算法。系统的研制为藏文输入法研究、藏文电子词典建设、藏文字词频统计、搜索引擎的设计和实现、机器翻译系统的开发、网络信息安全、藏文语料库建设以及藏语语义分析研究奠定了基础。  相似文献   

4.
汉语术语定义的结构分析和提取   总被引:13,自引:2,他引:13  
本文介绍的工作是在汉语句法分析研究基础上的一种应用研究,对术语如何下定义问题进行了理论上的探讨。术语的定义形式在汉语语法结构方面提供了模板结构和构成方式,可以作为知识发现研究的数据基础,也可以作为特定领域的语法知识系统。本文针对电子学和计算机领域的语料进行了分词和词性标注处理,然后应用句法分析工具分析出句子中的短语成分,并根据汉语句子的句型结构,总结出术语定义的结构特点,自动提取定义的模板。最后根据已建立的数据和概念描述,给出了术语发现的算法。  相似文献   

5.
In a statistical machine translation system (SMTS), decoding is the process of finding the most likely translation based on a statistical model, according to previously learned parameters. The success of an SMTS is strongly dependent on the quality of its decoder. Most of the SMTS's published in current literature use approaches based on traditional optimization methods and heuristics. On the other hand, over the last few years there has been a rapid increase in the use of metaheuristics. These kinds of techniques have shown to be able to solve difficult search problems in an efficient way for a wide number of applications.

This paper proposes a new approach based on evolutionary hybrid algorithms to translate sentences in a specific technical context. The algorithm has been enhanced by adaptive parameter control. The tests are carried out in the context of Spanish and then translated to English.

The experimental results validate the superior performance of our method in contrast to a statistical greedy decoder. We also compare our new approach to the existing public domain general translators.  相似文献   

6.
句法分析是自然语言处理的基础技术,主流的由数据驱动的神经网络句法分析模型需要大规模的标注数据,但是通过人工标注扩展树库成本很高,因此如何利用现有标注树库进行数据增强成为研究焦点。在汉语句法分析的数据增强任务中,对于给定的标注树库,要求数据增强所生成的句子满足如下条件: 第一,要求生成句具有多样化且完整的句法树结构;第二,要求生成句具有合理的语义。对此,我们首次提出基于词汇化树邻接语法的数据增强方法。针对第一个需求,该文设计实现基于词汇化树邻接语法的词汇化树抽取算法与句法树合成算法,基于该语法可以在句法树之间进行“接插”和“替换”的操作,从而推导生成新的句法树,并且用语言学的知识保证生成句符合语法规则且具有完整的句法树结构。针对第二个需求,该文利用语言模型对生成句进行语义合理性评估,选取语义合理的句子作为最终的增强数据,从而获取高质量的标注树库。我们以汉语为例开展研究,在汉语树库CTB5上进行句法分析的数据增强评测实验。实验结果显示,在小样本(CTB5的20%)实验中,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.39%和2.14%。在鲁棒性实验中,该文通过构建扩展测试集进行评测实验,在扩展测试集上,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.43%和0.44%,表现出更好的鲁棒性。  相似文献   

7.
The Gsearch system allows the selection of sentences by syntacticcriteria from text corpora, even when these corpora contain no priorsyntactic markup. This is achieved by means of a fast chart parser,which takes as input a grammar and a search expression specified by theuser. Gsearch features a modular architecture that can be extendedstraightforwardly to give access to new corpora. The Gsearcharchitecture also allows interfacing with external linguistic resources(such as taggers and lexical databases). Gsearch can be used withgraphical tools for visualizing the results of a query.  相似文献   

8.
The importance of the parsing task for NLP applications is well understood. However developing parsers remains difficult because of the complexity of the Arabic language. Most parsers are based on syntactic grammars that describe the syntactic structures of a language. The development of these grammars is laborious and time consuming. In this paper we present our method for building an Arabic parser based on an induced grammar, PCFG grammar. We first induce the PCFG grammar from an Arabic Treebank. Then, we implement the parser that assigns syntactic structure to each input sentence. The parser is tested on sentences extracted from the treebank (1650 sentences).We calculate the precision, recall and f-measure. Our experimental results showed the efficiency of the proposed parser for parsing modern standard Arabic sentences (Precision: 83.59 %, Recall: 82.98 % and F-measure: 83.23 %).  相似文献   

9.
摘 要: 针对传统基于机器学习方法在蛋白质互作用信息抽取中的缺陷,提出融合浅层句法分析的信息抽取方法,该方法首先将候选的句子进行浅层句法分析,包括对短语切分、同位语分析、并列结构分析、句子切分的处理。经过该步骤,句子被划分为多个单独的语法单元。然后,对每个语法单元采用基于最大熵的分类方法进行蛋白质互作用信息抽取。该方法在BC-PPI语料库中获得了62.1%的F1性能。比较实验结果表明,该方法能有效减少误判和漏判,提高信息抽取的性能。  相似文献   

10.
面向数据的句法分析技术   总被引:7,自引:1,他引:7  
面向数据的分析技术(Data-Oriented Parsing ,DOP) 首先由Scha (1990) 年提出。该处理技术具体表达了这样的假设:人类对语言的领悟和创造依赖于以往具体的语言经验,而不是依赖于抽象的语法规则。DOP 技术框架可以分为: (1) 建立包括以往成功分析的语言经验的标注语料库; (2) 从语料库中抽取片段单元来构造新语言的分析过程;(3) 计算分析过程的概率。DOP 模型建立在包含大量语言现象的语料库基础上,把经过标注的语料库看作一个语法( Grammar) 。当输入一个新的语言现象时,系统通过对语料库中片段单元的组合运算来组合分析过程。根据所有片段单元的共现频率来评估最有可能性的分析结果。本文详细论述了语料库的标注,片段单元的定义,组合分析和概率计算。  相似文献   

11.
句法分析是自然语言处理领域中重要的基础研究问题之一。近年来,基于统计学习模型的句法分析方法研究受到了广泛关注,多种模型与算法先后被提出。从采用的学习模型和算法类型着手,该文系统地对各种主流和前沿方法进行了归纳与分类,着重对各类模型和算法的思想进行了分析和对比,并对中文句法分析的研究现状进行了综述;最后,对句法分析下一步的研究方向与趋势进行了展望。  相似文献   

12.
目前用来评价机器翻译系统译文质量的方法主要是由IBM提出的BLEU、TER和METEOR等方法,他们分别以词汇的重现率、译文与参考译文之间的编辑距离和语言学知识等特征作为评价依据,在判定中文句子的困惑度方面具有一定局限性。所以本文提出在依存语法分析的基础之上,通过对中文句子及其句子主干的语法和语义两方面进行分析得出中文句子的困惑度。实验证明这种方法比通过译文加权改进后的BLEU方法准确率高出4%。  相似文献   

13.
GTB (the Grammar Tool Box) is the tool that underpins our investigations into generalised parsing. Our goal is to produce a system that supports systematic investigation of various styles of generalised parsing in a way that allows meaningful comparisons between them in a repeatable and easily accessible fashion whilst also allowing: (i) new theoretical ideas to be generated and explored; (ii) production quality parsers to be generated and (iii) humane pedagogy. GTB comprises a language (LC) with various kinds of built-in grammar and automata related objects, and a set of black-box methods written in C++ that provide implementations of grammar transforms, automata construction algorithms, parsing and recognition algorithms, and a variety of visualisation aids. In this paper we focus on the overall rationale for the GTB framework; the GTB design goals; and some detailed operational flows that are supported by GTB.  相似文献   

14.
In this paper, we consider probabilistic context-free grammars, a class of generative devices that has been successfully exploited in several applications of syntactic pattern matching, especially in statistical natural language parsing. We investigate the problem of training probabilistic context-free grammars on the basis of distributions defined over an infinite set of trees or an infinite set of sentences by minimizing the cross-entropy. This problem has applications in cases of context-free approximation of distributions generated by more expressive statistical models. We show several interesting theoretical properties of probabilistic context-free grammars that are estimated in this way, including the previously unknown equivalence between the grammar cross-entropy with the input distribution and the so-called derivational entropy of the grammar itself. We discuss important consequences of these results involving the standard application of the maximum-likelihood estimator on finite tree and sentence samples, as well as other finite-state models such as hidden Markov models and probabilistic finite automata.  相似文献   

15.
一种基于知网的中文句子情感倾向判别方法*   总被引:4,自引:0,他引:4  
党蕾  张蕾 《计算机应用研究》2010,27(4):1370-1372
针对基于知网的中文句子情感倾向判别方法中存在的准确率不高的问题,提出采用否定模式匹配与依存句法分析相结合的方法。研究分析了修饰词极性以及否定共享模式,确定修饰词以及扩展极性的定量和否定共享范围,提出依存语法距离的影响因素来计算情感倾向,并且在否定模式匹配后改进句子极性算法。实验结果表明该方法取得了良好的效果。  相似文献   

16.
17.
Controlled natural languages (CNL) with a direct mapping to formal logic have been proposed to improve the usability of knowledge representation systems, query interfaces, and formal specifications. Predictive editors are a popular approach to solve the problem that CNLs are easy to read but hard to write. Such predictive editors need to be able to “look ahead” in order to show all possible continuations of a given unfinished sentence. Such lookahead features, however, are difficult to implement in a satisfying way with existing grammar frameworks, especially if the CNL supports complex nonlocal structures such as anaphoric references. Here, methods and algorithms are presented for a new grammar notation called Codeco, which is specifically designed for controlled natural languages and predictive editors. A parsing approach for Codeco based on an extended chart parsing algorithm is presented. A large subset of Attempto Controlled English has been represented in Codeco. Evaluation of this grammar and the parser implementation shows that the approach is practical, adequate and efficient.  相似文献   

18.
In this paper, we propose a new natural language acquisition model (called EBNLA) based on explanation-based language ( EBL). To apply EBL to the natural language acquisition domain, suitable universal linguistic principles are incorporated as domain theory. The domain theory consists of two parts: static and dynamic. The static part, which is assumed to he invariant and innate to the model, includes theta theory in government-binding theory and universal fea ture instantiation principles in generalized phrase structure grammar. The dynamic part con tains context-free grammar rules as well as syntactic and thematic features of lexicons. In parsing ( problem solving), both parts work together to parse input sentences. As parsing fails, learning is triggered to enrich and generalize the dynamic part by obeying the principles in the static part. By introducing EBL and the universal linguistic principles, portability of the model and leamabitity of knowledge in the real-world natural language acquisition domain can be improved.  相似文献   

19.
This paper describes a parsing algorithm for Tree Adjoining Grammar (TAG) and its parallel implementation on the Connection Machine. TAG is a formalism for natural language that employs trees as the basic grammar structures. Parsing involves the application of two operations, called adjunction and substitution, to produce derived tree structures. Sequential parsing algorithms for TAGs run in time quadratic in the grammar size, which is impractical for the very large grammars currently being developed for natural language. This paper presents two parallel algorithms, one running in time nearly linear in the grammar size, and the other running in time logarithmic in the grammar size. Both parallel algorithms were implemented on a Connection Machine CM-2 and performance measurements were obtained for varying grammar sizes.This research was supported in part by NSF Grant BNS-9022010, by the ARO Center for Excellence in Artificial Intelligence, University of Pennsylvania, and by the Army High Performance Computing Research Center (AHPCRC), University of Minnesota.  相似文献   

20.
一个基于JAVA的堆栈式自然语言翻译解码器   总被引:1,自引:0,他引:1  
解码是统计学自然语言翻译系统的重要一步,解码器的任务是用从训练文本中学习到的语言/翻译模型的信息来确定源句子最可能的翻译句子,解码器的输入是翻译模型和语言模型,以及源语言句子,输出源语言句子最可能的对应目标句子/翻译。由于可能的目标句子很多,通常解码算法只能搜索一小部分可能的目标语言句子。该文介绍了一种基于堆栈算法的,用Java实现的解码器。Java平台提供了方便的跨平台的应用,高度安全、开放、健壮。解码器的实现重点在于解码算法和参数的选择。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号