首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
    
The increasing numbers of large data sets generated by information technologies provide a great opportunity to better understand emerging topics in human society. Retrieving real‐world events from such data, particularly free‐text data, is a complicated task in Natural Language Processing and Location‐based Social Networks. In this work, we propose a new approach, which recognizes geo‐referenced high‐level events/activities mentioned in web sources adopting open gazetteers: OpenStreetMap and Google Maps. Our approach demonstrated on sampled news articles identifies events associated with the relevant topics using a latent Dirichlet allocation. This research is an essential step towards recommendation systems, urban planning, and monitoring.  相似文献   

2.
在当前的基于统计的翻译方法中,双语语料库的规模、词对齐的准确率对于翻译系统的性能有很大的影响。虽然大规模语料库可以改善词语对齐的准确度,提高系统的性能,但同时会以增加系统的负载为代价,因此目前对于统计机器翻译方法的研究在使用大规模语料库的基础上,同时寻求其他可以提高系统性能的方法。针对以上问题,提出一种把双语词典应用在统计机器翻译中的方法,不仅优化了词对齐的准确率,而且得出质量更高的翻译结果,在一定程度上缓解了数据稀疏问题。  相似文献   

3.
会议场景下通过语音识别和机器翻译技术实现从演讲人语音到另外一种语言文字的翻译,对于跨语言信息交流具有重要意义,成为当前研究热点之一。该文针对由于会议行业属性带来的专业术语和行业用语的翻译问题,提出了一种融合外部词典知识的领域个性化方法。具体而言,首先采用联合占位符和拼接融合的编码策略,通过引入外部词典知识,在提升实体词、专业术语词翻译准确率的同时,兼顾翻译结果的流畅性。其次提出基于分类的领域旁支参数个性化自适应策略,在保持通用领域翻译效果的情况下实现会议相关领域翻译质量的提升。最后基于上述方案,设计了一套领域个性化自动训练系统。实验结果表明,在中英体育、商务和医学会议翻译任务上,该系统在不影响通用翻译的情况下,平均提升9.22个BLEU,获得较好翻译效果。  相似文献   

4.
面向小语种机器翻译的平行语料库构建方法   总被引:1,自引:0,他引:1  
神经机器翻译模型的训练效果在很大程度上取决于平行语料库的规模和质量.除了一些常见语言外,汉语与小语种间高质量平行语料库的建设一直处于滞后状态.现有小语种平行语料库多采用自动句子对齐技术利用网络资源构建而成,在文本质量和领域等方面有诸多局限性.采用人工翻译的方式可以构建高质量平行语料库,但是缺乏相关经验和方法.文中从机器...  相似文献   

5.
The real difficulty in developing practical NLP systems is due to the fact that we do not know in advance what actualinstances of knowledge should be used in the application system, even though we know in advance whattypes of knowledge are required. An effective method for extracting linquistic knowledge from corpora is needed. We propose automatic linguistic knowledge acquisition from sublanguage corpora. The system combines existing linquistic knowledge and human intervention with corpus based techniques. The algorithm involves a Gradual Approximation, which works to converge linguistic knowledge gradually towards desirable results. We conducted three experiments. The first experiment revealed the characteristic of this algorithm and the others proved the effectiveness of this algorithm for a real corpus. The results show the algorithm is promising, though there are some problems; the practical problem of the parameters, the formalism problems to include more linguistic features and the combination with other linguistic clues for more development. We would like to continue the research to perform further experiments and to improve the algorithm.  相似文献   

6.
标题反映文章的灵魂,精确把握标题能迅速领悟文章的中心内容。本文利用统计机器翻译方法搭建了一个机器翻译平台,使用兹平台对航空领域标题进行翻译,井采用国际评测NIST工具对该平台进行了开放测试和对闭测试,测试结果表明该统计方法对领域标题翻译具有有效性。  相似文献   

7.
一种基于贝叶斯分类与机读词典的多义词排歧方法   总被引:3,自引:0,他引:3  
一词多义是自然语言中普遍存在的现象,词义排歧的成功率是衡量机器翻译、信息检索、文本分类等自然语言处理软件性能的重要指标。提出了一种基于贝叶斯分类与机读词典的多义词排歧方法,通过小规模语料库的训练和歧义词在机读词典中的语义定义来完成歧义的消除。实验表明:基于贝叶斯分类与机读词典的多义词排歧算法在标注语料库规模受限的情况下,能取得较高的排歧准确率。  相似文献   

8.
基于X结构的词义选择利用单词所在的X结构,并与词典的用法部分的X结构相比较,通过比较结构及结构中其它词的相似性来决定单词的含义,单词间的相似性利用WordNet来实现.这一方法只要较少的学习例子,可以避免传统的基于单词同现的方法中需要大量的语料库及数据稀少等问题。  相似文献   

9.
汉语语句的自动改写   总被引:4,自引:1,他引:3  
在基于转换方式的口语机器翻译中,口语的多样性和不规则性加重了转换模块的处理负担。另外,由于缺少双语语料库和懂双语的语言学家,使得翻译知识的开发很困难或成本很高。为了解决这些问题,我们提出了在翻译前对源语言的语句进行自动改写的方法,试图通过加强源语言的处理来分散转换模块的负担。本文介绍了汉日口语机器翻译系统中汉语语句改写模块的开发。作者在分析了口语句子的改写目标后,提出了基于模板匹配的改写方法和从改写语料库中获取改写模板的半自动化方法。作者还介绍了改写模块的设计与实现,以及评价试验和结果。  相似文献   

10.
针对新疆地区的多语种发展现状做出介绍,涉及到维哈柯语料库、机器翻译、维吾尔语语音识别等领域,重点介绍新疆多语种智能化研究机构以及各机构的主要研究方向和内容。  相似文献   

11.
This paper presents an approach to avoiding translation by utilising a context-sensitive dictionary look-up system to assist in the comprehension of on-line texts in foreign languages. The system employs state-of-the-art finite-state technology for the context analysis and uses converted bilingual printed dictionaries as its primary resource of lexical information. The printed dictionaries have been opened up for the system by analysing the dictionary structure and parsing the type-setting tapes. The lexical information has been validated and augmented by corpus-based lexicographic revision, paying special attention to the treatment of multiword lexemes, for which formalised local grammars have been added to the augmented dictionaries. Although the system is currently implemented as a device for comprehension assistance, the same technology and methods can be employed to select information from the dictionaries relevant for translation processes.  相似文献   

12.
基于知识的满文识别后处理   总被引:1,自引:0,他引:1  
为提高手写满文的整体文本识别率,基于规则和统计的方法提出以构建语料为主、规则为辅的满文语言知识库,并将其他应用于满文手写体字符识别结果中候选字的后验概率统计中.小范围测试样本表明该方法的有效性较高.  相似文献   

13.
双语平行语料库是构造高质量统计机器翻译系统的重要基础。与传统的通过扩大双语平行语料库规模来提高翻译质量的策略不同,本文旨在尽可能地挖掘现有资源的潜力来提高统计机器翻译的性能。文中提出了一种基于信息检索模型的统计机器翻译训练数据选择与优化方法,通过选择现有训练数据资源中与待翻译文本相似的句子组成训练子集,可在不增加计算资源的情况下获得与使用全部数据相当甚至更优的机器翻译结果。通过将选择出的数据子集加入原始训练数据中优化训练数据的分布可进一步提高机器翻译的质量。实验证明,该方法对于有效利用现有数据资源提高统计机器翻译性能有很好的效果。  相似文献   

14.
    
This article discusses the various aspects of designing a system for eliciting knowledge about language from informants. For each design aspect, various options for implementation are presented, along with their pros, cons, and repercussions for other parts of the knowledge elicitation system. A running example throughout the text is taken from the paradigmatic morphology elicitation module of a system called Boas, which elicits knowledge to support a machine translation system. The main point of the article is an argument about the necessity to analyze the design choice space for complex natural language processing (NLP) systems early, comprehensively, and overtly.  相似文献   

15.
Statistical machine translation (SMT) has proven to be an interesting pattern recognition framework for automatically building machine translations systems from available parallel corpora. In the last few years, research in SMT has been characterized by two significant advances. First, the popularization of the so called phrase-based statistical translation models, which allows to incorporate local contextual information to the translation models. Second, the availability of larger and larger parallel corpora, which are composed of millions of sentence pairs, and tens of millions of running words. Since phrase-based models basically consists in statistical dictionaries of phrase pairs, their estimation from very large corpora is a very costly task that yields a huge number of parameters which are to be stored in memory. The handling of millions of model parameters and a similar number of training samples have become a bottleneck in the field of SMT, as well as in other well-known pattern recognition tasks such as speech recognition or handwritten recognition, just to name a few. In this paper, we propose a general framework that deals with the scaling problem in SMT without introducing significant time overhead by means of the combination of different scaling techniques. This new framework is based on the use of counts instead of probabilities, and on the concept of cache memory.  相似文献   

16.
17.
    
In conventional algorithms, the lack of entity information, reference, and semantic relations in the current corpus leads to a low rate of precision and efficiency in constructing cross‐language bilingual mapping. According to natural language processing and machine translation technology, to solve the problem, this paper aims to establish a parallel corpus for information extraction based on the OntoNotes corpus by combining automatic extraction and manual adjustment. To verify the validity of the parallel corpus constructed in this paper, a comparative experiment was carried out on the corpus. The corpus entity alignment rate, anaphora absence, and syntactic structure were analysed in detail based on statistics. The data set is well performed in language processing and machine translation. The parallel corpus for information extraction constructed in this paper can produce highly precise, stable, and efficient information in the process of bilingual mapping, which provides an effective parallel corpus for the study in machine translation of bilingual mapping.  相似文献   

18.
方林  程景云 《软件学报》1995,6(10):637-641
树文法是一种高维文法,能够用来表达二维以上复杂对象的构造规则.在模式识别、图式语言等领域有着广泛的应用前景.本文在树文法有关概念基础上提出了标志树、连接标志、标志树文法等概念,构造了标志树的匹配和识别算法,并成功解决了标志树文法的语法分析器构造问题.  相似文献   

19.
综合型语言知识库的建设与利用   总被引:15,自引:4,他引:15  
语言知识库的规模和质量决定了自然语言处理系统的成败。经过18年的努力,北京大学计算语言学研究所已经积累了一系列颇具规模、质量上乘的语言数据资源:现代汉语语法信息词典,大规模基本标注语料库,现代汉语语义词典,中文概念词典,不同单位对齐的双语语料库,多个专业领域的术语库,现代汉语短语结构规则库,中国古代诗词语料库等等。本项研究将把这些语言数据资源集成为一个综合型的语言知识库。集成不同的语言数据资源时,必须克服它们之间的“缝隙”。规划中的综合型语言知识库除了有统一的友好的使用界面和方便的应用程序接口外,还将提供支持知识挖掘的工具软件,促使现有的语言数据资源从初级产品形式向深加工产品形式不断发展;提供多种形式的知识传播和信息服务机制,让综合型语言知识库为语言信息处理研究、语言学本体研究和语言教学提供全方位的、多层次的支持。  相似文献   

20.
This paper presents a new method for utilizing translator knowledge bases for machine translation systems. Translator knowledge to be stored and utilized consists of translationally equivalent pattern pairs: surface-level phrasal, clausal, and sentential correspondences between the source and target languages. This knowledge will be utilized to translate domain-specific idiomatic, nonstandard, or ungrammatical expressions. The proposed method has been implemented in an adaptive English to Japanese machine translation system, HICATS/EJ, as one of its customization facilities.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号