共查询到20条相似文献,搜索用时 15 毫秒
1.
The increasing numbers of large data sets generated by information technologies provide a great opportunity to better understand emerging topics in human society. Retrieving real‐world events from such data, particularly free‐text data, is a complicated task in Natural Language Processing and Location‐based Social Networks. In this work, we propose a new approach, which recognizes geo‐referenced high‐level events/activities mentioned in web sources adopting open gazetteers: OpenStreetMap and Google Maps. Our approach demonstrated on sampled news articles identifies events associated with the relevant topics using a latent Dirichlet allocation. This research is an essential step towards recommendation systems, urban planning, and monitoring. 相似文献
2.
在当前的基于统计的翻译方法中,双语语料库的规模、词对齐的准确率对于翻译系统的性能有很大的影响。虽然大规模语料库可以改善词语对齐的准确度,提高系统的性能,但同时会以增加系统的负载为代价,因此目前对于统计机器翻译方法的研究在使用大规模语料库的基础上,同时寻求其他可以提高系统性能的方法。针对以上问题,提出一种把双语词典应用在统计机器翻译中的方法,不仅优化了词对齐的准确率,而且得出质量更高的翻译结果,在一定程度上缓解了数据稀疏问题。 相似文献
3.
会议场景下通过语音识别和机器翻译技术实现从演讲人语音到另外一种语言文字的翻译,对于跨语言信息交流具有重要意义,成为当前研究热点之一。该文针对由于会议行业属性带来的专业术语和行业用语的翻译问题,提出了一种融合外部词典知识的领域个性化方法。具体而言,首先采用联合占位符和拼接融合的编码策略,通过引入外部词典知识,在提升实体词、专业术语词翻译准确率的同时,兼顾翻译结果的流畅性。其次提出基于分类的领域旁支参数个性化自适应策略,在保持通用领域翻译效果的情况下实现会议相关领域翻译质量的提升。最后基于上述方案,设计了一套领域个性化自动训练系统。实验结果表明,在中英体育、商务和医学会议翻译任务上,该系统在不影响通用翻译的情况下,平均提升9.22个BLEU,获得较好翻译效果。 相似文献
4.
面向小语种机器翻译的平行语料库构建方法 总被引:1,自引:0,他引:1
神经机器翻译模型的训练效果在很大程度上取决于平行语料库的规模和质量.除了一些常见语言外,汉语与小语种间高质量平行语料库的建设一直处于滞后状态.现有小语种平行语料库多采用自动句子对齐技术利用网络资源构建而成,在文本质量和领域等方面有诸多局限性.采用人工翻译的方式可以构建高质量平行语料库,但是缺乏相关经验和方法.文中从机器... 相似文献
5.
The real difficulty in developing practical NLP systems is due to the fact that we do not know in advance what actualinstances of knowledge should be used in the application system, even though we know in advance whattypes of knowledge are required. An effective method for extracting linquistic knowledge from corpora is needed. We propose automatic linguistic knowledge acquisition from sublanguage corpora. The system combines existing linquistic knowledge and human intervention with corpus based techniques. The algorithm involves a Gradual Approximation, which works to converge linguistic knowledge gradually towards desirable results. We conducted three experiments. The first experiment revealed the characteristic of this algorithm and the others proved the effectiveness of this algorithm for a real corpus. The results show the algorithm is promising, though there are some problems; the practical problem of the parameters, the formalism problems to include more linguistic features and the combination with other linguistic clues for more development. We would like to continue the research to perform further experiments and to improve the algorithm. 相似文献
6.
标题反映文章的灵魂,精确把握标题能迅速领悟文章的中心内容。本文利用统计机器翻译方法搭建了一个机器翻译平台,使用兹平台对航空领域标题进行翻译,井采用国际评测NIST工具对该平台进行了开放测试和对闭测试,测试结果表明该统计方法对领域标题翻译具有有效性。 相似文献
7.
8.
基于X结构的词义选择利用单词所在的X结构,并与词典的用法部分的X结构相比较,通过比较结构及结构中其它词的相似性来决定单词的含义,单词间的相似性利用WordNet来实现.这一方法只要较少的学习例子,可以避免传统的基于单词同现的方法中需要大量的语料库及数据稀少等问题。 相似文献
9.
10.
针对新疆地区的多语种发展现状做出介绍,涉及到维哈柯语料库、机器翻译、维吾尔语语音识别等领域,重点介绍新疆多语种智能化研究机构以及各机构的主要研究方向和内容。 相似文献
11.
This paper presents an approach to avoiding translation by utilising a context-sensitive dictionary look-up system to assist in the comprehension of on-line texts in foreign languages. The system employs state-of-the-art finite-state technology for the context analysis and uses converted bilingual printed dictionaries as its primary resource of lexical information. The printed dictionaries have been opened up for the system by analysing the dictionary structure and parsing the type-setting tapes. The lexical information has been validated and augmented by corpus-based lexicographic revision, paying special attention to the treatment of multiword lexemes, for which formalised local grammars have been added to the augmented dictionaries. Although the system is currently implemented as a device for comprehension assistance, the same technology and methods can be employed to select information from the dictionaries relevant for translation processes. 相似文献
12.
13.
双语平行语料库是构造高质量统计机器翻译系统的重要基础。与传统的通过扩大双语平行语料库规模来提高翻译质量的策略不同,本文旨在尽可能地挖掘现有资源的潜力来提高统计机器翻译的性能。文中提出了一种基于信息检索模型的统计机器翻译训练数据选择与优化方法,通过选择现有训练数据资源中与待翻译文本相似的句子组成训练子集,可在不增加计算资源的情况下获得与使用全部数据相当甚至更优的机器翻译结果。通过将选择出的数据子集加入原始训练数据中优化训练数据的分布可进一步提高机器翻译的质量。实验证明,该方法对于有效利用现有数据资源提高统计机器翻译性能有很好的效果。 相似文献
14.
This article discusses the various aspects of designing a system for eliciting knowledge about language from informants. For each design aspect, various options for implementation are presented, along with their pros, cons, and repercussions for other parts of the knowledge elicitation system. A running example throughout the text is taken from the paradigmatic morphology elicitation module of a system called Boas, which elicits knowledge to support a machine translation system. The main point of the article is an argument about the necessity to analyze the design choice space for complex natural language processing (NLP) systems early, comprehensively, and overtly. 相似文献
15.
D. Ortiz-Martínez I. García-Varea F. Casacuberta 《Pattern recognition letters》2008,29(8):1145-PRintPerclntel
Statistical machine translation (SMT) has proven to be an interesting pattern recognition framework for automatically building machine translations systems from available parallel corpora. In the last few years, research in SMT has been characterized by two significant advances. First, the popularization of the so called phrase-based statistical translation models, which allows to incorporate local contextual information to the translation models. Second, the availability of larger and larger parallel corpora, which are composed of millions of sentence pairs, and tens of millions of running words. Since phrase-based models basically consists in statistical dictionaries of phrase pairs, their estimation from very large corpora is a very costly task that yields a huge number of parameters which are to be stored in memory. The handling of millions of model parameters and a similar number of training samples have become a bottleneck in the field of SMT, as well as in other well-known pattern recognition tasks such as speech recognition or handwritten recognition, just to name a few. In this paper, we propose a general framework that deals with the scaling problem in SMT without introducing significant time overhead by means of the combination of different scaling techniques. This new framework is based on the use of counts instead of probabilities, and on the concept of cache memory. 相似文献
16.
17.
Honghua He 《Expert Systems》2019,36(5)
In conventional algorithms, the lack of entity information, reference, and semantic relations in the current corpus leads to a low rate of precision and efficiency in constructing cross‐language bilingual mapping. According to natural language processing and machine translation technology, to solve the problem, this paper aims to establish a parallel corpus for information extraction based on the OntoNotes corpus by combining automatic extraction and manual adjustment. To verify the validity of the parallel corpus constructed in this paper, a comparative experiment was carried out on the corpus. The corpus entity alignment rate, anaphora absence, and syntactic structure were analysed in detail based on statistics. The data set is well performed in language processing and machine translation. The parallel corpus for information extraction constructed in this paper can produce highly precise, stable, and efficient information in the process of bilingual mapping, which provides an effective parallel corpus for the study in machine translation of bilingual mapping. 相似文献
18.
19.
综合型语言知识库的建设与利用 总被引:15,自引:4,他引:15
语言知识库的规模和质量决定了自然语言处理系统的成败。经过18年的努力,北京大学计算语言学研究所已经积累了一系列颇具规模、质量上乘的语言数据资源:现代汉语语法信息词典,大规模基本标注语料库,现代汉语语义词典,中文概念词典,不同单位对齐的双语语料库,多个专业领域的术语库,现代汉语短语结构规则库,中国古代诗词语料库等等。本项研究将把这些语言数据资源集成为一个综合型的语言知识库。集成不同的语言数据资源时,必须克服它们之间的“缝隙”。规划中的综合型语言知识库除了有统一的友好的使用界面和方便的应用程序接口外,还将提供支持知识挖掘的工具软件,促使现有的语言数据资源从初级产品形式向深加工产品形式不断发展;提供多种形式的知识传播和信息服务机制,让综合型语言知识库为语言信息处理研究、语言学本体研究和语言教学提供全方位的、多层次的支持。 相似文献
20.
This paper presents a new method for utilizing translator knowledge bases for machine translation systems. Translator knowledge to be stored and utilized consists of translationally equivalent pattern pairs: surface-level phrasal, clausal, and sentential correspondences between the source and target languages. This knowledge will be utilized to translate domain-specific idiomatic, nonstandard, or ungrammatical expressions. The proposed method has been implemented in an adaptive English to Japanese machine translation system, HICATS/EJ, as one of its customization facilities. 相似文献