首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
N-gram模型综述   总被引:1,自引:0,他引:1  
N-gram模型是自然语言处理中最常用的语言模型之一,广泛应用于语音识别、手写识别、拼写纠错、机器翻译和搜索引擎等众多任务.但是N-gram模型在训练和应用时经常会出现零概率问题,导致无法获得良好的语言模型,因此出现了拉普拉斯平滑、卡茨回退和Kneser-Ney平滑等平滑方法.在介绍了这些平滑方法的基本原理后,使用困惑度作为度量标准去比较了基于这几种平滑方法所训练出的语言模型.  相似文献   

2.
The paper presents a new approach for the introduction of computational science into high level school curricula. It also discusses a set of real life problems that are appropriate for these curricula because they can be described through simple models. The computer based simulation of these systems require an ad hoc environment, including a programming language, suitable for this target age. The paper proposes a new environment, the ORESPICS environment, including a new programming language. The sequential part of the language integrates the classical imperative constructs with a simple set of graphical primitives, mostly taken from the LOGO language. The concurrent part of the language is based on the message passing paradigm. The solutions of some classical problems through ORESPICS are shown.  相似文献   

3.
We present a new generative model of natural language, the latent words language model. This model uses a latent variable for every word in a text that represents synonyms or related words in the given context. We develop novel methods to train this model and to find the expected value of these latent variables for a given unseen text. The learned word similarities help to reduce the sparseness problems of traditional n-gram language models. We show that the model significantly outperforms interpolated Kneser–Ney smoothing and class-based language models on three different corpora. Furthermore the latent variables are useful features for information extraction. We show that both for semantic role labeling and word sense disambiguation, the performance of a supervised classifier increases when incorporating these variables as extra features. This improvement is especially large when using only a small annotated corpus for training.  相似文献   

4.
This paper proposes a new technique to test the performance of spoken dialogue systems by artificially simulating the behaviour of three types of user (very cooperative, cooperative and not very cooperative) interacting with a system by means of spoken dialogues. Experiments using the technique were carried out to test the performance of a previously developed dialogue system designed for the fast-food domain and working with two kinds of language model for automatic speech recognition: one based on 17 prompt-dependent language models, and the other based on one prompt-independent language model. The use of the simulated user enables the identification of problems relating to the speech recognition, spoken language understanding, and dialogue management components of the system. In particular, in these experiments problems were encountered with the recognition and understanding of postal codes and addresses and with the lengthy sequences of repetitive confirmation turns required to correct these errors. By employing a simulated user in a range of different experimental conditions sufficient data can be generated to support a systematic analysis of potential problems and to enable fine-grained tuning of the system.  相似文献   

5.
郭涛  曲宝胜  郭勇 《电脑学习》2011,(2):113-116
本文简单介绍了自然语言处理发展的现状,讨论了自然语言处理模型,将其分为三大类:分析模型、统计模型及混合模型。具体介绍了分析模型原理及存在的问题,重点讨论了各种统计模型的特点及局限性,最后简单介绍了混合模型,并指出目前自然语言处理技术中存在的问题。  相似文献   

6.
The solution of complex problems concerning wave propagation in plane and axisymmetric continua can be obtained through the use of discrete methods of analysis. Invariably these methods introduce mathematical models involving thousands of degrees of freedom. Programming the solution to these problems in a standard procedural language such as FORTRAN generally results in a large lag time between problem inception and program completion. In many instances a shorter development time which results in a moderately efficient program is a more economical and hence, a more desirable approach. The purpose of this paper is to demonstrate that a high level procedural language such as POST is a much more natural language than FORTRAN in which to express the solution of wave propagation problems. The lumped parameter method of analysis is used to determine the solution.  相似文献   

7.
A comparison of the concurrency models of Ada and occam is presented. This comparison is performed in terms of an Ada-to-occam compiler. A subset of the Ada programming language may be compiled allowing a study of how programs and algorithms expressed using the Ada concurrency model can be mapped to the occam concurrency model. The resultant occam code may then be executed on a transputer. This paper describes the Ada and occam concurrency models, highlights the major differences between them, discusses the problems encountered in trying to map concurrent Ada programs to equivalent occam programs as a result of these differences and explains how these problems were overcome in the present compiler.  相似文献   

8.
Modelling frameworks provide models with support components that handle tasks such as visualisation, data management and model integration. Within these broad requirements different approaches to framework development are possible. Tarsier is a modelling framework that supports the development of models in a high-level language, such as C++. This approach allows Tarsier model developers to craft object oriented solutions to large modelling problems. ICMS is a software system that supports the development of models in a custom modelling language that allows modellers with little programming experience to develop, integrate and visualise catchment models. Both frameworks provide sophisticated tools for model linking, data management, and data analysis and visualisation. By focusing on different user groups, Tarsier and ICMS have evolved into quite different environments, yet both satisfy the definition of a modelling framework. This paper concentrates on the components within each framework and the strengths and weaknesses of the different approaches.  相似文献   

9.
近些年来,条件概率模型的研究得到了很大的发展。在对序列标注类问题进行处理时,条件模型逐渐开始取代产生式模型,其应用领域相当广泛,条件概率模型可应用到图像识别、自然语言处理、入侵检测等问题上。条件随机场模型(Conditional Random Fields,CRFs)模型是条件模型中的代表模型,也是条件模型中现在研究得最多的模型之一。它避免了产生式模型的缺点,而且克服了前期最大熵模型标记偏置的缺陷,由此得到广泛的运用。在利用CRFs作具体应用研究时发现,单纯利用CRFs模型进行实际运用取得的效果并没有达到最好,所以在每个应用中均进行了改进。本文主要研究军用文书分词、军事命名实体识别、入侵检测等方面,所做的改进都在模型应用的基础上更进一步提高了系统的性能。  相似文献   

10.
This article describes the various knowledge sources that, in general, are required to handle multimodal human-machine interaction efficiently: these are called the task, user, dialogue, environment and system models. The first part discusses the content of these models. Special emphasis is given on problems that occur when speech is combined with other modalities. The second part focuses on spoken language characteristics and proposes an adapted semantic representation for the task model. It also describes a stochastic method to collect and process the information related to this model. The conclusion discusses an extension of such a stochastic method to multimodality.  相似文献   

11.
12.
Future expert systems for understanding physical mechanisms will probably employ causal models as the foundation of their expertise, and the problem of acquiring these causal models is important. This paper explores one possibility, that of acquiring causal models by understanding natural language explanations of these mechanisms. It identifies six different research issues in which understanding an explanation requires knowledge-based reasoning, and proposes approaches to these problems within an integrated natural language-based causal model acquisition system.  相似文献   

13.
随着自然语言处理(NLP,natural language processing)技术的快速发展,语言模型在文本分类和情感分析中的应用不断增加。然而,语言模型容易遭到盗版再分发,对模型所有者的知识产权造成严重威胁。因此,研究者着手设计保护机制来识别语言模型的版权信息。现有的适用于文本分类任务的语言模型水印无法与所有者身份相关联,且鲁棒性不足以及无法再生成触发集。为了解决这些问题,提出一种新的适用于文本分类任务模型的黑盒水印方案,可以远程快速验证模型所有权。将模型所有者的版权消息和密钥通过密钥相关的哈希运算消息认证码(HMAC,hash-based message authentication code)得到版权消息摘要,由HMAC得到的消息摘要可以防止被伪造,具有很强的安全性。从原始训练集各个类别中随机挑选一定的文本数据,将摘要与文本数据结合构建触发集,并在训练过程中对语言模型嵌入水印。为了评估水印的性能,在IMDB电影评论、CNEWS中文新闻文本分类数据集上对3种常见的语言模型嵌入水印。实验结果表明,在不影响原始模型测试精度的情况下,所提出的水印验证方案的准确率可以达到 100%。即使在模型微调和剪枝等常见攻击下,也能表现出较强的鲁棒性,并且具有抗伪造攻击的能力。同时,水印的嵌入不会影响模型的收敛时间,具有较高的嵌入效率。  相似文献   

14.
随着自然语言处理(NLP)领域中预训练技术的快速发展,将外部知识引入到预训练语言模型的知识驱动方法在NLP任务中表现优异,知识表示学习和预训练技术为知识融合的预训练方法提供了理论依据。概述目前经典预训练方法的相关研究成果,分析在新兴预训练技术支持下具有代表性的知识感知的预训练语言模型,分别介绍引入不同外部知识的预训练语言模型,并结合相关实验数据评估知识感知的预训练语言模型在NLP各个下游任务中的性能表现。在此基础上,分析当前预训练语言模型发展过程中所面临的问题和挑战,并对领域发展前景进行展望。  相似文献   

15.
《Applied Soft Computing》2008,8(2):1005-1017
Statistical language models are very useful tools to improve the recognition accuracy of optical character recognition (OCR) systems. In previous systems, segmentation by maximum word matching, semantic class segmentation, or trigram language models have been used. However, these methods have some disadvantages, such as inaccuracies due to a preference for longer words (which may be erroneous), failure to recognize word dependencies, complex semantic training data segmentation, and a requirement of high memory.To overcome these problems, we propose a novel bigram Markov language model in this paper. This type of model does not have large word preferences and does not require semantically segmented training data. Furthermore, unlike trigram models, the memory requirement is small. Thus, the scheme is suitable for handheld and pocket computers, which are expected to be a major future application of text recognition systems.However, due to a simple language model, the bigram Markov model alone can introduce more errors. Hence in this paper, a novel algorithm combining bigram Markov language models with heuristic fuzzy rules is described. It is found that the recognition accuracy is improved through the use of the algorithm, and it is well suited to mobile and pocket computer applications, including as we will show in the experimental results, the ability to run on mobile phones.The main contribution of this paper is to show how fuzzy techniques as linguistic rules can be used to enhance the accuracy of a crisp recognition system, and still have low computational complexity.  相似文献   

16.
近年来,随着深度学习的快速发展,面向自然语言处理领域的预训练技术获得了长足的进步。早期的自然语言处理领域长期使用Word2Vec等词向量方法对文本进行编码,这些词向量方法也可看作静态的预训练技术。然而,这种上下文无关的文本表示给其后的自然语言处理任务带来的提升非常有限,并且无法解决一词多义问题。ELMo提出了一种上下文相关的文本表示方法,可有效处理多义词问题。其后,GPT和BERT等预训练语言模型相继被提出,其中BERT模型在多个典型下游任务上有了显著的效果提升,极大地推动了自然语言处理领域的技术发展,自此便进入了动态预训练技术的时代。此后,基于BERT的改进模型、XLNet等大量预训练语言模型不断涌现,预训练技术已成为自然语言处理领域不可或缺的主流技术。文中首先概述预训练技术及其发展历史,并详细介绍自然语言处理领域的经典预训练技术,包括早期的静态预训练技术和经典的动态预训练技术;然后简要梳理一系列新式的有启发意义的预训练技术,包括基于BERT的改进模型和XLNet;在此基础上,分析目前预训练技术研究所面临的问题;最后对预训练技术的未来发展趋势进行展望。  相似文献   

17.
Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structures of interest, or to project out large slices of a corpus for external analysis. Existing languages suffer from a variety of problems in the areas of expressiveness, efficiency, and naturalness for linguistic query. We describe the domain of linguistic trees and discuss the expressive requirements for a query language. Then we present a language that can express a wide range of queries over these trees, and show that the language is first-order complete over trees.  相似文献   

18.
Formulation of qualitative models for complex decision problems exhibiting less structure, more imprecision and uncertainty is not adequately addressed in DSS research. Typical characteristics and requirements of such problems prohibit the development of DSS using knowledge based system development methodologies. This paper presents a methodology for formulation of qualitative models using fuzzy logic to handle the imprecision and uncertainty in the problem domain. The problem domain, in this methodology, is represented using problem-solving knowledge, environmental knowledge, and control knowledge components. A high level non-procedural language for representing these components of knowledge is illustrated using a project selection and resource allocation problem. The paper also describes the implementation of a prototype decision support environment based on this methodology.  相似文献   

19.
自然语言处理中的评测任务引导和推动着技术、模型和方法上的研究。近年来,新的评测数据集和评测任务不断被提出,与此同时,现有评测暴露的一系列问题也限制了自然语言处理技术的进步。该文从自然语言处理评测的概念、构成、发展和意义出发,分类综述了主流自然语言处理评测的任务和特点,进而总结归纳了自然语言处理评测中的问题及其成因。最后,该文参照人类语言能力评测规范,提出类人机器语言能力评测的概念,并从信度、难度、效度三个方面提出了一系列类人机器语言能力评测的基本原则和实施设想,并对评测技术的未来发展进行了展望。  相似文献   

20.
ABSTRACT

The understanding and acquisition of a language in a real-world environment is an important task for future robotics services. Natural language processing and cognitive robotics have both been focusing on the problem for decades using machine learning. However, many problems remain unsolved despite significant progress in machine learning (such as deep learning and probabilistic generative models) during the past decade. The remaining problems have not been systematically surveyed and organized, as most of them are highly interdisciplinary challenges for language and robotics. This study conducts a survey on the frontier of the intersection of the research fields of language and robotics, ranging from logic probabilistic programming to designing a competition to evaluate language understanding systems. We focus on cognitive developmental robots that can learn a language from interaction with their environment and unsupervised learning methods that enable robots to learn a language without hand-crafted training data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号