首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 218 毫秒
1.
本文提出了一种基于词和词义混合的统计语言模型,研究了这个模型在词义标注和汉语普通话语音识别中的性能,并且与传统的词义模型和基于词的语言模型进行了对比。这个模型比传统词义模型更准确地描述了词义和词的关系,在词义标注中具有较小的混淆度;在汉语普通话连续音识别中,这个词义模型的性能优于基于词的三元文法模型,并且需要较小的存储空间。  相似文献   

2.
中文语义依存关系分析的统计模型   总被引:7,自引:0,他引:7  
李明琴  李涓子  王作英  陆大 《计算机学报》2004,27(12):1679-1687
该文提出了一个统计语义分析器,它能够发现中文句子中的语义依存关系.这些语义依存关系可以用于表示句子的意义和结构.语义分析器在1百万词的标有语义依存关系的语料库(语义依存网络语料库,SDN)上训练并测试,文中设计、实现了多个实验以分析语义分析器的性能.实验结果表明,分析器在非限定领域中表现出了较好的性能,分析正确率与中文句法分析器基本相当。  相似文献   

3.
基于语义依存关系的汉语语料库的构建   总被引:4,自引:1,他引:4  
语料库是自然语言处理中用于知识获取的重要资源。本文以句子理解为出发点,讨论了在设计和建设一个基于语义依存关系的汉语大规模语料库过程中的几个基础问题,包括:标注体系的选择、标注关系集的确定,标注工具的设计,以及标注过程中的质量控制。该语料库设计规模100万词次,利用70个语义、句法依存关系,在已具有语义类标记的语料上进一步标注句子的语义结构。其突出特点在于将《知网》语义关系体系的研究成果和具体语言应用相结合,对实际语言环境中词与词之间的依存关系进行了有效的描述,它的建成将为句子理解或基于内容的信息检索等应用提供更强大的知识库支持。  相似文献   

4.
稀疏数据严重影响句子结构分析模型的结果, 而句法结构是语义内容和句法分析形式的结合。本文在语义结构信息标注的基础上提出了一种基于语义搭配关系的词聚类模型和算法,建立基于语义类的头驱动句子结构分析统计模型。该语言模型不但比较成功地解决了数据稀疏问题, 而且句子结构分析系统性能也有了明显的提高。句子结构分析实验结果表明,基于语义类的头驱动的句子结构分析统计模型,其召回率和精确率的值相应为88.26%和88.73%,综合指标改进了8.39%。  相似文献   

5.
针对目前已有的文本分类方法未考虑文本内部词之间的语义依存信息而需要大量训练数据的问题,提出基于语义依存分析的图网络文本分类模型TextSGN。首先对文本进行语义依存分析,对语义依存关系图中的节点(单个词)和边(依存关系)进行词嵌入和one-hot编码;在此基础上,为了对语义依存关系进行快速挖掘,提出一个SGN网络块,通过从结构层面定义信息传递的方式来对图中的节点和边进行更新,从而快速地挖掘语义依存信息,使得网络更快地收敛。在多组公开数据集上训练分类模型并进行分类测试,结果表明,TextSGN模型在短文本分类上的准确率达到95.2%,较次优分类法效果提升了3.6%。  相似文献   

6.
HNC语义标注模型的构建   总被引:1,自引:0,他引:1  
谢法奎  张全 《计算机科学》2009,36(5):238-240
介绍一种基于HNC理论的、人机结合的汉语语料语义标注模型.首先分析了HNC语义标注的内容,在此基础上定义了标注的流程.因标注十分复杂,在流程的主要环节使用机器标注来帮助人工标注.具体地说,在语义块切分问题上采用最大熵模型,其正确率和召回率分别达到了83.78%和91.17%;在句类判断问题上采用基于实例的模型,其正确率达到了51.64%.运用此标注模型建设了HNC语义标注语料库,目前语料规模已达到40万字.  相似文献   

7.
卢志茂  刘挺  李生 《自动化学报》2006,32(2):228-236
为实现汉语全文词义自动标注,本文采用了一种新的基于无指导机器学习策略的词义标注方法。实验中建立了四个词义排歧模型,并对其测试结果进行了比较.其中实验效果最优的词义排歧模型融合了两种无指导的机器学习策略,并借助依存文法分析手段对上下文特征词进行选择.最终确定的词义标注方法可以使用大规模语料对模型进行训练,较好的解决了数据稀疏问题,并且该方法具有标注正确率高、扩展性能好等优点,适合大规模文本的词义标注工作.  相似文献   

8.
基于SVMTool的中文词性标注   总被引:4,自引:0,他引:4  
SVMTool是建立在支持向量机(SVM)原理上的序列标注工具,具有简单、灵活、高效的特点,可以融入大量的语言特征。该文将SVMTool应用于中文词性标注任务,将基于隐马尔科夫模型的基线系统准确率提升了2.07%。针对未登录词准确率不高的问题,该文加入了中文字、词的特征,包括构成汉字的部首特征和词重叠特征,并从理论上分析了这两个特征的可行性,实验显示加入这些特征后,未登录词标注的准确率提升了1.16%,平均错误率下降了7.40%。  相似文献   

9.
赵晨光  蔡东风 《计算机应用》2010,30(6):1671-1672
为了提高词义排歧的准确率,提出了一种基于改进的向量空间模型(VSM)的词义排歧策略,该模型在提取特征向量的基础上,考虑了语法、词形、语义等因素,计算语境相似度,并引入搭配约束,改进了算法的效果,在开放测试环境下,词义标注正确率可达到80%以上。实验结果表明,该方法对语境信息的描述更加全面,有利于进一步的语义分析。  相似文献   

10.
研究了潜在语义分析(LSA)理论及其在连续语音识别中应用的相关技术,在此基础上利用WSJ0文本语料库上构建LSA模型,并将其与3-gram模型进行插值组合,构建了包含语义信息的统计语言模型;同时为了进一步优化混合模型的性能,提出了基于密度函数初始化质心的k-means聚类算法对LSA模型的向量空间进行聚类。WSJ0语料库上的连续语音识别实验结果表明:LSA+3-gram混合模型能够使识别的词错误率相比较于标准的3-gram下降13.3%。  相似文献   

11.
We show the results of studying models of the Russian language constructed with recurrent artificial neural networks for systems of automatic recognition of continuous speech. We construct neural network models with different number of elements in the hidden layer and perform linear interpolation of neural network models with the baseline trigram language model. The resulting models were used at the stage of rescoring the N best list. In our experiments on the recognition of continuous Russian speech with extra-large vocabulary (150 thousands of word forms), the relative reduction in the word error rate obtained after rescoring the 50 best list with the neural network language models interpolated with the trigram model was 14%.  相似文献   

12.
基于trigram语体特征分类的语言模型自适应方法   总被引:1,自引:0,他引:1  
本文从书面语和口语存在的差异出发,提出了语言模型的语体自适应方法。自适应采用了几种不同的计数意义上的插值算法。考虑Katz平滑的插值算法根据trigram单元的可信度来分配权值。基于trigram语体特征分类的自适应算法根据trigram单元的语体特征倾向动态分配权值,并选取了几种不同的权值生成函数。对口语语料做音转字的实验证明,使用这几种自适应算法可以让基准模型的性能有不同程度的提高,其中综合考虑单元可信度和特征倾向的算法效果最好,相对于本文的两个基准的汉字错误率下降率分别达到了50.2%和23.7%。  相似文献   

13.
Spelling speech recognition can be applied for several purposes including enhancement of speech recognition systems and implementation of name retrieval systems. This paper presents a Thai spelling analysis to develop a Thai spelling speech recognizer. The Thai phonetic characteristics, alphabet system and spelling methods have been analyzed. As a training resource, two alternative corpora, a small spelling speech corpus and an existing large continuous speech corpus, are used to train hidden Markov models (HMMs). Then their recognition results are compared to each other. To solve the problem of utterance speed difference between spelling utterances and continuous speech utterances, the adjustment of utterance speed has been taken into account. Two alternative language models, bigram and trigram, are used for investigating performance of spelling speech recognition. Our approach achieves up to 98.0% letter correction rate, 97.9% letter accuracy and 82.8% utterance correction rate when the language model is trained based on trigram and the acoustic model is trained from the small spelling speech corpus with eight Gaussian mixtures.  相似文献   

14.
In statistical language models,how to integrate diverse linguistic knowledge in a general framework for long-distance dependencies is a challenging issue,In this paper,an improved language model incorporating linguistic structure into maximum entropy framework is presented.The poposed model combines trigram with the structure knowledge of base phrase in which trigram is used to capture the local relation between words.while the structure knowledge of base phrase is considered to represent the long-distance relations between syntactical structures.The knowledge of syntax,semantics and vocabulary is is integrated into the maximum entropy framework,Experimental results show that the proposed model improves by 24% for language model perplexity and increases about3% for sign language recognition rate compared with the trigram model.  相似文献   

15.
A bit of progress in language modeling   总被引:2,自引:0,他引:2  
In the past several years, a number of different language modeling improvements over simple trigram models have been found, including caching, higher-order n -grams, skipping, interpolated Kneser–Ney smoothing, and clustering. We present explorations of variations on, or of the limits of, each of these techniques, including showing that sentence mixture models may have more potential. While all of these techniques have been studied separately, they have rarely been studied in combination. We compare a combination of all techniques together to a Katz smoothed trigram model with no count cutoffs. We achieve perplexity reductions between 38 and 50% (1 bit of entropy), depending on training data size, as well as a word error rate reduction of 8.9%. Our perplexity reductions are perhaps the highest reported compared to a fair baseline.  相似文献   

16.
We present a unified probabilistic framework for statistical language modeling which can simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information. Our approach is based on a recent statistical inference principle we have proposed—the latent maximum entropy principle—which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We describe efficient algorithms for marginalization, inference and normalization in our extended models. We then use these techniques to combine two standard forms of language models: local lexical models (Markov N-gram models) and global document-level semantic models (probabilistic latent semantic analysis). Our experimental results on the Wall Street Journal corpus show that we obtain a 18.5% reduction in perplexity compared to the baseline tri-gram model with Good-Turing smoothing.Editors: Dan Roth and Pascale Fung  相似文献   

17.
《Applied Soft Computing》2008,8(2):1005-1017
Statistical language models are very useful tools to improve the recognition accuracy of optical character recognition (OCR) systems. In previous systems, segmentation by maximum word matching, semantic class segmentation, or trigram language models have been used. However, these methods have some disadvantages, such as inaccuracies due to a preference for longer words (which may be erroneous), failure to recognize word dependencies, complex semantic training data segmentation, and a requirement of high memory.To overcome these problems, we propose a novel bigram Markov language model in this paper. This type of model does not have large word preferences and does not require semantically segmented training data. Furthermore, unlike trigram models, the memory requirement is small. Thus, the scheme is suitable for handheld and pocket computers, which are expected to be a major future application of text recognition systems.However, due to a simple language model, the bigram Markov model alone can introduce more errors. Hence in this paper, a novel algorithm combining bigram Markov language models with heuristic fuzzy rules is described. It is found that the recognition accuracy is improved through the use of the algorithm, and it is well suited to mobile and pocket computer applications, including as we will show in the experimental results, the ability to run on mobile phones.The main contribution of this paper is to show how fuzzy techniques as linguistic rules can be used to enhance the accuracy of a crisp recognition system, and still have low computational complexity.  相似文献   

18.
In this paper, we present a stochastic language-modeling tool which aims at retrieving variable-length phrases (multigrams), assuming n -gram dependencies between them, hence the name of the model: n -multigram. The estimation of the probability distribution of the phrases is intermixed with a phrase-clustering procedure in a way which jointly optimizes the likelihood of the data. As a result, the language data are iteratively structured at both a paradigmatic and a syntagmatic level in a fully integrated way. We evaluate the 2-multigram model as a statistical language model on ATIS, a task-oriented database consisting of air travel reservations. Experiments show that the 2-multigram model allows a reduction of 10% of the word error rate on ATIS with respect to the usual trigram model, with 25% fewer parameters than in the trigram model. In addition, we illustrate the ability of this model to merge semantically related phrases of different lengths into a common class.  相似文献   

19.
目前由于特定任务域语料的稀疏并且难以收集,这严重阻碍了对话系统的可移植性。如何利用在线收集的少量训练语料,实现语言模型的快速自适应,从而有效提高对话系统在新任务域的识别率是本文的目的所在。本文对传统cache模型修正后,提出了基于历史单元衰减的cache语言模型,以在线递增方式收集语料进行自适应,并与通用语言模型进行线性插值。在对话系统中,以对话回合为历史单元,也可称为基于对话回合衰减的cache语言模型。在两个完全不同任务域——颐和园导游与火车票订票任务域进行的实验表明,在自适应语料不到1千句时,与无自适应模型相比,有监督模式下的识别错误率分别降低了47.8%和74.0% ,无监督模式下的识别错误率分别降低了30.1%和51.1%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号