首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 406 毫秒
1.
在文字识别系统中,为了进一步提高文本识别率,后处理模块是很重要的环节。文章针对日文的语言特性,建立统计方法和规则相结合的混和语言模型,实现了一个日文识别后处理系统。该系统首先利用Viterbi算法得到统计模型输出的最优结果,通过与前端识别器输入的识别结果相比较,确定可疑字位置,再利用上下文词匹配方法和语法规则库的使用对可疑字进行检错和纠错处理。经实验验证,该后处理系统对识别日文印刷体文本错误率平均下降21.4%。  相似文献   

2.
基于HMM的汉语文本识别后处理研究   总被引:9,自引:1,他引:8  
本文用HMM(Hidden Markov Model)描述汉语文本识别后处理,将汉语语言和单字识别这两个概率模型结合起来,以充分利用单字识别器提供的信息。语言模型的参数由语料库统计得到;单字识别模型的参数为条件概率,经理论分析,它可转化为后验概率来求解。在分析训练样本集单字识别结果的基础上,提出一种统计方法估计候选字的后验概率。HMM在脱机手写体汉语文本识别中的实验表明,后处理性能除取决于语言模型外,还取决于后验概率的精确估计。  相似文献   

3.
基于HMM的汉语文本识别后处理琛   总被引:1,自引:0,他引:1  
本文用HMM描述汉语文本识别后处理,将汉语语言和单字识别这两个概率模型结合起来,以充分利用单字识别器提供的信息。语言模型的参数由语料库统计得到:单字识别模型的参数为条件概率,经理论分析,它要转化为后难事概率来求解。在分析训练样本集单字识别结果的基础上,提出一种统计方法估计候选字的后验概率。HMM在脱机手写体汉语文本识别中的实验表明,后处理性能除取决于语言模型外,还取决于后验概率的精确估计。  相似文献   

4.
基于N-gram语言模型的汉字识别后处理研究   总被引:1,自引:0,他引:1  
为提高汉字文本的识别率,本文将基于统计的N-gram元语言模型和单字识别器概率模型结合起来,以充分利用单字识别器提供的信息.该方法把具有确定性边界的一个汉字序列(多数情况为一个句子)作为一个处理单元,利用统计获得的字字同现概率和距离值信息,采用Viterbi算法,对汉字识别文本进行自动后处理.经过实验证明,后处理将汉字识别准确率平均值从97.62%提高到98.71%.  相似文献   

5.
免疫识别器的动态覆盖与人工免疫系统的有效性具有重要的关联关系,应该用尽可能少的识别器覆盖尽可能多的NONSELF空间。根据识别器的动态覆盖性原理,引入逻辑程序来表示免疫识别器,并运用逻辑程序的更新特性,试图消除识别器所识别出的NONSELF空间的冗余,降低识别器的浓度,增强识别器的动态覆盖性。  相似文献   

6.
现有的场景文本识别器容易受到模糊文本图像的困扰, 导致在实际应用中性能较差. 因此近年来研究人员提出了多种场景文本图像超分辨率模型作为场景文本识别的预处理器, 以提高输入图像的质量. 然而, 用于场景文本图像超分辨率任务的真实世界训练样本很难收集; 此外, 现有的场景文本图像超分辨率模型只学习将低分辨率(LR)文本图像转换为高分辨率(HR)文本图像, 而忽略了从HR到LR图像的模糊模式. 本文提出了模糊模式感知模块, 该模块从现有的真实世界HR-LR文本图像对中学习模糊模式, 并将其转移到其他HR图像中, 以生成具有不同退化程度的LR图像. 本文所提出的模糊模式感知模块可以为场景文本图像超分辨率模型生成大量的HR-LR图像对, 以弥补训练数据的不足, 从而显著提高性能. 实验结果表明, 当配备提出的模糊模式感知模块时, 场景文本图像超分辨率方法的性能可以进一步提高, 例如, SOTA方法TG在使用CRNN文本识别器进行评估时, 识别准确率提高了5.8%.  相似文献   

7.
语料资源缺乏的连续语音识别方法的研究   总被引:2,自引:0,他引:2  
由于少数民族语言有其本身的特点, 不能简单地套用现有的连续语音识别的方法. 本文以蒙古语为例, 研讨了声学和语言模型的建立, 并在日本国际电气通信基础技术研究所的连续语音识别器上实现了蒙古语的语音识别系统. 本文侧重于语言模型的建立, 基于蒙古语黏着性语言特点, 提出用相似词聚类方法建立多类N-gram模型. 实验结果显示, 应用我们提出的语言模型, 识别精度比用传统的词的N-gram识别法提高了5.5%.  相似文献   

8.
基于语言知识的手写汉语文本自动识别初探   总被引:2,自引:0,他引:2  
文中首先从信息开销的角度分析了识别一个汉字所需要的信息量。研究表明,单字识别算法是一种等概模型,需要的信息最多。因此,可把汉字文本当作Markov模型来处理,当前汉字的发生仅依赖于前m个汉字。根据对文本的统计,得到许多语言统计信息,在此基础上,设计了利用语言知识基于句子的文本自动识别方法。识别时当前待识字的匹配仅在前一个字的后邻接字集里进行;当一个句子识别完后,对其进行语言知识处理后再输出结果。因  相似文献   

9.
在实际成像条件下,运动中的三维目标,其投影形状(Silhouette)是变化的,因而其可识别性也处于变动中.为了应对这类困难情况,本文定义了模式的动态特征空间和模式的动态可识别性等概念.讨论了建立多尺度三维目标特性视图特征模型的必要性,以及将目标运动特性一般约束用于目标序列图像识别的合理性.据此,提出了处理三维目标运动图像序列的多尺度智能递推识别方法(MUSIRR).构造了一种混合神经网络和逻辑决策模块的智能识别器,BP神经网和RBF网用作识别器的基本构成单元.在训练阶段,该识别器使用目标的多尺度二值特性视图模型的规则矩不变量为样本特征向量.在识别阶段,算法在递推识别序列目标图像过程中,充分利用了目标姿态不会突变以及有关成像过程的合理约束,达到了提高识别率目的.与文献中的基于单尺度特性视图的三维目标识别方法相比,本文的方法训练过程简单,只需较少的目标特性视图模型样本,不仅能处理单帧图像,更能有效处理序列图像.对几类飞机目标的大规模模拟实验结果证实了本文方法的合理性和有效性.  相似文献   

10.
以基于隐马尔可夫模型和统计语言模型的研究作为基础,着重研究联机手写哈萨克文的切分技术、连体段分类和特征参数的独特提取技术。系统先将提取延迟笔划后的连体段主笔划作为HMM识别器的输入,再根据被识别的主笔划的编号和延迟笔划标记从连体段分类词典中查找,找到对应的连体段识别结果。通过去除连体段延迟笔画的方法可以有效地减少需建立的模型数目,进而提高识别速度和避免由字符切分所带来的问题。  相似文献   

11.
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. Highly efficient learning algorithms are described that enable the use of training corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary continuous speech recognizer using a lattice rescoring framework at a very low additional processing time. The neural network language model was thoroughly evaluated in a state-of-the-art large vocabulary continuous speech recognizer for several international benchmark tasks, in particular the Nist evaluations on broadcast news and conversational speech recognition. The new approach is compared to four-gram back-off language models trained with modified Kneser–Ney smoothing which has often been reported to be the best known smoothing method. Usually the neural network language model is interpolated with the back-off language model. In that way, consistent word error rate reductions for all considered tasks and languages were achieved, ranging from 0.4% to almost 1% absolute.  相似文献   

12.
一种基于字词结合的汉字识别上下文处理新方法   总被引:6,自引:0,他引:6  
根据字、词信息之间的互补性,提出一种字、词结合的上下文处理方法.在单字识别的基础上,首先利用前向一后向搜索算法在较大的候选集上进行基于字bigram模型的上下文处理,在提高文本识别率的同时可提高候选集的效率;然后在较小的候选集上进行基于词bigram模型的上下文处理.该方法在兼顾处理速度的同时,可有效地提高文本识别率.脱机手写体汉字文本(约6.6万字)识别中的实验表明:经字bigram模型处理,文本识别率由处理前的81.58%提高至94.50%,文本前10选累计正确率由94.33%提高到98.25%;再经词bigram模型处理,文本识别率进一步提高至95.75%。  相似文献   

13.
多层DGMM识别器在中国手语识别中的应用   总被引:4,自引:0,他引:4  
吴江琴  高文  陈熙霖  马继涌 《软件学报》2000,11(11):1430-1439
手语是聋人使用的语言,是由手形动作辅之以表情姿势由符号构成的比较稳定的表达系统 ,是一种靠动作/视觉交际的语言.手语识别的研究目标是让机器“看懂”聋人的语言.手 语识别和手语合成相结合,构成一个“人-机手语翻译系统”,便于聋人与周围环境的交 流.手语识别问题是动态手势信号即手语信号的识别问题.考虑到系统的实时性及识别效率, 该系统选取Cyberglove型号数据手套作为手语输入设备,采用DGMM(dynamic Gaussian mixt ure model)作为系统的识别技术,并根据中国手语的具体特点,在识别模块中选取了多层识 别器,可识别中国手语字典中的274个词条,识别率为97.4%.与基于单个DGMM的识别系统比 较,这种模型的识别精度与单个DGMM模型的识别精度基本相同,但其识别速度比单个DGMM的 识别速度有明显的提高.  相似文献   

14.
There is significant lexical difference—words and usage of words-between spontaneous/colloquial language and the written language. This difference affects the performance of spoken language recognition systems that use statistical language models or context-free-grammars because these models are based on the written language rather than the spoken form. There are many filler phrases and colloquial phrases that appear solely or more often in spontaneous and colloquial speech. Chinese languages perhaps exemplify such a difference as many colloquial forms of the language, such as Cantonese, exist strictly in spoken forms and are different from the written standard Chinese, which is based on Mandarin. A conventional way of dealing with this issue is to add colloquial terms manually to the lexicon. However, this is time-consuming and expensive. Meanwhile, supervised learning requires manual tagging of large corpuses, which is also time-consuming. We propose an unsupervised learning method to find colloquial terms and classify filler and content phrases in spontaneous and colloquial Chinese, including Cantonese. We propose using frequency strength, and spread measures of character pairs and groups to extract automatically frequent, out-of-vocabulary colloquial terms to add to a standard Chinese lexicon. An unsegmented, and unannotated corpus is segmented with the augmented lexicon. We then propose a Markov classifier to classify Chinese characters into either content or filler phrases in an iterative training method. This method is task-independent and can extract even mixed language terms. We show the effectiveness of our method by both a natural language query processing task and an adaptive Cantonese language-modeling task. The precision for content phrase extraction and classification is around 80%, with a recall of 99%, and the precision for filler phrase extraction and classification is around 99.5% with a recall of approximately 89%. The web search precision using these extracted content words is comparable to that of the search results with content phrases selected by humans. We adapt a language model trained from written texts with the Hong Kong Newsgroup corpus. It outperforms both the standard Chinese language model and also the Cantonese language model. It also performs better than the language model trained a simply by concatenating two sets of standard and colloquial texts.  相似文献   

15.
This paper presents a new Bayesian-based method of unconstrained handwritten offline Chinese text line recognition. In this method, a sample of a real character or non-character in realistic handwritten text lines is jointly recognized by a traditional isolated character recognizer and a character verifier, which requires just a moderate number of handwritten text lines for training. To improve its ability to distinguish between real characters and non-characters, the isolated character recognizer is negatively trained using a linear discriminant analysis (LDA)-based strategy, which employs the outputs of a traditional MQDF classifier and the LDA transform to re-compute the posterior probability of isolated character recognition. In tests with 383 text lines in HIT-MW database, the proposed method achieved the character-level recognition rates of 71.37% without any language model, and 80.15% with a bi-gram language model, respectively. These promising results have shown the effectiveness of the proposed method for unconstrained handwritten offline Chinese text line recognition.  相似文献   

16.
联机手写体汉字识别后处理技术的研究   总被引:4,自引:1,他引:3  
文中提出了一种规则和统计相结合的计算语言模型应用于联机手写体汉字识别后处理的技术,把基于统计的大词表Markov语言模型与语言规则量化模型,通过词网格技术集成在一个语言解码器,这种后处理方法由3个阶段组成,词网格生成,语言解码,基于Cache的自学习机制,语言解码器采用Viterbi搜索算法求解最优语句候选,该项技术已应用于HPC(手持机)手写电脑的联机汉字手写体识别系统中,汉字识别率为91.3%  相似文献   

17.
A hybrid named entity recognizer for Turkish   总被引:1,自引:0,他引:1  
Named entity recognition is an important subfield of the broader research area of information extraction from textual data. Yet, named entity recognition research conducted on Turkish texts is still rare as compared to related research carried out on other languages such as English, Spanish, Chinese, and Japanese. In this study, we present a hybrid named entity recognizer for Turkish, which is based on a manually engineered rule based recognizer that we have proposed. Since rule based systems for specific domains require their knowledge sources to be manually revised when ported to other domains, we enrich our rule based recognizer and turn it into a hybrid recognizer so that it learns from annotated data when available and improves its knowledge sources accordingly. The hybrid recognizer is originally engineered for generic news texts, but with its learning capability, it is improved to be applicable to that of financial news texts, historical texts, and child stories as well, without human intervention. Both the hybrid recognizer and its rule based predecessor are evaluated on the same corpora and the hybrid recognizer achieves better results as compared to its predecessor. The proposed hybrid named entity recognizer is significant since it is the first hybrid recognizer proposal for Turkish addressing the above porting problem considering that Turkish possesses different structural properties compared to widely studied languages such as English and there is very limited information extraction research conducted on Turkish texts. Moreover, the employment of the proposed hybrid recognizer for semantic video indexing is shown as a case study on Turkish news videos. The genuine textual and video corpora utilized throughout the paper are compiled and annotated by the authors due to the lack of publicly available annotated corpora for information extraction research on Turkish texts.  相似文献   

18.
The present work is an attempt to develop a robust character recognizer for Telugu texts. We aim at designing a recognizer, which exploits the inherent characteristics of the Telugu Script. Our proposed method uses wavelet multi-resolution analysis for the purpose extracting features and associative memory model to accomplish the recognition tasks. Our system learns the style and font from the document itself and then it recognizes the remaining characters in the document. The major contribution of the present study can be outlined as follows. It is a robust OCR system for Telugu printed text. It avoids feature extraction process and it exploits the inherent characteristics of the Telugu character by a clever selection of Wavelet Basis function, which extracts the invariant features of the characters. It has a Hopfield-based Dynamic Neural Network for the purpose of learning and recognition. This is important because it overcomes the inherent difficulties of memory limitation and spurious states in the Hopfield Network. The DNN has been demonstrated to be efficient for associative memory recall. However, though it is normally not suitable for image processing application, the multi-resolution analysis reduces the sizes of the images to make the DNN applicable to the present domain. Our experimental results show extremely promising results.  相似文献   

19.
语音识别中统计与规则结合的语言模型   总被引:2,自引:1,他引:1  
王轩  王晓龙  张凯 《自动化学报》1999,25(3):309-315
在分析语音识别系统中,基于规则方法和统计方法的语言模型,提出了一种对规则进行量化的合成语言模型.该模型既避免了规则方法无法适应大规模真实文本处理的缺点,同时也提高了统计模型处理远距离约束关系和语言递归现象的能力.合成语言模型使涵盖6万词条的非特定人孤立词的语音识别系统的准确率比单独使用词的TRIGRAM模型提高了4.9%(男声)和3.5%(女声).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号