首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Large vocabulary recognition of on-line handwritten cursive words   总被引:1,自引:0,他引:1  
This paper presents a writer independent system for large vocabulary recognition of on-line handwritten cursive words. The system first uses a filtering module, based on simple letter features, to quickly reduce a large reference dictionary (lexicon) to a more manageable size; the reduced lexicon is subsequently fed to a recognition module. The recognition module uses a temporal representation of the input, instead of a static two-dimensional image, thereby preserving the sequential nature of the data and enabling the use of a Time-Delay Neural Network (TDNN); such networks have been previously successful in the continuous speech recognition domain. Explicit segmentation of the input words into characters is avoided by sequentially presenting the input word representation to the neural network-based recognizer. The outputs of the recognition module are collected and converted into a string of characters that is matched against the reduced lexicon using an extended Damerau-Levenshtein function. Trained on 2,443 unconstrained word images (11 k characters) from 55 writers and using a 21 k lexicon we reached a 97.9% and 82.4% top-5 word recognition rate on a writer-dependent and writer-independent test, respectively  相似文献   

2.
在语音情感识别研究中,已有基于深度学习的方法大多没有针对语音时频两域的特征进行建模,且存在网络模型训练时间长、识别准确性不高等问题。语谱图是语音信号转换后具有时频两域的特殊图像,为了充分提取语谱图时频两域的情感特征,提出了一种基于参数迁移和卷积循环神经网络的语音情感识别模型。该模型把语谱图作为网络的输入,引入AlexNet网络模型并迁移其预训练的卷积层权重参数,将卷积神经网络输出的特征图重构后输入LSTM(Long Short-Term Memory)网络进行训练。实验结果表明,所提方法加快了网络训练的速度,并提高了情感识别的准确率。  相似文献   

3.
In order to determine priorities for the improvement of timing in synthetic speech this study looks at the role of segmental duration prediction and the role of phonological symbolic representation in the perceptual quality of a text-to-speech system. In perception experiments using German speech synthesis, two standard duration models (Klatt rules and CART) were tested. The input to these models consisted of a symbolic representation which was either derived from a database or a text-to-speech system. Results of the perception experiments show that different duration models can only be distinguished when the symbolic representation is appropriate. Considering the relative importance of the symbolic representation, post-lexical segmental rules were investigated with the outcome that listeners differ in their preferences regarding the degree of segmental reduction. As a conclusion, before fine-tuning the duration prediction, it is important to derive an appropriate phonological symbolic representation in order to improve timing in synthetic speech.  相似文献   

4.
Automatic recognition of the speech of children is a challenging topic in computer-based speech recognition systems. Conventional feature extraction method namely Mel-frequency cepstral coefficient (MFCC) is not efficient for children's speech recognition. This paper proposes a novel fuzzy-based discriminative feature representation to address the recognition of Malay vowels uttered by children. Considering the age-dependent variational acoustical speech parameters, performance of the automatic speech recognition (ASR) systems degrades in recognition of children's speech. To solve this problem, this study addresses representation of relevant and discriminative features for children's speech recognition. The addressed methods include extraction of MFCC with narrower filter bank followed by a fuzzy-based feature selection method. The proposed feature selection provides relevant, discriminative, and complementary features. For this purpose, conflicting objective functions for measuring the goodness of the features have to be fulfilled. To this end, fuzzy formulation of the problem and fuzzy aggregation of the objectives are used to address uncertainties involved with the problem.The proposed method can diminish the dimensionality without compromising the speech recognition rate. To assess the capability of the proposed method, the study analyzed six Malay vowels from the recording of 360 children, ages 7 to 12. Upon extracting the features, two well-known classification methods, namely, MLP and HMM, were employed for the speech recognition task. Optimal parameter adjustment was performed for each classifier to adapt them for the experiments. The experiments were conducted based on a speaker-independent manner. The proposed method performed better than the conventional MFCC and a number of conventional feature selection methods in the children speech recognition task. The fuzzy-based feature selection allowed the flexible selection of the MFCCs with the best discriminative ability to enhance the difference between the vowel classes.  相似文献   

5.
随着自动大规模语音识别的不断发展,以自动语音识别为基础的计算机辅助发音教学也随之进步,作为传统教学方法的补充,它极大地弥补了传统教育资源不足以及传统教育方法无法及时给学习者反馈的缺陷。二语学习者的发音偏误确认和评价在计算机辅助发音训练中是较为重要的研究课题之一。针对二语者发音偏误的确认任务中缺少二语偏误发音标注问题,该文提出了一种基于声学音素向量和孪生网络的方法,将带有配对信息的成对的语音特征作为系统输入,通过神经网络将语音特征映射到高层表示,期望将不同的音素区分开。训练过程引入了孪生网络,依照输出的两个音素向量是否来自于同一类音素来调整和优化输出向量之间的距离,并通过相应的损失函数实现优化过程。结果表明使用基于余弦最大间隔距离损失函数的孪生网络获得了89.93%的准确率,优于实验中其它方法。此方法应用在发音偏误确认任务时,不使用标注的二语发音偏误数据训练的情况下,也获得了89.19%的诊断正确率。  相似文献   

6.
This paper presents a new technique to enhance the performance of the input interface of spoken dialogue systems based on a procedure that combines during speech recognition the advantages of using prompt-dependent language models with those of using a language model independent of the prompts generated by the dialogue system. The technique proposes to create a new speech recognizer, termed contextual speech recognizer, that uses a prompt-independent language model to allow recognizing any kind of sentence permitted in the application domain, and at the same time, uses contextual information (in the form of prompt-dependent language models) to take into account that some sentences are more likely to be uttered than others at a particular moment of the dialogue. The experiments show the technique allows enhancing clearly the performance of the input interface of a previously developed dialogue system based exclusively on prompt-dependent language models. But most important, in comparison with a standard speech recognizer that uses just one prompt-independent language model without contextual information, the proposed recognizer allows increasing the word accuracy and sentence understanding rates by 4.09% and 4.19% absolute, respectively. These scores are slightly better than those obtained using linear interpolation of the prompt-independent and prompt-dependent language models used in the experiments.  相似文献   

7.
目的视觉目标的形状特征表示和识别是图像领域中的重要问题。在实际应用中,视角、形变、遮挡和噪声等干扰因素造成识别精度较低,且大数据场景需要算法具有较高的学习效率。针对这些问题,本文提出一种全尺度可视化形状表示方法。方法在尺度空间的所有尺度上对形状轮廓提取形状的不变量特征,获得形状的全尺度特征。将获得的全部特征紧凑地表示为单幅彩色图像,得到形状特征的可视化表示。将表示形状特征的彩色图像输入双路卷积网络模型,完成形状分类和检索任务。结果通过对原始形状加入旋转、遮挡和噪声等不同干扰的定性实验,验证了本文方法具有旋转和缩放不变性,以及对铰接变换、遮挡和噪声等干扰的鲁棒性。在通用数据集上进行形状分类和形状检索的定量实验,所得准确率在不同数据集上均超过对比算法。在MPEG-7数据集上精度达到99.57%,对比算法的最好结果为98.84%。在铰接和射影变换数据集上皆达到100%的识别精度,而对比算法的最好结果分别为89.75%和95%。结论本文提出的全尺度可视化形状表示方法,通过一幅彩色图像紧凑地表达了全部形状信息。通过卷积模型既学习了轮廓点间的形状特征关系,又学习了不同尺度间的形状特征关系。本文方法...  相似文献   

8.
Some applications of speech recognition, such as automatic directory information services, require very large vocabularies. In this paper, we focus on the task of recognizing surnames in an Interactive telephone-based Directory Assistance Services (IDAS) system, which supersedes other large vocabulary applications in terms of complexity and vocabulary size. We present a method for building compact networks in order to reduce the search space in very large vocabularies using Directed Acyclic Word Graphs (DAWGs). Furthermore, trees, graphs and full-forms (whole words with no merging of nodes) are compared in a straightforward way under the same conditions, using the same decoder and the same vocabularies. Experimental results showed that, as we move from full-form lexicons to trees and then to graphs, the size of the recognition network is reduced, as is the recognition time. However, recognition accuracy is retained since the same phoneme combinations are involved. Subsequently, we refine the N-best hypotheses' list provided by the speech recognizer by applying context-dependent phonological rules. Thus, a small number N in the N-best hypotheses' list produces multiple solutions sufficient to retain high accuracy and at the same time achieve real-time response. Recognition tests with a vocabulary of 88,000 surnames that correspond to 123,313 distinct pronunciations proved the efficiency of the approach. For N = 3 (a value that ensures we have fast performance), before the application of rules the recognition accuracy was 70.27%. After applying phonological rules the recognition performance rose to 86.75%.  相似文献   

9.
This paper proposes a novel method based on Spectral Regression (SR) for efficient scene recognition. First, a new SR approach, called Extended Spectral Regression (ESR), is proposed to perform manifold learning on a huge number of data samples. Then, an efficient Bag-of-Words (BOW) based method is developed which employs ESR to encapsulate local visual features with their semantic, spatial, scale, and orientation information for scene recognition. In many applications, such as image classification and multimedia analysis, there are a huge number of low-level feature samples in a training set. It prohibits direct application of SR to perform manifold learning on such dataset. In ESR, we first group the samples into tiny clusters, and then devise an approach to reduce the size of the similarity matrix for graph learning. In this way, the subspace learning on graph Laplacian for a vast dataset is computationally feasible on a personal computer. In the ESR-based scene recognition, we first propose an enhanced low-level feature representation which combines the scale, orientation, spatial position, and local appearance of a local feature. Then, ESR is applied to embed enhanced low-level image features. The ESR-based feature embedding not only generates a low dimension feature representation but also integrates various aspects of low-level features into the compact representation. The bag-of-words is then generated from the embedded features for image classification. The comparative experiments on open benchmark datasets for scene recognition demonstrate that the proposed method outperforms baseline approaches. It is suitable for real-time applications on mobile platforms, e.g. tablets and smart phones.  相似文献   

10.

Subjects recalled items from short‐term memory by speaking into a speech recognizer. Two experiments examined effects of the type of feedback provided by the device during this data entry task. Three types of feedback were compared, varying in: modality, either auditory or visual; timing, either concurrent or terminal, and specificity, either verbal or nonverbal. Recognizer performance was better with concurrent feedback than with terminal feedback and better with nonverbal feedback than with verbal feedback. In terms of the efficiency of memory (the number of errors and the rate of data throughput), performance was more impaired by concurrent verbal feedback than by nonverbal feedback. Two main functional features of feedback in automatic speech recognition were identified: (1) the degree of similarity between the feedback and the phonologically‐coded information held in short‐term memory, which pointed to the dangers of spoken feedback and to a lesser extent the use of verbal visual feedback, and (2) the extent to which prompting is required to establish the timeliness of data input, a feature which is especially important with isolated‐word speech recognition.  相似文献   

11.
A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.  相似文献   

12.
许友亮  张连海  屈丹  牛铜 《计算机工程》2012,38(11):160-162,166
提出一种基于长时性信息的音位属性检测方法,该方法通过高、低两层时间延迟神经网络(TDNN)进行实现,低层TDNN在短时特征上进行音位属性的检测,高层TDNN在低层检测结果的基础上,对更长时段上的信息进行融合。实验结果表明,引入长时性特征使得音位属性检测率提升约3%,将音位属性后验概率作为音素识别系统的观测特征,使用长时性特征的识别结果提升约1.7%。  相似文献   

13.
语音MFCC特征计算的改进算法   总被引:1,自引:0,他引:1  
提出了一种计算Mel频倒谱参数(Mel frequency cepstral coefficient,MFCC)特征的改进算法,该算法采用了加权滤波器分析(Wrapped discrete Fourier transform,WDFT)技术来提高语音信号低频部分的频谱分辨率,使之更符合人类听觉系统的特性。同时还运用了加权滤波器分析(Weighted filter bank analysis,WFBA)技术,以提高MFCC的鲁棒性。对TIMIT连续语音数据库中DR1集的音素识别结果表明,本文提出的改进算法比传统MFCC算法具有更好的识别率。  相似文献   

14.
汉语普通话易混淆音素的识别   总被引:2,自引:2,他引:0       下载免费PDF全文
针对汉语普通话语音识别中易混淆音素的声学特征,把小波包分解理论应用在感觉加权线性预测(PLP)特征中,提出一种新的特征参数提取算法,可以更精确地描述易混淆音素的频谱特征。使甩高斯混合模型对新的声学特征进行分类,从而达到区分的目的。实验结果证明,新的特征参数识别结果优于使用传统PLP特征参数的识别结果,识别错误率下降30%以上。  相似文献   

15.
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).  相似文献   

16.
目的 基于深度学习的多聚焦图像融合方法主要是利用卷积神经网络(convolutional neural network,CNN)将像素分类为聚焦与散焦。监督学习过程常使用人造数据集,标签数据的精确度直接影响了分类精确度,从而影响后续手工设计融合规则的准确度与全聚焦图像的融合效果。为了使融合网络可以自适应地调整融合规则,提出了一种基于自学习融合规则的多聚焦图像融合算法。方法 采用自编码网络架构,提取特征,同时学习融合规则和重构规则,以实现无监督的端到端融合网络;将多聚焦图像的初始决策图作为先验输入,学习图像丰富的细节信息;在损失函数中加入局部策略,包含结构相似度(structural similarity index measure,SSIM)和均方误差(mean squared error,MSE),以确保更加准确地还原图像。结果 在Lytro等公开数据集上从主观和客观角度对本文模型进行评价,以验证融合算法设计的合理性。从主观评价来看,模型不仅可以较好地融合聚焦区域,有效避免融合图像中出现伪影,而且能够保留足够的细节信息,视觉效果自然清晰;从客观评价来看,通过将模型融合的图像与其他主流多聚焦图像融合算法的融合图像进行量化比较,在熵、Qw、相关系数和视觉信息保真度上的平均精度均为最优,分别为7.457 4,0.917 7,0.978 8和0.890 8。结论 提出了一种用于多聚焦图像的融合算法,不仅能够对融合规则进行自学习、调整,并且融合图像效果可与现有方法媲美,有助于进一步理解基于深度学习的多聚焦图像融合机制。  相似文献   

17.
近年来, 人工智能在各个领域有着广泛的应用. 针对超市及菜市场人工称重操作耗时、计价流程繁杂的问题, 本文提出一种基于注意力YOLOv5模型的水果自动识别算法. 首先, 为了提升仅有局部特征不同, 全局特征相似水果的识别准确率, 本文在YOLOv5的SPP (spatial pyramid pooling)层后增加SENet (squeeze-and-excitation networks), 采用注意力机制自动学习每个特征通道的重要程度, 进而按照重要程度强化对水果识别任务有用的特征并抑制没有用的特征; 其次, 针对水果识别预测框与目标框重叠时, GIOU不能准确表达边框重合关系问题, 本文将原有的边框回归损失函数GIOU替换为CIOU, 同时考虑目标框与预测框的高宽比和中心点之间的关系, 从而使水果预测框更加接近真实框, 提升预测精度. 实验结果表明, 改进后的模型在常见场景下水果识别能力有明显提升, 平均精度mAP达99.10%, 识别速度FPS达到82, 能够满足实际应用需要.  相似文献   

18.
In contrast to speech recognition, whose speech features have been extensively explored in the research literature, feature extraction in Sign Language Recognition (SLR) is still a very challenging problem. In this paper we present a methodology for feature extraction in Brazilian Sign Language (BSL, or LIBRAS in Portuguese) that explores the phonological structure of the language and relies on RGB-D sensor for obtaining intensity, position and depth data. From the RGB-D images we obtain seven vision-based features. Each feature is related to one, two or three structural elements in BSL. We investigate this relation between extracted features and structural elements based on shape, movement and position of the hands. Finally we employ Support Vector Machines (SVM) to classify signs based on these features and linguistic elements. The experiments show that the attributes of these elements can be successfully recognized in terms of the features obtained from the RGB-D images, with accuracy results individually above 80% on average. The proposed feature extraction methodology and the decomposition of the signs into their phonological structure is a promising method to help expert systems designed for SLR.  相似文献   

19.

This paper proposes a speaker recognition system using acoustic features that are based on spectral-temporal receptive fields (STRFs). The STRF is derived from physiological models of the mammalian auditory system in the spectral-temporal domain. With the STRF, a signal is expressed by rate (in Hz) and scale (in cycles/octaves). The rate and scale are used to specify the temporal response and spectral response, respectively. This paper uses the proposed STRF based feature to perform speaker recognition. First, the energy of each scale is calculated using the STRF representation. A logarithmic operation is then applied to the scale energies. Finally, a discrete cosine transform is utilized to the generation of the proposed STRF feature. This paper also presents a feature set that combines the proposed STRF feature with conventional Mel frequency cepstral coefficients (MFCCs). The support vector machines (SVMs) are adopted to be the speaker classifiers. To evaluate the performance of the proposed speaker recognition system, experiments on 36-speaker recognition were conducted. Comparing with the MFCC baseline, the proposed feature set increases the speaker recognition rates by 3.85 % and 18.49 % on clean and noisy speeches, respectively. The experiments results demonstrate the effectiveness of adopting STRF based feature in speaker recognition.

  相似文献   

20.
基于深度学习的端到端语音识别模型中,由于模型的输入采用固定长度的语音帧,造成时域信息和部分高频信息损失进而导致识别率不高、鲁棒性差等问题。针对上述问题,提出了一种基于残差网络与双向长短时记忆网络相结合的模型,该模型采用语谱图作为输入,同时在残差网络中设计并行卷积层,提取不同尺度的特征,然后进行特征融合,最后采用连接时序分类方法进行分类,实现一个端到端的语音识别模型。实验结果表明,该模型在Aishell-1语音集上字错误率相较于传统端到端模型的WER下降2.52%,且鲁棒性较好。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号