期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A segment-based interpretation of HMM/ANN hybrids

《Computer Speech and Language》2007,21(3):562-578

Here we seek to understand the similarities and differences between two speech recognition approaches, namely the HMM/ANN hybrid and the posterior-based segmental model. Both these techniques create local posterior probability estimates and combine these estimates into global posteriors – but they are built on somewhat different concepts and mathematical derivations. The HMM/ANN hybrid combines the local estimates via a formulation that is inherited from the generative HMM concept, while the components of the segment-based model correspond quite directly to the two subtasks of phonetic decoding: segmentation and classification. In this paper we attempt to identify the corresponding components of the segmental model within the hybrid model, with the intent of gaining an insight from this unusual point of view. As regards one of these components, the segment-based phone posteriors, we show that the independence-based product rule combination applied in the hybrid produces strongly biased estimates. As for the other component, the segmentation probability factor, we argue that it is present in the hybrid thanks to the bias of the product rule – that is, the product rule goes wrong in such a special way that it helps the model find the best segmentation of the input. To prove this assertion, we combine this bias with the posterior estimates obtained by averaging, and find that the resulting ‘averaging hybrid’ slightly outperforms the standard one on a phone recognition task and a word recognition task as well. Overall we conclude that the contribution of the product rule to the decoding process is just as important for the segmentation subtask as it is for the segment classification subtask. 相似文献

2.

基于条件随机域的上下文人类动作识别

下载免费PDF全文

朱文球刘强《计算机工程与应用》2008,44(28):180-183

提出一种新的基于条件随机域和隐马尔可夫模型（HMM）的人类动作识别方法——HMCRF。目前已有的动作识别方法均使用隐马尔可夫模型及其变型,这些模型一个最突出的不足就是要求观察值相互独立。条件模型很容易表示上下文相关性,且可使用动态规划做到有效且精确的推论,它的参数可以通过凸函数优化训练得到。把条件图形模型应用于动作识别之上,并通过大量的实验表明,所提出的方法在识别正确率方面明显优于一般线性结构的CRF和HMM。相似文献

3.

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

Rajesh Kumar Aggarwal Mayank Dave 《International Journal of Speech Technology》2011,14(4):297-308

In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR. 相似文献

4.

Acoustic quality normalization for robust automatic speech recognition

Ghulam Muhammad 《International Journal of Speech Technology》2007,10(4):175-182

Automatic speech recognition (ASR) system suffers from the variation of acoustic quality in an input speech. Speech may be produced in noisy environments and different speakers have their own way of speaking style. Variations can be observed even in the same utterance and the same speaker in different moods. All these uncertainties and variations should be normalized to have a robust ASR system. In this paper, we apply and evaluate different approaches of acoustic quality normalization in an utterance for robust ASR. Several HMM (hidden Markov model)-based systems using utterance-level, word-level, and monophone-level normalization are evaluated with HMM-SM (subspace method)-based system using monophone-level normalization for normalizing variations and uncertainties in an utterance. SM can represent variations of fine structures in sub-words as a set of eigenvectors, and so has better performance at monophone-level than HMM. Experimental results show that word accuracy is significantly improved by the HMM-SM-based system with monophone-level normalization compared to that by the typical HMM-based system with utterance-level normalization in both clean and noisy conditions. Experimental results also suggest that monophone-level normalization using SM has better performance than that using HMM. 相似文献

5.

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Isar Nejadgholi Seyyed Ali Seyyedsalehi 《Neural computing & applications》2009,18(1):45-55

The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively. 相似文献

6.

基于多层次特征集成的中文实体指代识别

张海雷曹菲菲陈文亮任飞亮王会珍朱靖波《中文信息学报》2007,21(5):126-130

实体指代识别(Entity Mention Detection, EMD)是识别文本中对实体的指代(Mention)的任务,包括专名、普通名词、代词指代的识别。本文提出一种基于多层次特征集成的中文实体指代识别方法,利用条件随机场模型的特征集成能力,综合使用字符、拼音、词及词性、各类专名列表、频次统计等各层次特征提高识别性能。本文利用流水线框架,分三个阶段标注实体指代的各项信息。基于本方法的指代识别系统参加了2007年自动内容抽取(ACE07)中文EMD评测,系统的ACE Value值名列第二。相似文献

7.

Conditional models for contextual human motion recognition 总被引：1，自引：0，他引：1

Cristian Sminchisescu Atul Kanaujia Dimitris Metaxas 《Computer Vision and Image Understanding》2006,104(2-3):210

We describe algorithms for recognizing human motion in monocular video sequences, based on discriminative conditional random fields (CRFs) and maximum entropy Markov models (MEMMs). Existing approaches to this problem typically use generative structures like the hidden Markov model (HMM). Therefore, they have to make simplifying, often unrealistic assumptions on the conditional independence of observations given the motion class labels and cannot accommodate rich overlapping features of the observation or long-term contextual dependencies among observations at multiple timesteps. This makes them prone to myopic failures in recognizing many human motions, because even the transition between simple human activities naturally has temporal segments of ambiguity and overlap. The correct interpretation of these sequences requires more holistic, contextual decisions, where the estimate of an activity at a particular timestep could be constrained by longer windows of observations, prior and even posterior to that timestep. This would not be computationally feasible with a HMM which requires the enumeration of a number of observation sequences exponential in the size of the context window. In this work we follow a different philosophy: instead of restrictively modeling the complex image generation process – the observation, we work with models that can unrestrictedly take it as an input, hence condition on it. Conditional models like the proposed CRFs seamlessly represent contextual dependencies and have computationally attractive properties: they support efficient, exact recognition using dynamic programming, and their parameters can be learned using convex optimization. We introduce conditional graphical models as complementary tools for human motion recognition and present an extensive set of experiments that show not only how these can successfully classify diverse human activities like walking, jumping, running, picking or dancing, but also how they can discriminate among subtle motion styles like normal walks and wander walks. 相似文献

8.

A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers

Virender Kadyan Archana Mantri R. K. Aggarwal 《International Journal of Speech Technology》2017,20(4):761-769

Automatic speech recognition (ASR) system plays a vital role in the human–machine interaction. ASR system faces the challenge of performance degradation due to inconsistency between training and testing phases. This occurs due to extraction and representation of erroneous, redundant feature vectors. This paper proposes three different combinations at speech feature vector generation phase and two hybrid classifiers at modeling phase. In feature extraction phase MFCC, RASTA-PLP, and PLP are combined in different ways. In modeling phase, the mean and variance are calculated to generate the inter and intra class feature vectors. These feature vectors are further adopted by optimization algorithm to generate refined feature vectors with traditional statistical technique. This approach uses GA?+?HMM and DE?+?HMM techniques to produce refine model parameters. The experiments are conducted on datasets of large vocabulary isolated Punjabi lexicons. The simulation result shows the performance improvement using MFCC and DE?+?HMM technique when compared with RASTA-PLP, PLP using hybrid HMM classifiers. 相似文献

9.

在HMM中状态停留时间的模型化及不同特征参数的组合

郭庆柴海新吴文虎《计算机研究与发展》1999,36(3):257-262

在传统的隐马尔可夫模型中,模型在某状态停留一定时间的概率随着时间的增长呈指数下降的趋势。文中使用依赖于时间的状态转移概率对状态停留时间予以刻画。首先,在采用相同特征矢量下进行了修改后的隐马尔可夫模型和传统隐马尔可夫模型的比较和分析。其次,对不同特征矢量的组合进行了对比实验。另外,在进行不同参数的组合时,文中考虑了不同特征参数及其维数对观察矢量概率输出的影响。相似文献

10.

Inference Methods for CRFs with Co-occurrence Statistics

Ľubor Ladický Chris Russell Pushmeet Kohli Philip H. S. Torr 《International Journal of Computer Vision》2013,103(2):213-225

The Markov and Conditional random fields (CRFs) used in computer vision typically model only local interactions between variables, as this is generally thought to be the only case that is computationally tractable. In this paper we consider a class of global potentials defined over all variables in the CRF. We show how they can be readily optimised using standard graph cut algorithms at little extra expense compared to a standard pairwise field. This result can be directly used for the problem of class based image segmentation which has seen increasing recent interest within computer vision. Here the aim is to assign a label to each pixel of a given image from a set of possible object classes. Typically these methods use random fields to model local interactions between pixels or super-pixels. One of the cues that helps recognition is global object co-occurrence statistics, a measure of which classes (such as chair or motorbike) are likely to occur in the same image together. There have been several approaches proposed to exploit this property, but all of them suffer from different limitations and typically carry a high computational cost, preventing their application on large images. We find that the new model we propose produces a significant improvement in the labelling compared to just using a pairwise model and that this improvement increases as the number of labels increases. 相似文献

11.

基于双因子高斯过程动态模型的声道谱转换方法

孙新建张雄伟杨吉斌曹铁勇钟新毅《自动化学报》2014,40(6):1198-1207

针对作者已经提出的双因子高斯过程隐变量模型（Two-factor Gaussian process latent variable model,TF-GPLVM）用于语音转换时未考虑语音的动态特征,并且模型训练时需要估计的参数较多的问题,提出引入隐马尔科夫模型（Hidden Markov model,HMM）对语音动态特征进行建模,并利用HMM隐状态对各帧语音进行关于语义内容的概率软分类,建立了分离精度更高、运算负荷较小的双因子高斯过程动态模型（Two-factor Gaussian process dynamic model,TF-GPDM）.基于此模型,设计了一种全新的基于说话人特征替换的语音声道谱转换方案.主、客观实验结果表明,无论是与传统的统计映射和频率弯折转换方法相比,还是与双因子高斯过程隐变量模型方法相比,本文方法都获得了语音质量和转换相似度的提升,以及两项性能的更佳平衡. 相似文献

12.

On Acoustic Diversification Front-End for Spoken Language Identification

Khe Chai Sim Haizhou Li 《IEEE transactions on audio, speech, and language processing》2008,16(5):1029-1037

The parallel phone recognition followed by language model (PPRLM) architecture represents one of the state-of-the-art spoken language identification systems. A PPRLM system comprises multiple parallel subsystems, where each subsystem employs a phone recognizer with a different phone set for a particular language. The phone recognizer extracts phonotactic attributes from the speech input to characterize a language. The multiple parallel subsystems are devised to capture the phonetic diversification available in the speech input. Alternatively, this paper investigates a new approach for building a PPRLM system that aims at improving the acoustic diversification among its parallel subsystems by using multiple acoustic models. These acoustic models are trained on the same speech data with the same phone set but using different model structures and training paradigms. We examine the use of various structured precision (inverse covariance) matrix modeling techniques as well as the maximum likelihood and maximum mutual information training paradigms to produce complementary acoustic models. The results show that acoustic diversification, which requires only one set of phonetically transcribed speech data, yields similar performance improvements compared to phonetic diversification. In addition, further improvements were obtained by combining both diversification factors. The best performing system reported in this paper combined phonetic and acoustic diversifications to achieve EERs of 4.71% and 8.61% on the 2003 and 2005 NIST LRE sets, respectively, compared to 5.77% and 9.94% using phonetic diversification alone. 相似文献

13.

Validation of phonetic transcriptions in the context of automatic speech recognition

Christophe Van Bael Henk van den Heuvel Helmer Strik 《Language Resources and Evaluation》2007,41(2):129-146

相似文献

14.

Environmental Independent ASR Model Adaptation/Compensation by Bayesian Parametric Representation

Wang X. O'Shaughnessy D. 《IEEE transactions on audio, speech, and language processing》2007,15(4):1204-1217

The mismatch between system training and operating conditions can seriously deteriorate the performance of automatic speech recognition (ASR) systems. Various techniques have been proposed to solve this problem in a specified speech environment. Employment of these techniques often involves modification on the ASR system structure. In this paper, we propose an environment-independent (EI) ASR model parameter adaptation approach based on Bayesian parametric representation (BPR), which is able to adapt ASR models to new environments without changing the structure of an ASR system. The parameter set of BPR is optimized by a maximum joint likelihood criterion which is consistent with that of the hidden Markov model (HMM)-based ASR model through an independent expectation-maximization (EM) procedure. Variations of the proposed approach are investigated in the experiments designed in two different speech environments: one is the noisy environment provided by the AURORA 2 database, and the other is the network environment provided by the NTIMIT database. Performances of the proposed EI ASR model compensation approach are compared to those of the cepstral mean normalization (CMN) approach, which is one of the standard techniques for additive noise compensation. The experimental results show that performances of ASR models in different speech environments are significantly improved after being adapted by the proposed BPR model compensation approach 相似文献

15.

‘Early recognition’ of polysyllabic words in continuous speech

《Computer Speech and Language》2007,21(1):54-71

Humans are able to recognise a word before its acoustic realisation is complete. This in contrast to conventional automatic speech recognition (ASR) systems, which compute the likelihood of a number of hypothesised word sequences, and identify the words that were recognised on the basis of a trace back of the hypothesis with the highest eventual score, in order to maximise efficiency and performance. In the present paper, we present an ASR system, SpeM, based on principles known from the field of human word recognition that is able to model the human capability of ‘early recognition’ by computing word activation scores (based on negative log likelihood scores) during the speech recognition process.Experiments on 1463 polysyllabic words in 885 utterances showed that 64.0% (936) of these polysyllabic words were recognised correctly at the end of the utterance. For 81.1% of the 936 correctly recognised polysyllabic words the local word activation allowed us to identify the word before its last phone was available, and 64.1% of those words were already identified one phone after their lexical uniqueness point.We investigated two types of predictors for deciding whether a word is considered as recognised before the end of its acoustic realisation. The first type is related to the absolute and relative values of the word activation, which trade false acceptances for false rejections. The second type of predictor is related to the number of phones of the word that have already been processed and the number of phones that remain until the end of the word. The results showed that SpeM’s performance increases if the amount of acoustic evidence in support of a word increases and the risk of future mismatches decreases. 相似文献

16.

基于语音识别和手机平台的英语口语发音学习系统 总被引：1，自引：0，他引：1

涂惠燕陈一宁《计算机应用与软件》2011,(9)

研究一种实际可行的手机平台上基于语音识别技术的英语学习系统的应用方案。系统主要以HMM(隐马尔可夫模型)和Viterbi算法作为模型和算法基础,同时针对手机平台的限制,在算法设计和实现方面进行改进,达到降低运算时间同时保证识别精度的目的。相似文献

17.

A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech

《Computer Speech and Language》2000,14(2):101-114

In this paper we report our recent research whose goal is to improve the performance of a novel speech recognizer based on an underlying statistical hidden dynamic model of phonetic reduction in the production of conversational speech. We have developed a path-stack search algorithm which efficiently computes the likelihood of any observation utterance while optimizing the dynamic regimes in the speech model. The effectiveness of the algorithm is tested on the speech data in the Switchboard corpus, in which the optimized dynamic regimes computed from the algorithm are compared with those from exhaustive search. We also present speech recognition results on the Switchboard corpus that demonstrate improvements of the recognizer’s performance compared with the use of the dynamic regimes heuristically set from the phone segmentation by a state-of-the-art hidden Markov model (HMM) system. 相似文献

18.

Real-time lip reading system for isolated Korean word recognition

Jongju Shin Author Vitae Author Vitae Daijin Kim^{Author Vitae} 《Pattern recognition》2011,44(3):559-571

This paper proposes a real-time lip reading system (consisting of a lip detector, lip tracker, lip activation detector, and word classifier), which can recognize isolated Korean words. Lip detection is performed in several stages: face detection, eye detection, mouth detection, mouth end-point detection, and active appearance model (AAM) fitting. Lip tracking is then undertaken via a novel two-stage lip tracking method, where the model-based Lucas-Kanade feature tracker is used to track the outer lip, and then a fast block matching algorithm is used to track the inner lip. Lip activation detection is undertaken through a neural network classifier, the input for which being a combination of the lip motion energy function and the first dominant shape feature. In the last step, input words are defined and recognized by three different classifiers: HMM, ANN, and K-NN. We combine the proposed lip reading system with an audio-only automatic speech recognition (ASR) system to improve the word recognition performance in the noisy environments. We then demonstrate the potential applicability of the combined system for use within hands free in-vehicle navigation devices. Results from experiments undertaken on 30 isolated Korean words using the K-NN classifier at a speed of 15 fps demonstrate that the proposed lip reading system achieves a 92.67% word correct rate (WCR) for person-dependent tests, and a 46.50% WCR for person-independent tests. Also, the combined audio-visual ASR system increases the WCR from 0% to 60% in a noisy environment. 相似文献

19.

Isolated word recognition by neural network models withcross-correlation coefficients for speech dynamics

Jianxiong Wu Chorkin Chan 《IEEE transactions on pattern analysis and machine intelligence》1993,15(11):1174-1185

This paper presents an artificial neural network (ANN) for speaker-independent isolated word speech recognition. The network consists of three subnets in concatenation. The static information within one frame of speech signal is processed in the probabilistic mapping subnet that converts an input vector of acoustic features into a probability vector whose components are estimated probabilities of the feature vector belonging to the phonetic classes that constitute the words in the vocabulary. The dynamics capturing subnet computes the first-order cross correlation between the components of the probability vectors to serve as the discriminative feature derived from the interframe temporal information of the speech signal. These dynamic features are passed for decision-making to the classification subnet, which is a multilayer perceptron (MLP). The architecture of these three subnets are described, and the associated adaptive learning algorithms are derived. The recognition results for a subset of the DARPA TIMIT speech database are reported. The correct recognition rate of the proposed ANN system is 95.5%, whereas that of the best of continuous hidden Markov model (HMM)-based systems is only 91.0% 相似文献

20.

Learning general model for activity recognition with limited labelled data

《Expert systems with applications》2017

Activity recognition has been a hot topic for decades, from the scientific research to the development of off-the-shelf commercial products. Since people perform the activities differently, to avoid overfitting, building a general model with activity data of various users is required before the deployment for personal use. However, annotating a large amount of activity data is expensive and time-consuming. In this paper, we build a general model for activity recognition with a limited amount of labelled data. We combine Latent Dirichlet Allocation (LDA) and AdaBoost to jointly train a general activity model with partially labelled data. After that, when AdaBoost is used for online prediction, we combine it with graphical models (such as HMM and CRF) to exploit the temporal information in human activities to smooth out the accidental misclassifications. Experiments with publicly available datasets show that we are able to obtain the accuracy of more than 90% with 1% labelled data. 相似文献