期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

Matsoukas S. Gauvain J.-L. Adda G. Colthurst T. Chia-Lin Kao Kimball O. Lamel L. Lefevre F. Ma J.Z. Makhoul J. Nguyen L. Prasad R. Schwartz R. Schwenk H. Bing Xiang 《IEEE transactions on audio, speech, and language processing》2006,14(5):1541-1556

相似文献

2.

深度神经网络在维吾尔语大词汇量连续语音识别中的应用

麦麦提艾力·吐尔逊戴礼荣《数据采集与处理》2015,30(2):365-371

研究将深度神经网络有效地应用到维吾尔语大词汇量连续语音识别声学建模中的两种方法：深度神经网络与隐马尔可夫模型组成混合架构模型(Deep neural network hidden Markov model, DNN-HMM),代替高斯混合模型进行状态输出概率的计算;深度神经网络作为前端的声学特征提取器提取瓶颈特征(Bottleneck features, BN),为传统的GMM-HMM(Gaussian mixture model-HMM)声学建模架构提供更有效的声学特征(BN-GMM-HMM)。实验结果表明,DNN-HMM模型和BN- GMM-HMM模型比GMM-HMM基线模型词错误率分别降低了8.84%和5.86%,两种方法都取得了较大的性能提升。相似文献

3.

Finding consensus in speech recognition: word error minimization and other applications of confusion networks 总被引：3，自引：0，他引：3

Lidia Mangu Eric Brill Andreas Stolcke 《Computer Speech and Language》2000,14(4):373

We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding to the path with the highest posterior probability given the acoustics and a language model. However, even given optimal models, the MAP decoder does not necessarily minimize the commonly used performance metric, word error rate (WER). We describe a method for explicitly minimizing WER by extracting word hypotheses with the highest posterior probabilities from word lattices. We change the standard problem formulation by replacing global search over a large set of sentence hypotheses with local search over a small set of word candidates. In addition to improving the accuracy of the recognizer, our method produces a new representation of a set of candidate hypotheses that specifies the sequence of word-level confusions in a compact lattice format. We study the properties of confusion networks and examine their use for other tasks, such as lattice compression, word spotting, confidence annotation, and reevaluation of recognition hypotheses using higher-level knowledge sources. 相似文献

4.

基于MTL-DNN系统融合的混合语言模型语音识别方法

范正光屈丹李华张文林《数据采集与处理》2017,32(5):1012-1021

基于混合语言模型的语音识别系统虽然具有可以识别集外词的优点,但是集外词识别准确率远低于集内词。为了进一步提升混合语音识别系统的识别性能,本文提出了一种基于互补声学模型的多系统融合方法。首先,通过采用不同的声学建模单元,构建了两套基于隐马尔科夫模型和深层神经网络(Hidden Markov model and deep neural network, HMM-DNN）的混合语音识别系统;然后,针对这两种识别任务之间的关联性,采用多任务学习(Multi-task learning DNN, MTL-DNN)思想,实现DNN网络输入层和隐含层的共享,并通过联合训练提高建模精度。最后,采用ROVER(Recognizer output voting error reduction)方法对两套系统的输出结果进行融合。实验结果表明,相比于单任务学DNN(Single-task learning DNN, STL-DNN）建模方式,MTL-DNN可以获得更好的识别性能;将两个系统的输出进行融合,能够进一步降低词错误率。相似文献

5.

基于拼音约束联合学习的汉语语音识别

梁仁凤余正涛高盛祥黄于欣郭军军许树理《中文信息学报》2022,36(10):167-172

当前的语音识别模型在英语、法语等表音文字中已取得很好的效果。然而,汉语是一种典型的表意文字,汉字与语音没有直接的对应关系,但拼音作为汉字读音的标注符号,与汉字存在相互转换的内在联系。因此,在汉语语音识别中利用拼音作为解码时的约束,可以引入一种更接近语音的归纳偏置。该文基于多任务学习框架,提出一种基于拼音约束联合学习的汉语语音识别方法,以端到端的汉字语音识别为主任务,以拼音语音识别为辅助任务,通过共享编码器,同时利用汉字与拼音识别结果作为监督信号,增强编码器对汉语语音的表达能力。实验结果表明,相比基线模型,该文提出的方法取得了更优的识别效果,词错误率降低了2.24%。相似文献

6.

Discriminative n-gram language modeling

《Computer Speech and Language》2007,21(2):373-392

This paper describes discriminative language modeling for a large vocabulary speech recognition task. We contrast two parameter estimation methods: the perceptron algorithm, and a method based on maximizing the regularized conditional log-likelihood. The models are encoded as deterministic weighted finite state automata, and are applied by intersecting the automata with word-lattices that are the output from a baseline recognizer. The perceptron algorithm has the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data. We describe a method based on regularized likelihood that makes use of the feature set given by the perceptron algorithm, and initialization with the perceptron’s weights; this method gives an additional 0.5% reduction in word error rate (WER) over training with the perceptron alone. The final system achieves a 1.8% absolute reduction in WER for a baseline first-pass recognition system (from 39.2% to 37.4%), and a 0.9% absolute reduction in WER for a multi-pass recognition system (from 28.9% to 28.0%). 相似文献

7.

Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition 总被引：1，自引：0，他引：1

Garau G. Renals S. 《IEEE transactions on audio, speech, and language processing》2008,16(3):508-518

相似文献

8.

Stereo-Based Stochastic Mapping for Robust Speech Recognition

Afify M. Cui X. Gao Y. 《IEEE transactions on audio, speech, and language processing》2009,17(7):1325-1334

We present a stochastic mapping technique for robust speech recognition that uses stereo data. The idea is based on constructing a Gaussian mixture model for the joint distribution of the clean and noisy features and using this distribution to predict the clean speech during testing. The proposed mapping is called stereo-based stochastic mapping (SSM). Two different estimators are considered. One is iterative and is based on the maximum a posteriori (MAP) criterion while the other uses the minimum mean square error (MMSE) criterion. The resulting estimators are effectively a mixture of linear transforms weighted by component posteriors, and the parameters of the linear transformations are derived from the joint distribution. Compared to the uncompensated result, the proposed method results in 45% relative improvement in word error rate (WER) for digit recognition in the car. In the same setting, SSM outperforms SPLICE and gives similar results to MMSE compensation of Huang A 66% relative improvement in word error rate (WER) is observed when applied in conjunction with multistyle training (MST) for large vocabulary English speech recognition in a real environment. Also, the combination of the proposed mapping with CMLLR leads to about 38% relative improvement in performance compared to CMLLR alone for real field data. 相似文献

9.

Random forests and the data sparseness problem in language modeling

《Computer Speech and Language》2007,21(1):105-152

Language modeling is the problem of predicting words based on histories containing words already hypothesized. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The solution of these aspects is hindered by the data sparseness problem.Application of random forests (RFs) to language modeling deals with the two aspects simultaneously. We develop a new smoothing technique based on randomly grown decision trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new method is complementary to many existing ones dealing with the data sparseness problem. We study our RF approach in the context of n-gram type language modeling in which n − 1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are longer than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser–Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary state-of-the-art speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models. 相似文献

10.

Towards speech recognizer assessment using a human reference standard

SJ Cox PW Linford WB Hill RD Johnston 《Computer Speech and Language》1998,12(4):375-391

The measurement of the word error rate (WER) of a speech recognizer is valuable for the development of new algorithms but provides only the most limited information about the performance of the recognizer. We propose the use of a human reference standard to assess the performance of speech recognizers, so that the performance of a recognizer could be quoted as being equivalent to the performance of a human hearing speech which is subject to X dB of degradation. This approach should have the major advantage of being independent of the database and speakers used for testing. Furthermore, it would allow factors beyond the word error rate to be measured, such as the performance within an interactive speech system. In this paper, we report on preliminary work to explore the viability of this approach. This has consisted of recording a suitable database for experimentation, devising a method of degrading the speech in a controlled way and conducting two set of experiments on listeners to measure their responses to degraded speech to establish a reference. Results from these experiments raise several questions about the technique but encourage us to experiment with comparisons with automatic recognizers. 相似文献

11.

基于TDNN-FSMN的蒙古语语音识别技术研究

王勇和飞龙高光来《中文信息学报》2018,32(9):28-34

为了提高蒙古语语音识别性能,该文首先将时延神经网络融合前馈型序列记忆网络应用于蒙古语语音识别任务中,通过对长序列语音帧建模来充分挖掘上下文相关信息;此外研究了前馈型序列记忆网络“记忆”模块中历史信息和未来信息长度对模型的影响;最后分析了融合的网络结构中隐藏层个数及隐藏层节点数对声学模型性能的影响。实验结果表明,时延神经网络融合前馈型序列记忆网络相比深度神经网络、时延神经网络和前馈型序列记忆网络具有更好的性能,单词错误率与基线深度神经网络模型相比降低22.2%。相似文献

12.

Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition

《Computer Speech and Language》2007,21(1):88-104

In this paper, we introduce the backoff hierarchical class n-gram language models to better estimate the likelihood of unseen n-gram events. This multi-level class hierarchy language modeling approach generalizes the well-known backoff n-gram language modeling technique. It uses a class hierarchy to define word contexts. Each node in the hierarchy is a class that contains all the words of its descendant nodes. The closer a node to the root, the more general the class (and context) is. We investigate the effectiveness of the approach to model unseen events in speech recognition. Our results illustrate that the proposed technique outperforms backoff n-gram language models. We also study the effect of the vocabulary size and the depth of the class hierarchy on the performance of the approach. Results are presented on Wall Street Journal (WSJ) corpus using two vocabulary set: 5000 words and 20,000 words. Experiments with 5000 word vocabulary, which contain a small numbers of unseen events in the test set, show up to 10% improvement of the unseen event perplexity when using the hierarchical class n-gram language models. With a vocabulary of 20,000 words, characterized by a larger number of unseen events, the perplexity of unseen events decreases by 26%, while the word error rate (WER) decreases by 12% when using the hierarchical approach. Our results suggest that the largest gains in performance are obtained when the test set contains a large number of unseen events. 相似文献

13.

不完全匹配的语音和文本语句级对齐

徐锴陶冶李辉《计算机系统应用》2023,32(4):300-307

语音文本自动对齐技术广泛应用于语音识别与合成、内容制作等领域,其主要目的是将语音和相应的参考文本在语句、单词、音素等级别的单元进行对齐,并获得语音与参考文本之间的时间对位信息.最新的先进对齐方法大多基于语音识别,一方面,准确率受限于语音识别效果,识别字错误率高时文语对齐精度明显下降,识别字错误率对对齐精度影响较大;另一方面,这种对齐方法不能有效处理不完全匹配的长篇幅语音和文本的对齐.该文提出一种基于锚点和韵律信息的文语对齐方法,通过基于边界锚点加权的片段标注将语料划分为对齐段和未对齐段,针对未对齐段使用双门限端点检测方法提取韵律信息,并检测语句边界,降低了基于语音识别的对齐方法对语音识别效果的依赖程度.实验结果表明,与目前先进的基于语音识别的文语对齐方法比较,即使在识别字错误率为0.52时,该文所提方法的对齐准确率仍能提升45%以上;在音频文本不匹配程度为0.5时,该文所提方法能提高3%. 相似文献

14.

Comparative analysis of Dysarthric speech recognition: multiple features and robust templates

Revathi Arunachalam Nagakrishnan R. Sasikaladevi N. 《Multimedia Tools and Applications》2022,81(22):31245-31259

Research on recognizing the speeches of normal speakers is generally in practice for numerous years. Nevertheless, a complete system for recognizing the speeches of persons with a speech impairment is still under development. In this work, an isolated digit recognition system is developed to recognize the speeches of speech-impaired people affected with dysarthria. Since the speeches uttered by the dysarthric speakers are exhibiting erratic behavior, developing a robust speech recognition system would become more challenging. Even manual recognition of their speeches would become futile. This work analyzes the use of multiple features and speech enhancement techniques in implementing a cluster-based speech recognition system for dysarthric speakers. Speech enhancement techniques are used to improve speech intelligibility or reduce the distortion level of their speeches. The system is evaluated using Gamma-tone energy (GFE) features with filters calibrated in different non-linear frequency scales, stock well features, modified group delay cepstrum (MGDFC), speech enhancement techniques, and VQ based classifier. Decision level fusion of all features and speech enhancement techniques has yielded a 4% word error rate (WER) for the speaker with 6% speech intelligibility. Experimental evaluation has provided better results than the subjective assessment of the speeches uttered by dysarthric speakers. The system is also evaluated for the dysarthric speaker with 95% speech intelligibility. WER is 0% for all the digits for the decision level fusion of speech enhancement techniques and GFE features. This system can be utilized as an assistive tool by caretakers of people affected with dysarthria.

相似文献

15.

语音识别中的一种说话人聚类算法 总被引：1，自引：1，他引：1

肖述才欧智坚王作英《中文信息学报》2005,19(4):85-89

本文介绍了稳健语音识别中的一种说话人聚类算法,包括它在语音识别中的作用和具体的用法,聚类中常用的特征、距离测度,聚类的具体实现步骤等。我们从两个方面对该算法的性能进行了测试,一是直接计算句子聚类的正确率,二是对说话人自适应效果的改进的作用,即比较使用此算法后系统性能的改进进行评价。实验表明:在使用GLR 距离作为距离测度的时候,该算法对句子的聚类正确率达85169 %;在识别实验中,该聚类算法的使用,使得用于说话人自适应的数据更加充分,提高了自适应的效果,系统的误识率已经接近利用已知说话人信息进行自适应时的误识率。相似文献

16.

Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments

《Computer Speech and Language》2014,28(4):888-902

This article investigates speech feature enhancement based on deep bidirectional recurrent neural networks. The Long Short-Term Memory (LSTM) architecture is used to exploit a self-learnt amount of temporal context in learning the correspondences of noisy and reverberant with undistorted speech features. The resulting networks are applied to feature enhancement in the context of the 2013 2nd Computational Hearing in Multisource Environments (CHiME) Challenge track 2 task, which consists of the Wall Street Journal (WSJ-0) corpus distorted by highly non-stationary, convolutive noise. In extensive test runs, different feature front-ends, network training targets, and network topologies are evaluated in terms of frame-wise regression error and speech recognition performance. Furthermore, we consider gradually refined speech recognition back-ends from baseline ‘out-of-the-box’ clean models to discriminatively trained multi-condition models adapted to the enhanced features. In the result, deep bidirectional LSTM networks processing log Mel filterbank outputs deliver best results with clean models, reaching down to 42% word error rate (WER) at signal-to-noise ratios ranging from −6 to 9 dB (multi-condition CHiME Challenge baseline: 55% WER). Discriminative training of the back-end using LSTM enhanced features is shown to further decrease WER to 22%. To our knowledge, this is the best result reported for the 2nd CHiME Challenge WSJ-0 task yet. 相似文献

17.

基于知识蒸馏和生成对抗网络的远场语音识别

邬龙黎塔王丽颜永红《软件学报》2019,30(S2):25-34

为了进一步利用近场语音数据来提高远场语音识别的性能,提出一种基于知识蒸馏和生成对抗网络相结合的远场语音识别算法.该方法引入多任务学习框架,在进行声学建模的同时对远场语音特征进行增强.为了提高声学建模能力,使用近场语音的声学模型（老师模型）来指导远场语音的声学模型（学生模型）进行训练.通过最小化相对熵使得学生模型的后验概率分布逼近老师模型.为了提升特征增强的效果,加入鉴别网络来进行对抗训练,从而使得最终增强后的特征分布更逼近近场特征.AMI数据集上的实验结果表明,该算法的平均词错误率（WER）与基线相比在单通道的情况下,在没有说话人交叠和有说话人交叠时分别相对下降5.6%和4.7%.在多通道的情况下,在没有说话人交叠和有说话人交叠时分别相对下降6.2%和4.1%.TIMIT数据集上的实验结果表明,该算法获得了相对7.2%的平均词错误率下降.为了更好地展示生成对抗网络对语音增强的作用,对增强后的特征进行了可视化分析,进一步验证了该方法的有效性. 相似文献

18.

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

Hamza Frihia Halima Bahi 《International Journal of Speech Technology》2017,20(3):563-573

Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background. 相似文献

19.

Automatic Word Decompounding for ASR in a Morphologically Rich Language: Application to Amharic

《IEEE transactions on audio, speech, and language processing》2009,17(5):863-873

This paper investigates a data-driven word decompounding algorithm for use in automatic speech recognition. An existing algorithm, called “Morfessor,” has been enhanced in order to address the problem of increased phonetic confusability arising from word decompounding by incorporating phonetic properties and some constraints on recognition units derived from forced alignments experiments. Speech recognition experiments have been carried out on a broadcast news task for the Amharic language to validate the approach. The out of vocabulary (OOV) word rates were reduced by 35% to 50% and a small reduction in word error rate (WER) has been achieved. The algorithm is relatively language independent and requires minimal adaptation to be applied to other languages. 相似文献

20.

基于持续时间分布的鲁棒语速估计方法

张东宾杜利民《微计算机应用》2006,27(3):297-301

分析了语音识别系统中语速的定义与估计方法,提出了基于持续时间分布的鲁棒语速估计法，该方法首先利用非对称高斯分布来描述音子持续时间模型,然后采用中段平均规格化偏差来估计语速,应用该方法的估计语速和真实语速相关系数高达0，96，在此基础上,分别针对慢速和快速语句的特点,提出了动态词惩罚因子策略和动态帧移策略,分别可使系统误识率下降10，1%和9，9%。相似文献