共查询到20条相似文献,搜索用时 296 毫秒
1.
Channel distortion is one of the major factors which degrade the performances of automatic speech recognition (ASR) systems. Current compensation methods are generally based on the assumption that the channel distortion is a constant or slowly varying bias in an utterance or globally. However, this assumption is not sustained in a more complex circumstance, when the speech records being recognized are from many different unknown channels and have parts of the spectrum completely removed (e.g. band-limited speech). On the one hand, different channels may cause different distortions; on the other, the distortion caused by a given channel varies over the speech frames when parts of the speech spectrum are removed completely. As a result, the performance of the current methods is limited in complex environments. To solve this problem, we propose a unified framework in which the channel distortion is first divided into two subproblems, namely, spectrum missing and magnitude changing. Next, the two types of distortions are compensated with different techniques in two steps. In the first step, the speech bandwidth is detected for each utterance and the acoustic models are synthesized with clean models to compensate for spectrum missing. In the second step, the constant term of the distortion is estimated via the expectation-maximization (EM) algorithm and subtracted from the means of the synthesized model to further compensate for magnitude changing. Several databases are chosen to evaluate the proposed framework. The speech in these databases is recorded in different channels, including various microphones and band-limited channels. Moreover, to simulate more types of spectrum missing, various low-pass and band-pass filters are used to process the speech from the chosen databases. Although these databases and their filtered versions make the channel conditions more challenging for recognition, experimental results show that the proposed framework can substantially improve the performance of ASR systems in complex channel environments. 相似文献
2.
This article describes a novel method that models the correlation among acoustic observations in contiguous speech segments. The basic idea behind the method is that acoustic observations are conditioned not only on the phonetic context but also on the preceding acoustic segment observation. The correlation between consecutive acoustic observations is modeled by mean trajectory polynomial segment models (PSM). This method is an extension of conventional segment modeling approaches in that it describes the correlation of acoustic observations not only inside segments but also between contiguous segments. It is also a generalization of phonetic context (e.g., triphone) modeling approaches because it can model acoustic context and phonetic context at the same time. Using the proposed method in a speaker-independent phoneme classification test resulted in a 7 to 9% relative reduction of error rate as compared with the traditional triphone segmental model system and a 31% reduction as compared with a similar triphone hidden Markov model (HMM) system. 相似文献
3.
4.
通过对DTW算法的研究,提出了一种并行分段裁剪的新方法,在减少DTW算法运算量方面有显著效果,并将其用于一个地名识别系统中,经测试,可以明显缩短识别时间,具有很强的实时性,有较高的识别率,适合作为小型语音识别产品的主要算法。 相似文献
5.
This paper considers the separation and recognition of overlapped speech sentences assuming single-channel observation. A system based on a combination of several different techniques is proposed. The system uses a missing-feature approach for improving crosstalk/noise robustness, a Wiener filter for speech enhancement, hidden Markov models for speech reconstruction, and speaker-dependent/-independent modeling for speaker and speech recognition. We develop the system on the Speech Separation Challenge database, involving a task of separating and recognizing two mixing sentences without assuming advanced knowledge about the identity of the speakers nor about the signal-to-noise ratio. The paper is an extended version of a previous conference paper submitted for the challenge. 相似文献
6.
提出基于发音特征的声调建模改进方法,并将其用于随机段模型的一遍解码中。根据普通话的发音特点,确定了用于区别汉语元音、辅音信息的7种发音特征,并以此为目标值利用阶层式多层感知器计算语音信号属于发音特征的35个类别后验概率,将该概率作为发音特征与传统的韵律特征一起用于声调建模。根据随机段模型的解码特点,在两层剪枝后对保留下来的路径计算其声调模型概率得分,加权后加入路径总的概率得分中。在“863-test”测试集上进行的实验结果显示,使用了新的发音特征集合中声调模型的识别精度提高了3.11%;融入声调信息后随机段模型的字错误率从13.67%下降到12.74%。表明了将声调信息应用到随机段模型的可行性。 相似文献
7.
火车票查询系统中语音识别的研究及实现 总被引:5,自引:0,他引:5
文章首先介绍了火车票查询系统中语音识别的框架结构,并详细描述了采用微软SPEECHSDK技术实现车次、车站语音识别的详细流程,最后从识别率、鲁棒性方面对该识别系统进行测试和分析,实验表明,该语音识别系统是稳定的和实用的。 相似文献
8.
Enzo MumoloAuthor Vitae 《Pattern recognition》2002,35(10):2181-2191
A fast algorithm which aims at performing texture analysis of time-frequency images for denoising purposes is described in this paper. Time-frequency images are built using the peaks of the amplitude spectrum computed on a noisy speech signal. Using texture analysis, we can look at the spectral 2D information on a large scale, thus allowing the correction of spectral continuity by restoring peaks corrupted by noise which can appear as missing or modified. The algorithm has been used in preprocessors of speech processing systems. In fact, we report interesting results obtained with this algorithm in speech enhancement and HMM speech recognition tasks, especially for noise types which are quite difficult to treat with conventional algorithms, such as micro-interruptions or bursts of tonal noise at random frequencies. 相似文献
9.
深度语音信号与信息处理:研究进展与展望 总被引:1,自引:0,他引:1
论文首先对深度学习进行简要的介绍,然后就其在语音信号与信息处理研究领域的主要研究方向,包括语音识别、语音合成、语音增强的研究进展进行了详细的介绍。语音识别方向主要介绍了基于深度神经网络的语音声学建模、大数据下的模型训练和说话人自适应技术;语音合成方向主要介绍了基于深度学习模型的若干语音合成方法;语音增强方向主要介绍了基于深度神经网络的若干典型语音增强方案。论文的最后我们对深度学习在语音信与信息处理领域的未来可能的研究热点进行展望。 相似文献
10.
Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X
*, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X
+ of X
*, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all X ∈H. The non-sequential PR computation of X
+ involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying
graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked
list of prefixes that can be built separately or “on top of” the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach
can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates
need to be known, and not just the best match. These constitute the cases when the “cutoffs” cannot be used in the DFS trie-based
technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996). The new technique is compared with the DFS
trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng
8(4):540–547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate
marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy.
Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of
errors in Y.
B. John Oommen was born in Coonoor, India on September 9, 1953. He obtained his B.Tech. degree from the Indian Institute of Technology, Madras, India in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, in West Lafayettte, Indiana in 1979 and 1982 respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981–1982 academic year. He is still at Carleton and holds the rank of a Full Professor. His research interests include Automata Learning, Adaptive Data Structures, Statistical and Syntactic Pattern Recognition, Stochastic Algorithms and Partitioning Algorithms. He is the author of more than 255 refereed journal and conference publications and is a Fellow of the IEEE. Dr. Oommen is on the Editorial Board of the IEEE Transactions on Systems, Man and Cybernetics, and Pattern Recognition. Ghada Badr was born in Alexandria, Egypt in 1973. She received her B.Sc. and M.Sc. degrees in Computer Science with honors from Alexandria University, Faculty of Engineering, Alexandria, Egypt, in 1996 and 2001 respectively. She completed her Ph.D. from the School of Computer Science at Carleton University, Ottawa in Canada, in April 2006. She has also been a research assistant in Moubarak City for Scientific Research, Information Research Institute (IRI), Egypt, during the period of 1997–2001. Her Fields of expertise are: Advanced/Adaptive Data Structures, Syntactic and Structural Pattern Recognition, Artificial Intelligence, Exact/Approximate String Matching Algorithms, and Information Retrieval. She has authored more than 10 refereed journal and conference publications and is a co-inventor for one patent. 相似文献
Ghada Badr (Corresponding author)Email: |
B. John Oommen was born in Coonoor, India on September 9, 1953. He obtained his B.Tech. degree from the Indian Institute of Technology, Madras, India in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, in West Lafayettte, Indiana in 1979 and 1982 respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981–1982 academic year. He is still at Carleton and holds the rank of a Full Professor. His research interests include Automata Learning, Adaptive Data Structures, Statistical and Syntactic Pattern Recognition, Stochastic Algorithms and Partitioning Algorithms. He is the author of more than 255 refereed journal and conference publications and is a Fellow of the IEEE. Dr. Oommen is on the Editorial Board of the IEEE Transactions on Systems, Man and Cybernetics, and Pattern Recognition. Ghada Badr was born in Alexandria, Egypt in 1973. She received her B.Sc. and M.Sc. degrees in Computer Science with honors from Alexandria University, Faculty of Engineering, Alexandria, Egypt, in 1996 and 2001 respectively. She completed her Ph.D. from the School of Computer Science at Carleton University, Ottawa in Canada, in April 2006. She has also been a research assistant in Moubarak City for Scientific Research, Information Research Institute (IRI), Egypt, during the period of 1997–2001. Her Fields of expertise are: Advanced/Adaptive Data Structures, Syntactic and Structural Pattern Recognition, Artificial Intelligence, Exact/Approximate String Matching Algorithms, and Information Retrieval. She has authored more than 10 refereed journal and conference publications and is a co-inventor for one patent. 相似文献
11.
12.
Most speech enhancement methods based on short-time spectral modification are generally expressed as a spectral gain depending on the estimate of the local signal-to-noise ratio (SNR) on each frequency bin. Several studies have analyzed the performance of a priori SNR estimation algorithms to improve speech quality and to reduce speech distortions. In this paper, we concentrate on the analysis of over- and under estimation of the a priori SNR in speech enhancement and noise reduction systems. We first show that conventional approaches such as the decision-directed approach proposed by Ephraïm and Malah lead to a biased estimator for the a priori SNR. To reduce this bias, our strategy relies on the introduction of a correction term in the a priori SNR estimate depending on the current state of both the available a posteriori SNR and the estimated a priori one. The proposed solution leads to a bias-compensated a priori SNR estimate, and allows to finely estimating the output speech signal to be very close to the original one on each frequency bin. Such refinement procedure in the a priori SNR estimate can be inserted in any type of spectral gain function to improve the output speech quality. Objective tests under various environments in terms of the Normalized Covariance Metric (NCM) criterion, the Coherence Speech Intelligibility Index (CSII) criterion, the segmental SNR criterion and the Perceptual Evaluation of Speech Quality (PESQ) measure are presented showing the superiority of the proposed method compared to competitive algorithms. 相似文献
13.
为了实现机器人语音控制,并避免环境噪音的干扰,研究提出了基于Mel频率倒谱系数特征提取和深度神经网络的机器人语音控制指令识别方法。实验结果显示,相较于其他语音增强方法,基于深度神经网络和谐波增强技术的语音增强方法分段信噪比和语音质量感观评价均更高。同时相比于其他特征,研究提出的基于改进Mel频率倒谱系数特征能显著降低语音识别的字错误率,通过辅以改进深度神经网络-隐马尔科夫模型能进一步降低字错误率。在20dB条件下,该特征和改进深度神经网络-隐马尔科夫模型的平均字错误率分别为24.9%和22.1%,均低于其他方法。上述结果表明,研究提出的语音识别方法能实现带噪声语音的准确识别,提高机器人的语音控制指令识别能力。 相似文献
14.
介绍利用多媒体处理技术,在多媒体计算机上实现的从带噪声的语音信号中提取原始语音方法的研究及该方法的设计与开发。 相似文献
15.
Kuan-Yu Chen 《Information Processing Letters》2005,96(6):197-201
We study several fundamental problems arising from biological sequence analysis. Given a sequence of real numbers, we present two linear-time algorithms, one for locating the “longest” sum-constrained segment, and the other for locating the “shortest” sum-constrained segment. These two algorithms are based on the same framework and run in an online manner, hence they are capable of handling data stream inputs. Our algorithms also yield online linear-time solutions for finding the longest and shortest average-constrained segments by a simple reduction. 相似文献
16.
分析和研究了基于声波耦合和语音增强模块级联的语音增强方法的工业语音识别系统设计和实施过程,并对其进行了算法建模,同时在比较谱减法和MMSE-LSA的语音增强算法的同时进行了实验数据分析,使工业机器人语音识别系统在噪声环境下提高了识别率. 相似文献
17.
18.
19.
《Computer Speech and Language》2014,28(2):619-628
Post-filtering can be used in mobile communications to improve the quality and intelligibility of speech. Energy reallocation with a high-pass type filter has been shown to work effectively in improving the intelligibility of speech in difficult noise conditions. This paper introduces a post-filtering algorithm that adapts to the background noise level as well as to the fundamental frequency of the speaker and models the spectral effects observed in natural Lombard speech. The introduced method and another post-filtering technique were compared to unprocessed telephone speech in subjective listening tests in terms of intelligibility and quality. The results indicate that the proposed method outperforms the reference method in difficult noise conditions. 相似文献