期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《Computer Speech and Language》2001,15(2):151-174

This paper presents a new architecture for automatic speech recognition systems which is characterized by the division of the spectral domain of the speech signal into several independent frequency bands. This model is based on the psycho-acoustic work of Fletcher (1953) who proposed a similar principle for the human auditory system. Jont B. Allen published a paper in 1994 in which he summarized the work of Fletcher and also proposed to adapt the multi-band paradigm to automatic speech recognition (ASR) (Allen, 1994). Many researchers have then studied this principle and built such ASR systems. The goal of this paper is to analyse some of the most important issues in the design of a multi-band ASR system in order to determine which architecture it should have in which environment. Two other major problems are then considered: how to train multi-band systems and how to use them for continuous ASR. 相似文献

2.

Pervasive speech recognition

《Pervasive Computing, IEEE》2004,3(4):78-81

As mobile computing devices grow smaller and as in-car computing platforms become more common, we must augment traditional methods of human-computer interaction. Although speech interfaces have existed for years, the constrained system resources of pervasive devices, such as limited memory and processing capabilities, present new challenges. We provide an overview of embedded automatic speech recognition (ASR) on the pervasive device and discuss its ability to help us develop pervasive applications that meet today's marketplace needs. ASR recognizes spoken words and phrases. State-of-the-art ASR uses a phoneme-based approach for speech modeling: it gives each phoneme (or elementary speech sound) in the language under consideration a statistical representation expressing its acoustic properties. 相似文献

3.

Speaker-independent ASR for Modern Standard Arabic: effect of regional accents

Ghania Droua-Hamdani Sid-Ahmed Selouani Malika Boudraa 《International Journal of Speech Technology》2012,15(4):487-493

This paper deals with speaker-independent Automatic Speech Recognition (ASR) system for continuous speech. This ASR system has been developed for Modern Standard Arabic (MSA) using recordings of six regions taken from ALGerian Arabic Speech Database (ALGASD), and has been designed by using Hidden Markov Models. The main purpose of this study is to investigate the effect of regional accent on speech recognition rates. First, the experiment assessed the general performance of the model for the data speech of six regions, details of the recognition results are performed to observe the deterioration of the performance of the ASR according to the regional variation included in the speech material. The results have shown that the ASR performance is clearly impacted by the regional accents of the speakers. 相似文献

4.

Prosody modification for speech recognition in emotionally mismatched conditions

Vishnu Vidyadhara Raju Vegesna Krishna Gurugubelli Anil kumar Vuppala 《International Journal of Speech Technology》2018,21(3):521-532

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech. 相似文献

5.

带噪汉语语音识别的端点检测方法 总被引：4，自引：0，他引：4

王朋塔维娜陈树中《计算机工程》2003,29(17):120-121,135

在语音识别系统中产生错误识别的原因之一是端点检测有误差，在高信噪比情况下，正确地确定语音的端点并不困难，然而，大多数实际的语音识别系统需工作在低信噪比情况下，一些常规的端点检测方法，例如基于能量的端点检测方法在噪声环境下不能有效地工作。该文利用改进的隐马尔柯夫模型(HMM)进行语音检测以适应噪声的变化，实验结果表明本方法可得到高正确率的带噪语音端点检测。相似文献

6.

Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR

M. F. R. Chowdhury S.-A. Selouani D. O’Shaughnessy 《International Journal of Speech Technology》2012,15(1):5-23

Current automatic speech recognition (ASR) works in off-line mode and needs prior knowledge of the stationary or quasi-stationary test conditions for expected word recognition accuracy. These requirements limit the application of ASR for real-world applications where test conditions are highly non-stationary and are not known a priori. This paper presents an innovative frame dynamic rapid adaptation and noise compensation technique for tracking highly non-stationary noises and its application for on-line ASR. The proposed algorithm is based on a soft computing model using Bayesian on-line inference for spectral change point detection (BOSCPD) in unknown non-stationary noises. BOSCPD is tested with the MCRA noise tracking technique for on-line rapid environmental change learning in different non-stationary noise scenarios. The test results show that the proposed BOSCPD technique reduces the delay in spectral change point detection significantly compared to the baseline MCRA and its derivatives. The proposed BOSCPD soft computing model is tested for joint additive and channel distortions compensation (JAC)-based on-line ASR in unknown test conditions using non-stationary noisy speech samples from the Aurora 2 speech database. The simulation results for the on-line AR show significant improvement in recognition accuracy compared to the baseline Aurora 2 distributed speech recognition (DSR) in batch-mode. 相似文献

7.

混合连接时间/注意力机制端到端语音识别

陈聪贺杰陈佳《控制工程》2021,28(3):585-591

为提高常规自动语音识别(ASR)系统的精度,提出基于隐式马尔可夫模型混合连接时间分类/注意力机制的端到端ASR系统设计方法。首先,针对可观测时变序列语音识别过程中存在的连续性强、词汇量大的语音识别难点,基于隐式马尔可夫模型对语音识别过程进行模拟,实现了语音识别模型参数化;其次,使用连接时间分类目标函数作为辅助任务,在多目标学习框架中训练语音识别过程的关注模型编码器,可降低序列级连接时间分类目标近似度,实现语音识别过程精度提升;最后,通过在自建语音识别库上的仿真实验,验证所提算法在识别效率和精度上的性能优势。相似文献

8.

Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition

《IEEE transactions on audio, speech, and language processing》2009,17(1):138-149

Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system. 相似文献

9.

AUTOMATIC SPEECH RECOGNITION

Louis Fried 《Information Systems Management》1996,13(1):29-37

Automatic speech recognition (ASR) technology provides a natural interface for mission-critical multimedia applications. This article discusses the state of ASR technoloav. selection of an ASR system, and an approach for developing ASR applications. 相似文献

10.

Turkish Broadcast News Transcription and Retrieval

《IEEE transactions on audio, speech, and language processing》2009,17(5):874-883

相似文献

11.

Dual stream speech recognition using articulatory syllable models

Antti Puurula Dirk Van Compernolle 《International Journal of Speech Technology》2010,13(4):219-230

Recent theoretical developments in neuroscience suggest that sublexical speech processing occurs via two parallel processing pathways. According to this Dual Stream Model of Speech Processing speech is processed both as sequences of speech sounds and articulations. We attempt to revise the “beads-on-a-string” paradigm of Hidden Markov Models in Automatic Speech Recognition (ASR) by implementing a system for dual stream speech recognition. A baseline recognition system is enhanced by modeling of articulations as sequences of syllables. An efficient and complementary model to HMMs is developed by formulating Dynamic Time Warping (DTW) as a probabilistic model. The DTW Model (DTWM) is improved by enriching syllable templates with constrained covariance matrices, data imputation, clustering and mixture modeling. The resulting dual stream system is evaluated on the N-Best Southern Dutch Broadcast News benchmark. Promising results are obtained for DTWM classification and ASR tests. We provide a discussion on the remaining problems in implementing dual stream speech recognition. 相似文献

12.

不完全匹配的语音和文本语句级对齐

徐锴陶冶李辉《计算机系统应用》2023,32(4):300-307

语音文本自动对齐技术广泛应用于语音识别与合成、内容制作等领域,其主要目的是将语音和相应的参考文本在语句、单词、音素等级别的单元进行对齐,并获得语音与参考文本之间的时间对位信息.最新的先进对齐方法大多基于语音识别,一方面,准确率受限于语音识别效果,识别字错误率高时文语对齐精度明显下降,识别字错误率对对齐精度影响较大;另一方面,这种对齐方法不能有效处理不完全匹配的长篇幅语音和文本的对齐.该文提出一种基于锚点和韵律信息的文语对齐方法,通过基于边界锚点加权的片段标注将语料划分为对齐段和未对齐段,针对未对齐段使用双门限端点检测方法提取韵律信息,并检测语句边界,降低了基于语音识别的对齐方法对语音识别效果的依赖程度.实验结果表明,与目前先进的基于语音识别的文语对齐方法比较,即使在识别字错误率为0.52时,该文所提方法的对齐准确率仍能提升45%以上;在音频文本不匹配程度为0.5时,该文所提方法能提高3%. 相似文献

13.

Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

Rajesh Kumar Aggarwal M. Dave 《International Journal of Speech Technology》2011,14(4):309-320

In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have been widely used for modeling the temporal speech signal. As discussed in Part I, the conventional acoustic models used for ASR have many drawbacks like weak duration modeling and poor discrimination. This paper (Part II) presents a review on the techniques which have been proposed in literature for the refinements of standard HMM methods to cope with their limitations. Current advancements related to this topic are also outlined. The approaches emphasized in this part of review are connectionist approach, explicit duration modeling, discriminative training and margin based estimation methods. Further, various challenges and performance issues such as environmental variability, tied mixture modeling, and handling of distant speech signals are analyzed along with the directions for future research. 相似文献

14.

A Novel Uncertainty Decoding Rule With Applications to Transmission Error Robust Speech Recognition

Ion V. Haeb-Umbach R. 《IEEE transactions on audio, speech, and language processing》2008,16(5):1047-1060

In this paper, we derive an uncertainty decoding rule for automatic speech recognition (ASR), which accounts for both corrupted observations and inter-frame correlation. The conditional independence assumption, prevalent in hidden Markov model-based ASR, is relaxed to obtain a clean speech posterior that is conditioned on the complete observed feature vector sequence. This is a more informative posterior than one conditioned only on the current observation. The novel decoding is used to obtain a transmission-error robust remote ASR system, where the speech capturing unit is connected to the decoder via an error-prone communication network. We show how the clean speech posterior can be computed for communication links being characterized by either bit errors or packet loss. Recognition results are presented for both distributed and network speech recognition, where in the latter case common voice-over-IP codecs are employed. 相似文献

15.

混合CTC/attention架构端到端带口音普通话识别

杨威胡燕《计算机应用研究》2021,38(3):755-759

针对普通话语音识别任务中的多口音识别问题,提出了链接时序主义(connectionist temporal classification,CTC)和多头注意力(multi-head attention)的混合端到端模型,同时采用多目标训练和联合解码的方法。实验分析发现随着混合架构中链接时序主义权重的降低和编码器层数的加深,混合模型在带口音的数据集上表现出了更好的学习能力,同时训练一个深度达到48层的编码器—解码器架构的网络,生成模型的表现超过之前所有端到端模型,在数据堂开源的200 h带口音数据集上达到了5.6%字错率和26.2%句错率。实验证明了提出的端到端模型超过一般端到端模型的识别率,在解决带口音的普通话识别上有一定的先进性。相似文献

16.

Merge-Weighted Dynamic Time Warping for Speech Recognition

下载免费PDF全文

张湘莉兰骆志刚李明《计算机科学技术学报》2014,29(6):1072-1082

Obtaining training material for rarely used English words and common given names from countries where English is not spoken is di?cult due to excessive time, storage and cost factors. By considering pe... 相似文献

17.

Environmental Independent ASR Model Adaptation/Compensation by Bayesian Parametric Representation

Wang X. O'Shaughnessy D. 《IEEE transactions on audio, speech, and language processing》2007,15(4):1204-1217

The mismatch between system training and operating conditions can seriously deteriorate the performance of automatic speech recognition (ASR) systems. Various techniques have been proposed to solve this problem in a specified speech environment. Employment of these techniques often involves modification on the ASR system structure. In this paper, we propose an environment-independent (EI) ASR model parameter adaptation approach based on Bayesian parametric representation (BPR), which is able to adapt ASR models to new environments without changing the structure of an ASR system. The parameter set of BPR is optimized by a maximum joint likelihood criterion which is consistent with that of the hidden Markov model (HMM)-based ASR model through an independent expectation-maximization (EM) procedure. Variations of the proposed approach are investigated in the experiments designed in two different speech environments: one is the noisy environment provided by the AURORA 2 database, and the other is the network environment provided by the NTIMIT database. Performances of the proposed EI ASR model compensation approach are compared to those of the cepstral mean normalization (CMN) approach, which is one of the standard techniques for additive noise compensation. The experimental results show that performances of ASR models in different speech environments are significantly improved after being adapted by the proposed BPR model compensation approach 相似文献

18.

Collecting and evaluating speech recognition corpora for 11 South African languages

Jaco Badenhorst Charl van Heerden Marelie Davel Etienne Barnard 《Language Resources and Evaluation》2011,45(3):289-309

We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 h of speech per language are sufficient for the purposes of acceptable phone-based recognition. 相似文献

19.

基于端到端技术的藏语语音识别^*

王庆楠郭武解传栋《模式识别与人工智能》2017,30(4):359-364

现阶段基于链接时序分类技术的端到端的大规模连续语音识别成为研究热点,文中将其应用于藏语识别中,取得优于主流的双向长短时记忆网络性能.在基于端到端的语音识别中,不需要发音字典等语言学知识,识别性能无法得到保证.文中提出将已有的语言学知识结合至端到端的声学建模中,采用绑定的三音子作为建模单元,解决建模单元的稀疏性问题,大幅提高声学建模的区分度和鲁棒性.在藏语测试集上,通过实验证明文中方法提高基于链接时序分类技术的声学模型的识别率,并验证语言学知识和基于端到端声学建模技术结合的有效性. 相似文献

20.

A study on the challenges and opportunities of speech recognition for Bengali language

Mridha M. F. Ohi Abu Quwsar Hamid Md Abdul Monowar Muhammad Mostafa 《Artificial Intelligence Review》2022,55(4):3431-3455

Speech recognition is a fascinating process that offers the opportunity to interact and command the machine in the field of human-computer interactions. Speech recognition is a language-dependent system constructed directly based on the linguistic and textual properties of any language. Automatic speech recognition (ASR) systems are currently being used to translate speech to text flawlessly. Although ASR systems are being strongly executed in international languages, ASR systems’ implementation in the Bengali language has not reached an acceptable state. In this research work, we sedulously disclose the current status of the Bengali ASR system’s research endeavors. In what follows, we acquaint the challenges that are mostly encountered while constructing a Bengali ASR system. We split the challenges into language-dependent and language-independent challenges and guide how the particular complications may be overhauled. Following a rigorous investigation and highlighting the challenges, we conclude that Bengali ASR systems require specific construction of ASR architectures based on the Bengali language’s grammatical and phonetic structure.

相似文献