期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A syntactic procedure for the recognition of glottal pulses in continuous speech

R. De Mori P. Laface V.A. Makhonine M. Mezzalama 《Pattern recognition》1977,9(4):181-189

Speech prosody contains an important structural information for performing speech analysis and for extracting syntactic nuclei from spoken sentences. This paper describes a procedure based on a multichannel system of epoch filters for recognizing the pulses of glottal chord vibrations by an analysis of the speech waveform. Recognition is performed by a stochastic finite state automaton automatically inferred after experiments. 相似文献

2.

Topic tracking language model for speech recognition

Shinji Watanabe Tomoharu Iwata Takaaki Hori Atsushi Sako Yasuo Ariki 《Computer Speech and Language》2011,25(2):440-461

In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method. 相似文献

3.

Pseudo pitch synchronous analysis of speech with applications to speaker recognition

《IEEE transactions on audio, speech, and language processing》2006,14(2):467-478

The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as "pitch mismatch" between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data. 相似文献

4.

A cache-based natural language model for speech recognition 总被引：4，自引：0，他引：4

Kuhn R. De Mori R. 《IEEE transactions on pattern analysis and machine intelligence》1990,12(6):570-583

Speech-recognition systems must often decide between competing ways of breaking up the acoustic input into strings of words. Since the possible strings may be acoustically similar, a language model is required; given a word string, the model returns its linguistic probability. Several Markov language models are discussed. A novel kind of language model which reflects short-term patterns of word use by means of a cache component (analogous to cache memory in hardware terminology) is presented. The model also contains a 3g-gram component of the traditional type. The combined model and a pure 3g-gram model were tested on samples drawn from the Lancaster-Oslo/Bergen (LOB) corpus of English text. The relative performance of the two models is examined, and suggestions for the future improvements are made 相似文献

5.

Web-based possibilistic language models for automatic speech recognition

《Computer Speech and Language》2014,28(4):923-939

相似文献

6.

Morphology-based language modeling for conversational Arabic speech recognition

Katrin Kirchhoff Dimitra Vergyri Jeff Bilmes Kevin Duh Andreas Stolcke 《Computer Speech and Language》2006,20(4):589-608

Language modeling for large-vocabulary conversational Arabic speech recognition is faced with the problem of the complex morphology of Arabic, which increases the perplexity and out-of-vocabulary rate. This problem is compounded by the enormous dialectal variability and differences between spoken and written language. In this paper, we investigate improvements in Arabic language modeling by developing various morphology-based language models. We present four different approaches to morphology-based language modeling, including a novel technique called factored language models. Experimental results are presented for both rescoring and first-pass recognition experiments. 相似文献

7.

k-TSS language models in speech recognition systems

《Computer Speech and Language》2001,15(2):127-148

The aim of this work is to show the ability of stochastic regular grammars to generate accurate language models which can be well integrated, allocated and handled in a continuous speech recognition system. For this purpose, a syntactic version of the well-known n -gram model, called k -testable language in the strict sense (k -TSS), is used. The complete definition of a k -TSS stochastic finite state automaton is provided in the paper. One of the difficulties arising in representing a language model through a stochastic finite state network is that the recursive schema involved in the smoothing procedure must be adopted in the finite state formalism to achieve an efficient implementation of the backing-off mechanism. The use of the syntactic back-off smoothing technique applied to k -TSS language modelling allowed us to obtain a self-contained smoothed model integrating several k -TSS automata in a unique smoothed and integrated model, which is also fully defined in the paper. The proposed formulation leads to a very compact representation of the model parameters learned at training time: probability distribution and model structure. The dynamic expansion of the structure at decoding time allows an efficient integration in a continuous speech recognition system using a one-step decoding procedure. An experimental evaluation of the proposed formulation was carried out on two Spanish corpora. These experiments showed that regular grammars generate accurate language models (k -TSS) that can be efficiently represented and managed in real speech recognition systems, even for high values of k, leading to very good system performance. 相似文献

8.

Leveraging topical and positional cues for language modeling in speech recognition

Chiu Hsuan-Sheng Chen Kuan-Yu Chen Berlin 《Multimedia Tools and Applications》2014,72(2):1465-1481

Multimedia Tools and Applications - This paper investigates language modeling with topical and positional information for large vocabulary continuous speech recognition. We first compare among a... 相似文献

9.

Emergent artificial intelligence approaches for pattern recognition in speech and language processing

Rodrigo Capobianco Guido José Carlos Pereira Jan Frans Willem Slaets 《Computer Speech and Language》2010,24(3):431-432

相似文献

10.

Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features

Shashidhar G. Koolagudi Sreenivasa Rao Krothapalli 《International Journal of Speech Technology》2012,15(4):495-511

In this work, spectral features extracted from sub-syllabic regions and pitch synchronous analysis are proposed for speech emotion recognition. Linear prediction cepstral coefficients, mel frequency cepstral coefficients and the features extracted from high amplitude regions of spectrum are used to represent emotion specific spectral information. These features are extracted from consonant, vowel and transition regions of each syllable to study the contribution of these regions toward recognition of emotions. Consonant, vowel and the transition regions are determined using vowel onset points. Spectral features extracted from each pitch cycle, are also used to recognize emotions present in speech. The emotions used in this study are: anger, fear, happy, neutral and sad. The emotion recognition performance using sub-syllabic speech segments are compared with the results of conventional block processing approach, where entire speech signal is processed frame by frame. The proposed emotion specific features are evaluated using simulated emotion speech corpus, IITKGP-SESC (Indian Institute of Technology, KharaGPur-Simulated Emotion Speech Corpus). The emotion recognition results obtained using IITKGP-SESC are compared with the results of Berlin emotion speech corpus. Emotion recognition systems are developed using Gaussian mixture models and auto-associative neural networks. The purpose of this study is to explore sub-syllabic regions to identify the emotions embedded in a speech signal, and if possible, to avoid processing of entire speech signal for emotion recognition without serious compromise in the performance. 相似文献

11.

Audiovisual speech recognition for Kannada language using feed forward neural network

Shashidhar R. Patilkulkarni S. 《Neural computing & applications》2022,34(18):15603-15615

Neural Computing and Applications - Audiovisual speech recognition is one of the promising technologies in a noisy environment. In this work, we develop the database for Kannada Language and... 相似文献

12.

Monaural speech separation based on MAXVQ and CASA for robust speech recognition 总被引：1，自引：0，他引：1

Peng Li Yong Guan Shijin Wang Bo Xu Wenju Liu 《Computer Speech and Language》2010,24(1):30-44

Robustness is one of the most important topics for automatic speech recognition (ASR) in practical applications. Monaural speech separation based on computational auditory scene analysis (CASA) offers a solution to this problem. In this paper, a novel system is presented to separate the monaural speech of two talkers. Gaussian mixture models (GMMs) and vector quantizers (VQs) are used to learn the grouping cues on isolated clean data for each speaker. Given an utterance, speaker identification is firstly performed to identify the two speakers presented in the utterance, then the factorial-max vector quantization model (MAXVQ) is used to infer the mask signals and finally the utterance of the target speaker is resynthesized in the CASA framework. Recognition results on the 2006 speech separation challenge corpus prove that this proposed system can improve the robustness of ASR significantly. 相似文献

13.

Continuous speech recognition and syntactic processing in Iranian Farsi language

M. Sheikhan M. Tebyani M. Lotfizad 《International Journal of Speech Technology》1997,1(2):135-141

In this paper, the architecture of the first Iranian Farsi continuous speech recognizer and syntactic processor is introduced. In this system, by extracting suitable features of speech signal (cepstral, delta-cepstral, energy and zero-crossing rate) and using a hydrid architecture of neural networks (a Self-Organizing Feature Map, SOFM, at the first stage and a Multi-Layer Perceptron, MLP, at the second stage) the Iranian Farsi phonemes are recognized. Then the string of phonemes are corrected, segmented and converted to formal text by using a non-stochastic method. For syntactic processing, the symbolic (by using artificial intelligence techniques) and connectionist (by using artificial neural networks) approaches are used to determine the correctness, position and the kind of syntactic errors in Iranian Farsi sentences, as well. 相似文献

14.

基音周期检测的希尔伯特-黄变换方法

焦蓓曾以成毛燕湖《计算机工程与应用》2015,(1):204-207,227

根据语音信号非平稳非线性的时变特点,提出了一种基于希尔伯特-黄变换的基音周期检测法。该方法不需要对语音信号进行短时平稳假设,能自适应地对信号进行分解,具有很高的时频分辨率(不受Heisenberg不确定原理的制约)。利用短时能量对语音进行清浊音判断,再经过经验模态分解将信号分解为若干固有模态函数,然后对每个固有模态函数进行希尔伯特变换求其瞬时幅值与瞬时频率,根据基音特点对分解得到的固有模态函数加权求和突出基音周期信息,最后采用自相关平方法进行基音检测。实验表明,该方法较传统的基音检测法精度有所提高,且鲁棒性较好。相似文献

15.

A study on the challenges and opportunities of speech recognition for Bengali language

Mridha M. F. Ohi Abu Quwsar Hamid Md Abdul Monowar Muhammad Mostafa 《Artificial Intelligence Review》2022,55(4):3431-3455

Speech recognition is a fascinating process that offers the opportunity to interact and command the machine in the field of human-computer interactions. Speech recognition is a language-dependent system constructed directly based on the linguistic and textual properties of any language. Automatic speech recognition (ASR) systems are currently being used to translate speech to text flawlessly. Although ASR systems are being strongly executed in international languages, ASR systems’ implementation in the Bengali language has not reached an acceptable state. In this research work, we sedulously disclose the current status of the Bengali ASR system’s research endeavors. In what follows, we acquaint the challenges that are mostly encountered while constructing a Bengali ASR system. We split the challenges into language-dependent and language-independent challenges and guide how the particular complications may be overhauled. Following a rigorous investigation and highlighting the challenges, we conclude that Bengali ASR systems require specific construction of ASR architectures based on the Bengali language’s grammatical and phonetic structure.

相似文献

16.

对含噪语音进行基频检测的方法

下载免费PDF全文

杨帅宋刚《计算机工程与应用》2009,45(31):128-129

提出一种对含噪语音进行基频检测的新方法。先对含噪语音进行小波去噪,然后再经过预处理后,采用归一化的AMDF算法对语音进行基频提取,后期对基频信号采用搜索试探方法进行平滑处理,通过实验表明,该方法比传统方法有更好的鲁棒性,尤其在低信噪比的情况下。相似文献

17.

Integration of multiple acoustic and language models for improved Hindi speech recognition system

R. K. Aggarwal M. Dave 《International Journal of Speech Technology》2012,15(2):165-180

Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment. 相似文献

18.

基于Conformer的端到端语音识别方法

胡从刚申艺翔孙永奇赵思聪《计算机应用研究》2024,41(7)

针对Conformer编码器的声学输入网络对FBank语音信息提取不足和通道特征信息缺失问题,提出一种RepVGG-SE-Conformer的端到端语音识别方法。首先,利用RepVGG的多分支结构增强模型的语音信息提取能力,而在模型推理时通过结构重参数化将多分支融合为单分支,以降低计算复杂度、加快模型推理速度。然后,利用基于压缩和激励网络的通道注意力机制弥补缺失的通道特征信息,以提高语音识别准确率。最后,在公开数据集Aishell-1上的实验结果表明：相较于Conformer,所提出方法的字错误率降低了10.67%,验证了方法的先进性。此外,RepVGG-SE声学输入网络能够有效提高多种Transformer变体的端到端语音识别模型整体性能,具有很好的泛化能力。相似文献

19.

A computational auditory scene analysis system for speech segregation and robust speech recognition 总被引：2，自引：1，他引：1

Yang Shao Soundararajan Srinivasan Zhaozhang Jin DeLiang Wang 《Computer Speech and Language》2010,24(1):77-93

A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time–frequency (T–F) mask which retains the mixture in a local T–F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T–F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance. 相似文献

20.

高斯超向量-支持向量机鉴别性语种识别系统

梁春燕安茂波刘振业索宏彬汪俊杰《计算机工程与应用》2013,49(2):174-176,180

支持向量机在语种识别技术中获得了广泛的研究和应用,并且达到和传统混合高斯模型相当的性能。高斯超向量-支持向量机系统将高斯混合模型与支持向量机有效地结合起来,采用高斯超向量核函数,以支持向量机作为后端分类器。重点介绍基于高斯超向量-支持向量机的语种识别系统,并和传统的高斯混合模型系统进行比较。在美国国家标准技术研究院2003年和2007年语种识别评测数据集上进行实验。实验结果表明,高斯超向量-支持向量机系统相对于混合高斯模型建模的方法,在长时数据上有较明显的性能优势。相似文献