期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《Circuits and Systems Magazine, IEEE》2007,7(3):6-23

We have examined various aspects of how to produce synthetic speech. There are numerous applications for such synthetic speech, mostly when starting from a textual input, i.e., TTS. Given the large amount of text in databases and the public's need to access information efficiently, synthetic speech is a natural way to obtain information. A major application of the future will be speech-to-speech translation, in which a person speaking in one language will be able to converse automatically with someone using another language: ASR would transcribe the original speech to a textual form in language A, then an automatic text translator would map that text to language B, and finally a TTS system for this second language would generate the output speech. 相似文献

2.

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Vegesna Vishnu Vidyadhara Raju Gurugubelli Krishna Vuppala Anil Kumar 《Mobile Networks and Applications》2019,24(1):193-201

Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech is considered to be the crucial aspect of human-machine interaction. The combined spectral and differenced prosody features are considered for the task of the emotion recognition in the first stage. The task of emotion recognition does not serve the sole purpose of improvement in the performance of an ASR system. Based on the recognized emotions from the input speech, the corresponding adapted emotive ASR model is selected for the evaluation in the second stage. This adapted emotive ASR model is built using the existing neutral and synthetically generated emotive speech using prosody modification method. In this work, the importance of emotion recognition block at the front-end along with the emotive speech adaptation to the ASR system models were studied. The speech samples from IIIT-H Telugu speech corpus were considered for building the large vocabulary ASR systems. The emotional speech samples from IITKGP-SESC Telugu corpus were used for the evaluation. The adapted emotive speech models have yielded better performance over the existing neutral speech models.

相似文献

3.

Minimum variance distortionless response spectral estimation

《Signal Processing Magazine, IEEE》2005,22(5):117-126

In this article, we concentrate on spectral estimation techniques that are useful in extracting the features to be used by automatic speech recognition (ASR) system. As an aid to understanding the spectral estimation process for speech signals, we adopt the source filter model of speech production as presented in X. Huang et al. (2001), wherein speech is divided into two broad classes: voiced and unvoiced. Voiced speech is quasi-periodic, consisting of a fundamental frequency corresponding to the pitch of a speaker, as well as its harmonics. Unvoiced speech is stochastic in nature and is best modeled as white noise convolved with an infinite impulse response filter. 相似文献

4.

多特征融合的英语口语考试自动评分系统的研究 总被引：1，自引：0，他引：1

李艳玲颜永红《电子与信息学报》2012,34(9):2097-2102

该文主要针对大规模英语口语考试自动评分系统的问答题型,采用多特征融合的方法进行评分。以语音识别文本作为研究对象,提取了3类特征进行评分。这3类特征分别是：相似度特征、句法特征和语音特征。总共9个特征从不同方面描述了考生回答与专家评分之间的关系。在相似度特征中,改进了Manhattan距离作为相似度。同时提出了基于编辑距离的关键词覆盖率的特征,充分考虑了识别文本中存在的单词变异现象,为给考生一个客观公平的分数提供依据。所有提取的特征利用多元线性回归模型进行融合,得到机器评分。实验结果表明,提取的特征对机器评分是十分有效的,并且在以考生为单位的系统评分性能达到了专家评分性能的98.4%。相似文献

5.

具有文本生成功能的智能语音生成系统 总被引：1，自引：0，他引：1

陈芳袁保宗《电子学报》1997,25(10):5-8

智能语音生成系统不仅研究通常的文语转换过程，而且研究文转转换所需文本的生成过程，本文将介绍具有文本生成功能的智能语音生成系统，该系统通过主题选择、文本规划、文本组织、语法实现、文本形成等步骤得到正确的文本，根据怕生成的文本和文语转换实现高自然度及可懂度的语音输出。相似文献

6.

Gaze-contingent automatic speech recognition

Cooke N.J. Russell M. 《Signal Processing, IET》2008,2(4):369-380

There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human? computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely?coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions. 相似文献

7.

An error-protected speech recognition system for wirelesscommunications

Weerackody V. Reichl W. Potamianos A. 《Wireless Communications, IEEE Transactions on》2002,1(2):282-291

Future wireless multimedia terminals will have a variety of applications that require speech recognition capabilities. We consider a robust distributed speech recognition system where representative parameters of the speech signal are extracted at the wireless terminal and transmitted to a centralized automatic speech recognition (ASR) server. We propose two unequal error protection schemes for the ASR bit stream and demonstrate the satisfactory performance of these schemes for typical wireless cellular channels. In addition, a "soft-feature" error concealment strategy is introduced at the ASR server that uses "soft-outputs" from the channel decoder to compute the marginal distribution of only the reliable features during likelihood computation at the speech recognizer. This soft-feature error concealment technique reduces the ASR error rate by more than a factor of 2.5 for certain channels. Also considered is a channel decoding technique with source information that improves ASR performance 相似文献

8.

Measuring and modeling vocal source-tract interaction

Childers D.G. Chun-Fan Wong 《IEEE transactions on bio-medical engineering》1994,41(7):663-671

The quality of synthetic speech is affected by two factors: intelligibility and naturalness. At present, synthesized speech may be highly intelligible, but often sounds unnatural. Speech intelligibility depends on the synthesizer's ability to reproduce the formants, the formant bandwidths, and formant transitions, whereas speech naturalness is thought to depend on the excitation waveform characteristics for voiced and unvoiced sounds. Voiced sounds may be generated by a quasiperiodic train of glottal pulses of specified shape exciting the vocal tract filter. It is generally assumed that the glottal source and the vocal tract filter are linearly separable and do not interact. However, this assumption is often not valid, since it has been observed that appreciable source-tract interaction can occur in natural speech. Previous experiments in speech synthesis have demonstrated that the naturalness of synthetic speech does improve when source-tract interaction is simulated in the synthesis process. The purpose of this paper is two-fold: (1) to present an algorithm for automatically measuring source-tract interaction for voiced speech, and (2) to present a simple speech production model that incorporates source-tract interaction into the glottal source model, This glottal source model controls: (1) the skewness of the glottal pulse, and (2) the amount of the first formant ripple superimposed on the glottal pulse. A major application of the results of this paper is the modeling of vocal disorders 相似文献

9.

基于语义的汉语普通话电子喉语音转换增强

下载免费PDF全文

钱兆鹏肖克晶刘蝉孙悦《电子学报》2020,48(5):840-845

电子喉语音存在基频单一、发声机械、辐射噪声大等多种缺陷,这严重影响了电子喉语音可懂度和自然度,特别是对汉语普通话之类的声调语言,问题尤其严重.汉语普通话电子喉语音识别存在辅音混淆的问题并且识别结果没有声调,因此本文在识别结果的基础之上设计了拼音拼写修正器和声调标注工具,再结合基于Tacotron-2的TTS实现了电子喉语音向正常语音的转换.客观评价实验结果表明,拼音拼写修正器可以提高拼音准确率,声调标注在有上下文的语义环境中具有较高准确率.主观听力测试结果表明,本文所提方法在不同语言水平上提高了汉语普通话电子喉语音的可懂度和自然度.研究结果表明,本文设计的方法可以将不带声调的电子喉语音转换为正常语音,相比于传统语音转换方法具有更高的性能. 相似文献

10.

Identification of articulation error patterns using a novel dependence network

Chen YJ 《IEEE transactions on bio-medical engineering》2011,58(11):3061-3068

Articulation errors seriously reduce speech intelligibility and the ease of spoken communication. Speech-language pathologists manually identify articulation error patterns based on their clinical experience, which is a time-consuming and expensive process. This study proposes an automatic pronunciation error identification system that uses a novel dependence network (DN) approach. In order to derive a subject's articulatory information, a photo naming task is performed to obtain the subject's speech patterns. Based on clinical knowledge about speech evaluation, a DN scheme was used to model the relationships of a test word, a subject, a speech pattern, and an articulation error pattern. To integrate DN into automatic speech recognition (ASR), a pronunciation confusion network is proposed to model the probability of DN and is then used to guide the search space of the ASR. Further, to increase the accuracy of the ASR, an appropriate threshold based on a histogram of pronunciation errors is selected in order to disregard rare pronunciation errors. Finally, the articulation error patterns were well identified by integrating the likelihoods of the DNs of each phoneme. The results of this study indicate that it is feasible to clinically implement this dynamic network approach to achieve satisfactory performance in articulation evaluation. 相似文献

11.

Synthetic voices for computers

Flanagan J. L. Coker C. H. Rabiner L. R. Schafer R. W. Umeda N. 《Spectrum, IEEE》1970,7(10):22-45

The two methods described for giving voices to computers recognize the importance of economical storage of speech information and extensive vocabularies, and consequently are based on principles of speech synthesis. The first, formant synthesis, generates connected speech from low-bit-rate representations of spoken words. The second, text synthesis, produces connected speech solely from printed English text. For both methods the machine must contain stored knowledge of fundamental rules of language and acoustic constraints of human speech. Formant synthesis from an input information rate of about 1000 bits per second is demonstrated, as is text synthesis from a rate of about 75 bits per second. To give the reader an opportunity to evaluate some of the results described, a sample recording is available; see Appendix A for details. 相似文献

12.

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Jinfu Ni Yoshinori Shiga Chiori Hori 《Journal of Signal Processing Systems》2016,82(2):273-286

This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F₀) contours in the framework of HMM-based speech synthesis. An F₀ contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F₀ model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F₀ contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F₀ behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F₀ model. 相似文献

13.

基于GPS和TTS的自动报站器方案设计

张丹娜《电声技术》2013,37(10):30-32

针对国内公交车报站器的应用现状,提出了一种基于GPS和TTS的全自动公交语音报站器的设计方案.分析了该报站器的特点并提出了其整体构架以及具体的终端设计,采用以ARM Cortex-M3为核心的stm32f103vet6处理器,通过GPS和GPRS相结合的方式实现车辆位置信息的读取、车辆与控制中心的无线通信,同时采用语音合成技术真正实现全自动准确地播报站名及服务提示信息. 相似文献

14.

语音合成系统中语音库样本能量均衡方法研究

下载免费PDF全文

刘伟谢建志《信号处理》2017,33(2):229-235

语音库的质量是决定语音合成(Text to Speech, TTS)效果的重要因素。TTS语音库的制作周期需要六个月左右,期间,发音人的录音状态需要保持一致,即音色、能量皆不能有大的差异,这对于发音人来说是较为困难的。为此,本文给出语音能量均衡方法,其中包括时域包络波动检测算法和帧能量平均算法,旨在解决TTS语音数据库录制后能量不一致现象。首先分析获得标准语音的相关能量参数和波动参数作为模板;其次,利用时域包络波动检测算法对预调节语音样本的合格性进行检验;最后根据帧能量平均准则,对所有合格语音样本进行时域幅值调整,以最大限度地保证语音库整体能量的一致性。实验结果表明,本文提出的语音能量均衡方法可以有效提升TTS语音库质量,具有实际工程意义。相似文献

15.

声码器激励参数的描述模型 总被引：2，自引：0，他引：2

贺天宏张建伟崔慧娟唐昆《电声技术》2003,(4):52-55

在低速率声码器中，对浊音激励的描述直接影响了重建语音的质量。为提高音质，提出了一种两级浊音激励谱形状描述模型。在混合激励线性预测(MELP)声码器中浊音激励谱幅度峰值描述模型的基础上，增加了对整体谱的表述，同时给出了相应的矢量量化方法。测试表明此模型可使重建语音的音质得以明显改善，主观自然度测试结果达到了63％，对男声更为突出。此模型既保持了在激励谱低频带上描述准确的优点，又大大提高了对全频带描述的准确性，可应用于多种声码器中。相似文献

16.

Multitapering and a wavelet variant of MFCC in speech recognition

Ricotti L.P. 《Vision, Image and Signal Processing, IEE Proceedings -》2005,152(1):29-35

In speech recognition (ASR) based on hidden Markov models (HMM) it is necessary to obtain a spectral approximation with a reduced set of representation coefficients. The author introduces to the speech parameterisation scheme multitapering and a modification of the usual mel frequency cepstrum coefficients (MFCC) processing scheme based on wavelets on intervals (wavelet frequency coefficients, WFC). Phoneme recognition performance improvements compared to the MFCC have been experimentally verified on data from a speech database, using multitapering and WFC. 相似文献

17.

基于最小合成单元的维吾尔音库设计 总被引：1，自引：1，他引：0

卡斯木江·卡迪尔古丽娜尔·艾力艾斯卡尔·艾木都拉《通信技术》2012,45(4):83-85

为了实现容量小、可懂度和自然度较好的波形拼接式维吾尔语语音合成系统,结合维吾尔语的特点,完成了文本设计、录音、语音标注、以及音库建设过程。其中,把音节作为基本的合成单元建立了音节语音库;为了弥补音库中不存在音节的合成问题,又建立了音素作为合成单元的音素语音库。实验结果表明,以音节和音素为最小合成单元的波形拼接式维吾尔语语音合成系统除了音库容量相对少等特点之外,还具有良好的可懂度。相似文献

18.

Spoken language understanding

《Signal Processing Magazine, IEEE》2008,25(3):50-58

Semantics deals with the organization of meanings and the relations between sensory signs or symbols and what they denote or mean. Computational semantics performs a conceptualization of the world using computational processes for composing a meaning representation structure from available signs and their features present, for example, in words and sentences. Spoken language understanding (SLU) is the interpretation of signs conveyed by a speech signal. SLU and natural language understanding (NLU) share the goal of obtaining a conceptual representation of natural language sentences. Specific to SLU is the fact that signs to be used for interpretation are coded into signals along with other information such as speaker identity. Furthermore, spoken sentences often do not follow the grammar of a language; they exhibit self-corrections, hesitations, repetitions, and other irregular phenomena. SLU systems contain an automatic speech recognition (ASR) component and must be robust to noise due to the spontaneous nature of spoken language and the errors introduced by ASR. Moreover, ASR components output a stream of words with no structure information like punctuation and sentence boundaries. Therefore, SLU systems cannot rely on such markers and must perform text segmentation and understanding at the same time. 相似文献

19.

Using Spectral Subtraction for Suppression of Noise in Speech Signals with Analog Integrated Circuits 总被引：1，自引：0，他引：1

Stanislav Gruden Baldomir Zajc 《Analog Integrated Circuits and Signal Processing》1999,18(2-3):195-207

A new modification of the spectral subtraction algorithm is presented which enables operating entirely in the time domain and is thus suitable for realization in analog integrated circuits. The noise spectrum is obtained during speechless intervals and stored for spectral subtraction when speech is present in the signal. The frequency range of interest of the speech signal is divided into narrow frequency bands by means of a bank of band-pass filters. For each frequency band the noise model is realized as an auxiliary signal multiplied by a particular weight. A subsystem is presented that produces an output signal whose power is equal to the difference between the input signal power and the noise model power for each frequency channel, thereby realizing the spectral subtraction. Circuits to achieve the described operation are outlined. Finally, simulation results of the noise removal algorithm are shown in the form of a spectrogram and the results showing improvement in automatic speech recognition are given. 相似文献

20.

Recent advances in the automatic recognition of audiovisual speech 总被引：11，自引：0，他引：11

Potamianos G. Neti C. Gravier G. Garg A. Senior A.W. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2003,91(9):1306-1326

Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks. 相似文献