期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Modification of energy spectra,epoch parameters and prosody for emotion conversion in speech

Arijul Haque Krothapalli Sreenivasa Rao 《International Journal of Speech Technology》2017,20(1):15-25

This work attempts to convert a given neutral speech to a target emotional style using signal processing techniques. Sadness and anger emotions are considered in this study. For emotion conversion, we propose signal processing methods to process neutral speech in three ways: (i) modifying the energy spectra (ii) modifying the source features and (iii) modifying the prosodic features. Energy spectra of different emotions are analyzed, and a method has been proposed to modify the energy spectra of neutral speech after dividing the speech into different frequency bands. For the source part, epoch strength and epoch sharpness are extensively studied. A new method has been proposed for modification and incorporation of epoch strength and epoch sharpness parameters using appropriate modification factors. Prosodic features like pitch contour and intensity have also been modified in this work. New pitch contours corresponding to the target emotions are derived from the pitch contours of neutral test utterances. The new pitch contours are incorporated into the neutral utterances. Intensity modification is done by dividing neutral utterances into three equal segments and modifying the intensities of these segments separately, according to the modification factors suitable for the target emotions. Subjective evaluation using mean opinion scores has been carried out to evaluate the quality of converted emotional speech. Though the modified speech does not completely resemble the target emotion, the potential of these methods to change the style of the speech is demonstrated by these subjective tests. 相似文献

2.

Conversational speech synthesis and the need for some laughter

Campbell N. 《IEEE transactions on audio, speech, and language processing》2006,14(4):1171-1178

This paper reports progress in the synthesis of conversational speech, from the viewpoint of work carried out on the analysis of a very large corpus of expressive speech in normal everyday situations. With recent developments in concatenative techniques, speech synthesis has overcome the barrier of realistically portraying extra-linguistic information by using the actual voice of a recognizable person as a source for units, combined with minimal use of signal processing. However, the technology still faces the problem of expressing paralinguistic information, i.e., the variety in the types of speech and laughter that a person might use in everyday social interactions. Paralinguistic modification of an utterance portrays the speaker's affective states and shows his or her relationships with the speaker through variations in the manner of speaking, by means of prosody and voice quality. These inflections are carried on the propositional content of an utterance, and can perhaps be modeled by rule, but they are also expressed through nonverbal utterances, the complexity of which may be beyond the capabilities of many current synthesis methods. We suggest that this problem may be solved by the use of phrase-sized utterance units taken intact from a large corpus. 相似文献

3.

Estimation of word emotions based on part of speech and positional information 总被引：3，自引：0，他引：3

Kazuyuki Matsumoto Fuji Ren 《Computers in human behavior》2011,27(5):1553-1564

Recently, studies on emotion recognition technology have been conducted in the fields of natural language processing, speech signal processing, image data processing, and brain wave analysis, with the goal of letting the computer understand ambiguous information such as emotion or sensibility. This paper statistically studies the features of Japanese and English emotional expressions based on an emotion annotated parallel corpus and proposes a method to estimate emotion of the emotional expressions in the sentence. The proposed method identifies the words or phrases with emotion, which we call emotional expressions, and estimates the emotion category of the emotional expressions by focusing on the three kinds of features: part of speech of emotional expression, position of emotional expression, and part of speech of the previous/next morpheme of the target emotional expression. 相似文献

4.

Analyzing and identifying multiword expressions in spoken language

Helmer Strik Micha Hulsbosch Catia Cucchiarini 《Language Resources and Evaluation》2010,44(1-2):41-58

The present paper investigates multiword expressions (MWEs) in spoken language and possible ways of identifying MWEs automatically in speech corpora. Two MWEs that emerged from previous studies and that occur frequently in Dutch are analyzed to study their pronunciation characteristics and compare them to those of other utterances in a large speech corpus. The analyses reveal that these MWEs display extreme pronunciation variation and reduction, i.e., many phonemes and even syllables are deleted. Several measures of pronunciation reduction are calculated for these two MWEs and for all other utterances in the corpus. Five of these measures are more than twice as high for the MWEs, thus indicating considerable reduction. One overall measure of pronunciation deviation is then calculated and used to automatically identify MWEs in a large speech corpus. The results show that neither this overall measure, nor frequency of co-occurrence alone are suitable for identifying MWEs. The best results are obtained by using a metric that combines overall pronunciation reduction with weighted frequency. In this way, recurring “islands of pronunciation reduction” that contain (potential) MWEs can be identified in a large speech corpus. 相似文献

5.

Modeling Emotion and Attitude in Speech by Means of Perceptually Based Parameter Values

Mozziconacci Sylvie J. L. 《User Modeling and User-Adapted Interaction》2001,11(4):297-326

This study focuses on the perception of emotion and attitude in speech. The ability to identify vocal expressions of emotion and/or attitude in speech material was investigated. Systematic perception experiments were carried out to determine optimal values for the acoustic parameters: pitch level, pitch range and speech rate. Speech was manipulated by varying these parameters around the values found in a selected subset of the speech material which consisted of two sentences spoken by a male speaker expressing seven emotions or attitudes: neutrality, joy, boredom, anger, sadness, fear, and indignation. Listening tests were carried out with this speech material, and optimal values for pitch level, pitch range, and speech rate were derived for the generation of speech expressing emotion or attitude, from a neutral utterance. These values were perceptually tested in re-synthesized speech and in synthetic speech generated from LPC-coded diphones. 相似文献

6.

The effect of code-mixing on accent identification accuracy

Thomas Niesler Febe de Wet 《Computer Speech and Language》2009,23(4):435-443

We investigate whether accent identification is more effective for English utterances embedded in a different language as part of a mixed code than for English utterances that are part of a monolingual dialogue. Our focus is on Xhosa and Zulu, two South African languages for which code-mixing with English is very common. In order to carry out our investigation, we extract English utterances from mixed-code Xhosa and Zulu speech corpora, as well as comparable utterances from an English-only corpus by Xhosa and Zulu mother-tongue speakers. Experiments using automatic accent identification systems show that identification is substantially more accurate for the utterances originating from the mixed-code speech. These findings are supported by a corresponding set of perceptual experiments in which human subjects were asked to identify the accents of recorded utterances. We conclude that accent identification is more successful for these utterances because accents are more pronounced for English embedded in mother-tongue speech than for English spoken as part of a monolingual dialogue by non-native speakers. Furthermore we find that this is true for human listeners as well as for automatic identification systems. 相似文献

7.

Expressive Speech Animation Synthesis with Phoneme‐Level Controls

Z. Deng U. Neumann 《Computer Graphics Forum》2008,27(8):2096-2113

This paper presents a novel data‐driven expressive speech animation synthesis system with phoneme‐level controls. This system is based on a pre‐recorded facial motion capture database, where an actress was directed to recite a pre‐designed corpus with four facial expressions (neutral, happiness, anger and sadness). Given new phoneme‐aligned expressive speech and its emotion modifiers as inputs, a constrained dynamic programming algorithm is used to search for best‐matched captured motion clips from the processed facial motion database by minimizing a cost function. Users optionally specify ‘hard constraints’ (motion‐node constraints for expressing phoneme utterances) and ‘soft constraints’ (emotion modifiers) to guide this search process. We also introduce a phoneme–Isomap interface for visualizing and interacting phoneme clusters that are typically composed of thousands of facial motion capture frames. On top of this novel visualization interface, users can conveniently remove contaminated motion subsequences from a large facial motion dataset. Facial animation synthesis experiments and objective comparisons between synthesized facial motion and captured motion showed that this system is effective for producing realistic expressive speech animations. 相似文献

8.

Development of simulated emotion speech database for excitation source analysis

D. Pravena D. Govind 《International Journal of Speech Technology》2017,20(2):327-338

The work presented in this paper is focused on the development of a simulated emotion database particularly for the excitation source analysis. The presence of simultaneous electroglottogram (EGG) recordings for each emotion utterance helps to accurately analyze the variations in the source parameters according to different emotions. The work presented in this paper describes the development of comparatively large simulated emotion database for three emotions (Anger, Happy and Sad) along with neutrally spoken utterances in three languages (Tamil, Malayalam and Indian English). Emotion utterances in each language are recorded from 10 speakers in multiple sessions (Tamil and Malayalam). Unlike the existing simulated emotion databases, instead of emotionally neutral utterances, emotionally biased utterances are used for recording. Based on the emotion recognition experiments, the emotions elicited from emotionally biased utterances are found to show more emotion discrimination as compared to emotionally neutral utterances. Also, based on the comparative experimental analysis, the speech and EGG utterances of the proposed simulated emotion database are found to preserve the general trend in the excitation source characteristics (instantaneous F0 and strength of excitation parameters) for different emotions as that of the classical German emotion speech-EGG database (EmoDb). Finally, the emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German (EmoDb) and IITKGP-SESC Telugu speech emotion databases. 相似文献

9.

语音信号中的情感识别研究 总被引：25，自引：0，他引：25

赵力钱向民邹采荣吴镇扬《软件学报》2001,12(7):1050-1055

提出了从语音信号中识别情感特征的方法.从5名说话者中搜集了带有欢快、愤怒、惊奇和悲伤的情感语句共300句.从这些语音资料中提取了10个情感特征.提出了3种基于主元素分析的语音信号中的情感识别方法.使用这些方法获得了基本上接近于人的正常表现的识别效果. 相似文献

10.

Databases,features and classifiers for speech emotion recognition: a review

Monorama Swain Aurobinda Routray P. Kabisatpathy 《International Journal of Speech Technology》2018,21(1):93-120

Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a speech signal and identify the emotions from the speech utterances is an important task for the researchers. Speech emotion recognition has considered as an important research area over the last decade. Many researchers have been attracted due to the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been developed and outlined for the identification of emotional content of a speech from a person’s speech. In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages. 相似文献

11.

Thai spelling analysis for automatic spelling speech recognition

Chutima Pisarn Thanaruk Theeramunkong 《Information Sciences》2008,178(1):122-136

Spelling speech recognition can be applied for several purposes including enhancement of speech recognition systems and implementation of name retrieval systems. This paper presents a Thai spelling analysis to develop a Thai spelling speech recognizer. The Thai phonetic characteristics, alphabet system and spelling methods have been analyzed. As a training resource, two alternative corpora, a small spelling speech corpus and an existing large continuous speech corpus, are used to train hidden Markov models (HMMs). Then their recognition results are compared to each other. To solve the problem of utterance speed difference between spelling utterances and continuous speech utterances, the adjustment of utterance speed has been taken into account. Two alternative language models, bigram and trigram, are used for investigating performance of spelling speech recognition. Our approach achieves up to 98.0% letter correction rate, 97.9% letter accuracy and 82.8% utterance correction rate when the language model is trained based on trigram and the acoustic model is trained from the small spelling speech corpus with eight Gaussian mixtures. 相似文献

12.

Evaluation of the affective valence of speech using pitch substructure

Cook N.D. Fujisawa T.X. Takami K. 《IEEE transactions on audio, speech, and language processing》2006,14(1):142-151

In order to study the relationship between emotion and intonation, a new technique is introduced for the extraction of the dominant pitches within speech utterances and the quasi-musical analysis of the multipitch structure. After the distribution of fundamental frequencies over the entire utterance has been obtained, the underlying pitch structure is determined using an unsupervised "cluster" (Gaussian mixtures) algorithm. The technique normally results in 3-6 pitch clusters per utterance that can then be evaluated in terms of their inherent dissonance, harmonic "tension", and "major or minor modality". Stronger dissonance and tension were found in utterances with negative affect, relative to utterances with positive affect. Most importantly, utterances that were evaluated as having positive or negative affect had significantly different modality values. Factor analysis showed that the measures involving multiple pitches were distinct from other acoustical measures, indicating that the pitch substructure is an independent factor contributing to the affective valence of speech prosody. 相似文献

13.

Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training

Fanbo Meng Zhiyong Wu Jia Jia Helen Meng Lianhong Cai 《Multimedia Tools and Applications》2014,73(1):463-489

Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to synthesize corrective feedback in a computer-aided pronunciation training (CAPT) system. We first analyze contrastive (neutral versus emphatic) speech recording. The changes of the acoustic features of emphasis at different prosody locations and the local prominences of emphasis are analyzed. Based on the analysis, we develop a perturbation model that predicts the changes of the acoustic features from neutral to emphatic speech with high accuracy. Further based on the perturbation model we develop an HMM-based emphatic speech synthesis model. Different from the previous work, the HMM model is trained with neutral corpus, but the context features and additional acoustic-feature-related features are used during the growing of the decision tree. Then the output of the perturbation model can be used to supervise the HMM model to synthesize emphatic speeches instead of applying the perturbation model at the backend of a neutral speech synthesis model directly. In this way, the demand of emphasis corpus is reduced and the speech quality decreased by speech modification algorithm is avoided. The experiments indicate that the proposed emphatic speech synthesis model improves the emphasis quality of synthesized speech while keeping a high degree of the naturalness. 相似文献

14.

基于声学特征的语言情感识别

金琴陈师哲李锡荣杨刚许洁萍《计算机科学》2015,42(9):24-28

语音情感识别是语音处理领域中一个具有挑战性和广泛应用前景的研究课题。探索了语音情感识别中的关键问题之一:生成情感识别的有效的特征表示。从4个角度生成了语音信号中的情感特征表示:(1)低层次的声学特征,包括能量、基频、声音质量、频谱等相关的特征,以及基于这些低层次特征的统计特征;(2)倒谱声学特征根据情感相关的高斯混合模型进行距离转化而得出的特征;(3)声学特征依据声学词典进行转化而得出的特征;(4)声学特征转化为高斯超向量的特征。通过实验比较了各类特征在情感识别上的独立性能,并且尝试了将不同的特征进行融合,最后比较了不同的声学特征在几个不同语言的情感数据集上的效果(包括IEMOCAP英语情感语料库、CASIA汉语情感语料库和Berlin德语情感语料库)。在IEMOCAP数据集上,系统的正确识别率达到了71.9%,超越了之前在此数据集上报告的最好结果。相似文献

15.

Intelligent facial emotion recognition and semantic-based topic detection for a humanoid robot

Li Zhang Ming Jiang Dewan Farid M.A. Hossain 《Expert systems with applications》2013,40(13):5160-5168

Automatic perception of human affective behaviour from facial expressions and recognition of intentions and social goals from dialogue contexts would greatly enhance natural human robot interaction. This research concentrates on intelligent neural network based facial emotion recognition and Latent Semantic Analysis based topic detection for a humanoid robot. The work has first of all incorporated Facial Action Coding System describing physical cues and anatomical knowledge of facial behaviour for the detection of neutral and six basic emotions from real-time posed facial expressions. Feedforward neural networks (NN) are used to respectively implement both upper and lower facial Action Units (AU) analysers to recognise six upper and 11 lower facial actions including Inner and Outer Brow Raiser, Lid Tightener, Lip Corner Puller, Upper Lip Raiser, Nose Wrinkler, Mouth Stretch etc. An artificial neural network based facial emotion recogniser is subsequently used to accept the derived 17 Action Units as inputs to decode neutral and six basic emotions from facial expressions. Moreover, in order to advise the robot to make appropriate responses based on the detected affective facial behaviours, Latent Semantic Analysis is used to focus on underlying semantic structures of the data and go beyond linguistic restrictions to identify topics embedded in the users’ conversations. The overall development is integrated with a modern humanoid robot platform under its Linux C++ SDKs. The work presented here shows great potential in developing personalised intelligent agents/robots with emotion and social intelligence. 相似文献

16.

Vocal emotion recognition in five native languages of Assam using new wavelet features

Aditya Bihar Kandali Aurobinda Routray Tapan Kumar Basu 《International Journal of Speech Technology》2009,12(1):1-13

The present work investigates the following specific research questions concerning voice emotion recognition: whether vocal emotion expressions of discrete emotion (i) can be distinguished from no-emotion (i.e. neutral), (ii) can be distinguished from another, (iii) of surprise, which is actually a cognitive component that could be present with any emotion, can also be recognized as distinct emotion, (iv) can be recognized cross-lingually. This study will enable us to get more information regarding nature and function of emotion. Furthermore, this work will help in developing a generalized voice emotion recognition system, which will increase the efficiency of human-machine interaction systems. In this work an emotional utterance database is created with 140 acted utterances per speaker consisting of short sentences of six full-blown basic emotions and neutral of five native languages of Assam. This database is validated by a Listening Test. Four feature sets are extracted based on WPCC2 (Wavelet-Packet-Cepstral-Coefficients computed by method 2), MFCC (Mel-Frequency-Cepstral-Coefficients), tfWPCC2 (Teager-energy-operated-in-Transform-domain WPCC2) and tfMFCC. The Gaussian Mixture Model (GMM) is used as classifier. The performances of all these feature sets are compared in respect of accuracy of classification in two experiments: (i) text-and-speaker independent vocal emotion recognition in individual languages, and (ii) cross-lingual vocal emotion recognition. tfWPCC2 is a new wavelet feature set proposed by the same authors in one of their recent papers in a National Seminar in India as cited in References. 相似文献

17.

关于维吾尔语口语语料的三音子选取方法研究

徐宝龙努尔麦麦提·尤鲁瓦斯吾守尔·斯拉木《中文信息学报》2015,29(2):118-124

在大词汇量连续语音识别应用中,优质的语音训练语料是所有识别工作的基础和前提, 能否挑选出覆盖更多语音现象的语料是提高语音识别性能的关键。该文在多种维吾尔文口语化传播平台中采集了大量口语句子语料,并考虑协同发音的影响和常用词的适用性,根据评估函数对语料筛选。经过筛选后的语料包含的三音子更加均衡和高效,囊括的语音现象更加全面,为训练准确而牢靠的语音模型打下了稳固的根基。相似文献

18.

Prosody conversion from neutral speech to emotional speech 总被引：1，自引：0，他引：1

Jianhua Tao Yongguo Kang Aijun Li 《IEEE transactions on audio, speech, and language processing》2006,14(4):1145-1154

Emotion is an important element in expressive speech synthesis. Unlike traditional discrete emotion simulations, this paper attempts to synthesize emotional speech by using "strong", "medium", and "weak" classifications. This paper tests different models, a linear modification model (LMM), a Gaussian mixture model (GMM), and a classification and regression tree (CART) model. The linear modification model makes direct modification of sentence F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations, and intensities. Further analysis shows that emotional speech is also related to stress and linguistic information. Unlike the linear modification method, the GMM and CART models try to map the subtle prosody distributions between neutral and emotional speech. While the GMM just uses the features, the CART model integrates linguistic features into the mapping. A pitch target model which is optimized to describe Mandarin F0 contours is also introduced. For all conversion methods, a deviation of perceived expressiveness (DPE) measure is created to evaluate the expressiveness of the output speech. The results show that the LMM gives the worst results among the three methods. The GMM method is more suitable for a small training set, while the CART method gives the better emotional speech output if trained with a large context-balanced corpus. The methods discussed in this paper indicate ways to generate emotional speech in speech synthesis. The objective and subjective evaluation processes are also analyzed. These results support the use of a neutral semantic content text in databases for emotional speech synthesis. 相似文献

19.

How Convincing is Mr. Data's Smile: Affective Expressions of Machines

Bartneck Christoph 《User Modeling and User-Adapted Interaction》2001,11(4):279-295

Emotions should play an important role in the design of interfaces because people interact with machines as if they were social actors. This paper presents a literature review on affective expressions through speech, music and body language. It summarizes the quality and quantity of their parameters and successful examples of synthesis. Moreover, a model for the convincingness of affective expressions, based on Fogg and Hsiang Tseng (1999), was developed and tested. The empirical data did not support the original model and therefore this paper proposes a new model, which is based on appropriateness and intensity of the expressions. Furthermore, the experiment investigated if the type of emotion (happiness, sadness, anger, surprise, fear and disgust), knowledge about the source (human or machine), the level of abstraction (natural face, computer rendered face and matrix face) and medium of presentation (visual, audio/visual, audio) of an affective expression influences its convincingness and distinctness. Only the type of emotion and multimedia presentations had an effect on convincingness. The distinctness of an expression depends on the abstraction and the media through which it is presented. 相似文献

20.

基于PAD情感模型的可训练语音合成研究

陈雁翔龙润田《模式识别与人工智能》2013,26(11):1019-1025

情感语音合成是情感计算和语音信号处理研究的热点之一,进行准确的语音情感分析是合成高质量情感语音的前提.文中采用PAD情感模型作为情感分析量化模型,对情感语料库中的语音进行情感分析和聚类,获得各情感PAD参数模型.由HMM语音合成系统合成的情感语音,通过PAD模型进行参数修正,使得合成语音的情感参数更加准确,从而提高情感语音合成的质量.实验表明该方法能较好地提高合成语音的自然度和情感清晰度,在同性别不同说话人中也能达到较好的性能. 相似文献