首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Traditional dialogue systems use a fixed silence threshold to detect the end of users’ turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and gestures, to coordinate smooth exchange of speaking turns. However, little effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human–computer interactions. We present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human–computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). On this data, we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for detecting response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The subjective and objective measures from the user evaluation confirm that a model trained on speaker behavioural cues offers both smoother turn-transitions and more responsive system behaviour.  相似文献   

2.
In this paper we will be concerned with the role played by prosody in language learning and by the speech technology already available as commercial product or as prototype, capable to cope with the task of helping language learner in improving their knowledge of a second language from the prosodic point of view. The paper has been divided into two separate sections: Section One, dealing with Rhythm and all related topics; Section Two dealing with Intonation. In the Introduction we will argue that the use of ASR (Automatic Speech Recognition) as Teaching Aid should be under-utilized and should be targeted to narrowly focussed spoken exercises, disallowing open-ended dialogues, in order to ensure consistency of evaluation. Eventually, we will support the conjoined use of ASR technology and prosodic tools to produce GOP useable for linguistically consistent and adequate feedback to the student. This will be illustrated by presenting State of the Art for both sections, with systems well documented in the scientific literature of the respective field.  相似文献   

3.
为预测英语文语转换(Text-to-Speech,TIS)系统中韵律生成模块的韵律边界,通过在中间短语、语调短语和语句后分别插入不同长度的停顿,产生使合成语音具有与真人语音类似的韵律结构.通过采用基于语块的中间短语切分,以中间短语为基本单位,生成一个语调短语边界预测的学习语料库,然后采用转换式学习法进行标注学习,从而实现韵律边界的切分.在对真人语料库进行测试的实验中,标注正确率达到81.32%,通过在学习中增加语调短语音节数和标点符号的约束规则,可进一步提高标注正确率.  相似文献   

4.
Past research has demonstrated that there are cognitive processing costs associated with comprehension of speech generated by text-to-speech synthesizers, relative to comprehension of natural speech. This finding has important performance implications for the many applications that use such systems. The purpose of this study was to ascertain whether certain characteristics of synthetic speech slow on-line, real-time cognitive processing. Whereas past research has focused on the phonemic acoustic structure of synthetic speech, we manipulated prosodic, syntactic, and semantic cues in a task requiring participants to recall sentences spoken either by a human or by one of two speech synthesizers. The findings were interpreted to suggest that inappropriate prosodic modeling in synthetic speech was the major source of a performance differential between natural and synthetic speech. Prosodic cues, along with others, guide the parsing of speech and provide redundancy. When these cues are absent or inaccurate, the additional burden placed on working memory may exceed its capacity, particularly in time-limited, demanding tasks. Actual or potential applications of this research include improvement of text-to-speech output systems in warning systems, feedback devices in aerospace vehicles, educational and training modules, aids for the handicapped, consumer products, and technologies designed to increase the functional independence of older adults.  相似文献   

5.
With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus.  相似文献   

6.
This article describes the design, implementation and evaluation of an educational video game that helps individuals with Down syndrome to improve their speech skills, specifically those related to prosody. Special attention has been paid to the design of the user interface, taking into account the cognitive, learning, and attentional limitations of people with Down syndrome. The learning content is conveyed by activities of production and perception of prosodic phenomena, aimed at increasing their communicative competence. These activities are introduced within the narrative of a video game so that the players do not conceive the tool as a mere succession of learning activities, but so that they learn and improve their speech while playing. The evaluation strategy that has been followed involves real users and combines different evaluation activities. Results show a high level of acceptance by participants and also by professionals, speech therapists, and special education teachers.  相似文献   

7.
8.
Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.  相似文献   

9.
韵律结构的自动预测是高自然度文语转换(TTS)系统的关键组成部分,直接影响到合成语音的自然度和表现力。该文建立了一个同时具有语法信息与韵律结构标注的汉语语料库。在这一语料库的基础上,对汉语的韵律结构组成、韵律结构与语法语义之间的关系进行了分析,并进行了预测试验。研究发现,汉语的韵律结构虽与语法结构不同,但是有着密切的联系,韵律结构可以通过语法结构进行预测。韵律结构除与语法结构有关之外,还要受到语句语义的制约。  相似文献   

10.
Since intelligibility of synthetic speech is no longer the main criterion on which to base quality judgements, reliable methods for prosody evaluation become more important. We propose a method called PURR (Prosody Unveiling through Restricted Representation) to evaluate the prosodic component of a synthesis system without the interference of other system components. In PURR, the stimuli are reduced to their prosodic content. The method has proven to be suitable for test designs with naıve listeners. It can be used for comparative studies as well as for diagnostic analyses and is, therefore, a useful tool for basic research on the perception of prosodic phenomena. In this paper we first describe how the best signal manipulation method was determined using perception tests. The appropriateness of the resulting signal is further assessed in a recognition test of syntactic structure. We then report on further validations of the proposed method: different ways of synthetic prosody modelling are evaluated, both in comparison with human prosody and amongst different synthesis systems.  相似文献   

11.
Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system.  相似文献   

12.
韵律特征的转换是语音转换系统中的重要组成部分,对转换后的语音的自然度起着至关重要的作用.介绍韵律特征结构,指出了韵律特征建模与控制转换研究的主要进展,讨论现有常用转换方法的过程机理和优缺点.在此基础上,对韵律转换的研究情况进行了归纳分析和展望.  相似文献   

13.
语音是人类与智能手机或智能家电等现代智能设备进行通信的一种常用而有效的方式。随着计算机和网络技术的显著进步,语音识别系统得到了广泛的应用,它可以将用户发出的语音指令解释为智能设备上可以理解的数字指令或信号,实现用户与这些设备的远程交互功能。近年来,深度学习技术的进步推动了语音识别系统发展,使得语音识别系统的精度和可用性不断提高。然而深度学习技术自身还存在未解决的安全性问题,例如对抗样本。对抗样本是指在模型的预测阶段,通过对预测样本添加细微的扰动,使模型以高置信度给出一个错误的目标类别输出。目前对于对抗样本的攻击及防御研究主要集中在计算机视觉领域而忽略了语音识别系统模型的安全问题,当今最先进的语音识别系统由于采用深度学习技术也面临着对抗样本攻击带来的巨大安全威胁。针对语音识别系统模型同样面临对抗样本的风险,本文对语音识别系统的对抗样本攻击和防御提供了一个系统的综述。我们概述了不同类型语音对抗样本攻击的基本原理并对目前最先进的语音对抗样本生成方法进行了全面的比较和讨论。同时,为了构建更安全的语音识别系统,我们讨论了现有语音对抗样本的防御策略并展望了该领域未来的研究方向。  相似文献   

14.
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech.  相似文献   

15.
本文对自然言语的韵律组织中的不确定性及其对合成语音自然度的影响进行了初步探讨,并在此基础上,提出在韵律预测中用最小错误概率准则代替传统的最大生成概率准则,从而在预测结果中保留多种等价的韵律实现。本文还进一步提出一种将基于最小错误准则的韵律预测与单元选择结合的算法,首先根据最小错误准则在所有候选单元中筛选出最不可能造成韵律错误的样本,然后再依据最平滑拼接准则从各种韵律等价的路径中选出一条能达到最平滑拼接的作为最后输出。  相似文献   

16.
Building a text corpus suitable to be used in corpus-based speech synthesis is a time-consuming process that usually requires some human intervention to select the desired phonetic content and the necessary variety of prosodic contexts. If an emotional text-to-speech (TTS) system is desired, the complexity of the corpus generation process increases. This paper presents a study aiming to validate or reject the use of a semantically neutral text corpus for the recording of both neutral and emotional (acted) speech. The use of this kind of texts would eliminate the need to include semantically emotional texts into the corpus. The study has been performed for Basque language. It has been made by performing subjective and objective comparisons between the prosodic characteristics of recorded emotional speech using both semantically neutral and emotional texts. At the same time, the performed experiments allow for an evaluation of the capability of prosody to carry emotional information in Basque language. Prosody manipulation is the most common processing tool used in concatenative TTS. Experiments of automatic recognition of the emotions considered in this paper (the "Big Six emotions") show that prosody is an important emotional indicator, but cannot be the only manipulated parameter in an emotional TTS system-at least not for all the emotions. Resynthesis experiments transferring prosody from emotional to neutral speech have also been performed. They corroborate the results and support the use of a neutral-semantic-content text in databases for emotional speech synthesis.  相似文献   

17.
The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level.  相似文献   

18.
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).  相似文献   

19.
This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.  相似文献   

20.
提出一种基于时域基音同步叠加TD-PSOLA算法的情感语音合成系统。根据情感语音库分析总结情感规则,在此基础上利用TD-PSOLA算法对中性语音的韵律参数进行改变,并提出一种能够对基频曲线尾部形状改变的方法,使句子表达出丰富的情感。实验表明,合成出的语音具有明显的情感色彩,证明了该系统能以简单明了的方式实现情感语音的合成,有助于提高人脸语音动画表达的丰富性和生动性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号