首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This work focuses on the development of expressive text-to-speech synthesis techniques for a Chinese spoken dialog system, where the expressivity is driven by the message content. We adapt the three-dimensional pleasure-displeasure, arousal-nonarousal and dominance-submissiveness (PAD) model for describing expressivity in input text semantics. The context of our study is based on response messages generated by a spoken dialog system in the tourist information domain. We use the $P$ (pleasure) and $A$ (arousal) dimensions to describe expressivity at the prosodic word level based on lexical semantics. The $D$ (dominance) dimension is used to describe expressivity at the utterance level based on dialog acts. We analyze contrastive (neutral versus expressive) speech recordings to develop a nonlinear perturbation model that incorporates the PAD values of a response message to transform neutral speech into expressive speech. Two levels of perturbations are implemented—local perturbation at the prosodic word level, as well as global perturbation at the utterance level. Perceptual experiments involving 14 subjects indicate that the proposed approach can significantly enhance expressivity in response generation for a spoken dialog system.   相似文献   

2.
The paper proposes an innovative technique for generation of optimal mother wavelet using LPC trajectory with special reference to speech recognition. A new wavelet based model is proposed for speech signal processing. Lower order linear predictor coefficients (LPC) are related to the vocal tract area near lip that is the articulating organ. The trajectory of second LPC is proposed for the generation of mother wavelet for speech recognition. The observation interval is selected as the pitch period that represents one complete cycle of speech waveform. LPC of order 10 are evaluated for each pitch synchronous (PS) segment. An innovative technique is proposed for the generation of mother wavelet. The mother wavelet is separately generated for each word utterance. This generates a multidimensional space for speech words and increases the recognition accuracy. The wavelet transform (WT) coefficients are evaluated with respect to the generated mother wavelet for each word utterance and are stored as template along with the generated mother wavelet for each word utterance. The data base consists of 30 word utterances recorded locally using the sound recorder facility. In the recognition mode, the external word utterance is scanned and is divided into PS segments. The trajectory of second LPC is tracked. WT coefficients are evaluated with respect to the mother wavelet of each word in the vocabulary and are compared with the template for each word. The results indicate 100% recognition accuracy.  相似文献   

3.
4.
对话情感分析旨在识别出一段对话中每个句子的情感倾向,其在电商客服数据分析中发挥着关键作用。不同于对单个句子的情感分析,对话中句子的情感倾向依赖于其在对话中的上下文。目前已有的方法主要采用循环神经网络和注意力机制建模句子之间的关系,但是忽略了对话作为一个整体所呈现的特点。建立在多任务学习的框架下,该文提出了一个新颖的方法,同时推测一段对话的主题分布和每个句子的情感倾向。对话的主题分布,作为一种全局信息,被嵌入到每个词以及句子的表示中。通过这种方法,每个词和句子被赋予了在特定对话主题下的含义。在电商客服对话数据上的实验结果表明,该文提出的模型能充分利用对话主题信息,与不考虑主题信息的基线模型相比,Macro-F1值均有明显提升。  相似文献   

5.
Spoken language understanding (SLU) aims at extracting meaning from natural language speech. Over the past decade, a variety of practical goal-oriented spoken dialog systems have been built for limited domains. SLU in these systems ranges from understanding predetermined phrases through fixed grammars, extracting some predefined named entities, extracting users' intents for call classification, to combinations of users' intents and named entities. In this paper, we present the SLU system of VoiceTone/spl reg/ (a service provided by AT&T where AT&T develops, deploys and hosts spoken dialog applications for enterprise customers). The SLU system includes extracting both intents and the named entities from the users' utterances. For intent determination, we use statistical classifiers trained from labeled data, and for named entity extraction we use rule-based fixed grammars. The focus of our work is to exploit data and to use machine learning techniques to create scalable SLU systems which can be quickly deployed for new domains with minimal human intervention. These objectives are achieved by 1) using the predicate-argument representation of semantic content of an utterance; 2) extending statistical classifiers to seamlessly integrate hand crafted classification rules with the rules learned from data; and 3) developing an active learning framework to minimize the human labeling effort for quickly building the classifier models and adapting them to changes. We present an evaluation of this system using two deployed applications of VoiceTone/spl reg/.  相似文献   

6.
Automatic detection of a user's interest in spoken dialog plays an important role in many applications, such as tutoring systems and customer service systems. In this study, we propose a decision-level fusion approach using acoustic and lexical information to accurately sense a user's interest at the utterance level. Our system consists of three parts: acoustic/prosodic model, lexical model, and a model that combines their decisions for the final output. We use two different regression algorithms to complement each other for the acoustic model. For lexical information, in addition to the bag-of-words model, we propose new features including a level-of-interest value for each word, length information using the number of words, estimated speaking rate, silence in the utterance, and similarity with other utterances. We also investigate the effectiveness of using more automatic speech recognition (ASR) hypotheses (n-best lists) to extract lexical features. The outputs from the acoustic and lexical models are combined at the decision level. Our experiments show that combining acoustic evidence with lexical information improves level-of-interest detection performance, even when lexical features are extracted from ASR output with high word error rate.  相似文献   

7.
Estimating reliable class-conditional probability is the prerequisite to implement Bayesian classifiers, and how to estimate the probability density functions (PDFs) is also a fundamental problem for other probabilistic induction algorithms. The finite mixture model (FMM) is able to represent arbitrary complex PDFs by using a mixture of mutimodal distributions, but it assumes that the component mixtures follows a given distribution, which may not be satisfied for real world data. This paper presents a non-parametric kernel mixture model (KMM) based probability density estimation approach, in which the data sample of a class is assumed to be drawn by several unknown independent hidden subclasses. Unlike traditional FMM schemes, we simply use the k-means clustering algorithm to partition the data sample into several independent components, and the regional density diversities of components are combined using the Bayes theorem. On the basis of the proposed kernel mixture model, we present a three-step Bayesian classifier, which includes partitioning, structure learning, and PDF estimation. Experimental results show that KMM is able to improve the quality of estimated PDFs of conventional kernel density estimation (KDE) method, and also show that KMM-based Bayesian classifiers outperforms existing Gaussian, GMM, and KDE-based Bayesian classifiers.  相似文献   

8.
In this paper, we propose a new method for estimating camera motion parameters based on optical flow models. Camera motion parameters are generated using linear combinations of optical flow models. The proposed method first creates these optical flow models, and then linear decompositions are performed on the input optical flows calculated from adjacent images in the video sequence, which are used to estimate the coefficients of each optical flow model. These coefficients are then applied to the parameters used to create each optical flow model, and the camera motion parameters implied in the adjacent images can be estimated through a linear composition of the weighted parameters.We demonstrated that the proposed method estimates the camera motion parameters accurately and at a low computational cost as well as robust to noise residing in the video sequence being analyzed.  相似文献   

9.
在社交媒体中存在大量的对话文本,而在这些对话中,说话人的情感和意图通常是相关的。不仅如此,对话的整体结构也会影响对话的情感和意图,因此,需要对对话中的情感和意图进行联合学习。为此,该文提出了基于对话结构的情感、意图联合学习模型,考虑对话内潜在的情感与意图的关联性,并且利用对话的内在结构与说话人的情感和意图之间的关系,提升多轮对话文本的每一子句情感及其意图的分类性能。同时,通过使用注意力机制,利用对话的前后联系来综合考虑上下文对对话情感的影响。实验表明,联合学习模型能有效地提高对话子句情感及意图分类的性能。  相似文献   

10.
We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key‐shape mapping function learned by a set of viseme examples in the source and target faces. A well‐posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

11.
Computerized human motion simulation allows generation of dynamic human motions on computers. Biomechanical stresses can be estimated using the motions generated on a computer without actually collecting joint coordinate data. A two-dimensional whole-body lifting simulation model is presented in this paper. The model assumes that humans perform lifting activities based on minimization of physical work, subject to various constraints. The simulation method contains three major computation units: trajectory formation unit, dynamics of motion unit, and nonlinear optimization unit. The trajectory formation unit generates smooth polynomials representing motion characteristics of human lifting. Kinematics and kinetics are calculated in the dynamics unit. Objective and constraint functions are evaluated in the optimization unit. Optimal motions are generated by minimizing the objective function, subject to the constraints. Computation methods of the three units and simulation results are presented.  相似文献   

12.
Automatic emotion recognition from speech signals is one of the important research areas, which adds value to machine intelligence. Pitch, duration, energy and Mel-frequency cepstral coefficients (MFCC) are the widely used features in the field of speech emotion recognition. A single classifier or a combination of classifiers is used to recognize emotions from the input features. The present work investigates the performance of the features of Autoregressive (AR) parameters, which include gain and reflection coefficients, in addition to the traditional linear prediction coefficients (LPC), to recognize emotions from speech signals. The classification performance of the features of AR parameters is studied using discriminant, k-nearest neighbor (KNN), Gaussian mixture model (GMM), back propagation artificial neural network (ANN) and support vector machine (SVM) classifiers and we find that the features of reflection coefficients recognize emotions better than the LPC. To improve the emotion recognition accuracy, we propose a class-specific multiple classifiers scheme, which is designed by multiple parallel classifiers, each of which is optimized to a class. Each classifier for an emotional class is built by a feature identified from a pool of features and a classifier identified from a pool of classifiers that optimize the recognition of the particular emotion. The outputs of the classifiers are combined by a decision level fusion technique. The experimental results show that the proposed scheme improves the emotion recognition accuracy. Further improvement in recognition accuracy is obtained when the scheme is built by including MFCC features in the pool of features.  相似文献   

13.
Dynamic external and internal articulator motions are integrated into a low-cost data-driven three-dimensional talking head in this paper. External and internal articulations are defined and calibrated from the video streams and the videofluoroscopy to a generic 3D talking head model. Three different deformation modes in relation to pronunciation characteristics of muscular soft tissue of lips and tongue, up-down movements of chin and the relatively fixed articulators are set up and integrated. The shape blending functions among segmented phonemes of natural speech input are synthesized in an utterance. Animations of the confusable phonemes and minimal pairs are shown to English teachers and learners for a perception test. The results show that the proposed method can reflect the real situation of phonetic pronunciation realistically.  相似文献   

14.
目的 为了解决四足动物运动数据难以获取的问题,建立一种快速易用的四足动物运动重建和制作途径,提出了一种面向四足动物的实时低维运动生成方法。方法 首先,建立以质点、刚体和弹簧为基础的低维物理解算器,将四足动物骨架抽象为低维物理模型;其次,依据步态模式建立足迹约束,自脚向上分肢体求解全身物理质点的运动信息;最后,依据通用约束修正后的质点位置,反算全身动画骨骼节点,生成目标运动。结果 针对不同步态、不同体型、不同风格的四足动物进行多组实验,本文方法能够达到330帧/s的生成速度,且具备良好的视觉效果和通用性。结论 本文方法的输入数据易于学习和获取,计算过程实时稳定,可以快速生成符合视觉真实感的多风格运动数据。  相似文献   

15.
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is described. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergartens through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.  相似文献   

16.
We describe a novel single-ended algorithm constructed from models of speech signals, including clean and degraded speech, and speech corrupted by multiplicative noise and temporal discontinuities. Machine learning methods are used to design the models, including Gaussian mixture models, support vector machines, and random forest classifiers. Estimates of the subjective mean opinion score (MOS) generated by the models are combined using hard or soft decisions generated by a classifier which has learned to match the input signal with the models. Test results show the algorithm outperforming ITU-T P.563, the current “state-of-art” standard single-ended algorithm. Employed in a distributed double-ended measurement configuration, the proposed algorithm is found to be more effective than P.563 in assessing the quality of noise reduction systems and can provide a functionality not available with P.862 PESQ, the current double-ended standard algorithm.  相似文献   

17.
We propose an on-line algorithm to segment foreground from background in videos captured by a moving camera. In our algorithm, temporal model propagation and spatial model composition are combined to generate foreground and background models, and likelihood maps are computed based on the models. After that, an energy minimization technique is applied to the likelihood maps for segmentation. In the temporal step, block-wise models are transferred from the previous frame using motion information, and pixel-wise foreground/background likelihoods and labels in the current frame are estimated using the models. In the spatial step, another block-wise foreground/background models are constructed based on the models and labels given by the temporal step, and the corresponding per-pixel likelihoods are also generated. A graph-cut algorithm performs segmentation based on the foreground/background likelihood maps, and the segmentation result is employed to update the motion of each segment in a block; the temporal model propagation and the spatial model composition step are re-evaluated based on the updated motions, by which the iterative procedure is implemented. We tested our framework with various challenging videos involving large camera and object motions, significant background changes and clutters.  相似文献   

18.
目的 海上拍摄的视频存在大面积的无纹理区域,传统基于特征点检测和跟踪的视频去抖方法处理这类视频时往往效果较差。为此提出一种基于平稳光流估计的海上视频去抖算法。方法 该算法以层次化块匹配作为基础,引入平滑性约束计算基于层次块的光流,能够快速计算海上视频的近似光流场;然后利用基于平稳光流的能量函数优化,实现海上视频的高效去抖动。结果 分别进行了光流估计运行时间对比、视频稳定运行时间对比和用户体验比较共3组实验。相比于能处理海上视频去抖的SteadyFlow算法,本文的光流估计算法较SteadFlow算法的运动估计方法快10倍左右,整个视频去抖算法在处理速度上能提升70%以上。本文算法能够有效地实现海上视频去抖,获得稳定的输出视频。结论 提出了一种基于平稳光流估计的海上视频去抖算法,相对于传统方法,本文方法更适合处理海上视频的去抖。  相似文献   

19.
This paper proposes a novel integrated dialog simulation technique for evaluating spoken dialog systems. A data-driven user simulation technique for simulating user intention and utterance is introduced. A novel user intention modeling and generating method is proposed that uses a linear-chain conditional random field, and a two-phase data-driven domain-specific user utterance simulation method and a linguistic knowledge-based ASR channel simulation method are also presented. Evaluation metrics are introduced to measure the quality of user simulation at intention and utterance. Experiments using these techniques were carried out to evaluate the performance and behavior of dialog systems designed for car navigation dialogs and a building guide robot, and it turned out that our approach was easy to set up and showed similar tendencies to real human users.  相似文献   

20.
A person's speaking style, consisting of such attributes as voice, choice of vocabulary, and the physical motions employed, not only expresses the speaker's identity but also emphasizes the content of an utterance. Speech combining these aspects of speaking style becomes more vivid and expressive to listeners. Recent research on speaking style modeling has paid more attention to speech signal processing. This approach focuses on text processing for idiolect extraction and generation to model a specific person's speaking style for the application of text-to-speech (TTS) conversion. The first stage of this study adopts a statistical method to automatically detect the candidate idiolects from a personalized, transcribed speech corpus. Based on the categorization of the detected candidate idiolects, superfluous idiolects are extracted using the fluency measure while the remaining candidates are regarded as the nonsuperfluous idiolects. In idiolect generation, the input text is converted into a target text with a particular speaker's speaking style via the insertion of superfluous idiolect or synonym substitution of nonsuperfluous idiolect. To evaluate the performance of the proposed methods, experiments were conducted on a Chinese corpus collected and transcribed from the speech files of three Taiwanese politicians. The results show that the proposed method can effectively convert a source text into a target text with a personalized speaking style.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号