首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis.  相似文献   

3.
We have developed a real-time gesture recognition system whose models can be taught by only one instruction. Therefore the system can adapt to new gesture performer quickly but it can not raise the recognition rates even if we teach gestures many times. That is because the system could not utilize all the teaching data. In order to cope with the problem, averages of teaching data are calculated. First, the best frame correspondence of the teaching data and the model is obtained by Continuous DP. Next the averages and variations are calculated for each frame of the model. We show the effectiveness of our method in the experiments. Takuichi Nishimura: He is a researcher of Multi-modal Function Tsukuba Laboratory and Information Basis Function Laboratory at the Real World Computing Partnership. He has engaged in motion image understanding, multi-modal human computer interface, multi-modal information retrieval, and mobile robot navigation. He completed the master’s course of the University of Tokyo in 1992. Hiroaki Yabe: He is from SHARP corporation working as a researcher of Multi-modal Function Tsukuba Laboratory and Information Basis Function Tsukuba Laboratory at the Real World Computing Partnership. He has engaged in motion image understanding, multi-modal human computer interface, multi-modal information retrieval. He completed the master’s course of the University of Tokyo in 1995. Ryuichi Oka, Ph.D.: He is a chief of Multi-modal Function Tsukuba Laboratory and Information Basis Function Laboratory at Tsukuba Research Center of the Real World Computing Partnership (RWC Japan) which started in 1992. His research interests include motion image understanding, spontaneous speech understanding, self-organisation information base, multi-modal human computer interface, multi-modal information retrieval, mobile robot, integration of symbol and pattern, and super parallel computation. He received his Ph.D degree in Engineering from the University of Tokyo.  相似文献   

4.
5.
In this paper, we aim for the recognition of a set of dance gestures from contemporary ballet. Our input data are motion trajectories followed by the joints of a dancing body provided by a motion-capture system. It is obvious that direct use of the original signals is unreliable and expensive. Therefore, we propose a suitable tool for non-uniform sub-sampling of spatiotemporal signals. The key to our approach is the use of a deformable model to provide a compact and efficient representation of motion trajectories. Our dance gesture recognition method involves a set of hidden Markov models (HMMs), each of them being related to a motion trajectory followed by the joints. The recognition of such movements is then achieved by matching the resulting gesture models with the input data via HMMs. We have validated our recognition system on 12 fundamental movements from contemporary ballet performed by four dancers. This revised version was published online in November 2004 with corrections to the section numbers. Ballet Atlantique Régine Chopinot.  相似文献   

6.
Feature-based method for detecting landmarks from facial images was designed. The method was based on extracting oriented edges and constructing edge maps at two resolution levels. Edge regions with characteristic edge pattern formed landmark candidates. The method ensured invariance to expressions while detecting eyes. Nose and mouth detection was deteriorated by happiness and disgust.  相似文献   

7.
The Journal of Supercomputing - In recent years, low back pain rehabilitation exercises have been widely performed for spine-related illnesses. To facilitate rehabilitation exercises, pose-based...  相似文献   

8.
9.
Dancers express their feelings and moods through gestures and body movements. We seek to extend this mode of expression by dynamically and automatically adjusting music and lighting in the dance environment to reflect the dancer’s arousal state. Our intention is to offer a space that performance artists can use as a creative tool that extends the grammar of dance. To enable the dynamic manipulation of lighting and music, the performance space will be augmented with several sensors: physiological sensors worn by a dancer to measure her arousal state, as well as pressure sensors installed in a floor mat to track the dancers’ locations and movements. Data from these sensors will be passed to a three layered architecture. Layer 1 is composed of a sensor analysis system that analyzes and synthesizes physiological and pressure sensor signals. Layer 2 is composed of intelligent systems that adapt lighting and music to portray the dancer’s arousal state. The intelligent on-stage lighting system dynamically adjusts on-stage lighting direction and color. The intelligent virtual lighting system dynamically adapts virtual lighting in the projected imagery. The intelligent music system dynamically and unobtrusively adjusts the music. Layer 3 translates the high-level adjustments made by the intelligent systems in layer 2 to appropriate lighting board, image rendering, and audio box commands. Furthermore the resulting artifact is the DigitalBeing - a personal signature of the dancer in digital space. In this paper, we will describe this architecture in detail as well as the equipment and control systems used.  相似文献   

10.
《Artificial Intelligence》2007,171(8-9):568-585
Head pose and gesture offer several conversational grounding cues and are used extensively in face-to-face interaction among people. To accurately recognize visual feedback, humans often use contextual knowledge from previous and current events to anticipate when feedback is most likely to occur. In this paper we describe how contextual information can be used to predict visual feedback and improve recognition of head gestures in human–computer interfaces. Lexical, prosodic, timing, and gesture features can be used to predict a user's visual feedback during conversational dialog with a robotic or virtual agent. In non-conversational interfaces, context features based on user–interface system events can improve detection of head gestures for dialog box confirmation or document browsing. Our user study with prototype gesture-based components indicate quantitative and qualitative benefits of gesture-based confirmation over conventional alternatives. Using a discriminative approach to contextual prediction and multi-modal integration, performance of head gesture detection was improved with context features even when the topic of the test set was significantly different than the training set.  相似文献   

11.
Expressive text-to-speech (TTS) synthesis should contribute to the pleasantness, intelligibility, and speed of speech-based human-machine interactions which use TTS. We describe a TTS engine which can be directed, via text markup, to use a variety of expressive styles, here, questioning, contrastive emphasis, and conveying good and bad news. Differences in these styles lead us to investigate two approaches for expressive TTS, a "corpus-driven" and a "prosodic-phonology" approach. Each speaker records 11 h (excluding silences) of "neutral" sentences. In the corpus-driven approach, the speaker also records 1-h corpora in each expressive style; these segments are tagged by style for use during search, and decision trees for determining f/sub 0/ contours and timing are trained separately for each of the neutral and expressive corpora. In the prosodic-phonology approach, rules translating certain expressive markup elements to tones and break indices (ToBI) are manually determined, and the ToBI elements are used in single f/sub 0/ and duration trees for all expressions. Tests show that listeners identify synthesis in particular styles ranging from 70% correctly for "conveying bad news" to 85% for "yes-no questions". Further improvements are demonstrated through the use of speaker-pooled f/sub 0/ and duration models.  相似文献   

12.
Johnson  M.L. 《Computer》1991,24(7):30-34
The development and implementation of an expert system that determines the tempo and articulations of Bach fugues are described. The rules in the knowledge base are based on the expertise of two professional performers. The system's input is a numeric representation of the fugue. The system processes the input using a transition graph, a data structure consisting of nodes where data is stored and edges that connect the nodes. The transition graph recognizes rhythmic patterns in the input. Once the system identifies a pattern, it applies a specific rule or performs a procedure. System output consists of a listing of tempo and articulation instructions. To validate the expert system, its output was compared with versions of fugues edited by one of the two experts used in developing the system. In tests with six fugues, the expert system generated the same editing instructions 85 to 90% of the time  相似文献   

13.
14.
An approach to the analysis of dynamic facial images for the purposes of estimating and resynthesizing dynamic facial expressions is presented. The approach exploits a sophisticated generative model of the human face originally developed for realistic facial animation. The face model which may be simulated and rendered at interactive rates on a graphics workstation, incorporates a physics-based synthetic facial tissue and a set of anatomically motivated facial muscle actuators. The estimation of dynamical facial muscle contractions from video sequences of expressive human faces is considered. An estimation technique that uses deformable contour models (snakes) to track the nonrigid motions of facial features in video images is developed. The technique estimates muscle actuator controls with sufficient accuracy to permit the face model to resynthesize transient expressions  相似文献   

15.
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model.  相似文献   

16.
Clustering is an underspecified task: there are no universal criteria for what makes a good clustering. This is especially true for relational data, where similarity can be based on the features of individuals, the relationships between them, or a mix of both. Existing methods for relational clustering have strong and often implicit biases in this respect. In this paper, we introduce a novel dissimilarity measure for relational data. It is the first approach to incorporate a wide variety of types of similarity, including similarity of attributes, similarity of relational context, and proximity in a hypergraph. We experimentally evaluate the proposed dissimilarity measure on both clustering and classification tasks using data sets of very different types. Considering the quality of the obtained clustering, the experiments demonstrate that (a) using this dissimilarity in standard clustering methods consistently gives good results, whereas other measures work well only on data sets that match their bias; and (b) on most data sets, the novel dissimilarity outperforms even the best among the existing ones. On the classification tasks, the proposed method outperforms the competitors on the majority of data sets, often by a large margin. Moreover, we show that learning the appropriate bias in an unsupervised way is a very challenging task, and that the existing methods offer a marginal gain compared to the proposed similarity method, and can even hurt performance. Finally, we show that the asymptotic complexity of the proposed dissimilarity measure is similar to the existing state-of-the-art approaches. The results confirm that the proposed dissimilarity measure is indeed versatile enough to capture relevant information, regardless of whether that comes from the attributes of vertices, their proximity, or connectedness of vertices, even without parameter tuning.  相似文献   

17.
This paper presents a new approach for tracking hand rotation and various grasping gestures through an infrared camera. For the complexity and ambiguity of an observed hand shape, it is difficult to simultaneously estimate hand configuration and orientation from a silhouette image of a grasping hand gesture. This paper proposes a dynamic shape model for hand grasping gestures using cylindrical manifold embedding to analyze variations of hand shape in different hand configurations between two key hand poses and in simultaneous circular view change by hand rotation. An arbitrary hand shape between two key hand poses from any view can be generated using a cylindrical manifold embedding point after learning nonlinear generative models from the embedding space to the corresponding hand shape observed. The cylindrical manifold embedding model is extended to various grasping gestures by decomposing multiple cylindrical manifold embeddings through grasping style analysis. Grasping hand gestures with simultaneous hand rotation are tracked using particle filters on the manifold space with grasping style estimation. Experimental results for synthetic and real data indicate that the proposed model can accurately track various grasping gestures with hand rotation. The proposed approach may be applied to advanced user interfaces in dark environments by using images beyond the visible spectrum.  相似文献   

18.
This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones.  相似文献   

19.
XML is the standard data interchange format and XSLT is the W3C proposed standard for transforming and restructuring XML documents. It turns out that XSLT has very powerful query capabilities as well. Hovewer, due to its complex syntax and lack of formal specification, it is not a trivial task to decide whether two XSLT stylesheets yield the same result, even if for an XSLT subset. We isolate such fragment, powerful enough for expressing several interesting queries and for manipulating XML documents and show how to translate them into queries expressed in a properly extended version of TAX, a powerful XML query algebra, for which we provide a collection of equivalence rules. It is then possible to reason about XSLT equivalences, by translating XSLT stylesheets into XTAX expressions and then statically verifying their equivalence, by means of the mentioned equivalence rules.  相似文献   

20.
We introduce a novel approach for synthesis of software models based on identifying deterministic finite state automata. Our approach consists of three important contributions. First, we argue that in order to model software, one should focus mainly on observed executions (positive data), and use the randomly generated failures (negative data) only for testing consistency. We present a new greedy heuristic for this purpose, and show how to integrate it in the state-of-the-art evidence-driven state-merging (EDSM) algorithm. Second, we apply the enhanced EDSM algorithm to iteratively reduce the size of the problem. Yet during each iteration, the evidence is divided over states and hence the effectiveness of this algorithm is decreased. We propose—when EDSM becomes too weak—to tackle the reduced identification problem using satisfiability solvers. Third, in case the amount of positive data is small, we solve the identification problem several times by randomizing the greedy heuristic and combine the solutions using a voting scheme. The interaction between these contributions appeared crucial to solve hard software models synthesis benchmarks. Our implementation, called DFASAT, won the StaMinA competition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号