首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose a system for detecting the active speaker in cluttered and reverberant environments where more than one person speaks and moves. Rather than using only audio information, the system utilizes audiovisual information from multiple acoustic and video sensors that feed separate audio and video tracking modules. The audio module operates using a particle filter (PF) and an information-theoretic framework to provide accurate acoustic source location under reverberant conditions. The video subsystem combines in 3-D a number of 2-D trackers based on a variation of Stauffer's adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. Extensive experiments show that gains are to be expected when fusion of the separate modalities is performed to detect the active speaker.  相似文献   

2.
Emotive audio–visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural- sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio–visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.   相似文献   

3.
Advances in computer processing power and emerging algorithms are allowing new ways of envisioning human-computer interaction. Although the benefit of audio-visual fusion is expected for affect recognition from the psychological and engineering perspectives, most of existing approaches to automatic human affect analysis are unimodal: information processed by computer system is limited to either face images or the speech signals. This paper focuses on the development of a computing algorithm that uses both audio and visual sensors to detect and track a user's affective state to aid computer decision making. Using our multistream fused hidden Markov model (MFHMM), we analyzed coupled audio and visual streams to detect four cognitive states (interest, boredom, frustration and puzzlement) and seven prototypical emotions (neural, happiness, sadness, anger, disgust, fear and surprise). The MFHMM allows the building of an optimal connection among multiple streams according to the maximum entropy principle and the maximum mutual information criterion. Person-independent experimental results from 20 subjects in 660 sequences show that the MFHMM approach outperforms face-only HMM, pitch-only HMM, energy-only HMM, and independent HMM fusion, under clean and varying audio channel noise condition.  相似文献   

4.
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.  相似文献   

5.
The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level.  相似文献   

6.
从《数据结构》课程的特点出发,提出了高效学习《数据结构》的几点建议.  相似文献   

7.

To recognize objects of the unseen classes, most existing Zero-Shot Learning(ZSL) methods first learn a compatible projection function between the common semantic space and the visual space based on the data of source seen classes, then directly apply it to the target unseen classes. However, for data in the wild, distributions between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (i.e.alleviate the above domain shift problem). Specifically, three different strategies (symmetric Chamfer-distance, Bipartite matching distance, and Wasserstein distance) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. We also propose two new training strategies to handle the data in the wild, where many unrelated images in the test dataset may exist. This realistic setting has never been considered in previous methods. Extensive experiments demonstrate that the proposed visual structure constraint brings substantial performance gain consistently and the new training strategies make it generalize well for data in the wild. The source code is available at https://github.com/raywzy/VSC.

  相似文献   

8.
Tracking objects that undergo abrupt appearance changes and heavy occlusions is a challenging problem which conventional tracking methods can barely handle.To address the problem, we propose an online structure learning algorithm that contains three layers: an object is represented by a mixture of online structure models (OSMs) which are learnt from block-based online random forest classifiers (BORFs).BORFs are able to handle occlusion problems since they model local appearances of the target.To further improve the tracking accuracy and reliability, the algorithm utilizes mixture relational models (MRMs) as multi-mode context information to integrate BORFs into OSMs.Furthermore, the mixture construction of OSMs can avoid over-fitting effectively and is more flexible to describe targets.Fusing BORFs with MRMs, OSMs capture the discriminative parts of the target, which guarantees the reliability and robustness of our tracker.In addition, OSMs incorporate with block occlusion reasoning to update our BORFs and MRMs, which can deal with appearance changes and drifting problems effectively.Experiments on challenging videos show that the proposed tracker performs better than several state-of-the-art algorithms.  相似文献   

9.
数据结构是计算机专业学生必修的一门专业基础课,本文结合教学现状阐述其学习的几点建议,为计算机专业学生学习数据结构提供一些指导和帮助。  相似文献   

10.
《数据结构》课程是计算机及相关学科的一门重要的专业基础课,它不仅是一般程序设计的基础,而且是设计和实现编译程序、操作系统、数据库系统及其它系统程序和大型应用程序的重要基础。也是一门锻炼程序设计能力的实践课程。学习数据结构课程不仅为后续课程提供必要的基础知识,更重要的是通过这门课程的学习,可以进一步提高软件设计和编程能力。但学生普遍反映是这门课程难学、理论性强,实践操作困难,因此本文从如何更好的学习《数据结构》课程的角度出发,探讨了数据结构课程学习中存在的重点与难点及学习方法。  相似文献   

11.
通过给出一种基于Visual Basic编程的线性数据结构实现范例,研究与探讨利用数组和自定义数据类型描述链表、栈和队列等线性数据结构的构造方法.  相似文献   

12.
13.
苏杭丽 《计算机时代》2010,(6):62-63,66
文章结合“数据结构”课程教学的经验,总结了教学难点,并在此基础上,对抽象数据类型定义和数据结构算法的教学进行了研究和探讨,提出了数据结构的抽象数据类型定义的“四步法”和数据结构算法的“图形化”教学方法,并针对数据结构算法实现中的常见问题进行了分析。这些方法在教学实践中取得了良好的效果。  相似文献   

14.
We propose a system for detecting the active speaker in cluttered and reverberant environments where more than one person speaks and moves. Rather than using only audio information, the system utilizes audiovisual information from multiple acoustic and video sensors that feed separate audio and video tracking modules. The audio module operates using a particle filter (PF) and an information-theoretic framework to provide accurate acoustic source location under reverberant conditions. The video subsystem combines in 3-D a number of 2-D trackers based on a variation of Stauffer's adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. Extensive experiments show that gains are to be expected when fusion of the separate modalities is performed to detect the active speaker.   相似文献   

15.
16.
具有丢失数据的可分解马尔可夫网络结构学习   总被引:14,自引:0,他引:14  
王双成  苑森淼 《计算机学报》2004,27(9):1221-1228
具有丢失数据的可分解马尔可夫网络结构学习是一个重要而困难的研究课题,数据的丢失使变量之间的依赖关系变得混乱,无法直接进行可靠的结构学习.文章结合最大似然树和Gibbs抽样,通过对随机初始化的丢失数据和最大似然树进行迭代修正一调整,得到修复后的完整数据集;在此基础上基于变量之间的基本依赖关系和依赖分析思想进行可分解马尔可夫网络结构学习,能够避免现有的丢失数据处理方法和可分解马尔可夫网络结构学习方法存在的效率和可靠性低等问题.试验结果显示,该方法能够有效地进行具有丢失数据的可分解马尔可夫网络结构学习.  相似文献   

17.
多媒体CAI在数据结构算法教学中的应用   总被引:2,自引:0,他引:2  
本文介绍了多媒体CAI的特点,并应用于数据结构,有多媒体著作工具AUTHORWARE对数据结构中比较难理解的算法的执行过程用图,文,声界面进行动态模拟跟踪,以生运,逼真,灵活的方式激发学生的学习兴趣,减轻了教师在授课中的负担。  相似文献   

18.
A compiler-compiler for visual languages is presented. It has been designed as a framework for building visual programming environments that translate schemas into textual representation as well as into programs representing the deep meaning of schemas. The deep semantics is implemented by applying attribute grammars to schema languages; attribute dependencies are implemented as methods of Java classes. Unlike compiler-compilers of textual languages, a large part of the framework is needed for support of interactive usage of a visual language.  相似文献   

19.
贝叶斯网络是用来描述不确定变量之间潜在依赖关系的图形模型。从完备数据集上学习贝叶斯网络是一个研究热点。分析了完备数据集上构建贝叶斯网的常见理论方法。  相似文献   

20.
We present Peax , a novel feature-based technique for interactive visual pattern search in sequential data, like time series or data mapped to a genome sequence. Visually searching for patterns by similarity is often challenging because of the large search space, the visual complexity of patterns, and the user's perception of similarity. For example, in genomics, researchers try to link patterns in multivariate sequential data to cellular or pathogenic processes, but a lack of ground truth and high variance makes automatic pattern detection unreliable. We have developed a convolutional autoencoder for unsupervised representation learning of regions in sequential data that can capture more visual details of complex patterns compared to existing similarity measures. Using this learned representation as features of the sequential data, our accompanying visual query system enables interactive feedback-driven adjustments of the pattern search to adapt to the users’ perceived similarity. Using an active learning sampling strategy, Peax collects user-generated binary relevance feedback. This feedback is used to train a model for binary classification, to ultimately find other regions that exhibit patterns similar to the search target. We demonstrate Peax 's features through a case study in genomics and report on a user study with eight domain experts to assess the usability and usefulness of Peax . Moreover, we evaluate the effectiveness of the learned feature representation for visual similarity search in two additional user studies. We find that our models retrieve significantly more similar patterns than other commonly used techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号