共查询到20条相似文献,搜索用时 31 毫秒
1.
Maganti H.K. Gatica-Perez D. McCowan I. 《IEEE transactions on audio, speech, and language processing》2007,15(8):2257-2269
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system. 相似文献
2.
《Neural Networks, IEEE Transactions on》2009,20(12):1898-1910
3.
Zhihong Zeng Jilin Tu Pianfetti B.M. Huang T.S. 《Multimedia, IEEE Transactions on》2008,10(4):570-577
Advances in computer processing power and emerging algorithms are allowing new ways of envisioning human-computer interaction. Although the benefit of audio-visual fusion is expected for affect recognition from the psychological and engineering perspectives, most of existing approaches to automatic human affect analysis are unimodal: information processed by computer system is limited to either face images or the speech signals. This paper focuses on the development of a computing algorithm that uses both audio and visual sensors to detect and track a user's affective state to aid computer decision making. Using our multistream fused hidden Markov model (MFHMM), we analyzed coupled audio and visual streams to detect four cognitive states (interest, boredom, frustration and puzzlement) and seven prototypical emotions (neural, happiness, sadness, anger, disgust, fear and surprise). The MFHMM allows the building of an optimal connection among multiple streams according to the maximum entropy principle and the maximum mutual information criterion. Person-independent experimental results from 20 subjects in 660 sequences show that the MFHMM approach outperforms face-only HMM, pitch-only HMM, energy-only HMM, and independent HMM fusion, under clean and varying audio channel noise condition. 相似文献
4.
Talantzis F. Pnevmatikakis A. Constantinides A.G. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(3):799-807
We propose a system for detecting the active speaker in cluttered and reverberant environments where more than one person speaks and moves. Rather than using only audio information, the system utilizes audiovisual information from multiple acoustic and video sensors that feed separate audio and video tracking modules. The audio module operates using a particle filter (PF) and an information-theoretic framework to provide accurate acoustic source location under reverberant conditions. The video subsystem combines in 3-D a number of 2-D trackers based on a variation of Stauffer's adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. Extensive experiments show that gains are to be expected when fusion of the separate modalities is performed to detect the active speaker. 相似文献
5.
Yildirim S. Narayanan S. 《IEEE transactions on audio, speech, and language processing》2009,17(1):2-12
The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level. 相似文献
6.
Kanda T. Miyashita T. Osada T. Haikawa Y. Ishiguro H. 《Robotics, IEEE Transactions on》2008,24(3):725-735
Identifying the extent to which the appearance of a humanoid robot affects human behavior toward it is important. We compared participant impressions of and behaviors toward two real humanoid robots in simple human-robot interactions. These two robots, which have different appearances but are controlled to perform the same recorded utterances and motions, are adjusted by a motion-capturing system. We conducted an experiment with 48 human participants who individually interacted with the two robots and also with a human for reference. The results revealed that different appearances did not affect participant verbal behaviors, but they did affect such nonverbal behaviors as distance and delay of response. These differences are explained by two factors: impressions and attributions. 相似文献
7.
Stiefelhagen R. Ekenel H.K. Fugen C. Gieselmann P. Holzapfel H. Kraft F. Nickel K. Voit M. Waibel A. 《Robotics, IEEE Transactions on》2007,23(5):840-851
In this paper, we present our work in building technologies for natural multimodal human-robot interaction. We present our systems for spontaneous speech recognition, multimodal dialogue processing, and visual perception of a user, which includes localization, tracking, and identification of the user, recognition of pointing gestures, as well as the recognition of a person's head orientation. Each of the components is described in the paper and experimental results are presented. We also present several experiments on multimodal human-robot interaction, such as interaction using speech and gestures, the automatic determination of the addressee during human-human-robot interaction, as well on interactive learning of dialogue strategies. The work and the components presented here constitute the core building blocks for audiovisual perception of humans and multimodal human-robot interaction used for the humanoid robot developed within the German research project (Sonderforschungsbereich) on humanoid cooperative robots. 相似文献
8.
9.
Pavel Grigorenko Ando Saabas Enn Tyugu 《Electronic Notes in Theoretical Computer Science》2005,141(4):137
A compiler-compiler for visual languages is presented. It has been designed as a framework for building visual programming environments that translate schemas into textual representation as well as into programs representing the deep meaning of schemas. The deep semantics is implemented by applying attribute grammars to schema languages; attribute dependencies are implemented as methods of Java classes. Unlike compiler-compilers of textual languages, a large part of the framework is needed for support of interactive usage of a visual language. 相似文献
10.
Full-Body Compliant Human–Humanoid Interaction: Balancing in the Presence of Unknown External Forces
This paper proposes an effective framework of human-humanoid robot physical interaction. Its key component is a new control technique for full-body balancing in the presence of external forces, which is presented and then validated empirically. We have adopted an integrated system approach to develop humanoid robots. Herein, we describe the importance of replicating human-like capabilities and responses during human-robot interaction in this context. Our balancing controller provides gravity compensation, making the robot passive and thereby facilitating safe physical interactions. The method operates by setting an appropriate ground reaction force and transforming these forces into full-body joint torques. It handles an arbitrary number of force interaction points on the robot. It does not require force measurement at interested contact points. It requires neither inverse kinematics nor inverse dynamics. It can adapt to uneven ground surfaces. It operates as a force control process, and can therefore, accommodate simultaneous control processes using force-, velocity-, or position-based control. Forces are distributed over supporting contact points in an optimal manner. Joint redundancy is resolved by damping injection in the context of passivity. We present various force interaction experiments using our full-sized bipedal humanoid platform, including compliant balance, even when affected by unknown external forces, which demonstrates the effectiveness of the method. 相似文献
11.
In this paper, we present an approach for recognizing pointing gestures in the context of human–robot interaction. In order to obtain input features for gesture recognition, we perform visual tracking of head, hands and head orientation. Given the images provided by a calibrated stereo camera, color and disparity information are integrated into a multi-hypothesis tracking framework in order to find the 3D-positions of the respective body parts. Based on the hands’ motion, an HMM-based classifier is trained to detect pointing gestures. We show experimentally that the gesture recognition performance can be improved significantly by using information about head orientation as an additional feature. Our system aims at applications in the field of human–robot interaction, where it is important to do run-on recognition in real-time, to allow for robot egomotion and not to rely on manual initialization. 相似文献
12.
《Neural Networks, IEEE Transactions on》2008,19(12):2032-2043
13.
Vedad Hulusic Carlo Harvey Kurt Debattista Nicolas Tsingos Steve Walker David Howard Alan Chalmers 《Computer Graphics Forum》2012,31(1):102-131
In recent years research in the three‐dimensional sound generation field has been primarily focussed upon new applications of spatialized sound. In the computer graphics community the use of such techniques is most commonly found being applied to virtual, immersive environments. However, the field is more varied and diverse than this and other research tackles the problem in a more complete, and computationally expensive manner. Furthermore, the simulation of light and sound wave propagation is still unachievable at a physically accurate spatio‐temporal quality in real time. Although the Human Visual System (HVS) and the Human Auditory System (HAS) are exceptionally sophisticated, they also contain certain perceptional and attentional limitations. Researchers, in fields such as psychology, have been investigating these limitations for several years and have come up with findings which may be exploited in other fields. This paper provides a comprehensive overview of the major techniques for generating spatialized sound and, in addition, discusses perceptual and cross‐modal influences to consider. We also describe current limitations and provide an in‐depth look at the emerging topics in the field. 相似文献
14.
《Neural Networks, IEEE Transactions on》2009,20(6):992-1008
15.
Despite its great importance, there has been no general consensus on how to model the trends in time-series data. Compared to traditional approaches, neural networks (NNs) have shown some promise in time-series forecasting. This paper investigates how to best model trend time series using NNs. Four different strategies (raw data, raw data with time index, detrending, and differencing) are used to model various trend patterns (linear, nonlinear, deterministic, stochastic, and breaking trend). We find that with NNs differencing often gives meritorious results regardless of the underlying data generating processes (DGPs). This finding is also confirmed by the real gross national product (GNP) series. 相似文献
16.
The Takagi-Sugeno (T-S) model of fuzzy delay systems with impulse is first presented in this paper. By means of classical analysis methods and Razumikhin technique, the criteria of uniform stability and uniform asymptotic stability for T-S fuzzy delay systems with impulse are obtained, respectively. Three numerical examples are also discussed to illustrate the efficiency of the obtained results. 相似文献
17.
18.
Bing Chen Xiao-Ping Liu Shao-Cheng Tong Chong Lin 《Fuzzy Systems, IEEE Transactions on》2008,16(3):652-663
This paper discusses the stabilization of Takagi-Sugeno (T-S) fuzzy systems with bounded and time-varying input delay. The robust stabilization via state feedback is first addressed, and delay-dependent stabilization conditions are proposed in terms of LMIs. Observer-based feedback stabilization is also discussed for T-S fuzzy input delay systems without uncertainties. A separate design principle is developed. Some illustrative examples are given to show the effectiveness and the feasibility of the proposed methods. 相似文献
19.
Bo Liu Tianguang Chu Long Wang Guangming Xie 《Automatic Control, IEEE Transactions on》2008,53(4):1009-1013
This note studies the controllability of a leader-follower network of dynamic agents linked via neighbor rules. The leader is a particular agent acting as an external input to steer the other member agents. Based on switched control system theory, we derive a simple controllability condition for the network with switching topology, which indicates that the controllability of the whole network does not need to rely on that of the network for every specific topology. This merit provides convenience and flexibility in design and application of multiagent networks. For the fixed topology case, we show that the network is uncontrollable whenever the leader has an unbiased action on every member, regardless of the connectivity of the members themselves. This gives new insight into the relation between the controllability and the connectivity of the leader-follower network. We also give a formula for formation control of the network. 相似文献
20.
《Robotics, IEEE Transactions on》2008,24(6):1274-1288