首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.  相似文献   

2.
A novel model is presented to learn bimodally informative structures from audio–visual signals. The signal is represented as a sparse sum of audio–visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio–temporal visual basis function. To represent an audio–visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio–visual material. The basis functions that emerge during learning capture salient audio–visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.   相似文献   

3.
Advances in computer processing power and emerging algorithms are allowing new ways of envisioning human-computer interaction. Although the benefit of audio-visual fusion is expected for affect recognition from the psychological and engineering perspectives, most of existing approaches to automatic human affect analysis are unimodal: information processed by computer system is limited to either face images or the speech signals. This paper focuses on the development of a computing algorithm that uses both audio and visual sensors to detect and track a user's affective state to aid computer decision making. Using our multistream fused hidden Markov model (MFHMM), we analyzed coupled audio and visual streams to detect four cognitive states (interest, boredom, frustration and puzzlement) and seven prototypical emotions (neural, happiness, sadness, anger, disgust, fear and surprise). The MFHMM allows the building of an optimal connection among multiple streams according to the maximum entropy principle and the maximum mutual information criterion. Person-independent experimental results from 20 subjects in 660 sequences show that the MFHMM approach outperforms face-only HMM, pitch-only HMM, energy-only HMM, and independent HMM fusion, under clean and varying audio channel noise condition.  相似文献   

4.
We propose a system for detecting the active speaker in cluttered and reverberant environments where more than one person speaks and moves. Rather than using only audio information, the system utilizes audiovisual information from multiple acoustic and video sensors that feed separate audio and video tracking modules. The audio module operates using a particle filter (PF) and an information-theoretic framework to provide accurate acoustic source location under reverberant conditions. The video subsystem combines in 3-D a number of 2-D trackers based on a variation of Stauffer's adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. Extensive experiments show that gains are to be expected when fusion of the separate modalities is performed to detect the active speaker.  相似文献   

5.
The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level.  相似文献   

6.
Identifying the extent to which the appearance of a humanoid robot affects human behavior toward it is important. We compared participant impressions of and behaviors toward two real humanoid robots in simple human-robot interactions. These two robots, which have different appearances but are controlled to perform the same recorded utterances and motions, are adjusted by a motion-capturing system. We conducted an experiment with 48 human participants who individually interacted with the two robots and also with a human for reference. The results revealed that different appearances did not affect participant verbal behaviors, but they did affect such nonverbal behaviors as distance and delay of response. These differences are explained by two factors: impressions and attributions.  相似文献   

7.
In this paper, we present our work in building technologies for natural multimodal human-robot interaction. We present our systems for spontaneous speech recognition, multimodal dialogue processing, and visual perception of a user, which includes localization, tracking, and identification of the user, recognition of pointing gestures, as well as the recognition of a person's head orientation. Each of the components is described in the paper and experimental results are presented. We also present several experiments on multimodal human-robot interaction, such as interaction using speech and gestures, the automatic determination of the addressee during human-human-robot interaction, as well on interactive learning of dialogue strategies. The work and the components presented here constitute the core building blocks for audiovisual perception of humans and multimodal human-robot interaction used for the humanoid robot developed within the German research project (Sonderforschungsbereich) on humanoid cooperative robots.  相似文献   

8.
We propose a system for detecting the active speaker in cluttered and reverberant environments where more than one person speaks and moves. Rather than using only audio information, the system utilizes audiovisual information from multiple acoustic and video sensors that feed separate audio and video tracking modules. The audio module operates using a particle filter (PF) and an information-theoretic framework to provide accurate acoustic source location under reverberant conditions. The video subsystem combines in 3-D a number of 2-D trackers based on a variation of Stauffer's adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. Extensive experiments show that gains are to be expected when fusion of the separate modalities is performed to detect the active speaker.   相似文献   

9.
A compiler-compiler for visual languages is presented. It has been designed as a framework for building visual programming environments that translate schemas into textual representation as well as into programs representing the deep meaning of schemas. The deep semantics is implemented by applying attribute grammars to schema languages; attribute dependencies are implemented as methods of Java classes. Unlike compiler-compilers of textual languages, a large part of the framework is needed for support of interactive usage of a visual language.  相似文献   

10.
This paper proposes an effective framework of human-humanoid robot physical interaction. Its key component is a new control technique for full-body balancing in the presence of external forces, which is presented and then validated empirically. We have adopted an integrated system approach to develop humanoid robots. Herein, we describe the importance of replicating human-like capabilities and responses during human-robot interaction in this context. Our balancing controller provides gravity compensation, making the robot passive and thereby facilitating safe physical interactions. The method operates by setting an appropriate ground reaction force and transforming these forces into full-body joint torques. It handles an arbitrary number of force interaction points on the robot. It does not require force measurement at interested contact points. It requires neither inverse kinematics nor inverse dynamics. It can adapt to uneven ground surfaces. It operates as a force control process, and can therefore, accommodate simultaneous control processes using force-, velocity-, or position-based control. Forces are distributed over supporting contact points in an optimal manner. Joint redundancy is resolved by damping injection in the context of passivity. We present various force interaction experiments using our full-sized bipedal humanoid platform, including compliant balance, even when affected by unknown external forces, which demonstrates the effectiveness of the method.  相似文献   

11.
In this paper, we present an approach for recognizing pointing gestures in the context of human–robot interaction. In order to obtain input features for gesture recognition, we perform visual tracking of head, hands and head orientation. Given the images provided by a calibrated stereo camera, color and disparity information are integrated into a multi-hypothesis tracking framework in order to find the 3D-positions of the respective body parts. Based on the hands’ motion, an HMM-based classifier is trained to detect pointing gestures. We show experimentally that the gesture recognition performance can be improved significantly by using information about head orientation as an additional feature. Our system aims at applications in the field of human–robot interaction, where it is important to do run-on recognition in real-time, to allow for robot egomotion and not to rely on manual initialization.  相似文献   

12.
This paper presents learning multilayer Potts perceptrons (MLPotts) for data driven function approximation. A Potts perceptron is composed of a receptive field and a $K$ -state transfer function that is generalized from sigmoid-like transfer functions of traditional perceptrons. An MLPotts network is organized to perform translation from a high-dimensional input to the sum of multiple postnonlinear projections, each with its own postnonlinearity realized by a weighted $K$-state transfer function. MLPotts networks span a function space that theoretically covers network functions of multilayer perceptrons. Compared with traditional perceptrons, weighted Potts perceptrons realize more flexible postnonlinear functions for nonlinear mappings. Numerical simulations show MLPotts learning by the Levenberg–Marquardt (LM) method significantly improves traditional supervised learning of multilayer perceptrons for data driven function approximation.   相似文献   

13.
In recent years research in the three‐dimensional sound generation field has been primarily focussed upon new applications of spatialized sound. In the computer graphics community the use of such techniques is most commonly found being applied to virtual, immersive environments. However, the field is more varied and diverse than this and other research tackles the problem in a more complete, and computationally expensive manner. Furthermore, the simulation of light and sound wave propagation is still unachievable at a physically accurate spatio‐temporal quality in real time. Although the Human Visual System (HVS) and the Human Auditory System (HAS) are exceptionally sophisticated, they also contain certain perceptional and attentional limitations. Researchers, in fields such as psychology, have been investigating these limitations for several years and have come up with findings which may be exploited in other fields. This paper provides a comprehensive overview of the major techniques for generating spatialized sound and, in addition, discusses perceptual and cross‐modal influences to consider. We also describe current limitations and provide an in‐depth look at the emerging topics in the field.  相似文献   

14.
There have been many computational models mimicking the visual cortex that are based on spatial adaptations of unsupervised neural networks. In this paper, we present a new model called neuronal cluster which includes spatial as well as temporal weights in its unified adaptation scheme. The “in-place” nature of the model is based on two biologically plausible learning rules, Hebbian rule and lateral inhibition. We present the mathematical demonstration that the temporal weights are derived from the delay in lateral inhibition. By training with the natural videos, this model can develop spatio–temporal features such as orientation selective cells, motion sensitive cells, and spatio–temporal complex cells. The unified nature of the adaption scheme allows us to construct a multilayered and task-independent attention selection network which uses the same learning rule for edge, motion, and color detection, and we can use this network to engage in attention selection in both static and dynamic scenes.   相似文献   

15.
Despite its great importance, there has been no general consensus on how to model the trends in time-series data. Compared to traditional approaches, neural networks (NNs) have shown some promise in time-series forecasting. This paper investigates how to best model trend time series using NNs. Four different strategies (raw data, raw data with time index, detrending, and differencing) are used to model various trend patterns (linear, nonlinear, deterministic, stochastic, and breaking trend). We find that with NNs differencing often gives meritorious results regardless of the underlying data generating processes (DGPs). This finding is also confirmed by the real gross national product (GNP) series.  相似文献   

16.
The Takagi-Sugeno (T-S) model of fuzzy delay systems with impulse is first presented in this paper. By means of classical analysis methods and Razumikhin technique, the criteria of uniform stability and uniform asymptotic stability for T-S fuzzy delay systems with impulse are obtained, respectively. Three numerical examples are also discussed to illustrate the efficiency of the obtained results.  相似文献   

17.
18.
This paper discusses the stabilization of Takagi-Sugeno (T-S) fuzzy systems with bounded and time-varying input delay. The robust stabilization via state feedback is first addressed, and delay-dependent stabilization conditions are proposed in terms of LMIs. Observer-based feedback stabilization is also discussed for T-S fuzzy input delay systems without uncertainties. A separate design principle is developed. Some illustrative examples are given to show the effectiveness and the feasibility of the proposed methods.  相似文献   

19.
This note studies the controllability of a leader-follower network of dynamic agents linked via neighbor rules. The leader is a particular agent acting as an external input to steer the other member agents. Based on switched control system theory, we derive a simple controllability condition for the network with switching topology, which indicates that the controllability of the whole network does not need to rely on that of the network for every specific topology. This merit provides convenience and flexibility in design and application of multiagent networks. For the fixed topology case, we show that the network is uncontrollable whenever the leader has an unbiased action on every member, regardless of the connectivity of the members themselves. This gives new insight into the relation between the controllability and the connectivity of the leader-follower network. We also give a formula for formation control of the network.  相似文献   

20.
This paper proposes a dynamic model of the swim of elongated fish suited to the online control of biomimetic eel-like robots. The approach can be considered as an extension of the original reactive “large elongated body theory” of Lighthill to the 3-D self-propulsion to which a resistive empirical model has been added. While all the mathematical fundamentals have been detailed by Boyer . (http://www.irccyn.ec-nantes.fr/hebergement/Publications/2007/3721.pdf, 2007), this paper essentially focuses on the numerical validation and calibration of the model and the study of swimming gaits. The proposed model is coupled to an algorithm allowing us to compute the motion of the fish head and the field of internal control torque from the knowledge of the imposed internal strain fields. Based on the Newton–Euler formalism of robot dynamics, this algorithm works faster than real time. As far as precision is concerned, many tests obtained with several planar and 3-D gaits are reported and compared (in the planar case) with a Navier–Stokes solver, which, until today have been devoted to the planar swim. The comparisons obtained are very encouraging since in all the cases we tested, the differences between our simplified and reference simulations do not exceed 10%.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号