首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
2.
This paper presents a method for separating speech of individual speakers from the combined speech of two speakers. The main objective of this work is to demonstrate the significance of the combined excitation source based temporal processing and short-time spectrum based spectral processing method for the separation of speech produced by individual speakers. Speech in a two speaker environment is simultaneously collected over two spatially separated microphones. The speech signals are first subjected to excitation source information (linear prediction residual) based temporal processing. In temporal processing, speech of each speaker is enhanced with respect to the other by relatively emphasizing the speech around the instants of significant excitation of desired speaker by deriving speaker-specific weight function. To further improve the separation, the temporally processed speech is subjected to spectral processing. This involves enhancing the regions around the pitch and harmonic peaks of short time spectra computed from the temporally processed speech. To do so the pitch estimate is obtained from the temporally processed speech. The performance of the proposed method is evaluated using (i) objective quality measures: percentage of energy loss, percentage of noise residue, the signal-to-noise ratio (SNR) gain and perceptual evaluation of speech quality (PESQ), and (ii) subjective quality measure: mean opinion score (MOS). Experimental results are reported for both real and synthetic speech mixtures. The SNR gain and MOS values show that the proposed combined temporal and spectral processing method provides an average improvement in the performance of 5.83% and 8.06% respectively, compared to the best performing individual temporal or spectral processing methods.  相似文献   

3.
We present a method for capturing the skeletal motions of humans using a sparse set of potentially moving cameras in an uncontrolled environment. Our approach is able to track multiple people even in front of cluttered and non‐static backgrounds, and unsynchronized cameras with varying image quality and frame rate. We completely rely on optical information and do not make use of additional sensor information (e.g. depth images or inertial sensors). Our algorithm simultaneously reconstructs the skeletal pose parameters of multiple performers and the motion of each camera. This is facilitated by a new energy functional that captures the alignment of the model and the camera positions with the input videos in an analytic way. The approach can be adopted in many practical applications to replace the complex and expensive motion capture studios with few consumer‐grade cameras even in uncontrolled outdoor scenes. We demonstrate this based on challenging multi‐view video sequences that are captured with unsynchronized and moving (e.g. mobile‐phone or GoPro) cameras.  相似文献   

4.
Various data mining methods have been developed last few years for hepatitis study using a large temporal and relational database given to the research community. In this work we introduce a novel temporal abstraction method to this study by detecting and exploiting temporal patterns and relations between events in viral hepatitis such as “event A slightly happened before event B and B simultaneously ended with event C”. We developed algorithms to first detect significant temporal patterns in temporal sequences and then to identify temporal relations between these temporal patterns. Many findings by data mining methods applied to transactions/graphs of temporal relations shown to be significant by physician evaluation and matching with published in Medline.  相似文献   

5.
Abstract— A method is proposed to measure and characterize motion artifacts in matrix displays. By using a fast, V(λ)‐corrected photodiode and a data‐acquisition system, accurate measurements of the temporal luminance behavior (step response) are recorded. The motion artifacts of LCDs and PDP displays are predicted from these measurements using the properties of the human‐visual system. The method is validated with perceptual evaluation experiments, for which a new evaluation protocol is established. In the end, new measures are proposed to quantify the motion‐rendering performance of these matrix displays.  相似文献   

6.
Event cameras or neuromorphic cameras mimic the human perception system as they measure the per-pixel intensity change rather than the actual intensity level. In contrast to traditional cameras, such cameras capture new information about the scene at MHz frequency in the form of sparse events. The high temporal resolution comes at the cost of losing the familiar per-pixel intensity information. In this work we propose a variational model that accurately models the behaviour of event cameras, enabling reconstruction of intensity images with arbitrary frame rate in real-time. Our method is formulated on a per-event-basis, where we explicitly incorporate information about the asynchronous nature of events via an event manifold induced by the relative timestamps of events. In our experiments we verify that solving the variational model on the manifold produces high-quality images without explicitly estimating optical flow. This paper is an extended version of our previous work (Reinbacher et al. in British machine vision conference (BMVC), 2016) and contains additional details of the variational model, an investigation of different data terms and a quantitative evaluation of our method against competing methods as well as synthetic ground-truth data.  相似文献   

7.
Summary The decomposition of a program into a control part, which is concerned only with determining the flow of control, and a kernel part, which is concerned only with computing output values, is proposed. It is shown that such a kernel-control decomposition is easily made and forms a useful and intuitive basis for analyzing and optimizing programs. A sequence of four progressively more abstract formal models of programs is developed, based on the concept of kernel-control decomposition. The application of these models to the study of program equivalence, termination, and optimization is outlined. The most general of the formal models, that of a set of programs forming a control structure class, formalizes a broad notion of equivalence of control structures.This work was performed at the University of Texas at Austin under NSF Grants GJ-778, GJ-36424, and MCS 75-16858  相似文献   

8.
视频显著性目标检测需要同时结合空间信息和时间信息,连续地定位视频序列中与运动相关的显著性目标,其核心问题在于如何高效地刻画运动目标的时空特征.现有的视频显著性目标检测算法大多使用光流,ConvLSTM以及3D卷积等提取时域特征,缺乏对时间信息的连续学习能力.为此,设计了一种鲁棒的时空渐进式学习网络(spatial-temporal progressive learning network, STPLNet),以完成对视频序列中显著性目标的高效定位.在空间域中使用一种U型结构对各视频帧进行编码解码,在时间域中通过学习视频序列中帧间运动目标的主体部分和形变区域特征,渐进地对运动目标特征进行编码,能够捕捉到目标的时间相关性特征和运动趋向性.在4个公开数据集上与13个主流的视频显著性目标检测算法进行一系列对比实验,所提出的模型在多个指标(max F, S-measure (S), MAE)上达到了最优结果,同时在运行速度上具有较好的实时性.  相似文献   

9.
Shadow removal for videos is an important and challenging vision task. In this paper, we present a novel shadow removal approach for videos captured by free moving cameras using illumination transfer optimization. We first detect the shadows of the input video using interactive fast video matting. Then, based on the shadow detection results, we decompose the input video into overlapped 2D patches, and find the coherent correspondences between the shadow and non‐shadow patches via discrete optimization technique built on the patch similarity metric. We finally remove the shadows of the input video sequences using an optimized illumination transfer method, which reasonably recovers the illumination information of the shadow regions and produces spatio‐temporal shadow‐free videos. We also process the shadow boundaries to make the transition between shadow and non‐shadow regions smooth. Compared with previous works, our method can handle videos captured by free moving cameras and achieve better shadow removal results. We validate the effectiveness of the proposed algorithm via a variety of experiments.  相似文献   

10.
Learning to Perceive and Act by Trial and Error   总被引:5,自引:1,他引:4  
This article considers adaptive control architectures that integrate active sensory-motor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phoenomenon perceptual aliasingand show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. We then describe a new decision system that overcomes these difficulties for a restricted class of decision problems. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its visual attention in order to collect necessary sensory information.  相似文献   

11.
Space-time super-resolution   总被引:3,自引:0,他引:3  
We propose a method for constructing a video sequence of high space-time resolution by combining information from multiple low-resolution video sequences of the same dynamic scene. Super-resolution is performed simultaneously in time and in space. By "temporal super-resolution," we mean recovering rapid dynamic events that occur faster than regular frame-rate. Such dynamic events are not visible (or else are observed incorrectly) in any of the input sequences, even if these are played in "slow-motion." The spatial and temporal dimensions are very different in nature, yet are interrelated. This leads to interesting visual trade-offs in time and space and to new video applications. These include: 1) treatment of spatial artifacts (e.g., motion-blur) by increasing the temporal resolution and 2) combination of input sequences of different space-time resolutions (e.g., NTSC, PAL, and even high quality still images) to generate a high quality video sequence. We further analyze and compare characteristics of temporal super-resolution to those of spatial super-resolution. These include: the video cameras needed to obtain increased resolution; the upper bound on resolution improvement via super-resolution; and, the temporal analogue to the spatial "ringing" effect.  相似文献   

12.
We propose a novel framework for automatic discovering and learning of behavioural context for video-based complex behaviour recognition and anomaly detection. Our work differs from most previous efforts on learning visual context in that our model learns multi-scale spatio-temporal rather than static context. Specifically three types of behavioural context are investigated: behaviour spatial context, behaviour correlation context, and behaviour temporal context. To that end, the proposed framework consists of an activity-based semantic scene segmentation model for learning behaviour spatial context, and a cascaded probabilistic topic model for learning both behaviour correlation context and behaviour temporal context at multiple scales. These behaviour context models are deployed for recognising non-exaggerated multi-object interactive and co-existence behaviours in public spaces. In particular, we develop a method for detecting subtle behavioural anomalies against the learned context. The effectiveness of the proposed approach is validated by extensive experiments carried out using data captured from complex and crowded outdoor scenes.  相似文献   

13.
14.
15.
In this paper, we present a new unsupervised method to classify a set of Multichanel Signals (MC) with unknown events. Each signal is characterized by a sequence of events where the number of events, the start time and the duration between events can change randomly. The proposed method helps in the classification and event detection of the MC signals by an expert which usually becomes a tedious and difficult task. To this end, first, the problem of classification of MC signals characterized by a succession of events is analyzed by transforming the MC signals into a set of temporal sequences of easy interpretation. The algorithm detects events by means of an optimal unsupervised classification. It is not necessary to know the nature of the events and formulate hypotheses regarding their behavior. Then, a set of multichannel electromyographic (EMG) signals with events is generated. These MC signals are used to test the proposed method.  相似文献   

16.
We present a new video‐based performance cloning technique. After training a deep generative network using a reference video capturing the appearance and dynamics of a target actor, we are able to generate videos where this actor reenacts other performances. All of the training data and the driving performances are provided as ordinary video segments, without motion capture or depth information. Our generative model is realized as a deep neural network with two branches, both of which train the same space‐time conditional generator, using shared weights. One branch, responsible for learning to generate the appearance of the target actor in various poses, uses paired training data, self‐generated from the reference video. The second branch uses unpaired data to improve generation of temporally coherent video renditions of unseen pose sequences. Through data augmentation, our network is able to synthesize images of the target actor in poses never captured by the reference video. We demonstrate a variety of promising results, where our method is able to generate temporally coherent videos, for challenging scenarios where the reference and driving videos consist of very different dance performances.  相似文献   

17.
Object Detection Using the Statistics of Parts   总被引:3,自引:0,他引:3  
In this paper we describe a trainable object detector and its instantiations for detecting faces and cars at any size, location, and pose. To cope with variation in object orientation, the detector uses multiple classifiers, each spanning a different range of orientation. Each of these classifiers determines whether the object is present at a specified size within a fixed-size image window. To find the object at any location and size, these classifiers scan the image exhaustively.Each classifier is based on the statistics of localized parts. Each part is a transform from a subset of wavelet coefficients to a discrete set of values. Such parts are designed to capture various combinations of locality in space, frequency, and orientation. In building each classifier, we gathered the class-conditional statistics of these part values from representative samples of object and non-object images. We trained each classifier to minimize classification error on the training set by using Adaboost with Confidence-Weighted Predictions (Shapire and Singer, 1999). In detection, each classifier computes the part values within the image window and looks up their associated class-conditional probabilities. The classifier then makes a decision by applying a likelihood ratio test. For efficiency, the classifier evaluates this likelihood ratio in stages. At each stage, the classifier compares the partial likelihood ratio to a threshold and makes a decision about whether to cease evaluation—labeling the input as non-object—or to continue further evaluation. The detector orders these stages of evaluation from a low-resolution to a high-resolution search of the image. Our trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.  相似文献   

18.
In this paper, we propose a new video summarization procedure that produces a dynamic (video) abstract of the original video sequence. Our technique compactly summarizes a video data by preserving its original temporal characteristics (visual activity) and semantically essential information. It relies on an adaptive nonlinear sampling. The local sampling rate is directly proportional to the amount of visual activity in localized sub-shot units of the video. To get very short, yet semantically meaningful summaries, we also present an event-oriented abstraction scheme, in which two semantic events; emotional dialogue and violent action, are characterized and abstracted into the video summary before all other events. If the length of the summary permits, other non key events are then added. The resulting video abstract is highly compact.  相似文献   

19.
20.
This research is concerned with a gradient descent training algorithm for a target network that makes use of a helper feed-forward network (FFN) to represent the cost function required for training the target network. A helper FFN is trained because the cost relation for the target is not differentiable. The transfer function of the trained helper FFN provides a differentiable cost function of the parameter vector for the target network allowing gradient search methods for finding the optimum values of the parameters. The method is applied to the training of discrete recurrent networks (DRNNs) that are used as a tool for classification of temporal sequences of characters from some alphabet and identification of a finite state machine (FSM) that may have produced all the sequences. Classification of sequences that are input to the DRNN is based on the terminal state of the network after the last element in the input sequence has been processed. If the DRNN is to be used for classifying sequences the terminal states for class 0 sequences must be distinct from the terminal states for class 1 sequences. The cost value to be used in training must therefore be a function of this disjointedness and no more. The outcome of this is a cost relationship that is not continuous but discrete and therefore derivative free methods have to be used or alternatively the method suggested in this paper. In the latter case the transform function of the helper FFN that is trained using the cost function is a differentiable function that can be used in the training of the DRNN using gradient descent.Acknowledgement. This work was supported by a discovery grant from the Government of Canada. The comments made by the reviewers are also greatly appreciated and have proven to be quite useful.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号