首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Pornographic video detection based on multimodal fusion is an effective approach for filtering pornography. However, existing methods lack accurate representation of audio semantics and pay little attention to the characteristics of pornographic audios. In this paper, we propose a novel framework of fusing audio vocabulary with visual features for pornographic video detection. The novelty of our approach lies in three aspects: an audio semantics representation method based on an energy envelope unit (EEU) and bag-of-words (BoW), a periodicity-based audio segmentation algorithm, and a periodicity-based video decision algorithm. The first one, named the EEU+BoW representation method, is proposed to describe the audio semantics via an audio vocabulary. The audio vocabulary is constructed by k-means clustering of EEUs. The latter two aspects echo with each other to make full use of the periodicities in pornographic audios. Using the periodicity-based audio segmentation algorithm, audio streams are divided into EEU sequences. After these EEUs are classified, videos are judged to be pornographic or not by the periodicity-based video decision algorithm. Before fusion, two support vector machines are respectively applied for the audio-vocabulary-based and visual-features-based methods. To fuse their results, a keyframe is selected from each EEU in terms of the beginning and ending positions, and then an integrated weighted scheme and a periodicity-based video decision algorithm are adopted to yield final detection results. Experimental results show that our approach outperforms the traditional one which is only based on visual features, and achieves satisfactory performance. The true positive rate achieves 94.44% while the false positive rate is 9.76%.  相似文献   

3.
4.
5.
现有音视人眼关注点检测算法使用双流结构分别对音视信息进行特征提取,随后对音视特征融合得到最终的预测图。但数据集中的音频信息和视觉信息会有不相关的情况,因此在音视不一致时直接对音视特征进行融合会使得音频信息对视觉特征产生消极的影响。针对上述问题,本文提出一种基于音视一致性的音视人眼关注点检测网络(Audio-visual Consistency Network, AVCN)。为验证该网络的可靠性,本文在现有音视结合的人眼关注点检测模型上加入音视一致性网络,AVCN对提取的音、视频特征进行一致性二值判断,二者一致时,输出音视融合特征作为最终的预测图,反之则输出视觉占主导的特征作为最终结果。该算法在开放的6个数据集上进行实验,结果表明加入AVCN模型的整体指标会有所提高。  相似文献   

6.
Audio-Visual Event Recognition in Surveillance Video Sequences   总被引:2,自引:0,他引:2  
In the context of the automated surveillance field, automatic scene analysis and understanding systems typically consider only visual information, whereas other modalities, such as audio, are typically disregarded. This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage and coupled with an audio BG/FG modelling scheme. These processes permit one to detect separate audio and visual patterns representing unusual unimodal events in a scene. The integration of audio and visual data is subsequently performed by exploiting the concept of synchrony between such events. The audio-visual (AV) association is carried out online and without need for training sequences, and is actually based on the computation of a characteristic feature called audio-video concurrence matrix, allowing one to detect and segment AV events, as well as to discriminate between them. Experimental tests involving classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by employing the single modalities and without considering the synchrony issue  相似文献   

7.
Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported.  相似文献   

8.
Toward semantic indexing and retrieval using hierarchical audio models   总被引:1,自引:0,他引:1  
Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed framework provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.  相似文献   

9.
The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross-modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image-based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network(DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set.  相似文献   

10.
This paper proposes a new two-phase approach to robust text detection by integrating the visual appearance and the geometric reasoning rules. In the first phase, geometric rules are used to achieve a higher recall rate. Specifically, a robust stroke width transform (RSWT) feature is proposed to better recover the stroke width by additionally considering the cross of two strokes and the continuousness of the letter border. In the second phase, a classification scheme based on visual appearance features is used to reject the false alarms while keeping the recall rate. To learn a better classifier from multiple visual appearance features, a novel classification method called double soft multiple kernel learning (DS-MKL) is proposed. DS-MKL is motivated by a novel kernel margin perspective for multiple kernel learning and can effectively suppress the influence of noisy base kernels. Comprehensive experiments on the benchmark ICDAR2005 competition dataset demonstrate the effectiveness of the proposed two-phase text detection approach over the state-of-the-art approaches by a performance gain up to 4.4% in terms of F-measure.  相似文献   

11.
目的 视觉显著性在众多视觉驱动的应用中具有重要作用,这些应用领域出现了从2维视觉到3维视觉的转换,从而基于RGB-D数据的显著性模型引起了广泛关注。与2维图像的显著性不同,RGB-D显著性包含了许多不同模态的线索。多模态线索之间存在互补和竞争关系,如何有效地利用和融合这些线索仍是一个挑战。传统的融合模型很难充分利用多模态线索之间的优势,因此研究了RGB-D显著性形成过程中多模态线索融合的问题。方法 提出了一种基于超像素下条件随机场的RGB-D显著性检测模型。提取不同模态的显著性线索,包括平面线索、深度线索和运动线索等。以超像素为单位建立条件随机场模型,联合多模态线索的影响和图像邻域显著值平滑约束,设计了一个全局能量函数作为模型的优化目标,刻画了多模态线索之间的相互作用机制。其中,多模态线索在能量函数中的权重因子由卷积神经网络学习得到。结果 实验在两个公开的RGB-D视频显著性数据集上与6种显著性检测方法进行了比较,所提模型在所有相关数据集和评价指标上都优于当前最先进的模型。相比于第2高的指标,所提模型的AUC(area under curve),sAUC(shuffled AUC),SIM(similarity),PCC(Pearson correlation coefficient)和NSS(normalized scanpath saliency)指标在IRCCyN数据集上分别提升了2.3%,2.3%,18.9%,21.6%和56.2%;在DML-iTrack-3D数据集上分别提升了2.0%,1.4%,29.1%,10.6%,23.3%。此外还进行了模型内部的比较,验证了所提融合方法优于其他传统融合方法。结论 本文提出的RGB-D显著性检测模型中的条件随机场和卷积神经网络充分利用了不同模态线索的优势,将它们有效融合,提升了显著性检测模型的性能,能在视觉驱动的应用领域发挥一定作用。  相似文献   

12.
人格识别分析是人格计算研究中一个重要的研究内容,在人类行为分析、人工智能、人机交互、个性化推荐等方面具有重要的应用价值,是近年来心理学、认知学、计算机科学等领域中的一个多学科交叉的热点研究课题。本文介绍了与人格识别相关的各种人格类型表示理论和人格识别数据库,阐述了面向听视觉信息的各种听视觉人格特征提取技术,如手工特征和深度特征,并在此基础上对面向听视觉信息人格识别的多模态融合方法做了详细的分类和归纳,最后概括了面向听视觉信息的多模态人格识别发展趋势,并进行了展望。  相似文献   

13.
We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.  相似文献   

14.
This paper firstly introduces common wearable sensors, smart wearable devices and the key application areas. Since multi-sensor is defined by the presence of more than one model or channel, e.g. visual, audio, environmental and physiological signals. Hence, the fusion methods of multi-modality and multi-location sensors are proposed. Despite it has been contributed several works reviewing the stateoftheart on information fusion or deep learning, all of them only tackled one aspect of the sensor fusion applications, which leads to a lack of comprehensive understanding about it. Therefore, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of the fusion methods of wearable sensors. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of multi-sensor applications for human activity recognition, including those recently added to the field for unsupervised learning and transfer learning. Finally, the open research issues that need further research and improvement are identified and discussed.  相似文献   

15.
This paper proposes a novel representation space for multimodal information, enabling fast and efficient retrieval of video data. We suggest describing the documents not directly by selected multimodal features (audio, visual or text), but rather by considering cross-document similarities relatively to their multimodal characteristics. This idea leads us to propose a particular form of dissimilarity space that is adapted to the asymmetric classification problem, and in turn to the query-by-example and relevance feedback paradigm, widely used in information retrieval. Based on the proposed dissimilarity space, we then define various strategies to fuse modalities through a kernel-based learning approach. The problem of automatic kernel setting to adapt the learning process to the queries is also discussed. The properties of our strategies are studied and validated on artificial data. In a second phase, a large annotated video corpus, (ie TRECVID-05), indexed by visual, audio and text features is considered to evaluate the overall performance of the dissimilarity space and fusion strategies. The obtained results confirm the validity of the proposed approach for the representation and retrieval of multimodal information in a real-time framework.  相似文献   

16.
Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.  相似文献   

17.
音视显著性检测方法采用的双流网络结构,在音视信号不一致时,双流网络的音频信息对视频信息产生负面影响,削弱物体的视觉特征;另外,传统融合方式忽视了特征属性的重要程度。针对双流网络的问题进行研究,提出了一种基于视觉信息补偿的多流音视显著性算法(MSAVIC)。首先,在双流网络的基础上增加单独的视频编码分支,保留视频信号中完整的物体外观和运动信息。其次,利用特征融合策略将视频编码特征与音视频显著性特征相结合,增强视觉信息的表达,实现音视不一致情况下对视觉信息的补偿。理论分析和实验结果表明,MSAVIC在四个数据集上超过其他方法2%左右,在显著性检测方面具有较好的效果。  相似文献   

18.
ABSTRACT

Saliency detection has been revealed an effective and reliable approach to extract the region of interest (ROI) in remote sensing images. However, most existing saliency detection methods employing multiple saliency cues ignore the intrinsic relationship between different cues and do not distinguish the diverse contributions of different cues to the final saliency map. In this paper, we propose a novel self-adaptively multiple feature fusion model for saliency detection in remote sensing images to take advantage of this relationship to improve the accuracy of ROI extraction. First, we take multiple feature channels, namely colour, intensity, texture and global contrast into consideration to produce primary feature maps. Particularly, we design a novel method based on dual-tree complex wavelet transform for remote sensing images to generate texture feature pyramids. Then, we introduce a novel self-adaptive multiple feature fusion method based on low-rank matrix recovery, in which the significances of feature maps are ranked by the low rank constraint recovery, and subsequently multiple features’ contributions are allocated adaptively to produce the final saliency map. Experimental results demonstrate that our proposal outperforms the state-of-the-art methods.  相似文献   

19.
In this paper, we present a probabilistic multi-task learning approach for visual saliency estimation in video. In our approach, the problem of visual saliency estimation is modeled by simultaneously considering the stimulus-driven and task-related factors in a probabilistic framework. In this framework, a stimulus-driven component simulates the low-level processes in human vision system using multi-scale wavelet decomposition and unbiased feature competition; while a task-related component simulates the high-level processes to bias the competition of the input features. Different from existing approaches, we propose a multi-task learning algorithm to learn the task-related “stimulus-saliency” mapping functions for each scene. The algorithm also learns various fusion strategies, which are used to integrate the stimulus-driven and task-related components to obtain the visual saliency. Extensive experiments were carried out on two public eye-fixation datasets and one regional saliency dataset. Experimental results show that our approach outperforms eight state-of-the-art approaches remarkably.  相似文献   

20.
In the Semantic Web vision of the World Wide Web, content will not only be accessible to humans but will also be available in machine interpretable form as ontological knowledge bases. Ontological knowledge bases enable formal querying and reasoning and, consequently, a main research focus has been the investigation of how deductive reasoning can be utilized in ontological representations to enable more advanced applications. However, purely logic methods have not yet proven to be very effective for several reasons: First, there still is the unsolved problem of scalability of reasoning to Web scale. Second, logical reasoning has problems with uncertain information, which is abundant on Semantic Web data due to its distributed and heterogeneous nature. Third, the construction of ontological knowledge bases suitable for advanced reasoning techniques is complex, which ultimately results in a lack of such expressive real-world data sets with large amounts of instance data. From another perspective, the more expressive structured representations open up new opportunities for data mining, knowledge extraction and machine learning techniques. If moving towards the idea that part of the knowledge already lies in the data, inductive methods appear promising, in particular since inductive methods can inherently handle noisy, inconsistent, uncertain and missing data. While there has been broad coverage of inducing concept structures from less structured sources (text, Web pages), like in ontology learning, given the problems mentioned above, we focus on new methods for dealing with Semantic Web knowledge bases, relying on statistical inference on their standard representations. We argue that machine learning research has to offer a wide variety of methods applicable to different expressivity levels of Semantic Web knowledge bases: ranging from weakly expressive but widely available knowledge bases in RDF to highly expressive first-order knowledge bases, this paper surveys statistical approaches to mining the Semantic Web. We specifically cover similarity and distance-based methods, kernel machines, multivariate prediction models, relational graphical models and first-order probabilistic learning approaches and discuss their applicability to Semantic Web representations. Finally we present selected experiments which were conducted on Semantic Web mining tasks for some of the algorithms presented before. This is intended to show the breadth and general potential of this exiting new research and application area for data mining.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号