Fusing audio vocabulary with visual features for pornographic video detection |
| |
Affiliation: | 1. School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore;2. School of Computer, Central China Normal University, LuoyuRoad 152, Wuhan, China;3. Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, Utrecht, The Netherlands;4. School of Computing, Informatics, Decision System Engineering, Arizona State University, Tempe, AZ 85287, USA |
| |
Abstract: | Pornographic video detection based on multimodal fusion is an effective approach for filtering pornography. However, existing methods lack accurate representation of audio semantics and pay little attention to the characteristics of pornographic audios. In this paper, we propose a novel framework of fusing audio vocabulary with visual features for pornographic video detection. The novelty of our approach lies in three aspects: an audio semantics representation method based on an energy envelope unit (EEU) and bag-of-words (BoW), a periodicity-based audio segmentation algorithm, and a periodicity-based video decision algorithm. The first one, named the EEU+BoW representation method, is proposed to describe the audio semantics via an audio vocabulary. The audio vocabulary is constructed by k-means clustering of EEUs. The latter two aspects echo with each other to make full use of the periodicities in pornographic audios. Using the periodicity-based audio segmentation algorithm, audio streams are divided into EEU sequences. After these EEUs are classified, videos are judged to be pornographic or not by the periodicity-based video decision algorithm. Before fusion, two support vector machines are respectively applied for the audio-vocabulary-based and visual-features-based methods. To fuse their results, a keyframe is selected from each EEU in terms of the beginning and ending positions, and then an integrated weighted scheme and a periodicity-based video decision algorithm are adopted to yield final detection results. Experimental results show that our approach outperforms the traditional one which is only based on visual features, and achieves satisfactory performance. The true positive rate achieves 94.44% while the false positive rate is 9.76%. |
| |
Keywords: | |
本文献已被 ScienceDirect 等数据库收录! |
|