共查询到20条相似文献,搜索用时 62 毫秒
1.
Nikolaos Sarafianos Theodoros Giannakopoulos Sergios Petridis 《Multimedia Tools and Applications》2016,75(1):115-130
Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal. 相似文献
2.
3.
Medeni Soysal K. Berker Loğoğlu Mashar Tekin Ersin Esen Ahmet Saracoğlu Banu Oskay Acar Ezgi Can Ozan Tuğrul K. Ateş Hakan Sevimli Müge Sevinç İlkay Atıl Savaş Özkan Mehmet Ali Arabacı Seda Tankız Talha Karadeniz Duygu Önür Sezin Selçuk A. Aydın Alatan Tolga Çiloğlu 《Multimedia Tools and Applications》2014,72(3):2787-2832
Concept detection stands as an important problem for efficient indexing and retrieval in large video archives. In this work, the KavTan System, which performs high-level semantic classification in one of the largest TV archives of Turkey, is presented. In this system, concept detection is performed using generalized visual and audio concept detection modules that are supported by video text detection, audio keyword spotting and specialized audio-visual semantic detection components. The performance of the presented framework was assessed objectively over a wide range of semantic concepts (5 high-level, 14 visual, 9 audio, 2 supplementary) by using a significant amount of precisely labeled ground truth data. KavTan System achieves successful high-level concept detection performance in unconstrained TV broadcast by efficiently utilizing multimodal information that is systematically extracted from both spatial and temporal extent of multimedia data. 相似文献
4.
Chin-Feng Lai Yueh-Min Huang Jiann-Liang Chen Wen Ji Min Chen 《Multimedia Systems》2011,17(4):299-311
In the array of mobile communication techniques, the application of a mobile phone combined with television is a new technique
under development. As TV program is a real-time video/audio service, in comparison with either traditional video/audio file
downloads or network video/audio streams, there are more technical difficulties to be overcome, in particular, how to satisfy
the playback functions of TV programs in hand-held device. OpenCore is a multimedia framework, which has recently been widely
applied in hand-held devices, but it does not offer functions of mobile TV. To solve this problem, this study incorporates
the function of mobile TV into the OpenCore framework, in order to support both formats of TV signals, i.e. DVB-H and DVB-T.
The incorporated function, DVB-H/T, has different characteristics, so that users can select TV signals according to their
receiving environments and fulfill their needs in TV programs selection. 相似文献
5.
Current interactive services for digital TV are limited. They basically display a Web page alongside the TV program, which enhances the viewer's experience by providing extra information about the TV program. We define new interactive services for digital TV, which provide DVD-like interactivity to TV viewers. These services enable viewers to control the content and final presentation of a TV program. Some of the attractive applications of our services include parental management, multilingual audio, multiangle video, video in video, etc. The challenge in implementing these services is in transmitting an extra audio or video stream (called incidental) along with the main streams of the TV program. In the first part of this paper, we present a framework for adding the incidental streams to the original transmission stream without increasing the required bandwidth, degrading the picture quality of the main streams, or violating the compatibility of the transmitted stream with standard TV receivers. In the second part of this paper, we explore the two basic mechanisms of the presented framework: traffic characterization and admission control. We present methods for implementing these mechanisms. Using our methods, one can determine whether a TV transmission network has the capability of sending an incidental stream or not. Simulations were conducted to test the validity of our method. The results verify that our method successfully transmits the incidental streams without any discrepancy and without affecting the quality of the main streams. 相似文献
6.
Jian-quan Ouyang Hua Nie Min Zhang Zezhou li Yongzhou LiAuthor vitae 《Computers & Electrical Engineering》2011,37(6):991-1008
Sixty-four percent of consumers believe television advertising still has the greatest impact on them. Therefore, there is a great application to provide accurate and real-time TV advertising identification for government and advertisement providers. As the integration of multi-modal method takes full account of video and audio information, this paper aims to handle composite fingerprinting in a unified framework for advertising identification. The Improved Harris Combining Motion feature which is based on the differences between the adjacent video frames can produce video fingerprint. Meanwhile the proposed FIR filter based Fast Audio Fingerprint is focused on extracting the differences between the equivalent bands from adjacent frames. Moreover, this multi-model framework combines the audio and video fingerprint by weighted manner. Experimental results show that compared with the current methods, both audio and video fingerprint has the advantage of higher discrimination, stronger robustness and lower time complexity. Moreover, multi model fingerprint can enhances the performance of the unique fingerprint. 相似文献
7.
Kapsouras I. Tefas A. Nikolaidis N. Peeters G. Benaroya L. Pitas I. 《Multimedia Tools and Applications》2017,76(2):2223-2242
Multimedia Tools and Applications - Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two... 相似文献
8.
9.
介绍了一种可支持多路音频/视频同时播放的装置,该装置可用来实现电脑或电视等多媒体设备在播放多个视频画面的同时,获得互不干扰的音频输出。该装置中的数据缓存模块连接有多个音频/视频读取模块,音频/视频读取模块的输出端分别与音频输出接口和视频呈现模块连接。 相似文献
10.
11.
《IEEE transactions on audio, speech, and language processing》2010,18(1):141-157
12.
Yasuo Ariki 《New Generation Computing》2000,18(4):341-357
Because of the media digitization, a large amount of information such as speech, audio and video data is produced everyday.
In order to retrieve data from these databases quickly and precisely, multimedia technologies for structuring and retrieving
of speech, audio and video data are strongly required. In this paper, we overview the multimedia technologies such as structuring
and retrieval of speech, audio and video data, speaker indexing, audio summarization and cross media retrieval existing today
for TV news detabase. The main purpose of structuring is to produce tables of contents and indices from audio and video data
automatically. In order to make these technologies feasible, first, processing units such as words on audio data and shots
on video data are extracted. On a second step, they are meaningfully integrated into topics. Furthermore, the units extracted
from different types of media are integrated for higher functions.
Yasuo Ariki, Ph.D.: He is a Professor in the Department of Electronics and Informatics at the Ryukoku University. He received his B.E., M.E.
and Ph.D. in information science from Kyoto University in 1974, 1976 and 1979, respectively. He had been an Assistant in Kyoto
University from 1980 to 1990, and stayed at Edinburgh University as visiting academic from 1987 to 1990. His research interests
are in speech and image recognition and in information retrieval and database. He is a member of IPSJ, IEICE, ASJ, Soc. Artif.
Intel. and IEEE. 相似文献
13.
A new type of local area network operating system namedCrossover Net is developed for effective control of digital devices and audio visual devices such as video cameras, TV displays, video discs and video cassette recorders. A remarkable feature ofCrossoverNet is that it can handle and transfer both digital and analog information, including data, video and voice, in an integrated manner. To ensure secure operations of the entire network, we formalizeCrossoverNet in a precise mathematical way by defining object, job, task and process as data types belonging toCrossoverNet. We also present a practical installation example of theCrossoverNet system. 相似文献
14.
现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。 相似文献
15.
Prosodic and other Long-Term Features for Speaker Diarization 总被引:1,自引:0,他引:1
《IEEE transactions on audio, speech, and language processing》2009,17(5):985-993
16.
17.
Jinqiao Wang Lingyu Duan Qingshan Liu Hanqing Lu Jin J.S. 《Multimedia, IEEE Transactions on》2008,10(3):393-408
With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots. 相似文献
18.
19.
20.
光纤通道音视频协议(FC-AV)定义了视频信息在光纤通道FC网络传输方法,为视频设备之间互连提供一种接口标准,代表航电系统视频信息传输技术的发展趋势。深入研究光纤通道音视频协议的基本原理,基于FPGA实现视频信息采集、容器系统的组织、FC数据帧的组织封装以及显示容器系统显示控制等关键技术,研制一套硬件接口模块,构建一个视频传输网络,验证了基于光纤通道传输图像信息方法,解决图像信息远距离的难题,为航空电子应用光纤通道传输图像信息奠定了基础。 相似文献