期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Nikolaos Sarafianos Theodoros Giannakopoulos Sergios Petridis 《Multimedia Tools and Applications》2016,75(1):115-130

Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal. 相似文献

2.

An overview of automatic speaker diarization systems

Tranter S.E. Reynolds D.A. 《IEEE transactions on audio, speech, and language processing》2006,14(5):1557-1565

相似文献

3.

Multimodal concept detection in broadcast media: KavTan

Medeni Soysal K. Berker Loğoğlu Mashar Tekin Ersin Esen Ahmet Saracoğlu Banu Oskay Acar Ezgi Can Ozan Tuğrul K. Ateş Hakan Sevimli Müge Sevinç İlkay Atıl Savaş Özkan Mehmet Ali Arabacı Seda Tankız Talha Karadeniz Duygu Önür Sezin Selçuk A. Aydın Alatan Tolga Çiloğlu 《Multimedia Tools and Applications》2014,72(3):2787-2832

Concept detection stands as an important problem for efficient indexing and retrieval in large video archives. In this work, the KavTan System, which performs high-level semantic classification in one of the largest TV archives of Turkey, is presented. In this system, concept detection is performed using generalized visual and audio concept detection modules that are supported by video text detection, audio keyword spotting and specialized audio-visual semantic detection components. The performance of the presented framework was assessed objectively over a wide range of semantic concepts (5 high-level, 14 visual, 9 audio, 2 supplementary) by using a significant amount of precisely labeled ground truth data. KavTan System achieves successful high-level concept detection performance in unconstrained TV broadcast by efficiently utilizing multimodal information that is systematically extracted from both spatial and temporal extent of multimedia data. 相似文献

4.

Design and integration of the OpenCore-based mobile TV framework for DVB-H/T wireless network

Chin-Feng Lai Yueh-Min Huang Jiann-Liang Chen Wen Ji Min Chen 《Multimedia Systems》2011,17(4):299-311

In the array of mobile communication techniques, the application of a mobile phone combined with television is a new technique under development. As TV program is a real-time video/audio service, in comparison with either traditional video/audio file downloads or network video/audio streams, there are more technical difficulties to be overcome, in particular, how to satisfy the playback functions of TV programs in hand-held device. OpenCore is a multimedia framework, which has recently been widely applied in hand-held devices, but it does not offer functions of mobile TV. To solve this problem, this study incorporates the function of mobile TV into the OpenCore framework, in order to support both formats of TV signals, i.e. DVB-H and DVB-T. The incorporated function, DVB-H/T, has different characteristics, so that users can select TV signals according to their receiving environments and fulfill their needs in TV programs selection. 相似文献

5.

Data transmission schemes for DVD-like interactive TV

Azimi M. Nasiopoulos P. Ward R.K. 《Multimedia, IEEE Transactions on》2006,8(4):856-865

Current interactive services for digital TV are limited. They basically display a Web page alongside the TV program, which enhances the viewer's experience by providing extra information about the TV program. We define new interactive services for digital TV, which provide DVD-like interactivity to TV viewers. These services enable viewers to control the content and final presentation of a TV program. Some of the attractive applications of our services include parental management, multilingual audio, multiangle video, video in video, etc. The challenge in implementing these services is in transmitting an extra audio or video stream (called incidental) along with the main streams of the TV program. In the first part of this paper, we present a framework for adding the incidental streams to the original transmission stream without increasing the required bandwidth, degrading the picture quality of the main streams, or violating the compatibility of the transmitted stream with standard TV receivers. In the second part of this paper, we explore the two basic mechanisms of the presented framework: traffic characterization and admission control. We present methods for implementing these mechanisms. Using our methods, one can determine whether a TV transmission network has the capability of sending an incidental stream or not. Simulations were conducted to test the validity of our method. The results verify that our method successfully transmits the incidental streams without any discrepancy and without affecting the quality of the main streams. 相似文献

6.

Fusing audio-visual fingerprint to detect TV commercial advertisement

Jian-quan Ouyang Hua Nie Min Zhang Zezhou li Yongzhou LiAuthor vitae 《Computers & Electrical Engineering》2011,37(6):991-1008

Sixty-four percent of consumers believe television advertising still has the greatest impact on them. Therefore, there is a great application to provide accurate and real-time TV advertising identification for government and advertisement providers. As the integration of multi-modal method takes full account of video and audio information, this paper aims to handle composite fingerprinting in a unified framework for advertising identification. The Improved Harris Combining Motion feature which is based on the differences between the adjacent video frames can produce video fingerprint. Meanwhile the proposed FIR filter based Fast Audio Fingerprint is focused on extracting the differences between the equivalent bands from adjacent frames. Moreover, this multi-model framework combines the audio and video fingerprint by weighted manner. Experimental results show that compared with the current methods, both audio and video fingerprint has the advantage of higher discrimination, stronger robustness and lower time complexity. Moreover, multi model fingerprint can enhances the performance of the unique fingerprint. 相似文献

7.

Multimodal speaker clustering in full length movies

Kapsouras I. Tefas A. Nikolaidis N. Peeters G. Benaroya L. Pitas I. 《Multimedia Tools and Applications》2017,76(2):2223-2242

Multimedia Tools and Applications - Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two... 相似文献

8.

Indexing for reuse of TV news shots

M. Bertini A.Del Bimbo 《Pattern recognition》2002,35(3):581-591

相似文献

9.

一种可支持多路音频、视频同时播放的装置设计

叶惠仙《计算机时代》2011,(12):13-15

介绍了一种可支持多路音频／视频同时播放的装置,该装置可用来实现电脑或电视等多媒体设备在播放多个视频画面的同时,获得互不干扰的音频输出。该装置中的数据缓存模块连接有多个音频／视频读取模块,音频／视频读取模块的输出端分别与音频输出接口和视频呈现模块连接。相似文献

10.

Step-by-step and integrated approaches in broadcast news speaker diarization

《Computer Speech and Language》2006,20(2-3):303-330

相似文献

11.

BIC-Based Speaker Segmentation Using Divide-and-Conquer Strategies With Application to Speaker Diarization

《IEEE transactions on audio, speech, and language processing》2010,18(1):141-157

In this paper, we propose three divide-and-conquer approaches for Bayesian information criterion (BIC)-based speaker segmentation. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using $Delta BIC$ , a widely-adopted distance measure of two audio segments. We compare our approaches to three popular distance-based approaches, namely, Chen and Gopalakrishnan's window-growing-based approach, Siegler 's fixed-size sliding window approach, and Delacourt and Wellekens's DISTBIC approach, by performing computational cost analysis and conducting speaker change detection experiments on two broadcast news data sets. The results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches. In addition, we apply the segmentation approaches discussed in this paper to the speaker diarization task. The experiment results show that a more effective segmentation approach leads to better diarization accuracy. 相似文献

12.

Multimedia technologies for structuring and retrieval of TV news

Yasuo Ariki 《New Generation Computing》2000,18(4):341-357

Because of the media digitization, a large amount of information such as speech, audio and video data is produced everyday. In order to retrieve data from these databases quickly and precisely, multimedia technologies for structuring and retrieving of speech, audio and video data are strongly required. In this paper, we overview the multimedia technologies such as structuring and retrieval of speech, audio and video data, speaker indexing, audio summarization and cross media retrieval existing today for TV news detabase. The main purpose of structuring is to produce tables of contents and indices from audio and video data automatically. In order to make these technologies feasible, first, processing units such as words on audio data and shots on video data are extracted. On a second step, they are meaningfully integrated into topics. Furthermore, the units extracted from different types of media are integrated for higher functions. Yasuo Ariki, Ph.D.: He is a Professor in the Department of Electronics and Informatics at the Ryukoku University. He received his B.E., M.E. and Ph.D. in information science from Kyoto University in 1974, 1976 and 1979, respectively. He had been an Assistant in Kyoto University from 1980 to 1990, and stayed at Edinburgh University as visiting academic from 1987 to 1990. His research interests are in speech and image recognition and in information retrieval and database. He is a member of IPSJ, IEICE, ASJ, Soc. Artif. Intel. and IEEE. 相似文献

13.

CrossoverNet: A computer graphics/ video crossover LAN system

Tosiyasu L. Kunii Yukari Shirota 《The Visual computer》1986,2(2):78-89

A new type of local area network operating system namedCrossover Net is developed for effective control of digital devices and audio visual devices such as video cameras, TV displays, video discs and video cassette recorders. A remarkable feature ofCrossoverNet is that it can handle and transfer both digital and analog information, including data, video and voice, in an integrated manner. To ensure secure operations of the entire network, we formalizeCrossoverNet in a precise mathematical way by defining object, job, task and process as data types belonging toCrossoverNet. We also present a practical installation example of theCrossoverNet system. 相似文献

14.

一种基于多模态感知的双声道音频生成方法

官丽尹康樊梦佳薛昆解凯《计算技术与自动化》2022,(4):157-165

现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。相似文献

15.

Prosodic and other Long-Term Features for Speaker Diarization 总被引：1，自引：0，他引：1

《IEEE transactions on audio, speech, and language processing》2009,17(5):985-993

Speaker diarization is defined as the task of determining “who spoke when” given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized datasets (NIST RT) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. 相似文献

16.

基于AVS的嵌入式音视频同步方法

下载免费PDF全文

陈健赵岩陈贺新《计算机工程》2009,35(3):240-241

音视频同步是数字电视广播和多媒体通信等应用的关键技术。该文提出一种基于AVS并结合嵌入技术的音视频同步方法。将压缩音频数据嵌入AVS视频编码系统,保证传输或存储、接收端解码与播放过程中的音视频始终同步。实验结果表明,该方法实现了音视频完全同步,能减小用于同步的开销。相似文献

17.

A Multimodal Scheme for Program Segmentation and Representation in Broadcast Video Streams

Jinqiao Wang Lingyu Duan Qingshan Liu Hanqing Lu Jin J.S. 《Multimedia, IEEE Transactions on》2008,10(3):393-408

With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots. 相似文献

18.

An Information Theoretic Approach to Speaker Diarization of Meeting Data 总被引：1，自引：0，他引：1

Vijayasenan D. Valente F. Bourlard H. 《IEEE transactions on audio, speech, and language processing》2009,17(7):1382-1393

相似文献

19.

Multimodal genre classification of TV programs and YouTube videos 总被引：1，自引：1，他引：0

Hazım Kemal Ekenel Tomas Semela 《Multimedia Tools and Applications》2013,63(2):547-567

相似文献

20.

基于FC的航电数字视频传输技术研究

王红春《微机发展》2010,(5):250-252,F0003

光纤通道音视频协议（FC-AV）定义了视频信息在光纤通道FC网络传输方法,为视频设备之间互连提供一种接口标准,代表航电系统视频信息传输技术的发展趋势。深入研究光纤通道音视频协议的基本原理,基于FPGA实现视频信息采集、容器系统的组织、FC数据帧的组织封装以及显示容器系统显示控制等关键技术,研制一套硬件接口模块,构建一个视频传输网络,验证了基于光纤通道传输图像信息方法,解决图像信息远距离的难题,为航空电子应用光纤通道传输图像信息奠定了基础。相似文献