首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 964 毫秒
1.
2.
With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots.  相似文献   

3.
4.

Language resources for studying doctor–patient interaction are rare, primarily due to the ethical issues related to recording real medical consultations. Rarer still are resources that involve more than one healthcare professional in consultation with a patient, despite many chronic conditions requiring multiple areas of expertise for effective treatment. In this paper, we present the design, construction and output of the Patient Consultation Corpus, a multimodal corpus of simulated consultations between a patient portrayed by an actor, and at least two healthcare professionals with different areas of expertise. As well as the transcribed text from each consultation, the corpus also contains audio and video where for each consultation: the audio consists of individual tracks for each participant, allowing for clear identification of speakers; the video consists of two framings for each participant—upper-body and face—allowing for close analysis of behaviours and gestures. Having presented the design and construction of the corpus, we then go on to briefly describe how the multi-modal nature of the corpus allows it to be analysed from several different perspectives.

  相似文献   

5.
Identifying the active speaker in a video of a distributed meeting can be very helpful for remote participants to understand the dynamics of the meeting. A straightforward application of such analysis is to stream a high resolution video of the speaker to the remote participants. In this paper, we present the challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and propose a novel boosting-based multimodal speaker detection (BMSD) algorithm. Instead of separately performing sound source localization (SSL) and multiperson detection (MPD) and subsequently fusing their individual results, the proposed algorithm fuses audio and visual information at feature level by using boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. In experiments that includes hundreds of real-world meetings, the proposed BMSD algorithm reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%. To the best of our knowledge, this is the first real-time multimodal speaker detection algorithm that is deployed in commercial products.   相似文献   

6.
This paper proposes a novel representation space for multimodal information, enabling fast and efficient retrieval of video data. We suggest describing the documents not directly by selected multimodal features (audio, visual or text), but rather by considering cross-document similarities relatively to their multimodal characteristics. This idea leads us to propose a particular form of dissimilarity space that is adapted to the asymmetric classification problem, and in turn to the query-by-example and relevance feedback paradigm, widely used in information retrieval. Based on the proposed dissimilarity space, we then define various strategies to fuse modalities through a kernel-based learning approach. The problem of automatic kernel setting to adapt the learning process to the queries is also discussed. The properties of our strategies are studied and validated on artificial data. In a second phase, a large annotated video corpus, (ie TRECVID-05), indexed by visual, audio and text features is considered to evaluate the overall performance of the dissimilarity space and fusion strategies. The obtained results confirm the validity of the proposed approach for the representation and retrieval of multimodal information in a real-time framework.  相似文献   

7.
现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。  相似文献   

8.
In this paper, a unified framework for multimodal content retrieval is presented. The proposed framework supports retrieval of rich media objects as unified sets of different modalities (image, audio, 3D, video and text) by efficiently combining all monomodal heterogeneous similarities to a global one according to an automatic weighting scheme. Then, a multimodal space is constructed to capture the semantic correlations among multiple modalities. In contrast to existing techniques, the proposed method is also able to handle external multimodal queries, by embedding them to the already constructed multimodal space, following a space mapping procedure of a submanifold analysis. In our experiments with five real multimodal datasets, we show the superiority of the proposed approach against competitive methods.  相似文献   

9.
Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.  相似文献   

10.
The design and performance of a low bit-rate video telephony service for mobile third-generation (3G) systems is presented. The ITU-T G.723.1 speech coding and the ITU-T H.263 video coding recommendations are used, as proposed by the ITU-T H.324 low bit-rate multimedia communications recommendation. The target bit-rate for the H.324 service is 64 kb/s. The design is performed in conjunction with that of a wideband-code division multiple access (W-CDMA) radio transmission technology (RTT) system, proposed by the European Space Agency (ESA) for the satellite component of the ITU IMT-2000 standard. Most of the results could also be applied to the 3G terrestrial systems. The use of concatenated channel coding with convolutional inner coding and Reed-Solomon (RS) outer coding is investigated. Service designs based on equal error protection (EEP) and unequal error protection (UEP) schemes for the audio and video sources are compared. The simulation of the proposed video telephony services shows that significantly more graceful video and audio degradation is obtained with the proposed UEP scheme than with a more straightforward EEP method. The UEP scheme reduces significantly the occurrence of highly annoying audio and video artefacts, allowing satellite-based video telephony services that are compatible with the current Internet-based applications  相似文献   

11.
12.
Pornographic video detection based on multimodal fusion is an effective approach for filtering pornography. However, existing methods lack accurate representation of audio semantics and pay little attention to the characteristics of pornographic audios. In this paper, we propose a novel framework of fusing audio vocabulary with visual features for pornographic video detection. The novelty of our approach lies in three aspects: an audio semantics representation method based on an energy envelope unit (EEU) and bag-of-words (BoW), a periodicity-based audio segmentation algorithm, and a periodicity-based video decision algorithm. The first one, named the EEU+BoW representation method, is proposed to describe the audio semantics via an audio vocabulary. The audio vocabulary is constructed by k-means clustering of EEUs. The latter two aspects echo with each other to make full use of the periodicities in pornographic audios. Using the periodicity-based audio segmentation algorithm, audio streams are divided into EEU sequences. After these EEUs are classified, videos are judged to be pornographic or not by the periodicity-based video decision algorithm. Before fusion, two support vector machines are respectively applied for the audio-vocabulary-based and visual-features-based methods. To fuse their results, a keyframe is selected from each EEU in terms of the beginning and ending positions, and then an integrated weighted scheme and a periodicity-based video decision algorithm are adopted to yield final detection results. Experimental results show that our approach outperforms the traditional one which is only based on visual features, and achieves satisfactory performance. The true positive rate achieves 94.44% while the false positive rate is 9.76%.  相似文献   

13.
Three studies of collaborative activity were conducted as part of research in developing multimedia technology to support collaboration. One study surveyed users' opinions of their use of video conference rooms. Users indicated that the availability of the video conference rooms was too limited, audio quality needed improvement, and a shared drawing space was needed. A second study analyzed videotapes of a work group when meeting face-to-face, video conferencing, and phone conferencing. The analyses found that the noticeable audio delay in video conferencing made it difficult for the participants to manage turn-taking and coordinate eye glances. In the third study, a distributed team was observed under three conditions: using their existing collaboration tools, adding a desktop conferencing prototype (audio, video, and shared drawing tool), and subtracting the video capability from the prototype. Qualitative and quantitative data were collected by videotaping the team, interviewing the team members individually, and recording their usage of the phone, electronic mail, face-to-face meetings, and desktop conferencing. The team's use of the desktop conferencing prototype dropped significantly when the video capability was removed. Analysis of the videotape data showed how the video channel was used to help mediate their interaction and convey visual information. Desktop conferencing apparently reduced e-mail usage and was perceived to reduce the number of shorter, two-person, face-to-face meetings.  相似文献   

14.
In this paper, we introduce an embedded hardware low complexity JPEG 2000 video coding system. The hardware implementation itself exploits temporal redundancy by coding differential frames which are arranged in an adaptive group of picture (GOP) structure. The GOP structure is determined by statistical analysis of previously coded differential frames. We present a hardware video coding system that applies this interframe coding scheme to a digital signal processor environment. This system basically consists of a microprocessor (ADSP-BF533 Blackfin processor from analog devices) and a JPEG 2000 compression engine (ADV202). Our straightforward system is predestined for encoding surveillance-type videos with low cost.  相似文献   

15.
基于多模式分析自动解析新闻视频   总被引:1,自引:0,他引:1  
王伟强  高文 《软件学报》2001,12(9):1271-1278
提出一种结合视觉、声音、文字等多种模式信息自动解析新闻视频的方法,并对音频特征的提取以及综合多种模式信息解析新闻视频的算法进行了详细的探讨.多种模式信息的使用有效地弥补了仅基于图像分析技术分割新闻条目的不足,从而使该方法对不同方式存在的新闻条目在分割时具有更广泛的适应性.在包含184100帧的测试数据集上,对于新闻条目边界点的检测,系统获得了95.1%查全率,93.3%的正确率.实验结果证明了该方法的有效性、强壮性.  相似文献   

16.
In peer-to-peer (P2P) video-on-demand (VoD) systems, a scalable source coding is a promising solution to provide heterogeneous peers with different video quality. In this paper, we present a systematic study on the throughput maximization problem in P2P VoD applications. We apply network coding to scalable P2P systems to eliminate the delivery redundancy. Since each peer receives distinct packets, a peer with a higher throughput can reconstruct the video at a higher quality. We maximize the throughput in the existing buffer-forwarding P2P VoD systems using a fully distributed algorithm. We demonstrate in the simulations that the proposed distributed algorithm achieves a higher throughput compared to the proportional allocation scheme or the equal allocation scheme. The existing buffer-forwarding architecture has a limitation in total upload capacity. Therefore we propose a hybrid-forwarding P2P VoD architecture to improve the throughput by combining the buffer-forwarding approach with the storage-forwarding approach. The throughput maximization problem in the hybrid-forwarding architecture is also solved using a fully distributed algorithm. We demonstrate that the proposed hybrid-forwarding architecture greatly improves the throughput compared to the existing buffer-forwarding architecture. In addition, by adjusting the priority weight at each peer, we can implement the differentiated throughput among different users within a video session in the buffer-forwarding architecture, and the differentiated throughput among different video sessions in the hybrid-forwarding architecture.   相似文献   

17.
18.
Multimodal sensing, recognizing and browsing group social dynamics   总被引:1,自引:0,他引:1  
Group social dynamics is crucial for determining whether a meeting was well organized and the conclusion well reasoned. In this paper, we propose multimodal approaches for sensing, recognition and browsing of social dynamics, specifically human semantic interactions and group interests in small group meetings. Unlike physical interactions (e.g., turn-taking and addressing), the human interactions considered here are incorporated with semantics, i.e., user intention or attitude toward a topic. Group interests are defined as episodes in which participants engaged in an emphatic and heated discussion. We adopt multiple sensors, such as video cameras, microphones and motion sensors for meeting capture. Multimodal methods are proposed for human interaction recognition and group interest recognition based on a variety of features. A graphical user interface, the MMBrowser, is presented for browsing group social dynamics. Experimental results have demonstrated the feasibility of the proposed approaches.  相似文献   

19.
This paper studies the quality of multimedia content at very low bitrates. We carried out subjective experiments for assessing audiovisual, audio-only, and video-only quality. We selected content and encoding parameters that are typical of mobile applications. Our focus were the MPEG-4 AVC (a.k.a. H.264) and AAC coding standards. Based on these data, we first analyze the influence of video and audio coding parameters on quality. We investigate the optimal trade-off between bits allocated to audio and to video under global bitrate constraints. Finally, we explore models for the interactions between audio and video in terms of perceived audiovisual quality.  相似文献   

20.
A Multimodal and Multilevel Ranking Scheme for Large-Scale Video Retrieval   总被引:2,自引:0,他引:2  
A critical issue of large-scale multimedia retrieval is how to develop an effective framework for ranking the search results. This problem is particularly challenging for content-based video retrieval due to some issues such as short text queries, insufficient sample learning, fusion of multimodal contents, and large-scale learning with huge media data. In this paper, we propose a novel multimodal and multilevel (MMML) ranking framework to attack the challenging ranking problem of content-based video retrieval. We represent the video retrieval task by graphs and suggest a graph based semi-supervised ranking (SSR) scheme, which can learn with small samples effectively and integrate multimodal resources for ranking smoothly. To make the semi-supervised ranking solution practical for large-scale retrieval tasks, we propose a multilevel ranking framework that unifies several different ranking approaches in a cascade fashion. We have conducted empirical evaluations of our proposed solution for automatic search tasks on the benchmark testbed of TRECVID2005. The promising empirical results show that our ranking solutions are effective and very competitive with the state-of-the-art solutions in the TRECVID evaluations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号