首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 20 毫秒
1.
2.
为提高视频语义信息提取准确率,提出了一种基于多模态特征的新闻视频语义提取框架。在视频中提取主题字幕信息,对音频进行分类和语音识别,根据主题字幕信息借助搜索引擎得到与新闻视频相关的网页;最后利用网页文本对语音识别的结果进行纠错,从而通过视频字幕信息和语音脚本的跨模态融合提高视频语义提取的准确率。在中等规模的新闻视频(含新闻网页)库测试表明了该方法的有效性,经纠错后的语音识别准确率达到了65%左右。  相似文献   

3.
In this paper, a unified framework for multimodal content retrieval is presented. The proposed framework supports retrieval of rich media objects as unified sets of different modalities (image, audio, 3D, video and text) by efficiently combining all monomodal heterogeneous similarities to a global one according to an automatic weighting scheme. Then, a multimodal space is constructed to capture the semantic correlations among multiple modalities. In contrast to existing techniques, the proposed method is also able to handle external multimodal queries, by embedding them to the already constructed multimodal space, following a space mapping procedure of a submanifold analysis. In our experiments with five real multimodal datasets, we show the superiority of the proposed approach against competitive methods.  相似文献   

4.
We present a multimodal document alignment framework, which highlights existing alignment relationships between documents that are discussed and recorded during multimedia events such as meetings. These relationships that should help indexing the archives of these events are detected using various techniques from natural language processing and information retrieval. The main alignment strategies studied are based on thematic, quotation and reference relationships. At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques. Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations. The satisfactory evaluation results obtained at several stages show the importance of our approach in bridging the gap between meeting documents, independently from the language and domain. They highlight also the utility of the multimodal alignment in advanced applications, e.g. multimedia document browsing, content-based / temporal-based searching, etc.  相似文献   

5.
Nowadays, mashup services and especially metasearch engines play an increasingly important role on the Web. Most of users use them directly or indirectly to access and aggregate information from more than one data sources. Similarly to the rest of the search systems, the effectiveness of a metasearch engine is mainly determined by the quality of the results it returns in response to user queries. Since these services do not maintain their own document index, they exploit multiple search engines using a rank aggregation method in order to classify the collected results. However, the rank aggregation methods which have been proposed until now, utilize a very limited set of parameters regarding these results, such as the total number of the exploited resources and the rankings they receive from each individual resource. In this paper we present QuadRank, a new rank aggregation method, which takes into consideration additional information regarding the query terms, the collected results and the data correlated to each of these results (title, textual snippet, URL, individual ranking and others). We have implemented and tested QuadRank in a real-world metasearch engine, QuadSearch, a system developed as a testbed for algorithms related to the wide problem of metasearching. The name QuadSearch is related to the current number of the exploited engines (four). We have exhaustively tested QuadRank for both effectiveness and efficiency in the real-world search environment of QuadSearch and also, using a task from the recent TREC-2009 conference. The results we present in our experiments reveal that in most cases QuadRank outperformed all component engines, another metasearch engine (Dogpile) and two successful rank aggregation methods, Borda Count and the Outranking Approach.  相似文献   

6.
Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval.  相似文献   

7.
In the classic rank aggregation (RA) problem, we are given L input lists with potentially inconsistent orders of n elements; our goal is to find a single order of all elements that minimizes the total number of disagreements with the given orders. The problem is well known to be NP-hard, already for L=4. We consider a generalization of RA, where each list is associated with a set of orderings, and our goal is to choose one ordering per list and to find a permutation of the elements that minimizes the total disagreements with the chosen orderings. For the case in which the lists completely overlap, i.e. each list contains all n elements, we show that a simple Greedy algorithm yields a (2−2/L)-approximation for generalized RA. The case in which the lists only partially overlap, i.e. each list contains a subset of the n elements, is much harder to approximate. In fact, we show that RA with multiple orderings per list and partial overlaps cannot be approximated within any bounded ratio.  相似文献   

8.
9.
Hypermedia composite templates define generic structures of nodes and links to be added to a document composition, providing spatio-temporal synchronization semantics. This paper presents EDITEC, a graphical editor for hypermedia composite templates. EDITEC templates are based on the XTemplate 3.0 language. The editor was designed for offering a user-friendly visual approach. It presents a new method that provides several options for representing iteration structures graphically, in order to specify a certain behavior to be applied to a set of generic document components. The editor provides a multi-view environment, giving the user a complete control of the composite template during the authoring process. Composite templates can be used in NCL documents for embedding spatio-temporal semantics into NCL contexts. NCL is the standard declarative language used for the production of interactive applications in the Brazilian digital TV system and ITU H.761 IPTV services. Hypermedia composite templates could also be used in other hypermedia authoring languages offering new types of compositions with predefined semantics.  相似文献   

10.
Multimodal function optimization, where the aim is to locate more than one solution, has attracted growing interest especially in the evolutionary computing research community. To evaluate experimentally the strengths and weaknesses of multimodal optimization algorithms, it is important to use test functions representing different characteristics and various levels of difficulty. The available selection of multimodal test problems is, however, rather limited and no general framework exists. This paper describes an attempt to construct a software framework which includes a variety of easily tunable test functions. The aim is to provide a general and easily expandable environment for testing different methods of multimodal optimization. Several function families with different characteristics are included. The framework implements new parameterizable function families for generating desired landscapes. Additionally the framework implements a selection of well known test functions from the literature, which can be rotated and stretched. The software module can easily be imported to any optimization algorithm implementation compatible with the C programming language. As an application example, 8 optimization approaches are compared by their ability to locate several global optima over a set of 16 functions with different properties generated by the proposed module. The effects of function regularity, dimensionality and number of local optima on the performance of different algorithms are studied.  相似文献   

11.
We tackle the crucial challenge of fusing different modalities of features for multimodal sentiment analysis. Mainly based on neural networks, existing approaches largely model multimodal interactions in an implicit and hard-to-understand manner. We address this limitation with inspirations from quantum theory, which contains principled methods for modeling complicated interactions and correlations. In our quantum-inspired framework, the word interaction within a single modality and the interaction across modalities are formulated with superposition and entanglement respectively at different stages. The complex-valued neural network implementation of the framework achieves comparable results to state-of-the-art systems on two benchmarking video sentiment analysis datasets. In the meantime, we produce the unimodal and bimodal sentiment directly from the model to interpret the entangled decision.  相似文献   

12.
In human–robot interaction, the robot controller must reactively adapt to sudden changes in the environment (due to unpredictable human behaviour). This often requires operating different modes, and managing sudden signal changes from heterogeneous sensor data. In this paper, we present a multimodal sensor-based controller, enabling a robot to adapt to changes in the sensor signals (here, changes in the human collaborator behaviour). Our controller is based on a unified task formalism, and in contrast with classical hybrid visicn–force–position control, it enables smooth transitions and weighted combinations of the sensor tasks. The approach is validated in a mock-up industrial scenario, where pose, vision (from both traditional camera and Kinect), and force tasks must be realized either exclusively or simultaneously, for human–robot collaboration.  相似文献   

13.
Wang  Meihua  Mai  Jiaming  Liang  Yun  Cai  Ruichu  Fu  Tom Zhengjia  Zhang  Zhenjie 《Multimedia Tools and Applications》2018,77(9):11259-11276

Traditional dehazing techniques, as a well studied topic in image processing, are now widely used to eliminate the haze effects from individual images. However, the state-of-the-art dehazing algorithms may not provide sufficient support to video analytics, as a crucial pre-processing step for video-based decision making systems (e.g., robot navigation), due to poor coherence and low processing efficiency of the present algorithms. This paper presents a new framework, particularly designed for video dehazing, to output coherent results in real time, with two novel techniques. Firstly, we decompose the dehazing algorithms into three generic components, namely transmission map estimator, atmospheric light estimator and haze-free image generator. They can be simultaneously processed by multiple threads in the distributed system, such that the processing efficiency is optimized by automatic CPU resource allocation based on the workloads. Secondly, a cross-frame normalization scheme is proposed to enhance the coherence among consecutive frames, by sharing the parameters of atmospheric light from consecutive frames in the distributed computation platform. The combination of the above three components enables our framework to generate highly consistent and accurate dehazing results in real-time, by using only 5 PCs connected by Ethernet.

  相似文献   

14.
Multimedia Tools and Applications - Robust multimodal fusion is one of the challenging research problems in semantic scene understanding. In real-world applications, the fusion system can overcome...  相似文献   

15.
16.
The rapidly expanding field of multimedia communications has fueled significant research and development work in the area of real-time video encoding. Dedicated hardware solutions have reached maturity and cost-efficient hardware encoders are being developed by several manufacturers. However, software solutions based on a general purpose processor or a programmable digital signal processor (DSP) have significant merits. Toward this objective, we have developed a flexible framework for video encoding that yields very good computation-performance tradeoffs. The proposed framework consists of a set of optimized core components: motion estimation (ME), the discrete cosine transform (DCT), quantization, and mode selection. Each of the components can be configured to achieve a desired computation-performance tradeoff. The components can be assembled to obtain encoders with varying degrees of computational complexity. Computation control has been implemented within the proposed framework to allow the resulting algorithms to adapt to the available computational resources. The proposed framework was applied to MPEG-2 and H.263 encoding using Intel's Pentium/MMX desktop processor. Excellent speed-performance tradeoffs were obtained  相似文献   

17.
Multimedia Tools and Applications - Utilizing cloud services in running large-scale video surveillance systems is not uncommon. However, special attention should be given to data security and...  相似文献   

18.
19.
Parallel neural networks for multimodal video genre classification   总被引:2,自引:2,他引:0  
Improvements in digital technology have made possible the production and distribution of huge quantities of digital multimedia data. Tools for high-level multimedia documentation are becoming indispensable to efficiently access and retrieve desired content from such data. In this context, automatic genre classification provides a simple and effective solution to describe multimedia contents in a structured and well understandable way. We propose in this article a methodology for classifying the genre of television programmes. Features are extracted from four informative sources, which include visual-perceptual information (colour, texture and motion), structural information (shot length, shot distribution, shot rhythm, shot clusters duration and saturation), cognitive information (face properties, such as number, positions and dimensions) and aural information (transcribed text, sound characteristics). These features are used for training a parallel neural network system able to distinguish between seven video genres: football, cartoons, music, weather forecast, newscast, talk show and commercials. Experiments conducted on more than 100 h of audiovisual material confirm the effectiveness of the proposed method, which reaches a classification accuracy rate of 95%.
Alberto MessinaEmail:

Maurizio Montagnuolo   Born in 1975, Maurizio Montagnuolo received his Laurea degree in Telecommunications Engineering from the Polytechnic of Turin in 2004, after developing his thesis at the RAI Research Centre. Currently, he is attending the Ph.D. course in “Business and Management” at the University of Turin, in collaboration with RAI, and supported by EuriX S.r.l., Turin. His main research interests concern the semantic classification of audiovisual content. Alberto Messina   is from the RAI—Radiotelevisione Italiana Centre for Research and Technological Innovation (CRIT), Turin. He began his collaboration as a research engineer with RAI in 1996, when he completed his MS Thesis in Electronic Engineering (at Politecnico di Torino) about objective quality evaluation of MPEG2 video coding. After starting his career as a designer of RAI’s Multimedia Catalogue, he has been involved in several internal and international research projects in the field of digital archiving, with particular emphasis on automated documentation, and automated production. His current interests are ranging from file formats and metadata standards to the domain of content analysis and information extraction algorithms, where he now concentrates his main focus. Recently, he has started promising research activities concerning semantic information extraction from the numerical analysis of audiovisual material, particularly in the field of conceptual characterisation of multimedia objects, genre classification of multimedia items, automatic editorial segmentation of TV programmes. He is also author of technical and scientific publications in this subject area. He has extensive collaborations with the local University of Torino—Computer Science Department, which include common research projects and students’ tutorship. To complete his scientific formation, he has recently decided to take a PhD in the area of Computer Science. He is active member of several EBU projects including P/TVFILE, P/MAG and P/CP, chairman of the P/SCAIE project dealing with automatic metadata extraction techniques. He is currently working in the EU PrestoSpace project in the Metadata Access and Delivery area. He has served as Programme Committee Member in a Special Track of the 10th Conference of Italian Association of Artificial Intelligence, and in the First Workshop on Ambient media Delivery and Interactive Television (AMDIT08).   相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号