首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
本文对基于内容的音频检索提出了一种分级方法,第一级:用HMM对音频事件的统计特性建模;第二级:用SVM结合一些音频事件对特定语义场景建模,完成对语义场景的检索。实验证明,HMM和SVM的结合对音频语义级场景的检索达到比较理想的效果。  相似文献   

2.

视频-文本检索作为一项被广泛应用于现实生活中的多模态检索技术受到越来越多的研究者的关注. 近来, 大部分视频文本工作通过利用大规模预训练模型中所学到的视觉与语言之间的匹配关系来提升文本视频间跨模态检索效果. 然而, 这些方法忽略了视频、文本数据都是由一个个事件组合而成. 倘若能捕捉视频事件与文本事件之间的细粒度相似性关系, 将能帮助模型计算出更准确的文本与视频之间的语义相似性关系, 进而提升文本视频间跨模态检索效果. 因此, 提出了一种基于CLIP生成多事件表示的视频文本检索方法(CLIP based multi-event representation generation for video-text retrieval, CLIPMERG). 首先, 通过利用大规模图文预训练模型CLIP的视频编码器(ViT)以及文本编码器(Tansformer)分别将视频、文本数据转换成视频帧token序列以及文本的单词token序列;然后, 通过视频事件生成器(文本事件生成器)将视频帧token序列(单词token序列)转换成k个视频事件表示(k个文本事件表示);最后, 通过挖掘视频事件表示与文本事件表示之间的细粒度关系以定义视频、文本间的语义相似性关系. 在3个常用的公开视频文本检索数据集MSR-VTT, DiDeMo, LSMDC上的实验结果表明所提的CLIPMERG优于现有的视频文本检索方法.

  相似文献   

3.
Sports video annotation is important for sports video semantic analysis such as event detection and personalization. In this paper, we propose a novel approach for sports video semantic annotation and personalized retrieval. Different from the state of the art sports video analysis methods which heavily rely on audio/visual features, the proposed approach incorporates web-casting text into sports video analysis. Compared with previous approaches, the contributions of our approach include the following. 1) The event detection accuracy is significantly improved due to the incorporation of web-casting text analysis. 2) The proposed approach is able to detect exact event boundary and extract event semantics that are very difficult or impossible to be handled by previous approaches. 3) The proposed method is able to create personalized summary from both general and specific point of view related to particular game, event, player or team according to user's preference. We present the framework of our approach and details of text analysis, video analysis, text/video alignment, and personalized retrieval. The experimental results on event boundary detection in sports video are encouraging and comparable to the manually selected events. The evaluation on personalized retrieval is effective in helping meet users' expectations.  相似文献   

4.
Automatic audio content recognition has attracted an increasing attention for developing multimedia systems, for which the most popular approaches combine frame-based features with statistic models or discriminative classifiers. The existing methods are effective for clean single-source event detection but may not perform well for unstructured environmental sounds, which have a broad noise-like flat spectrum and a diverse variety of compositions. We present an automatic acoustic scene understanding framework that detects audio events through two hierarchies, acoustic scene recognition and audio event recognition, in which the former is preceded by following dominant audio sources and in turn helps infer non-dominant audio events within the same scene through modeling their occurrence correlations. On the scene recognition hierarchy, we perform adaptive segmentation and feature extraction for every input acoustic scene stream through Eigen-audiospace and an optimized feature subspace, respectively. After filtering background, scene streams are recognized by modeling the observation density of dominant features using a two-level hidden Markov model. On the audio event recognition hierarchy, scene knowledge is characterized by an audio context model that essentially describes the occurrence correlations of dominant and non-dominant audio events within this scene. Monte Carlo integration and gradient descent techniques are employed to maximize the likelihood and correctly tag each audio event. To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.  相似文献   

5.
6.
In this paper we present a framework for unified, personalized access to heterogeneous multimedia content in distributed repositories. Focusing on semantic analysis of multimedia documents, metadata, user queries and user profiles, it contributes to the bridging of the gap between the semantic nature of user queries and raw multimedia documents. The proposed approach utilizes as input visual content analysis results, as well as analyzes and exploits associated textual annotation, in order to extract the underlying semantics, construct a semantic index and classify documents to topics, based on a unified knowledge and semantics representation model. It may then accept user queries, and, carrying out semantic interpretation and expansion, retrieve documents from the index and rank them according to user preferences, similarly to text retrieval. All processes are based on a novel semantic processing methodology, employing fuzzy algebra and principles of taxonomic knowledge representation. The first part of this work presented in this paper deals with data and knowledge models, manipulation of multimedia content annotations and semantic indexing, while the second part will continue on the use of the extracted semantic information for personalized retrieval.
Stefanos KolliasEmail:
  相似文献   

7.
We propose a method for characterizing sound activity in fixed spaces through segmentation, indexing, and retrieval of continuous audio recordings. Regarding segmentation, we present a dynamic Bayesian network (DBN) that jointly infers onsets and end times of the most prominent sound events in the space, along with an extension of the algorithm for covering large spaces with distributed microphone arrays. Each segmented sound event is indexed with a hidden Markov model (HMM) that models the distribution of example-based queries that a user would employ to retrieve the event (or similar events). In order to increase the efficiency of the retrieval search, we recursively apply a modified spectral clustering algorithm to group similar sound events based on the distance between their corresponding HMMs. We then conduct a formal user study to obtain the relevancy decisions necessary for evaluation of our retrieval algorithm on both automatically and manually segmented sound clips. Furthermore, our segmentation and retrieval algorithms are shown to be effective in both quiet indoor and noisy outdoor recording conditions.   相似文献   

8.
为了支持企业在决策时从企业数据中通过检索获得有意义的数据,提出了基于语义模型的语义检索方法。该方法首先基于概念树描述语义模型,通过概念映射将数据源与语义模型进行语义关联。在此基础上,建立语义模型和支持描述逻辑推理的知识模型之间的映射,通过调用描述逻辑推理机完成语义检索,检索结果再通过语义模型映射对应数据源信息,最终返回语义一致益于决策的数据视图。  相似文献   

9.
Modeling Content for Semantic-Level Querying of Multimedia   总被引:2,自引:0,他引:2  
Many semantic content-based models have been developed for modeling video and audio in order to enable information retrieval based on semantic content. The level of querying of the media depends upon the semantic aspects modeled. This paper proposes a semantic content-based model for semantic-level querying that makes full use of the explicit media structure, objects, spatial relationships between objects, events and actions involving objects, temporal relationships between events and actions, and integration between syntactic and semantic information.  相似文献   

10.
11.
Social media networks contain both content and context-specific information. Most existing methods work with either of the two for the purpose of multimedia mining and retrieval. In reality, both content and context information are rich sources of information for mining, and the full power of mining and processing algorithms can be realized only with the use of a combination of the two. This paper proposes a new algorithm which mines both context and content links in social media networks to discover the underlying latent semantic space. This mapping of the multimedia objects into latent feature vectors enables the use of any off-the-shelf multimedia retrieval algorithms. Compared to the state-of-the-art latent methods in multimedia analysis, this algorithm effectively solves the problem of sparse context links by mining the geometric structure underlying the content links between multimedia objects. Specifically for multimedia annotation, we show that an effective algorithm can be developed to directly construct annotation models by simultaneously leveraging both context and content information based on latent structure between correlated semantic concepts. We conduct experiments on the Flickr data set, which contains user tags linked with images. We illustrate the advantages of our approach over the state-of-the-art multimedia retrieval techniques.  相似文献   

12.
We present our studies on the application of Coupled Hidden Markov Models(CHMMs) to sports highlights extraction from broadcast video using both audio and video information. First, we generate audio labels using audio classification via Gaussian mixture models, and video labels using quantization of the average motion vector magnitudes. Then, we model sports highlights using discrete-observations CHMMs on audio and video labels classified from a large training set of broadcast sports highlights. Our experimental results on unseen golf and soccer content show that CHMMs outperform Hidden Markov Models(HMMs) trained on audio-only or video-only observations. Next, we study how the coupling between the two single-modality HMMs offers improvement on modelling capability by making refinements on the states of the models. We also show that the number of states optimized in this fashion also gives better classification results than other number of states. We conclude that CHMMs provide a promising tool for information fusion techniques in the sports domain for audio-visual event detection and analysis.  相似文献   

13.
While most existing sports video research focuses on detecting event from soccer and baseball etc., little work has been contributed to flexible content summarization on racquet sports video, e.g. tennis, table tennis etc. By taking advantages of the periodicity of video shot content and audio keywords in the racquet sports video, we propose a novel flexible video content summarization framework. Our approach combines the structure event detection method with the highlight ranking algorithm. Firstly, unsupervised shot clustering and supervised audio classification are performed to obtain the visual and audio mid-level patterns respectively. Then, a temporal voting scheme for structure event detection is proposed by utilizing the correspondence between audio and video content. Finally, by using the affective features extracted from the detected events, a linear highlight model is adopted to rank the detected events in terms of their exciting degrees. Experimental results show that the proposed approach is effective.  相似文献   

14.
互联网上存在海量数据,如何在大量的信息中查找到有用信息就变成了一个至关重要的问题。语义网为解决这一问题带来了曙光。然而当今网络现状与语义网之间存在巨大差距,即海量非结构化的页面内容难直接转化为语义的知识。提出了一种基于文档内容的语义标注方法,利用本体所表达的语义环境,即本体知识相关词汇及其所处的语义上下文环境在文档中出现频率,实现对文档的语义标注。实验显示方法取得良好的效果,但受本体知识质量和标注文档质量两个因素影响较大。  相似文献   

15.
16.
语义视频检索综述   总被引:4,自引:1,他引:4  
视频内容检索是多媒体应用的一个活跃研究方向,现有的内容检索技术大多是基于低层次特征的。这些非语义的低层特征难以理解,与人思维中的高层语义概念相差甚远,严重影响视频内容检索系统的易用性。低层特征和高层语义概念间的语义鸿沟很难逾越。如何跨越语义鸿沟,用语义概念检索视频内容是目前基于内容视频检索最具挑战性的研究方向。本文介绍语义视频检索出现的背景,分析语义鸿沟出现的原因,对现有尝试跨越语义鸿沟的主要方法进行综述;评述了相关技术的优缺点,探讨了各方法将来可能的研究发展方向以及视频语义检索近期、长期可能的技术突破点。  相似文献   

17.
Automatic annotation of semantic events allows effective retrieval of video content. In this work, we present solutions for highlights detection in sports videos. This application is particularly interesting for broadcasters, since they extensively use manual annotation to select interesting highlights that are edited to create new programmes. The proposed approach exploits the typical structure of a wide class of sports videos, namely, those related to sports which are played in delimited venues with playfields of well known geometry, like soccer, basketball, swimming, track and field disciplines, and so on. For this class of sports, a modeling scheme based on a limited set of visual cues and on finite state machines (FSM) that encode the temporal evolution of highlights is presented. Algorithms for model checking and for visual cues estimation are discussed, as well as applications of the representation to different sport domains.  相似文献   

18.
随着互联网与多媒体技术的迅猛发展,网络数据的呈现形式由单一文本扩展到包含图像、视频、文本、音频和3D模型等多种媒体,使得跨媒体检索成为信息检索的新趋势.然而,"异构鸿沟"问题导致不同媒体的数据表征不一致,难以直接进行相似性度量,因此,多种媒体之间的交叉检索面临着巨大挑战.随着深度学习的兴起,利用深度神经网络模型的非线性建模能力有望突破跨媒体信息表示的壁垒,但现有基于深度学习的跨媒体检索方法一般仅考虑图像和文本两种媒体数据之间的成对关联,难以实现更多种媒体的交叉检索.针对上述问题,提出了跨媒体深层细粒度关联学习方法,支持多达5种媒体类型数据(图像、视频、文本、音频和3D模型)的交叉检索.首先,提出了跨媒体循环神经网络,通过联合建模多达5种媒体类型数据的细粒度信息,充分挖掘不同媒体内部的细节信息以及上下文关联.然后,提出了跨媒体联合关联损失函数,通过将分布对齐和语义对齐相结合,更加准确地挖掘媒体内和媒体间的细粒度跨媒体关联,同时利用语义类别信息增强关联学习过程的语义辨识能力,提高跨媒体检索的准确率.在两个包含5种媒体的跨媒体数据集PKU XMedia和PKU XMediaNet上与现有方法进行实验对比,实验结果表明了所提方法的有效性.  相似文献   

19.
Highlight detection is a fundamental step in semantics based video retrieval and personalized sports video browsing. In this paper, an effective hidden Markov models (HMMs) based soccer video event detection method based on a hierarchical video analysis framework is proposed. Soccer video shots are classified into four coarse mid-level semantics: global, median, close-up and audience. Global and local motion information is utilized for the refinement of coarse mid-level semantics. Sequential soccer video is segmented into event clips. Both the temporal transitions of the mid-level semantics and the overall features of an event clip are fused using HMMs to determine the type of event. Highlight detection performance of dynamic Bayesian networks (DBN), conditional random fields (CRF) and the proposed HMM based approach are compared. The average F-score of our highlights (including goal, shoot, foul and placed kick) detection approach is 82.92%, which outperforms that of DBN and CRF by 9.85% and 11.12% respectively. The effects of number of hidden states, overall features, and the refinement of mid-level semantics on the event detection performance are also discussed.  相似文献   

20.
In this paper, an audio-driven algorithm for the detection of speech and music events in multimedia content is introduced. The proposed approach is based on the hypothesis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous segments of audio. In this context, a two-step segmentation approach is employed in order to initially identify transition points between the homogeneous regions and subsequently classify the derived segments using a supervised binary classifier. The transition point detection mechanism is based on the analysis and composition of multiple self-similarity matrices, generated using different audio feature sets. The implemented technique aims at discriminating events focusing on transition point detection with high temporal resolution, a target that is also reflected in the adopted assessment methodology. Thereafter, multimedia indexing can be efficiently deployed (for both audio and video sequences), incorporating the processes of high resolution temporal segmentation and semantic annotation extraction. The system is evaluated against three publicly available datasets and experimental results are presented in comparison with existing implementations. The proposed algorithm is provided as an open source software package in order to support reproducible research and encourage collaboration in the field.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号