首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations.We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results.  相似文献   

2.
谢飞  龚声蓉  刘纯平  季怡 《计算机科学》2015,42(11):293-298
基于视觉单词的人物行为识别由于在特征中加入了中层语义信息,因此提高了识别的准确性。然而,视觉单词提取时由于前景和背景存在相互干扰,使得视觉单词的表达能力受到影响。提出一种结合局部和全局特征的视觉单词生成方法。该方法首先用显著图检测出前景人物区域,采用提出的动态阈值矩阵对人物区域用不同的阈值来分别检测时空兴趣点,并计算周围的3D-SIFT特征来描述局部信息。在此基础上,采用光流直方图特征描述行为的全局运动信息。通过谱聚类将局部和全局特征融合成视觉单词。实验证明,相对于流行的局部特征视觉单词生成方法,所提出的方法在简单背景的KTH数据集上的识别率比平均识别率提高了6.4%,在复杂背景的UCF数据集上的识别率比平均识别率提高了6.5%。  相似文献   

3.
4.
Efficiently representing and recognizing the semantic classes of the subregions of large-scale high spatial resolution (HSR) remote-sensing images are challenging and critical problems. Most of the existing scene classification methods concentrate on the feature coding approach with handcrafted low-level features or the low-level unsupervised feature learning approaches, which essentially prevent them from better recognizing the semantic categories of the scene due to their limited mid-level feature representation ability. In this article, to overcome the inadequate mid-level representation, a patch-based spatial-spectral hierarchical convolutional sparse auto-encoder (HCSAE) algorithm, based on deep learning, is proposed for HSR remote-sensing imagery scene classification. The HCSAE framework uses an unsupervised hierarchical network based on a sparse auto-encoder (SAE) model. In contrast to the single-level SAE, the HCSAE framework utilizes the significant features from the single-level algorithm in a feedforward and full connection approach to the maximum extent, which adequately represents the scene semantics in the high level of the HCSAE. To ensure robust feature learning and extraction during the SAE feature extraction procedure, a ‘dropout’ strategy is also introduced. The experimental results using the UC Merced data set with 21 classes and a Google Earth data set with 12 classes demonstrate that the proposed HCSAE framework can provide better accuracy than the traditional scene classification methods and the single-level convolutional sparse auto-encoder (CSAE) algorithm.  相似文献   

5.
6.
In this paper, we develop a content-based video classification approach to support semantic categorization, high-dimensional indexing and multi-level access. Our contributions are in four points: (a) We first present a hierarchical video database model that captures the structures and semantics of video contents in databases. One advantage of this hierarchical video database model is that it can provide a framework for automatic mapping from high-level concepts to low-level representative features. (b) We second propose a set of useful techniques for exploiting the basic units (e.g., shots or objects) to access the videos in database. (c) We third suggest a learning-based semantic classification technique to exploit the structures and semantics of video contents in database. (d) We further develop a cluster-based indexing structure to both speed-up query-by-example and organize databases for supporting more effective browsing. The applications of this proposed multi-level video database representation and indexing structures for MPEG-7 are also discussed.  相似文献   

7.
?? ?  ??? ????????? 《计算机工程》2007,33(13):218-163,
随着分布式计算的不断发展,传统的基于角色的安全性(RBAC)模型已无法满足分布式安全的要求,该文根据微软的代码访问安全性,归纳出基于证据的代码访问控制(EBCAC)模型和它的一种形式化描述,该模型能实现对系统更低层次的访问控制;提出了一种改进的基于证据的代码访问控制系统设计方案,给出了防止引诱攻击的实例.  相似文献   

8.
9.
魏维  叶斌  张元茂 《计算机工程》2007,33(13):218-220,229
从视觉和声音两方面对视频语义内容的表征技术进行研究。采用能反映时间语义约束、语义变化的帧切片策略选取关键帧,用时空注意力模型选择空域的内容,用分类器对这些区域进行基本语义分类识别,建立不同时间声音段的随机模型,进行声音语义内容表示和基本声音语义提取。实验表明,视频内容表征方式能简洁地表示视频的语义内容,有效提取视频基本语义。  相似文献   

10.
目的关于图像场景分类中视觉词包模型方法的综述性文章在国内外杂志上还少有报导,为了使国内外同行对图像场景分类中的视觉词包模型方法有一个较为全面的了解,对这些研究工作进行了系统总结。方法在参考国内外大量文献的基础上,对现有图像场景分类(主要指针对单一图像场景的分类)中出现的各种视觉词包模型方法从低层特征的选择与局部图像块特征的生成、视觉词典的构建、视觉词包特征的直方图表示、视觉单词优化等多方面加以总结和比较。结果回顾了视觉词包模型的发展历程,对目前存在的多种视觉词包模型进行了归纳,比较常见方法各自的优缺点,总结了视觉词包模型性能评价方法,并对目前常用的标准场景库进行汇总,同时给出了各自所达到的最高精度。结论图像场景分类中视觉词包模型方法的研究作为计算机视觉领域方兴未艾的热点研究领域,在国内外研究中取得了不少进展,在计算机视觉领域的研究也不再局限于直接应用模型描述图像内容,而是更多地考虑图像与文本的差异。虽然视觉词包模型在图像场景分类的应用中还存在很多亟需解决的问题,但是这丝毫不能掩盖其研究的重要意义。  相似文献   

11.
12.
Highlight detection is a fundamental step in semantics based video retrieval and personalized sports video browsing. In this paper, an effective hidden Markov models (HMMs) based soccer video event detection method based on a hierarchical video analysis framework is proposed. Soccer video shots are classified into four coarse mid-level semantics: global, median, close-up and audience. Global and local motion information is utilized for the refinement of coarse mid-level semantics. Sequential soccer video is segmented into event clips. Both the temporal transitions of the mid-level semantics and the overall features of an event clip are fused using HMMs to determine the type of event. Highlight detection performance of dynamic Bayesian networks (DBN), conditional random fields (CRF) and the proposed HMM based approach are compared. The average F-score of our highlights (including goal, shoot, foul and placed kick) detection approach is 82.92%, which outperforms that of DBN and CRF by 9.85% and 11.12% respectively. The effects of number of hidden states, overall features, and the refinement of mid-level semantics on the event detection performance are also discussed.  相似文献   

13.
语义视频检索综述   总被引:4,自引:1,他引:4  
视频内容检索是多媒体应用的一个活跃研究方向,现有的内容检索技术大多是基于低层次特征的。这些非语义的低层特征难以理解,与人思维中的高层语义概念相差甚远,严重影响视频内容检索系统的易用性。低层特征和高层语义概念间的语义鸿沟很难逾越。如何跨越语义鸿沟,用语义概念检索视频内容是目前基于内容视频检索最具挑战性的研究方向。本文介绍语义视频检索出现的背景,分析语义鸿沟出现的原因,对现有尝试跨越语义鸿沟的主要方法进行综述;评述了相关技术的优缺点,探讨了各方法将来可能的研究发展方向以及视频语义检索近期、长期可能的技术突破点。  相似文献   

14.
集成视觉特征和语义信息的相关反馈方法   总被引:1,自引:0,他引:1  
为了有效地利用图像检索系统的语义分类信息和视觉特征,提出一种基于Bayes的集成视觉特征和语义信息的相关反馈检索方法.首先,将图像库的数据经语义监督的视觉特征聚类算法划分为小的聚类,每个聚类内数据的视觉特征相似并且语义类别相同;然后以聚类为单位标注正负反馈的实例,这显著区别于以单个图像为单位的相关反馈过程;最后分别以基于视觉特征的Bayes分类器和基于语义的Bayes分类器修正相似距离.在图像库上的实验表明,只用较少的反馈次数就可以达到较高的检索准确率.  相似文献   

15.
16.
基于统计学理论,提出了一种视频多粒度语义分析的通用方法,使得多层次语义分析与多模式信息融合得到统一.为了对时域内容进行表示,首先提出一种具有时间语义语境约束的关键帧选取策略和注意力选择模型;在基本视觉语义识别后,采用一种多层视觉语义分析框架来抽取视觉语义;然后应用隐马尔可夫模型(HMM)和贝叶斯决策进行音频语义理解;最后用一种具有两层结构的仿生多模式融合方案进行语义信息融合.实验结果表明,该方法能有效融合多模式特征,并提取不同粒度的视频语义.  相似文献   

17.
18.
本文提出了一种基于期望最大化(EM)算法的局部图像特征的语义提取方法。首先提取图像的局部图像特征,统计特征在视觉词汇本中的出现频率,将图像表示成词袋模型;引入文本分析中的潜在语义分析技术建立从低层图像特征到高层图像语义之间的映射模型;然后利用EM算法拟合概率模型,得到图像局部特征的潜在语义概率分布;最后利用该模型提取出的图像在潜在语义上的分布来进行图像分析和理解。与其他基于语义的图像理解方法相比,本文方法不需要手工标注,以无监督的方式直接从图像低层特征中发掘图像的局部潜在语义,既求得了局部语义信息,又获得了局部语义的空间分布特性,因而能更好地对场景建模。为验证本文算法获取语义的有效性,在15类场景图像上进行了实验,实验结果表明,该方法取得了良好的分类准确率。  相似文献   

19.
肖琳  陈博理  黄鑫  刘华锋  景丽萍  于剑 《软件学报》2020,31(4):1079-1089
自大数据蓬勃发展以来,多标签分类一直是令人关注的重要问题,在现实生活中有许多实际应用,如文本分类、图像识别、视频注释、多媒体信息检索等.传统的多标签文本分类算法将标签视为没有语义信息的符号,然而,在许多情况下,文本的标签是具有特定语义的,标签的语义信息和文档的内容信息是有对应关系的,为了建立两者之间的联系并加以利用,提出了一种基于标签语义注意力的多标签文本分类(LAbel Semantic Attention Multi-label Classification,简称LASA)方法,依赖于文档的文本和对应的标签,在文档和标签之间共享单词表示.对于文档嵌入,使用双向长短时记忆(bi-directional long short-term memory,简称Bi-LSTM)获取每个单词的隐表示,通过使用标签语义注意力机制获得文档中每个单词的权重,从而考虑到每个单词对当前标签的重要性.另外,标签在语义空间里往往是相互关联的,使用标签的语义信息同时也考虑了标签的相关性.在标准多标签文本分类的数据集上得到的实验结果表明,所提出的方法能够有效地捕获重要的单词,并且其性能优于当前先进的多标签文本分类算法.  相似文献   

20.
HMM模型具有良好的适应性,可以自动学习,对预测随机时序数据性能良好。场景是足球视频的基本特征,场景的转换体现了足球视频的摄制、编辑模式,表现了足球视频的语义。提出了一种基于场景分析和HMM的视频语义分析框架,用于识别足球视频中的一些语义事件。为了克服以往基于主颜色和其他底层特征的视频场景分析中存在的较大误差,又提出基于视觉注意模型对足球视频中的场景进行分析。实验结果表明,基于场景分析和HMM的事件识别方法对足球视频中的任意球事件有良好的识别效果  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号