首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Based on the local keypoints extracted as salient image patches, an image can be described as a ?bag-of-visual-words (BoW)? and this representation has appeared promising for object and scene classification. The performance of BoW features in semantic concept detection for large-scale multimedia databases is subject to various representation choices. In this paper, we conduct a comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, stop word removal, feature selection, spatial information, and visual bi-gram. We offer practical insights in how to optimize the performance of BoW by choosing appropriate representation choices. For the weighting scheme, we elaborate a soft-weighting method to assess the significance of a visual word to an image. We experimentally show that the soft-weighting outperforms other popular weighting schemes such as TF-IDF with a large margin. Our extensive experiments on TRECVID data sets also indicate that BoW feature alone, with appropriate representation choices, already produces highly competitive concept detection performance. Based on our empirical findings, we further apply our method to detect a large set of 374 semantic concepts. The detectors, as well as the features and detection scores on several recent benchmark data sets, are released to the multimedia community.  相似文献   

2.
The method based on Bag-of-visual-Words (BoW) deriving from local keypoints has recently appeared promising for video annotation. Visual word weighting scheme has critical impact to the performance of BoW method. In this paper, we propose a new visual word weighting scheme which is referred as emerging patterns weighting (EP-weighting). The EP-weighting scheme can efficiently capture the co-occurrence relationships of visual words and improve the effectiveness of video annotation. The proposed scheme firstly finds emerging patterns (EPs) of visual keywords in training dataset. And then an adaptive weighting assignment is performed for each visual word according to EPs. The adjusted BoW features are used to train classifiers for video annotation. A systematic performance study on TRECVID corpus containing 20 semantic concepts shows that the proposed scheme is more effective than other popular existing weighting schemes.  相似文献   

3.
针对当前网络上存在着大量的重复或近似重复的视频问题,提出了一种基于镜头层比较和位置敏感哈希的快速准确的网络视频重复检测方法。通过视频间匹配的镜头数占查询视频总镜头数的比例来判断视频的相似性。除此之外,还利用著名的近似最近邻查找技术——LSH在镜头层来快速查找相似镜头,从而提高检测速度。通过将镜头作为检索单元,把数据库中所有视频的镜头放到一起构建一个新的数据集,将种子(查询)视频的每一个镜头作为一个查询请求,应用基于LSH的近似近邻检索方法,检索出与查询镜头相匹配的所有镜头,最后融合这些返回的结果,得到查询视频的重复或者近似重复的视频集。通过在包含12 790个视频的CC_WEB_VIDEO数据集上的实验结果表明,该方法取得了相比已有方法更好的检测性能。  相似文献   

4.
基于词汇树的词袋模型(Bag-of-Words)表示算法是目前图像检索领域中的主流算法.针对传统词汇树方法中空间上下文信息缺失的问题,提出一种基于空间上下文加权词汇树的图像检索方法.该方法在词汇树框架下,首先生成SIFT点的空间上下文信息描述.然后利用SIFT点间的空间上下文相似度对SIFT间的匹配得分进行加权,得到图像间的相似度.最后,通过相似度排序完成图像检索.实验结果表明,该方法能够大幅度提高图像检索的性能,同时,对大规模图像库有较好的适用性.  相似文献   

5.
Many recent image retrieval methods are based on the “bag-of-words” (BoW) model with some additional spatial consistency checking. This paper proposes a more accurate similarity measurement that takes into account spatial layout of visual words in an offline manner. The similarity measurement is embedded in the standard pipeline of the BoW model, and improves two features of the model: i) latent visual words are added to a query based on spatial co-occurrence, to improve query recall; and ii) weights of reliable visual words are increased to improve the precision. The combination of these methods leads to a more accurate measurement of image similarity. This is similar in concept to the combination of query expansion and spatial verification, but does not require query time processing, which is too expensive to apply to full list of ranked results. Experimental results demonstrate the effectiveness of our proposed method on three public datasets.  相似文献   

6.
7.
一种基于优化“词袋”模型的物体识别方法*   总被引:1,自引:0,他引:1  
针对传统基于“词袋”模型物体识别现有方法的不足,对现特征表达、视觉词典和图像表示方法进行优化,以提高物体识别正确率。采用HUE直方图与SIFT特征描述符分别描述兴趣点周围的颜色和形状特征,实现“词袋”模型下两种特征的特征级和图像级融合,引入K-means++聚类算法生成视觉词典,并利用软权重思想将特征向量映射到视觉单词形成图像直方图。实验结果表明,所述方法会产生较高的物体识别正确率,且识别结果不受两种特征融合权重的影响。  相似文献   

8.
Efficient near-duplicate image detection is important for several applications that feature extraction and matching need to be taken online. Most image representations targeting at conventional image retrieval problems are either computationally expensive to extract and match, or limited in robustness. Aiming at this problem, in this paper, we propose an effective and efficient local-based representation method to encode an image as a binary vector, which is called Local-based Binary Representation (LBR). Local regions are extracted densely from the image, and each region is converted to a simple and effective feature describing its texture. A statistical histogram can be calculated over all the local features, and then it is encoded to a binary vector as the holistic image representation. The proposed binary representation jointly utilizes the local region texture and global visual distribution of the image, based on which a similarity measure can be applied to detect near-duplicate image effectively. The binary encoding scheme can not only greatly speed up the online computation, but also reduce memory cost in real applications. In experiments the precision and recall, as well as computational time of the proposed method are compared with other state-of-the-art image representations and LBR shows clear advantages on online near-duplicate image detection and video keyframe detection tasks.  相似文献   

9.
In recent years, the bag-of-words (BoW) video representations have achieved promising results in human action recognition in videos. By vector quantizing local spatial temporal (ST) features, the BoW video representation brings in simplicity and efficiency, but limitations too. First, the discretization of feature space in BoW inevitably results in ambiguity and information loss in video representation. Second, there exists no universal codebook for BoW representation. The codebook needs to be re-built when video corpus is changed. To tackle these issues, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local ST features of a video into a distribution estimated by a generative probabilistic model. Furthermore, the probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable to most discriminative classifiers, such as the nearest neighbor schemes and the kernel based classifiers. Experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.  相似文献   

10.
通过例子视频进行视频检索的新方法   总被引:17,自引:1,他引:17  
结合作者开发的系统,提出了一种新的视频相似度度量模型,详细地介绍了算法的实现,并得出了实验结果。相对于已有的算法,它在考虑人们主观视觉判断的基础上,提出了许多影响相似度的因子,如顺序因子、速度因子、干扰因子等。因而它全面、系统地体现了视频间的相似程度。而且该算法具有粒度可适性,可用于不同层次的视频结构的比较,在视频检索系统中。它可以广泛地应用于针对用户提交例子的查询中。  相似文献   

11.
视频的检索反馈   总被引:4,自引:0,他引:4  
基于内容的视频检索是一个重要的研究领导领域,吸引了众多研究者的举,常用的查询方式是通过例子视频进行检索,但是如何定义是否两个视频相似,仍然是尚未解决的问题,限制了检索系统的应用范围,而且由于视频内容的复杂性,不同在检索过程中,即使对同一部视频,其注重的角度也有可能不同,因此接受用户的反馈意见,当用户对查询结果不满意时可以优化查询结果,突出用户的需求,在综合了人类礼堂心理特征的基础上,介绍了一种视频相似衡量的模型,从镜头、视频等多个层次,多种视觉判断角度,对视频间相似度进行衡量,并在此基础上提出了多个粒度--镜头层次和视频层次进行检索反馈的方法,整个过程是自动进行的,根据用户的意见灵活地优化检索结果。  相似文献   

12.
Recently, image representation based on bag-of-visual-words (BoW) model has been popularly applied in image and vision domains. In BoW, a visual codebook of visual words is defined, usually by clustering local features, to represent any novel image with the occurrence of its contained visual words. Given a set of images, we argue that the significance of each image is determined by the significance of its contained visual words. Traditionally, the significances of visual words are defined by term frequency-inverse document frequency (tf-idf), which cannot necessarily capture the intrinsic visual context. In this paper, we propose a new scheme of latent visual context learning (LVCL). The visual context among images and visual words is formulated from latent semantic context and visual link graph analysis. With LVCL, the importance of visual words and images will be distinguished from each other, which will facilitate image level applications, such as image re-ranking and canonical image selection.We validate our approach on text-query based search results returned by Google Image. Experimental results demonstrate the effectiveness and potentials of our LVCL in applications of image re-ranking and canonical image selection, over the state-of-the-art approaches.  相似文献   

13.

As one of key technologies in content-based near-duplicate detection and video retrieval, video sequence matching can be used to judge whether two videos exist duplicate or near-duplicate segments or not. Despite a lot of research efforts devoted in recent years, how to precisely and efficiently perform sequence matching among videos (which may be subject to complex audio-visual transformations) from a large-scale database still remains a pretty challenging task. To address this problem, this paper proposes a multiscale video sequence matching (MS-VSM) method, which can gradually detect and locate the similar segments between videos from coarse to fine scales. At the coarse scale, it makes use of the Maximum Weight Matching (MWM) algorithm to rapidly select several candidate reference videos from the database for a given query. Then for each candidate video, its most similar segment with respect to the given query is obtained at the middle scale by the Constrained Longest Ascending Matching Subsequence (CLAMS) algorithm, and then can be used to judge whether that candidate exists near-duplicate or not. If so, the precise locations of the near-duplicate segments in both query and reference videos are determined at the fine scale by using bi-directional scanning to check the matching similarity at the segments’ boundaries. As such, the MS-VSM method can achieve excellent near-duplicate detection accuracy and localization precision with a very high processing efficiency. Extensive experiments show that it outperforms several state-of-the-art methods remarkably on several benchmarks.

  相似文献   

14.
目的 以词袋模型为基础的拷贝图像检索方法是当前最有效的方法。然而,由于局部特征量化存在信息损失,导致视觉词汇区别能力不足和视觉词汇误匹配增加,从而影响了拷贝图像检索效果。针对视觉词汇的误匹配问题,提出一种基于近邻上下文的拷贝图像检索方法。该方法通过局部特征的上下文关系消除视觉词汇歧义,提高视觉词汇的区分度,进而提高拷贝图像的检索效果。方法 首先,以距离和尺度关系选择图像中某局部特征点周围的特征点作为该特征点的上下文,选取的上下文中的局部特征点称为近邻特征点;再以近邻特征点的信息以及与该局部特征的关系为该局部特征构建上下文描述子;然后,通过计算上下文描述子的相似性对局部特征匹配对进行验证;最后,以正确匹配特征点的个数衡量图像间的相似性,并以此相似性选取若干候选图像作为返回结果。结果 在Copydays图像库进行实验,与Baseline方法进行比较。在干扰图像规模为100 k时,相对于Baseline方法,mAP提高了63%。当干扰图像规模从100 k增加到1 M时,Baseline的mAP值下降9%,而本文方法下降3%。结论 本文拷贝图像检索方法对图像编辑操作,如旋转、图像叠加、尺度变换以及裁剪有较高的鲁棒性。该方法可以有效地应用到图像防伪、图像去重等领域。  相似文献   

15.
Based on the analysis of temporal slices, we propose novel approaches for clustering and retrieval of video shots. Temporal slices are a set of two-dimensional (2-D) images extracted along the time dimension of an image volume. They encode rich set of visual patterns for similarity measure. In this paper, we first demonstrate that tensor histogram features extracted from temporal slices are suitable for motion retrieval. Subsequently, we integrate both tensor and color histograms for constructing a two-level hierarchical clustering structure. Each cluster in the top level contains shots with similar color while each cluster in bottom level consists of shots with similar motion. The constructed structure is then used for the cluster-based retrieval. The proposed approaches are found to be useful particularly for sports games, where motion and color are important visual cues when searching and browsing the desired video shots.  相似文献   

16.
17.
语义视频检索的现状和研究进展   总被引:9,自引:0,他引:9  
概述了图像的可视化特征如颜色、纹理、形状和运动信息,时空关系分析,以及多特征目标提取和相似度量度;分析了视频语义的提取,语义查询、检索;探讨了视频语义检索的性能评估,存在的问题和发展方向。  相似文献   

18.
As an effective technique to manage and explore large scale of video collections, personalized video search has received great attentions in recent years. One of the key problems in the related technique development is how to design and evaluate the similarity measures. Most of the existing approaches simply adopt traditional Euclidean distance or its variants. Consequently, they generally suffer from two main disadvantages: (1) low effectiveness—retrieval accuracy is poor. One of main reasons is that very little research has been carried out on designing an effective fusion scheme for integrating multimodal information (e.g., text, audio and visual) from video sequences and (2) poor scalability—development process of the video similarity metrics is largely disconnected from that of the relevant database access methods (indexing structures). This article reports a new distance metric called personalized video distance to effectively fuse information about individual preference and multimodal properties into a compact signature. Moreover, a novel hashing-based indexing structure has been designed to facilitate fast retrieval process and better scalability. A set of comprehensive empirical studies have been carried out based on two large video test collections and carefully designed queries with different complexities. We observe significant improvements over the existing techniques on various aspects.  相似文献   

19.
在基于词袋模型的图像检索框架中,图像包含的SIFT特征点往往数量比较大,特征不够强。因此图像检索系统的效率和性能往往受影响。基于SIFT特征点的性质和视觉显著性原理,提出了SIFT特征点的局部对称性度量方法,并且在图像检索框架中嵌入了基于对称性的SIFT特征点过滤方法和加权策略,以提升SIFT特征点的利用效率。在牛津大学建筑物图像集上的实验结果表明,提出的基于对称性的SIFT特征点选择策略能有效地提高图像检索的性能。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号