首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 156 毫秒
1.
Web信息采集中的哈希函数比较   总被引:4,自引:0,他引:4  
在Web信息采集的过程中,需要判断待采页面是否在已采页面集合中.为了实现快速采集,采用哈希函数来实现.基于一个含有2000多万个URL的序列,通过大规模的实验性评测,比较了函数Tianlhash、ELFhash、Hflp、hf和Strhash的一阶和二阶哈希冲突率.实验结果表明,Strhash和Tianlhash的性能较佳,值得推荐.并且,ELFhash的测试性能要优于HHp和hf.采用二阶哈希后的天罗Web信息采集系统,占用几兆的内存空间,大大提高了采集速度,并降低了数据库的负荷.  相似文献   

2.
防火墙通过URL过滤控制对因特网信息资源的访问.为了在高速防火墙上实现URL过滤,本文提出了位图法以改进URL过滤器的哈希表数据结构,提高哈希表查找速度;提出了快速压缩法,降低过滤器的空间占用.经过位图法和快速压缩法改进,并应用高速缓存优化后,采用实验对URL过滤器进行性能评价,发现URL过滤的平均时间下降了253.7%。空间下降了25.7%.  相似文献   

3.
贾建伟  陈崚 《计算机科学》2016,43(6):254-256, 311
在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。  相似文献   

4.
特征匹配是图像识别中一个基本研究问题。常用的匹配方式一般是基于贪婪算法的线性扫描方式,但只适用于低维数据。当数据维数超过一定程度时,这些匹配方法的时间效率将会急剧下降,甚至不强于强力线性扫描方法。本文提出一种基于最小哈希的二值特征匹配方法。通过最小哈希函数映射变换操作,将原始特征集合分成多个子集合,并将一个在超大集合下内查找相邻元素的问题转化为在一个很小的集合内查找相邻元素的问题,计算量有所下降。使用Jaccard距离度量的最小哈希函数能最大限度地保证原始数据中相似的向量对在哈希变换后依然相似。实验表明这种匹配方法应用在二值特征上时,可以获得比KD-Tree更好的匹配效果。   相似文献   

5.
散列函数又称哈希函数(hash)。它以任意长度的子串为输入,其输出为固定长度的伪随机子串。由于给定输入后输出是固定的,因此它是确定函数。其固定的输出长度较短,一般只有几十个字节。如目前业界使用的MD5函数,输出为16个字节;SHA1为20个字节。散列函数的输入空间是无限集合,而输出是有限集合,因此散列函数不是一一映射。散列函数输出长度愈长,安全性愈高。  相似文献   

6.
构造了多层Count-Min概要数据结构来概括流数据中的层次结构。通过定义多层数据域U*上两两相互独立的异或哈希函数族,将数据流元组映射到L×D×W的三维计数数组,L是层次个数,D是从哈希函数族中均匀随机选取的哈希函数个数,W是哈希函数的值域。基于该结构,利用广度优先查询策略,查找多层频繁项集和估计多层频繁项值。实验表明,该结构在更新时间、存储空间和估计精度方面比直接堆叠多个Count-Min结构有较大的提高。  相似文献   

7.
Web集群服务的请求分配算法大多使用Hash方法对请求URL进行散列,并按一定规则对请求内容进行负载均衡调度.提出了一种基于URL词典排序及全部URL按其词典序列分为k*n个集合的URLALLOC算法.该算法通过对URL进行词典序排序并将全部URL按其词典序列分为k*n个集合,依访问流量排序及分段互补等一系列方法将Web负载尽可能均匀地分布到多个后端服务器中.仿真实验结果表明:URLALLOC算法比现有的URL散列方法具有更好的负载均衡能力.  相似文献   

8.
在哈希函数中,如果两个不同的单词被映射到同一个槽,那么我们称为冲突。当哈希函数存在冲突时,将降低词典查找的速度。由于完美哈希函数完全避免了冲突,因此在许多对查找性能要求较高的应用中广泛使用。本文就此提出了一种基于多级相关图的大规模词典完美哈希函数的构造算法。词典单词的每个字符(首字母除外)都用两个平滑函数平滑为两个字符,构建平滑后词典对应的多级相关图,多级相关图的结点度都比较小,而且分布比较均匀,因此更容易生成完美哈希函数。实验表明:基于多级相关图的哈希函数构造算法适用于大规模词典,填充因子接近1,同时工作空间比已有算法都要小。  相似文献   

9.
Bloom Filter及其应用综述   总被引:10,自引:2,他引:10  
Bloom Filter对数据集合采用一个位串表示并能有效支持集合元素的哈希查找操作。本文对Bloom Filter及其改进型进行了综述性分析研究.探讨了它的实用性。较为详细地阐述了它在P2P网络文件存储系统OceanStore和文本检索系统中的应用情况。最后指出了进一步的研究方向。  相似文献   

10.
针对基于图像进行三维重建技术在使用大规模图像集合进行重建时,需要对图像集合中图像进行两两匹配耗时问题,提出了基于哈希技术对图像构建全局哈希特征的方法,通过过滤掉无效的图像关系对来减少计算时间,极大地提高了大规模图像集合三维重建的匹配计算效率。提出的大规模图像快速哈希匹配算法包括构建图像哈希特征、构建初始匹配图、挑选候选匹配对、哈希匹配几个步骤。实验结果表明该方法能显著地提高三维重建中图像匹配的速度。  相似文献   

11.
两种对URL的散列效果很好的函数   总被引:32,自引:2,他引:32  
李晓明  凤旺森 《软件学报》2004,15(2):179-184
在Web信息处理的研究中,不少情况下需要对很大的URL序列进行散列操作.针对两种典型的应用场合,即Web结构分析中的信息查询和并行搜索引擎中的负载平衡,基于一个含有2000多万个URL的序列,进行了大规模的实验评测.说明在许多文献中推荐的对字符串散列效果很好的ELFhash函数对URL的散列效果并不好,同时推荐了两种对URL散列效果很好的函数.  相似文献   

12.
Many current perfect hashing algorithms suffer from the problem of pattern collisions. In this paper, a perfect hashing technique that uses array-based tries and a simple sparse matrix packing algorithm is introduced. This technique eliminates all pattern collisions, and, because of this, it can be used to form ordered minimal perfect hashing functions on extremely large word lists. This algorithm is superior to other known perfect hashing functions for large word lists in terms of function building efficiency, pattern collision avoidance, and retrieval function complexity. It has been successfully used to form an ordered minimal perfect hashing function for the entire 24481 element Unix word list without resorting to segmentation. The item lists addressed by the perfect hashing function formed can be ordered in any manner, including alphabetically, to easily allow other forms of access to the same list  相似文献   

13.
哈希算法已被广泛应用于解决大规模图像检索的问题. 在已有的哈希算法中, 无监督哈希算法因为不需要数据库中图片的语义信息而被广泛应用. 平移不变核局部敏感哈希(SKLSH)算法就是一种较为代表性的无监督哈希算法.该算法随机的产生哈希函数, 并没有考虑所产生的哈希函数的具体检索效果. 因此, SKLSH算法可能产生一些检索效果表现较差的哈希函数. 在本文中, 提出了编码选择哈希算法(BSH). BSH算法根据SKLSH算法产生的哈希函数的具体检索效果来进行挑选. 挑选的标准主要根据哈希函数在3个方面的表现: 相似性符合度, 信息包含量, 和编码独立性. 然后,BSH算法还使用了一种基于贪心的选择方法来找到哈希函数的最优组合. BSH算法和其他代表性的哈希算法在两个真实图像库上进行了检索效果的对比实验. 实验结果表明, 相比于最初的SKLSH算法和其他哈希算法, BSH算法在检索准确度上有着明显的提高.  相似文献   

14.
With the rapid development of the Internet, recent years have seen the explosive growth of social media. This brings great challenges in performing efficient and accurate image retrieval on a large scale. Recent work shows that using hashing methods to embed high-dimensional image features and tag information into Hamming space provides a powerful way to index large collections of social images. By learning hash codes through a spectral graph partitioning algorithm, spectral hashing(SH) has shown promising performance among various hashing approaches. However, it is incomplete to model the relations among images only by pairwise simple graphs which ignore the relationship in a higher order. In this paper, we utilize a probabilistic hypergraph model to learn hash codes for social image retrieval. A probabilistic hypergraph model offers a higher order repre-sentation among social images by connecting more than two images in one hyperedge. Unlike a normal hypergraph model, a probabilistic hypergraph model considers not only the grouping information, but also the similarities between vertices in hy-peredges. Experiments on Flickr image datasets verify the performance of our proposed approach.  相似文献   

15.

Explosive growth of big data demands efficient and fast algorithms for nearest neighbor search. Deep learning-based hashing methods have proved their efficacy to learn advanced hash functions that suit the desired goal of nearest neighbor search in large image-based data-sets. In this work, we present a comprehensive review of different deep learning-based supervised hashing methods particularly for image data-sets suggested by various researchers till date to generate advanced hash functions. We categorize prior works into a five-tier taxonomy based on: (i) the design of network architecture, (ii) training strategy based on nature of data-set, (iii) the type of loss function, (iv) the similarity measure and, (v) the nature of quantization. Further, different data-sets used in prior works are reported and compared based on various challenges in the characteristics of images that are part of the data-sets. Lastly, different future directions such as incremental hashing, cross-modality hashing and guidelines to improve design of hash functions are discussed. Based on our comparative review, it has been observed that generative adversarial networks-based hashing models outperform other methods. This is due to the fact that they leverage more data in the form of both real world and synthetically generated data. Furthermore, it has been perceived that triplet-loss-based loss functions learn better discriminative representations by pushing similar patterns together and dis-similar patterns away from each other. This study and its observations shall be useful for the researchers and practitioners working in this emerging research field.

  相似文献   

16.
Learning-based hashing methods are becoming the mainstream for large scale visual search. They consist of two main components: hash codes learning for training data and hash functions learning for encoding new data points. The performance of a content-based image retrieval system crucially depends on the feature representation, and currently Convolutional Neural Networks (CNNs) has been proved effective for extracting high-level visual features for large scale image retrieval. In this paper, we propose a Multiple Hierarchical Deep Hashing (MHDH) approach for large scale image retrieval. Moreover, MHDH seeks to integrate multiple hierarchical non-linear transformations with hidden neural network layer for hashing code generation. The learned binary codes represent potential concepts that connect to class labels. In addition, extensive experiments on two popular datasets demonstrate the superiority of our MHDH over both supervised and unsupervised hashing methods.  相似文献   

17.
Binary code is a kind of special representation of data. With the binary format, hashing framework can be built and a large amount of data can be indexed to achieve fast research and retrieval. Many supervised hashing approaches learn hash functions from data with supervised information to retrieve semantically similar samples. This kind of supervised information can be generated from external data other than pixels. Conventional supervised hashing methods assume a fixed relationship between the Hamming distance and the similar (dissimilar) labels. This assumption leads to too rigid requirement in learning and makes the similar and dissimilar pairs not distinguishable. In this paper, we adopt a large margin principle and define a Hamming margin to formulate such relationship. At the same time, inspired by support vector machine which achieves strong generalization capability by maximizing the margin of its decision surface, we propose a binary hash function in the same manner. A loss function is constructed corresponding to these two kinds of margins and is minimized by a block coordinate descent method. The experiments show that our method can achieve better performance than the state-of-the-art hashing methods.  相似文献   

18.
Hashing methods have received significant attention for effective and efficient large scale similarity search in computer vision and information retrieval community. However, most existing cross-view hashing methods mainly focus on either similarity preservation of data or cross-view correlation. In this paper, we propose a graph regularized supervised cross-view hashing (GSCH) to preserve both the semantic correlation and the intra-view and inter view similarity simultaneously. In particular, GSCH uses intra-view similarity to estimate inter-view similarity structure. We further propose a sequential learning approach to derive the hashing function for each view. Experimental results on benchmark datasets against state-of-the-art methods show the effectiveness of our proposed method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号