首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。  相似文献   

2.
在图相似性搜索问题中,图编辑距离是较为普遍的度量方法,其计算性能很大程度上决定了图相似性搜索算法的性能。针对传统图编辑距离算法中存在的因大量冗余映射和较大搜索空间导致的性能低下问题,提出了一种改进的图编辑距离算法。该算法首先对图中顶点进行等价划分,以此计算映射编码来判断等价映射;然后定义映射完整性更新等价映射优先级,选出主映射参与扩展;其次,设计高效的启发式函数,提出基于映射编码的下界计算方法,快速得到最优映射。最后,将改进的图编辑距离算法扩展应用于图相似性搜索。在不同数据集上的实验结果表明,该算法具有更好的搜索性能,在搜索空间上最大可降低49%,速度提升了约29%。  相似文献   

3.
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.  相似文献   

4.
在分析了PageRank算法基础上,提出了PageRank应用于科技文献相似性搜索的可行性,针对PageRank的不足提出了一种改进算法,该算法结合了对文献内容和文献间的引用关系的分析,综合计算文献间相似度,提高了搜索结果的准确率,并通过实验验证了算法的有效性和可行性。  相似文献   

5.
随着越来越多的Web服务被提交到UDDI注册中心注册,搜索一个合适的Web服务开始变得和在互联网上找到一个合适的页面同样困难。现有的技术主要是通过关键字匹配的技术来对UDDI注册中心Web服务的描述信息进行检索。但鉴于Web服务描述信息非常稀疏的特点,传统的信息检索技术并不能得到很好的效果,因此提出了基于本体的Web服务检索技术。在已有的对Web服务检索流程的改进的基础上,充分利用了从UDDI注册中心得到的Web服务描述信息,用本体来描述其内部的关系,并在此基础上应用本体相似技术来比较和匹配Web服务描述信息。  相似文献   

6.
一种高效的多变量时间序列相似查询算法   总被引:1,自引:0,他引:1  
周大镯  吴晓丽  闫红灿 《计算机应用》2008,28(10):2541-2543
为了高效地执行多变量时间序列(MTS)相似查询,提出一种基于距离的索引结构(Dbis)相似查询算法。采用主成分分析方法对MTS数据进行降维处理;聚类MTS主成分序列,选择每类质心作为参考点;依据参考点将每类变换到一维空间,这样可以利用B+ 树结构进行索引查询;MTS序列比较相似采用的是扩展的Frobenius范数(Eros)。通过对股票数据集实验验证了Dbis算法的高效性。  相似文献   

7.
Improving the recall of information retrieval systems for similarity search in time series databases is of great practical importance. In the manufacturing domain, these systems are used to query large databases of manufacturing process data that contain terabytes of time series data from millions of parts. This allows domain experts to identify parts that exhibit specific process faults. In practice, the search often amounts to an iterative query–response cycle in which users define new queries (time series patterns) based on results of previous queries. This is a well-documented phenomenon in information retrieval and not unique to the manufacturing domain. Indexing manufacturing databases to speed up the exploratory search is often not feasible as it may result in an unacceptable reduction in recall. In this paper, we present a novel adaptive search algorithm that refines the query based on relevance feedback provided by the user. Additionally, we propose a mechanism that allows the algorithm to self-adapt to new patterns without requiring any user input. As the search progresses, the algorithm constructs a library of time series patterns that are used to accurately find objects of the target class. Experimental validation of the algorithm on real-world manufacturing data shows, that the recall for the retrieval of fault patterns is considerably higher than that of other state-of-the-art adaptive search algorithms. Additionally, its application to publicly available benchmark data sets shows, that these results are transferable to other domains.  相似文献   

8.
研究基于时间序列相似搜索技术的煤矿瓦斯涌出分析新途径,提出基于PPR的煤矿瓦斯监测数据相似搜索方法。实验采用玉华煤矿的真实煤矿瓦斯监测数据,评价指标为信息损失量及相似查询效率。与基于离散傅立叶变换(DFT)和离散小波变换(DWT)的时间序列相似搜索算法的对比实验显示:在相同压缩比下,3种方法的信息损失相近;但是基于PPR的相似搜索算法的平均查询效率分别比基于DFT和基于DWT方法高32%和34%。因此PPR算法适合用于瓦斯监测数据相似搜索。  相似文献   

9.
Efficient similarity search for market basket data   总被引:2,自引:0,他引:2  
Several organizations have developed very large market basket databases for the maintenance of customer transactions. New applications, e.g., Web recommendation systems, present the requirement for processing similarity queries in market basket databases. In this paper, we propose a novel scheme for similarity search queries in basket data. We develop a new representation method, which, in contrast to existing approaches, is proven to provide correct results. New algorithms are proposed for the processing of similarity queries. Extensive experimental results, for a variety of factors, illustrate the superiority of the proposed scheme over the state-of-the-art method. Edited by R. Ng. Received: August 6, 2001 / Accepted: May 21, 2002 Published online: September 25, 2002  相似文献   

10.
陈湘涛  丁平尖  王晶 《计算机应用》2014,34(9):2604-2607
现有的相似性搜索算法通常没有考虑时间因素,为此,提出一种异构信息网中基于元路径的动态相似性搜索算法PDSim。PDSim算法首先计算给定元路径下实体的链接矩阵,得到实体之间的元路径实例数比值,同时基于建立时间的不同,计算其时间差异度;在此基础上针对给定的元路径,获得异构信息网中动态相似性的度量。在多个相似性搜索实例中,PDSim能够捕获到实体随时间变化而产生的兴趣的变化;应用于聚类时,相对于PathSim和PCRW方法,其标准互信息聚类精度可以提高0.17%~9.24%。实验结果表明,PDSim方法与传统的基于链接的相似性搜索算法相比,显著提高了异构信息网中动态相似性搜索的效率和用户满意度,是一种研究实体随时间而发生动态变化的相似性搜索方法。  相似文献   

11.
Finding similar items in a large and unstructured dataset is a challenging task in many applications of data science, such as searching, indexing, and retrieval. With the increasing data volume and demand for real time responses, similarity search has gained much consideration. In this paper, a parallel computational approach for similarity search using Bloom filters (PCASSB) has been proposed, which uses Bloom filter for the representation of features of document and comparison with user's query. Query features are stored in integer query array (IQA), an array of integer. The PCASSB, an approximate similarity search technique, has been implemented on graphics processing unit with compute unified device architecture as the programming platform. To compute the similarity score between query and reference dataset, Dice coefficient has been used as a baseline method. The accuracy of the results generated by PCASSB is compared with the baseline method and other state‐of‐the‐art methods. The experimental results show that the proposed technique is quite effective in processing large number of text documents as it takes less computational time.  相似文献   

12.
Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.  相似文献   

13.
为了减少噪声数据对查询最优序列的影响,避免Euclidean距离对形态的敏感性,以及要求序列等长的缺点,提出了面向噪声数据的时间序列相似性搜索算法.运用SPC方法去除序列中的噪声数据;采用DTW距离作为度量函数,使用规范化方法使序列处于相同的分辨率下;采用LB_ Keogh下界函数对候选序列集合进行筛选.仿真实验结果表明,该算法在阈值较小时,对含有噪声数据序列的匹配能力较强.  相似文献   

14.
15.
针对当前《知网》的词语语义描述与人们对词汇的主观认知之间存在诸多不匹配的问题,在充分利用丰富的网络知识的背景下,提出了一种融合《知网》和搜索引擎的词汇语义相似度计算方法。首先,考虑了词语与词语义原之间的包含关系,利用改进的概念相似度计算方法得到初步的词语语义相似度结果;然后,利用基于搜索引擎的相关性双重检测算法和点互信息法得出进一步的语义相似度结果;最后,设计了拟合函数并利用批量梯度下降法学习权值参数,融合前两步的相似度计算结果。实验结果表明,与单纯的基于《知网》和基于搜索引擎的改进方法相比,融合方法的斯皮尔曼系数和皮尔逊系数均提升了5%,同时提升了具体词语义描述与人们对词汇的主观认知之间的匹配度,验证了将网络知识背景融入到概念相似度计算方法中能有效提高中文词汇语义相似度的计算性能。  相似文献   

16.
传统的时间序列表示方法均在不同程度上采用了对数据的约简手段,从而破坏了时间序列的非线性和分形这些重要的本质特征,也就使得时间序列的相似性匹配误差加大。提出一种高精度的随机非平稳时间序列表示方法FSPA,该方法将分形理论和R/S方法应用到现有的时间序列表示方法中,既保留了时间序列的非线性和分形的重要特征,同时也实现了维度的约简。实验分别在合成数据和实际数据上进行,结果表明,该方法具有更高的精度且需要较少的存储空间。  相似文献   

17.
活动轨迹的近似查询是在带关键词信息的轨迹集中,检索与查询点集距离最近且满足查询点集关键词要求的活动轨迹的过程。因为GAT(Grid index for Activity Trajectories)不能查询海量活动轨迹,将GAT扩展到适用于海量活动轨迹的近似查询技术GATH(GAT on Hadoop)。和GAT相比,GATH使用两种新的索引结构进行剪枝;其网格索引依照海量数据的特点从底层单元格开始进行基于空间的剪枝;其倒排索引用于进行基于关键词的剪枝。实验结果证实GATH比GAT能有效缩短索引建立时间及提高剪枝效率。  相似文献   

18.
图数据库的相似性搜索是一个非常重要的研究内容,图的相似性匹配属于图同构的判定问题,是NP完全问题,传统的高开销搜索的方法已经不能满足复杂图查询的需要;另外,由于图数据库的复杂性和特殊性,已有的优化算法不能直接使用。为了提高图数据库的搜索效率,提出了一种基于索引的相似性搜索算法,通过数据库中的频繁结构建立特征索引,算法可高效准确地滤除大量的非相似图集合,避免了图之间精确匹配即图同构的计算,最后将本算法应用于化学数据库,实验结果证明了该方法的有效性和可行性。  相似文献   

19.
Time delay estimation is a general issue in both signal processing and process control fields. Neither offline step impulse response-based methods nor least squares methods in control field estimate time delay directly from the real running data. Although the methods for signal processing directly evaluate the delay from signals, such as correlation calculation, coherence analysis and least mean square methods, they are mainly suitable for two signals only different at a time delay part and an attenuation factor. In this article, an estimation method is proposed which is directly based on the real running input and output data of a control plant. The input and output signals of a plant show raw monotony from each other in many cases. According to this feature, we estimate the delay by comparing the trend of two signals. Furthermore, it is extended to an adaptive method for estimating piecewise time-varying delay by sliding window and forgetting factor. The experiments on real plant show the good performances of our methods. The simulation experiments demonstrate that our basic method performs better than CCF or coherence analysis for the nonlinear plant and the adaptive one performs better than least mean square methods for the signals with transfer function except time delay.  相似文献   

20.
FastMap、SparseMap、BoostMap被认为是适用于任何度量空间的嵌入方法。然而之前的研究者高估了它们的适用性,它们在基于关键词的度量空间中并不适用。为了评估它们在关键词空间中的适用性,通过将它们实例化到基于关键词的相似性搜索的场景中,利用嵌入方法和局部敏感哈希相结合的方法,针对它们的嵌入效果进行了研究。重点从精确度、召回率、应力(stress)和距离保存效率方面,给出了它们在不同数据集上的实验结果。发现它们在基于关键词的度量空间中的嵌入效果并不好,得出了它们并不适用于所有的度量空间的结论,并分析了其效果不好的原因。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号