首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
田雪  朱晓杰  申培松  陈驰  邹洪 《软件学报》2016,27(6):1566-1576
随着云计算的广泛应用,数据中心的数据量急速增加,同时,用户文档通常包含隐私敏感信息,需要先加密然后上传到云服务器,面对如此大量的密文数据,现有技术在大数据量的密文数据上的检索效率很低.针对此问题,本文提出在大数据下的基于相似查询树的密文检索方法(MRSE-SS),该方法通过设置聚类中心和成员之间的最大距离对文档向量进行聚类,并把中心向量看成n维超球体的球心,最大距离作为半径,再逐步将小聚类聚合成大聚类.使用该方法构建的密文文档集合,在查询阶段仅需检索查询向量相邻的聚类即可获得理想的查询结果集合,从而提高了密文检索的效率.本文还以《软件学报》期刊最近10年的论文作为样本进行了实验,数据集中选取2900篇文章和4800个关键词,实验结果显示,当文档集个数呈指数增长的时候,检索时间仅呈线性增长,并且检索结果的关联性比传统检索方法更强.  相似文献   

2.
Efficient Similarity Search over Future Stream Time Series   总被引:2,自引:0,他引:2  
With the advance of hardware and communication technologies, stream time series is gaining ever-increasing attention due to its importance in many applications such as financial data processing, network monitoring, Web click-stream analysis, sensor data mining, and anomaly detection. For all of these applications, an efficient and effective similarity search over stream data is essential. Because of the unique characteristics of the stream, for example, data are frequently updated and real-time response is required, the previous approaches proposed for searching through archived data may not work in the stream scenarios. Especially, in the cases where data often arrive periodically for various reasons (for example, the communication congestion or batch processing), queries on such incomplete time series or even future time series may result in inaccuracy using traditional approaches. Therefore, in this paper, we propose three approaches, polynomial, Discrete Fourier Transform (DFT), and probabilistic, to predict the unknown values that have not arrived at the system and answer similarity queries based on the predicted data. We also apply efficient indexes, that is, a multidimensional hash index and a B+-tree, to facilitate the prediction and similarity search on future time series, respectively. Extensive experiments demonstrate the efficiency and effectiveness of our methods for prediction and answering queries.  相似文献   

3.
With the rapid increase in both centralized video archives and distributed WWW video resources, content-based video retrieval is gaining its importance. To support such applications efficiently, content-based video indexing must be addressed. Typically, each video is represented by a sequence of frames. Due to the high dimensionality of frame representation and the large number of frames, video indexing introduces an additional degree of complexity. In this paper, we address the problem of content-based video indexing and propose an efficient solution, called the ordered VA-file (OVA-file) based on the VA-file. OVA-file is a hierarchical structure and has two novel features: 1) partitioning the whole file into slices such that only a small number of slices are accessed and checked during k nearest neighbor (kNN) search and 2) efficient handling of insertions of new vectors into the OVA-file, such that the average distance between the new vectors and those approximations near that position is minimized. To facilitate a search, we present an efficient approximate kNN algorithm named ordered VA-LOW (OVA-LOW) based on the proposed OVA-file. OVA-LOW first chooses possible OVA-slices by ranking the distances between their corresponding centers and the query vector, and then visits all approximations in the selected OVA-slices to work out approximate kNN. The number of possible OVA-slices is controlled by a user-defined parameter delta. By adjusting delta, OVA-LOW provides a trade-off between the query cost and the result quality. Query by video clip consisting of multiple frames is also discussed. Extensive experimental studies using real video data sets were conducted and the results showed that our methods can yield a significant speed-up over an existing VA-file-based method and (distance with high query result quality. Furthermore, by incorporating temporal correlation of video content, our methods achieved much more efficient performance  相似文献   

4.
高效时序相似搜索技术   总被引:6,自引:0,他引:6  
时序相似搜索被认为是将来最有前途的技术之一.然而,时序数据是典型的高维海量数据,如何开发高效算法非常关键.文中概述了时序相似搜索技术的研究现状和进展以及研究的主要内容,讨论了该技术的几个重要应用范例,并对一些典型算法进行了定量分析;然后晕点论述了高效时序相似搜索的关键技术,包括边界过滤、三角不等式修剪、多辨析率检索方法、过滤精炼方案等.最后讨论并分析了时序的近似相似搜索技术.上述所有技术通过对比,其正面和反面都被深入分析.最后指出了存在的问题和未来的研究热点和方向.  相似文献   

5.
陈珂  洪银杰  陈刚 《软件学报》2012,23(6):1588-1601
基于可能世界的不确定集合的相似查询,从语义上或者从计算方法的角度来看,都有别于传统的确定型集合上的技术.由于集合中的项存在不确定性,即一个项出现在集合中是有一定概率的,使得传统处理集合的技术不再适用.提出了一个基于可能世界的集合期望相似度的度量公式.在期望的度量公式中,如果一对集合(X,Y)的期望相似度大于给定的阈值τ∈(0,1),则被称为相似集合对.一般的算法,在基于可能世界的情况下计算不确定集合的期望相似度,其复杂度是指数级的.提出了利用动态规划来计算集合期望相似度的算法,该算法的复杂度是多项式级别,极大地减少了计算时间.实验结果表明了基于该算法查询的可用性和高性能.  相似文献   

6.
戴东波  熊赟  朱扬勇 《软件学报》2010,21(4):718-731
序列数据在文本、Web访问日志文件、生物数据库中普遍存在,对其进行相似性查找是一种重要的获取和分析知识的手段.基于参考集索引技术是一类解决序列相似性查找的有效方法,主要思想是找到序列数据库中的少数序列作为参考集,通过参考集过滤掉数据库中与查询序列不相关的数据,从而高效地回答查询.在现有基于参考集索引技术的基础上,提出一种过滤能力更强的序列相似性查询算法IRI(improved reference indexing).首先,充分利用了先前的查询结果集来加速当前的查询,其次考虑了基于序列特征的上界和下界,使得应用参考集进行过滤的上下界更紧,过滤能力进一步加强.最后,为了避免候选集中费时的编辑距离计算,则只计算前缀序列间的编辑距离,从而进一步加速算法运行.实验采用真实的DNA序列和蛋白质序列数据,结果表明,算法IRI在查询性能上明显优于现有的基于参考集索引方法RI(reference indexing).  相似文献   

7.
一种有效的量化交易数据相似性搜索方法   总被引:7,自引:0,他引:7  
量化交易数据与一般交易数据的不同之处在于它在各个维上的值是数值型而不是二值型的。研究这种数据的有效的相似性搜索方法是一个重要而具有挑战性的课题,提出了一个新的相似性度量函数Hsim(),这个度量函数可以较好地克服Lp等传统的距离函数在高维空间中的缺点,并能将二值型和数值型数据距离的计算整合到一个统一的框架中去。结合量化交易数据的特点,构造了定义在该函数上的相似性索引结构,并对建立在该索引结构上的相似性查询方法进行了阐述。实验表明,这种搜索方法对量化交易数据的相似性搜索有较高的修剪率,能大大地加快搜索的速度。  相似文献   

8.
相似性搜索是从数据库中检索出同给定数据对象相似的数据对象,已有的基于R-tree的相似性搜索,当搜索空间的维的个数较小时效率较高,但当搜索空间的维的个数较大时则效率很低.针对此问题,提出了新的度量空间分割方法和索引结构pgh-tree,利用数据对象与很少几个固定参考对象的距离之差进行数据分割和索引,产生一个平衡的索引树.在此基础上,提出了新的算法,利用查询数据对象与固定参考对象的距离之差过滤掉大部分的不相关数据,具有较小的I/O代价和距离计算复杂性,平均复杂性为θ(n^0.58),是目前复杂性最小的相似性搜索算法.另外还讨论了基于pgh-tree的最近相邻点搜索策略.  相似文献   

9.
利用高维数据空间合理划分,提出一种简单有效的KNN检索算法-LBD。通过聚类将数据划分成多个子集空间,对每个聚类子集内的高维向量,利用距离和位码定义简化表示形式。KNN搜索时,首先利用距离信息确定候选范围,然后利用某些维上的位码不相同信息进一步缩小搜索范围,提高剪枝效率。位码字符串比较时,按照维度贡献优先顺序,大大加快非候选点过滤。LBD利用特殊的B+树组织,降低I/O和距离计算代价。采用模拟数据和真实数据,实验验证了LBD具有更高的检索效率。  相似文献   

10.
This paper presents how to embed AI mechanisms in pervasive spaces to produce more intelligent, adaptive, and convenient environments. The paper also concentrated on computational intelligence in particular, fuzzy systems and neural networks because this emerging AI domain covers most of the requirements for this task.  相似文献   

11.
文章在代数格及其运算的基础上,深入研究格的嵌入,使得建立在格上的范例相似度量,在范例复用所做的属性替换后,不但遵循一致、严密、完备的理论框架,而且能进一步提高相似度量的精度。最后提供了该方法在时间序列原子模式相似度量中的应用。  相似文献   

12.
吴枫  仲妍  吴泉源  贾焰  杨树强 《软件学报》2009,20(10):2867-2884
相似性搜索在股票交易行情、网络安全、传感器网络等众多领域应用广泛.由于这些领域中产生的数据具有无限的、连续的、快速的、实时的特性,所以需要适合数据流上的在线相似性搜索算法.首先,在具有或不具有全局约束条件下,分别提出了没有索引结构的DTW(dynamic time warping)下限函数LB_seg_WFglobalLB_seg_WF,它们是一种分段DTW技术,能够处理数据流上的非等长序列间在线相似性匹配问题.然后,为了进一步提高LB_seg_WFglobalLB_seg_WF的近似程度,提出了一系列的改进方法.最后,针对流上使用LB_seg_WFglobalLB_seg_WF可能会出现连续失效的情况,分别提出了DTW的下限函数LB_WFglobal(具有全局约束条件)和上限函数UB_WF、下限函数LB_WF(不具有全局约束条件).通过增量方式快速估计DTW,极大地减少了估计DTW的冗余计算量.通过理论分析和统计实验,验证了该方法的有效性.  相似文献   

13.
This paper presents a new algorithm for Nonlinear Dimensionality Reduction (NLDR). Our algorithm is developed under the conceptual framework of compatible mapping. Each such mapping is a compound of a tangent space projection and a group of splines. Tangent space projection is estimated at each data point on the manifold, through which the data point itself and its neighbors are represented in tangent space with local coordinates. Splines are then constructed to guarantee that each of the local coordinates can be mapped to its own single global coordinate with respect to the underlying manifold. Thus, the compatibility between local alignments is ensured. In such a work setting, we develop an optimization framework based on reconstruction error analysis, which can yield a global optimum. The proposed algorithm is also extended to embed out of samples via spline interpolation. Experiments on toy data sets and real-world data sets illustrate the validity of our method.  相似文献   

14.
在归纳常见的句子相似度计算方法后,基于《人民日报》3.4万余份文本训练了用于语义相似度计算的词向量模型,并设计了一种融合词向量的多特征句子相似度计算方法。该方法在词方面,考虑了句子中重叠的词数和词的连续性,并运用词向量模型测量了非重叠词间的相似性;在结构方面,考虑了句子中重叠词的语序和两个句子的长度一致性。实验部分设计实现了4种句子相似度计算方法,并开发了相应的实验系统。结果表明:提出的算法能够取得相对较好的实验结果,对句子中词的语义特征和句子结构特征进行组合处理和优化,能够提升句子相似度计算的准确性。  相似文献   

15.
In the past, similarity search for audio data has largely been focused on music. Recent digitization efforts in some of the larger animal sound archives bring other types of audio recordings into the focus of interest. Although recordings in animal sound archives are usually very well annotated by metadata, it is almost impossible to manually annotate all sounds made by animals in each recording. Complementary to classical text-based querying of databases that exploit available annotations, algorithms capable of automatically finding sections of recordings similar to a given query fragment provide a promising approach for content-based navigation. In our work, we present algorithms for feature extraction, as well as indexing and retrieval of animal sound recordings. Making use of a concept from image processing, the structure tensor, our feature extraction algorithm is adapted to the typical curve-like spectral features that are characteristic for many types of animal sounds. We propose a method for similarity search in animal sound databases which is obtained by adding a novel ranking scheme to an existing inverted file based approach for multimedia retrieval. Evaluation of our methods is based on recordings from the Animal Sound Archive, Berlin.  相似文献   

16.
17.
In this paper, we consider how to search for a mobile evader in a large heterogeneous region when sensors are used for detection. Sensors are modeled using probability of detection. Due to environmental effects, this probability will not be constant over the entire region. We map this problem to a graph-search problem, and even though deterministic graph search is NP-complete, we derive a tractable optimal probabilistic search strategy. We do this by defining the problem as a dynamic game played on a Markov chain. We prove that this strategy is optimal in the sense of Nash. Simulations of an example problem illustrate our approach and verify our claims.  相似文献   

18.
Local Search (LS) has proven to be an efficient optimisation technique in clustering applications and in the minimisation of stochastic complexity of a data set. In the present paper, we propose two ways of organising LS in these contexts, the Multi-operator Local Search (MOLS) and the Adaptive Multi-Operator Local Search (AMOLS), and compare their performance to single operator (random swap) LS method and repeated GLA (Generalised Lloyd Algorithm). Both of the proposed methods use several different LS operators to solve the problem. MOLS applies the operators cyclically in the same order, whereas AMOLS adapts itself to favour the operators which manage to improve the result more frequently. We use a large database of binary vectors representing strains of bacteria belonging to the family Enterobacteriaceae and a binary image as our test materials. The new techniques turn out to be very promising in these tests.  相似文献   

19.
A very challenging problem in the genetics domain is to infer haplotypes from genotypes. This process is expected to identify genes affecting health, disease and response to drugs. One of the approaches to haplotype inference aims to minimise the number of different haplotypes used, and is known as haplotype inference by pure parsimony (HIPP). The HIPP problem is computationally difficult, being NP-hard. Recently, a SAT-based method (SHIPs) has been proposed to solve the HIPP problem. This method iteratively considers an increasing number of haplotypes, starting from an initial lower bound. Hence, one important aspect of SHIPs is the lower bounding procedure, which reduces the number of iterations of the basic algorithm, and also indirectly simplifies the resulting SAT model. This paper describes the use of local search to improve existing lower bounding procedures. The new lower bounding procedure is guaranteed to be as tight as the existing procedures. In practice the new procedure is in most cases considerably tighter, allowing significant improvement of performance on challenging problem instances.  相似文献   

20.
Reverse Nearest Neighbor Search in Metric Spaces   总被引:7,自引:0,他引:7  
Given a set {cal D} of objects, a reverse nearest neighbor (RNN) query returns the objects o in {cal D} such that o is closer to a query object q than to any other object in {cal D}, according to a certain similarity metric. The existing RNN solutions are not sufficient because they either 1) rely on precomputed information that is expensive to maintain in the presence of updates or 2) are applicable only when the data consists of "Euclidean objects” and similarity is measured using the L_2 norm. In this paper, we present the first algorithms for efficient RNN search in generic metric spaces. Our techniques require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality. We confirm the effectiveness of the proposed methods with extensive experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号