首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
移动对象连续k近邻(CKNN)查询是指给定一个连续移动的对象集合,对于任意一个k近邻查询q,实时计算查询qk近邻并在查询有效时间内对查询结果进行实时更新.现实生活中,交通出行、社交网络、电子商务等领域许多基于位置的应用服务都涉及移动对象连续k近邻查询这一基础问题.已有研究工作解决连续k近邻查询问题时,大多需要通过多次迭代确定一个包含k近邻的查询范围,而每次迭代需要根据移动对象的位置计算当前查询范围内移动对象的数量,整个迭代过程的计算代价占查询代价的很大部分.为此,提出了一种基于网络索引和混合高斯函数移动对象分布密度的双重索引结构(grid GMM index,GGI),并设计了移动对象连续k近邻增量查询算法(incremental search for continuous k nearest neighbors,IS-CKNN).GGI索引结构的底层采用网格索引对海量移动对象进行维护,上层构建混合高斯模型模拟移动对象在二维空间中的分布.对于给定的k近邻查询q,IS-CKNN算法能够基于混合高斯模型直接确定一个包含qk近邻的查询区域,减少了已有算法求解该区域的多次迭代过程;当移动对象和查询q位置发生变化时,进一步提出一种高效的增量查询策略,能够最大限度地利用已有查询结果减少当前查询的计算量.最后,在滴滴成都网约车数据集以及两个模拟数据集上进行大量实验,充分验证了算法的性能.  相似文献   

2.
Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.  相似文献   

3.
Skyline queries are extensively incorporated in various real-life applications by filtering uninteresting data objects. Sometimes, a skyline query may return so many results because it cannot control the retrieval conditions especially for highdimensional datasets. As an extension of skyline query, the kdominant skyline query reduces the control of the dimension by controlling the value of the parameter kto achieve the purpose of reducing the retrieval objects. In addition, with the continuous promotion of Bigdata applications, the data we acquired may not have the entire content that people wanted for some practically reasons of delivery failure, no power of battery, accidental loss, so that the data might be incomplete with missing values in some attributes. Obviously, the k-dominant skyline query algorithms of incomplete data depend on the user definition in some degree and the results cannot be shared. Meanwhile, the existing algorithms are unsuitable for directly used to the incomplete big data. Based on the above situations, this paper mainly studies k-dominant skyline query problem over incomplete dataset and combines this problem with the distributed structure like MapReduce environment. First, we propose an index structure over incomplete data, named incomplete data index based on dominate hierarchical tree (ID-DHT). Applying the bucket strategy, the incomplete data is divided into different buckets according to the dimensions of missing attributes. Second, we also put forward query algorithm for incomplete data in MapReduce environment, named MapReduce incomplete data based on dominant hierarchical tree algorithm (MR-ID-DHTA). The data in the bucket is allocated to the subspace according to the dominant condition by Map function. Reduce function controls the data according to the key value and returns the k-dominant skyline query result. The effective experiments demonstrate the validity and usability of our index structure and the algorithm.  相似文献   

4.
徐超 《计算机应用研究》2023,40(1):198-203+208
信息化社会带来了数据量的快速增长,但也导致数据的完整性和价值密度越来越低,如何从海量不完整数据中快速找到有价值的信息用于用户个性化推荐,已经成为研究的热点。结合上界值剪枝算法,提出一种用户敏感top-k dominating(TKD)查询方法(USTD)。该方法首先结合用户兴趣度,重新定义了不完整数据集上对象的支配关系及其度量方法,计算对象的权重支配分数;然后以重新定义的支配关系,证明了依据上界值可剪枝的条件;最后将上界值剪枝条件融于USTD的查询过程,以从数据集中快速找到前k个满足用户兴趣的数据。实验结果表明,USTD算法在查询速度上有一定提升,在查询结果的评分上具有显著优势。  相似文献   

5.
Multi-dimensional top-k dominating queries   总被引:1,自引:0,他引:1  
The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data.  相似文献   

6.
A data warehouse can store very large amounts of data that should be processed in parallel in order to achieve reasonable query execution times. The MapReduce programming model is a very convenient way to process large amounts of data in parallel on commodity hardware clusters. A very popular query used in data warehouses is star‐join. In this paper, we present a fast and efficient star‐join query execution algorithm built on top of a MapReduce framework called Hadoop. By using dynamic filters against dimension tables, the algorithm needs a single scan of the fact table, which means a significant reduction of input/output operations and computational complexity. Also, the algorithm requires only two MapReduce iterations in total–one to build the filters against dimension tables and one to scan the fact table. Our experiments show that the proposed algorithm performs much better than the existing solutions in terms of execution time and input/output. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

7.
移动对象的连续范围查询是许多基于位置的服务的核心问题。针对该问题,提出一种面向大规模移动对象并发范围查询的分布式搜索方法。首先,设计了一种由全局网格索引(GGI)和局部弹性四叉树构成的移动对象分布式动态索引(DDI)结构。其次,提出了一种基于DDI结构的分布式查询算法(DSA),该算法首先引入了一种在移动对象和查询点的位置连续变化的情况下的查询结果增量更新策略;然后,在增量更新过程中引入一种面向多并发查询的共享计算优化策略,该策略能够根据已有计算结果对移动对象范围查询结果进行增量搜索。最后,基于德国路网模拟了3个具有不同空间分布的移动对象数据集,将DSA与NS(Naive Search)、GI(Grid Index)和分布式混合索引(DHI)进行对比。实验结果表明,与性能最好的对比算法DHI相比,DSA的初始查询时间减少了22.7%,增量查询时间减少了15.2%,性能优于对比算法。  相似文献   

8.
雷斌  许嘉  谷峪  于戈 《软件学报》2013,24(S2):188-199
以无线传感器网络为代表的新型数据应用和以图像处理为基础的传统数据应用都产生了大规模的概率数据.在概率数据的管理中,Top-k相似性连接操作返回最相似的k 对概率数据,具有重要应用价值.直方图是最常用的概率数据模型之一,而EMD(Earth Mover’s Distance)距离因其较强的鲁棒性可更准确地量化直方图概率数据之间的相似性.然而EMD距离的计算却具有三次方的时间复杂度,给基于EMD距离的Top-k 相似性连接带来巨大挑战.基于流行的MapReduce并行处理框架,利用EMD距离对偶线性规划问题的优良特性,提出了两种大规模概率数据上基于EMD距离的Top-k相似性连接算法.首先提出基于块嵌套循环连接思想的基本解决方法,命名为Top-k BNLJ算法.进而改进数据划分策略,提出基于数据局部性进行数据划分的Top-k DLPJ 算法,有效降低了MapReduce作业执行过程中的数据传输量.使用大规模真实数据集对两种算法进行评估,证实了本文提出的Top-k DLPJ算法的高效性和处理大规模数据集时的良好扩展性.  相似文献   

9.
刘义  景宁  陈荦  熊伟 《软件学报》2013,24(8):1836-1851
针对大规模空间数据的高性能k-近邻连接查询处理,研究了MapReduce框架下基于R-树索引的k-近邻连接查询处理。首先利用无依赖并行和串行同步计算的形式化定义抽象了MapReduce并行编程模型,基于此并行计算模型抽象,分别提出了 R-树索引快速构建算法和基于 R-树的并行 k-近邻连接算法。在索引构建过程中,提出一种采样算法以快速确立空间划分函数,使得索引构建符合无依赖并行和串行同步计算抽象,在MapReduce框架下非常容易进行表达。在k-近邻连接查询过程中,基于构建的分布式R-树索引,引入k-近邻扩展框限定查询范围并进行数据划分,然后利用 R-树索引进行 k-近邻连接查询,提高了查询效率。从理论上分析了所提出算法的通信和计算代价。实验与分析结果表明,该算法在真实数据集的查询上具有良好的效率和可扩展性能,可以很好地支持大规模空间数据的k-近邻连接查询处理,具有良好的实用价值。  相似文献   

10.
李晨  申德荣  朱命冬  寇月  聂铁铮  于戈 《软件学报》2016,27(9):2278-2289
互联网上每天都会产生大量的带地理位置标签和时间标签的信息,比如微博、新闻、团购等等,如何在众多的信息中找到在时间和空间地理位置上都满足用户查询需求的信息十分重要.针对这一需求,提出了一种对地理位置和时间信息的k近邻查询(ST-kNN查询)处理方法.首先,利用时空相似度对数据对象的地理位置变量和时间变量进行映射变换,将数据对象映射到新的三维空间中,用三维空间中两点之间的距离相似度来近似代替两个对象之间实际的时空相似度;然后,针对这个三维空间设计了一种ST-Rtree(spatial temporal rtree)索引,该索引综合了空间因素和时间因素,保证在查询时每个对象至多遍历1次;最后,在该索引的基础上提出了一种精确的k近邻查询算法,并通过一次计算确定查询结果范围,从而找到前k个结果,保证了查询的高效性.基于大量数据集的实验,证明了该查询处理方法的高效性.  相似文献   

11.
k代表轮廓查询是从传统轮廓查询中衍生出来的一类查询.给定多维数据集合D,轮廓查询从D中找到所有不被其他对象支配的对象,将其返回给用户,便于用户结合自身偏好选择高质量对象.然而,轮廓对象规模通常较大,用户需要从大量数据中进行选择,导致选择速度和质量无法得到保证.与传统轮廓查询相比,k代表轮廓查询从所有轮廓对象中选择“代表性”最强的k个对象返回给用户,有效地解决了传统轮廓查询存在的这一问题.给定滑动窗口W和连续查询q,q监听窗口中的数据.当窗口滑动时,查询q返回窗口中,组合支配面积最大的k个对象.现有算法的核心思想是:实时监测当前窗口中的轮廓对象集合,当轮廓对象集合更新时,算法更新k代表轮廓.然而,实时监测窗口中,轮廓集合的计算代价通常较大.此外,当轮廓集合规模较大时,从中选择k代表轮廓的计算代价是同样巨大的,导致已有算法无法在高速流环境下使用.针对上述问题,提出了ρ-近似k代表轮廓查询.为了支持该查询,提出了查询处理框架PAKRS(predict-basedapproximatekrepresentativeskyline).首先,PAKRS利用高速流的特性对当前窗口进行划分,根据划分结...  相似文献   

12.
基于MapReduce的分布式近邻传播聚类算法   总被引:2,自引:0,他引:2  
随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间.  相似文献   

13.
可伸缩的增量连续k近邻查询处理   总被引:7,自引:0,他引:7  
廖巍  熊伟  王钧  景宁  钟志农 《软件学报》2007,18(2):268-278
针对基于TPR树(time-parameterized R-tree)索引的大量并发CKNN(continuous k-nearest neighbor)查询处理,提出了一种可伸缩的增量连续k近邻查询处理(scalable processing of incremental continuous k-nearest neighbor queries,简称SI-CNN)框架,通过引入搜索区域进行预裁剪以减少查询更新所需要的TPR树节点访问代价,并引入了增量结果表以保存候选对象,批量地更新查询结果集,具有良好的可伸缩性.基于SI-CNN框架提出了一种增量更新的SI-CNN查询处理算法,能够基于上次查询结果增量的更新查询,支持查询集合中加入或删除查询和移动对象数据集的插入、删除等动态更新操作.实验结果与分析表明,基于SI-CNN框架的SI-CNN算法可以很好地支持大量并发的CKNN查询处理,具有良好的实用价值.  相似文献   

14.
为了提高数据挖掘中异常检测算法在数据量增大时的准确度、灵敏度和执行效率,本文提出了一种基于MapReduce框架和Local Outlier Factor (LOF)算法的并行异常检测算法(MR-DLOF)。首先,将存放在Hadoop分布式文件系统(HDFS)上的数据集逻辑地切分为多个数据块。然后,利用MapReduce原理将各个数据块中的数据并行处理,使得每个数据点的k-邻近距离和LOF值的计算仅在单个块中执行,从而提高了算法的执行效率;同时重新定义了k-邻近距离的概念,避免了数据集中存在大于或等于k个重复点而导致局部密度为无穷大的情况。最后,将LOF值较大的数据点合并重新计算其LOF值,从而提高算法准确度和灵敏度。通过真实数据集验证了MR-DLOF算法的有效性、高效性和可扩展性。  相似文献   

15.
利用MapReduce模型可自动编写串行程序及编程接口简单的优点,实现量子进化算法在MapReduce模型下的并行化,提出基于MapReduce模型的并行量子进化算法MRQEA,并将其部署到Hadoop云计算平台上运行。对0-1背包问题的测试结果证明,MRQEA算法在处理大型数据集时具有良好的加速比和并行效率。  相似文献   

16.
Distance-based outliers: algorithms and applications   总被引:20,自引:0,他引:20  
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., ); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of , k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for . Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for . Finally, we discuss our work on three real-life applications, including one on spatio-temporal data (e.g., a video surveillance application), in order to confirm the relevance and broad applicability of DB outliers. Received February 15, 1999 / Accepted August 1, 1999  相似文献   

17.
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.  相似文献   

18.
An important property of today’s big data processing is that the same computation is often repeated on datasets evolving over time, such as web and social network data. While repeating full computation of the entire datasets is feasible with distributed computing frameworks such as Hadoop, it is obviously inefficient and wastes resources. In this paper, we present HadUP (Hadoop with Update Processing), a modified Hadoop architecture tailored to large-scale incremental processing with conventional MapReduce algorithms. Several approaches have been proposed to achieve a similar goal using task-level memoization. However, task-level memoization detects the change of datasets at a coarse-grained level, which often makes such approaches ineffective. Instead, HadUP detects and computes the change of datasets at a fine-grained level using a deduplication-based snapshot differential algorithm (D-SD) and update propagation. As a result, it provides high performance, especially in an environment where task-level memoization has no benefit. HadUP requires only a small amount of extra programming cost because it can reuse the code for the map and reduce functions of Hadoop. Therefore, the development of HadUP applications is quite easy.  相似文献   

19.
基于Hadoop平台协同过滤推荐算法   总被引:1,自引:1,他引:0  
针对协同过滤推荐算法在数据稀疏性及在大数据规模下系统可扩展性的两个问题, 在分析研究Hadoop分布式平台与协同过滤推荐算法后, 提出了一种基于Hadoop平台实现协同过滤推荐算法的优化方案. 实验证明, 在Hadoop平台上通过MapReduce结合Hbase数据库实现算法, 能够有效地提高协同过滤推荐算法在大数据规模下的执行效率, 从而能够进一步地搭建低成本高性能、动态扩展的分布式推荐引擎.  相似文献   

20.
蒋涛  张彬  余法红  柳晴  周傲英 《软件学报》2015,26(9):2297-2310
不同于传统的k-Skyband 查询方法,提出一种相互k-Skyband 查询(MkSB),它从对称角度执行Skyline查询,找出所有既在q的动态k-Skyband(DkSB)中又在q的反向k-Skyband(RkSB)中的数据对象.进一步地,为了更好地支持用户决策和数据分析,排序操作被引入到MkSB算法中.因为MkSB 需要执行q的DkSB 和反向RkSB,故它需要遍历索引多次,从而导致了大量冗余的I/O 开销.利用信息重用技术和若干有效的修剪方法,MkSB 将多次的索引搜索合并成单次,极大地降低了I/O访问次数.同时,证明了基于窗口查询的MkSB(WMkSB)算法具有最低的I/O 代价.在真实与合成数据集上的实验结果表明,所提出的算法是有效的且明显胜过基于BBS 的算法,尤其WMkSB 算法具有极少的I/O 开销,通常能够减少95%以上的冗余I/O.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号