首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 34 毫秒
1.
Inexact subgraph matching based on type-isomorphism was introduced by Berry et al. [J. Berry, B. Hendrickson, S. Kahan, P. Konecny, Software and algorithms for graph queries on multithreaded architectures, in: Proc. IEEE International Parallel and Distributed Computing Symposium, IEEE, 2007, pp. 1–14] as a generalization of the exact subgraph matching problem. Enumerating small subgraph patterns in very large graphs is a core problem in the analysis of social networks, bioinformatics data sets, and other applications. This paper describes a MapReduce algorithm for subgraph type-isomorphism matching. The MapReduce computing framework is designed for distributed computing on massive data sets, and the new algorithm leverages MapReduce techniques to enable processing of graphs with billions of vertices. The paper also introduces a new class of walk-level constraints for narrowing the set of matches. Constraints meeting criteria defined in the paper are useful for specifying more precise patterns and for improving algorithm performance. Results are provided on a variety of graphs, with size ranging up to billions of vertices and edges, including graphs that follow a power law degree distribution.  相似文献   

2.
Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention.In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.  相似文献   

3.
随着数据采集和存储技术的发展,社交网络、生物信息科学、交通导航等领域中出现了规模庞大、内部结构复杂、查询需求多样的大图数据。传统基于单机内存的图处理方法无法满足大图数据管理需求。可扩展计算平台的发展为大图数据管理提供了可行的技术方案。本文首先分析了大图数据之上的不同类型查询,重点探讨了基于关系数据库、基于MapReduce计算框架、基于BSP(Bulk Synchronous Parallel)计算模型和基于第三方外包服务器的大图数据管理方法,并分析了未来可能的研究路线。  相似文献   

4.
With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLab, a MapReduce-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MapReduce framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SimLab on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative.  相似文献   

5.
MapReduce in MPI for Large-scale graph algorithms   总被引:1,自引:0,他引:1  
  相似文献   

6.

Recent trends in big data have shown that the amount of data continues to increase at an exponential rate. This trend has inspired many researchers over the past few years to explore new research direction of studies related to multiple areas of big data. The widespread popularity of big data processing platforms using MapReduce framework is the growing demand to further optimize their performance for various purposes. In particular, enhancing resources and jobs scheduling are becoming critical since they fundamentally determine whether the applications can achieve the performance goals in different use cases. Scheduling plays an important role in big data, mainly in reducing the execution time and cost of processing. This paper aims to survey the research undertaken in the field of scheduling in big data platforms. Moreover, this paper analyzed scheduling in MapReduce on two aspects: taxonomy and performance evaluation. The research progress in MapReduce scheduling algorithms is also discussed. The limitations of existing MapReduce scheduling algorithms and exploit future research opportunities are pointed out in the paper for easy identification by researchers. Our study can serve as the benchmark to expert researchers for proposing a novel MapReduce scheduling algorithm. However, for novice researchers, the study can be used as a starting point.

  相似文献   

7.
Skyline查询是一个典型的多目标优化查询,在多目标优化、数据挖掘等领域有着广泛的应用。现有的Skyline查询处理算法大都假定数据集存放在单一数据库服务器中,查询处理算法通常也被设计成针对单一服务器的串行算法。随着数据量的急剧增长,特别是在大数据背景下,传统的基于单机的串行Skyline算法已经远远不能满足用户的需求。基于流行的分布式并行编程框架MapReduce,研究了适用于大数据集的并行Skyline查询算法。针对影响MapReduce计算的因素,对现有基于角度的划分策略进行了改进,提出了Balanced Angular划分策略;同时,为了减少Reduce过程的计算量,提出了在Map端预先进行数据过滤的策略。实验结果显示所提出的Skyline查询算法能显著提升系统性能。  相似文献   

8.
9.
MapReduce是一个能够对大规模数据进行分布式处理的框架,目前被各个领域广泛应用。在提供MapReduce服务的集群中,如何保证不同优先级用户的截止时间限定是MapReduce作业调度问题的一个挑战。针对这一问题,提出了一个基于排队网络的多优先级作业调度算法(MPSA)。首先分析和归纳了基于MapReduce模型的算法,提出了三种常见模式,采用Jackson排队网络对基于MapReduce模型的算法建立了数学模型,应用该网络模型可以求出不同优先级队列对资源的需求;随后使用AR(1)模型进行预测,使算法可以动态地适应不同的用户访问量;利用二分查找算法,分步计算出不同优先级在map阶段和reduce阶段分配的槽位数;最后实现了在MapReduce模型中应用的实时调度算法。实验结果表明,与传统的FIFO和公平调度算法相比,本文提出的算法在用户到达率和任务规模变化的情况下,可以更加有效地满足不同优先级用户的截止时间限定。  相似文献   

10.
针对现有物联网大数据特征选择算法计算效率低下、可扩展性不高的问题,提出一种基于改进人工蜂群(ABC)选择特征的系统架构,该架构包含四层体系,可以高效地聚合有效数据,剔除不需要的数据。整个系统是基于Hadoop平台、MapReduce以及改进ABC算法的。改进ABC算法用于选择特征,而MapReduce则由并行算法支持,该算法可高效处理大数据集。该系统使用MapReduce工具实现,并利用粒子滤波来消除噪声。将提出的算法与同类方法进行比较,并通过使用十个不同的数据集对效率、准确性和吞吐量进行评估。结果表明,相比其他几种较新的算法,提出的算法在选择特征时更具可扩展性和高效性。  相似文献   

11.
王飞  秦小麟  刘亮  沈尧 《计算机科学》2015,42(5):204-210
k-近邻连接查询是空间数据库中一种常用的操作,该查询处理过程涉及连接和最近邻查询两个复杂操作.传统的集中式k-近邻连接查询算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-近邻连接查询算法成为了目前亟需解决的问题.现有的分布式k-近邻连接查询算法都包括了多轮串行的MapReduce任务,而每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此算法效率低下.首先提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模.在该框架基础上,提出了一种高效的k-近邻连接算法,它利用空间填充曲线将多维数据映射为一维数据,从而将k-近邻连接查询转化为一维范围查询.实验结果表明,该算法的可扩展性较高,且效率比现有算法更优.  相似文献   

12.
An efficient algorithm for discovering frequent subgraphs   总被引:8,自引:0,他引:8  
Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approaches cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the data sets in these domains. An alternate way of modeling the objects in these data sets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is that of discovering subgraphs that occur frequently over the entire set of graphs. We present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph data sets. We experimentally evaluate the performance of FSG using a variety of real and synthetic data sets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in data sets containing more than 200,000 graph transactions and scales linearly with respect to the size of the data set.  相似文献   

13.
High-Utility Itemset Mining (HUIM) is considered a major issue in recent decades since it reveals profit strategies for use in industry for decision-making. Most existing works have focused on mining high-utility itemsets from databases showing large amount of patterns; however exact decisions are still challenging to make from that large amounts of discovered knowledge. Closed High-utility itemset mining (CHUIM) provides a smart way to present concise high-utility itemsets that can be more effective for making correct decisions. However, none of the existing works have focused on handling large-scale databases to integrate discovered knowledge from several distributed databases. In this paper, we first present a large-scale information fusion architecture to integrate discovered closed high-utility patterns from several distributed databases. The generic composite model is used to cluster transactions regarding their relevant correlation that can ensure correctness and completeness of the fusion model. The well-known MapReduce framework is then deployed in the developed DFM-Miner algorithm to handle big datasets for information fusion and integration. Experiments are then compared to the state-of-the-art CHUI-Miner and CLS-Miner algorithms for mining closed high-utility patterns and the results indicated that the designed model is well designed for handling large-scale databases with less memory usage. Moreover, the designed MapReduce framework can speed up the mining performance of closed high-utility patterns in the developed fusion system.  相似文献   

14.
iMapReduce: A Distributed Computing Framework for Iterative Computation   总被引:2,自引:0,他引:2  
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.  相似文献   

15.
应毅  任凯  刘亚军 《计算机科学》2018,45(Z11):353-355
传统的日志分析技术在处理海量数据时存在计算瓶颈。针对该问题,研究了基于大数据技术的日志分析方案:由多台计算机完成日志文件的存储、分析、挖掘工作,建立了一个基于Hadoop开源框架的并行网络日志分析引擎,在MapReduce模型下重新实现了IP统计算法和异常检测算法。实验证明,在数据密集型计算中使用大数据技术可以明显提高算法的执行效率和增加系统的可扩展性。  相似文献   

16.
由互联网时代快速发展而产生的海量数据给传统聚类方法带来了巨大挑战,如何改进聚类算法从而获取有效信息成为当前的研究热点。K-Medoids是一种常见的基于划分的聚类算法,其优点是可以有效处理孤立、噪声点,但面临着初始中心敏感、容易陷入局部最优值、处理大数据时的CPU和内存瓶颈等问题。为解决上述问题,提出了一种MapReduce架构下基于遗传算法的K-Medoids聚类。利用遗传算法的种群进化特点改进K-Medoids算法的初始中心敏感的问题,在此基础上,利用MapReduce并行遗传K-Medoids算法提高算法效率。通过带标签的数据集进行实验的结果表明,运行在Hadoop集群上的基于MapReduce和遗传算法的K-Medoids算法能有效提高聚类的质量和效率。  相似文献   

17.
FrequentItemsetMining (FIM) is one of the most important data mining tasks and is the foundation of many data mining tasks. In Big Data era, centralized FIM algorithms cannot meet the needs of FIM for big data in terms of time and space, so Distributed Frequent Itemset Mining (DFIM) algorithms have been designed to meet the above challenges. In this paper, LocalGlobal and RedistributionMining which are two main paradigms of DFIM algorithm are discussed; Two algorithms of these paradigms on MapReduce named LG and RM are proposed while MapReduce is a popular distributed computing model, and also the related work is discussed. The experimental results show that the RM algorithm has better performance in terms of computation and scalability of sites, and can be used as the basis for designing the DFIM algorithm based on MapReduce. This paper also discusses the main ideas of improving the DFIM algorithms based on MapReduce.  相似文献   

18.
王晓峰  于卓  赵健  曹泽轩 《计算机工程》2022,48(6):182-192+199
最大团问题是一个经典的组合优化问题,在蛋白质功能推测、竞胜标确定、视频对象分割等领域有广泛的应用。随着图例规模的增大,最大团问题求解难度增加,常规图例最大团求解算法已逐渐被大规模图例最大团求解算法取代。介绍求解大规模图例最大团问题的技术支撑点,重点总结基于大规模图例的最大团问题算法,并在大数据计算背景下对融合单层图划分方法和多层图划分方法的MapReduce框架和Spark框架进行优缺点分析。此外,比较k-core方法与k-community方法的应用场景,从算法分类的角度总结不同类型算法的优缺点,对求解大规模图例最大团问题的确定型算法进行梳理,并对代表性的求解算法在公开数据集中的表现进行对比分析。基于分析结果,指出不同算法在求解大规模图例最大团问题时需要重点关注的方面,并展望了智能优化算法、分层式深度强化学习方法、图结构相变分析技术的未来研究方向。  相似文献   

19.
As a parallel programming framework, MapReduce can process scalable and parallel applications with large scale datasets. The executions of Mappers and Reducers are independent of each other. There is no communication among Mappers, neither among Reducers. When the amount of final results is much smaller than the original data, it is a waste of time processing the unpromising intermediate data. We observe that this waste can be significantly reduced by simple communication mechanisms to enhance the performance of MapReduce. In this paper, we propose ComMapReduce, an efficient framework that extends and improves MapReduce for big data applications in the cloud. ComMapReduce can effectively obtain certain shared information with efficient lightweight communication mechanisms. Three basic communication strategies, Lazy, Eager and Hybrid, and two optimization communication strategies, Prepositive and Postpositive, are proposed to obtain the shared information and effectively process big data applications. We also illustrate the implementations of three typical applications with large scale datasets on ComMapReduce. Our extensive experiments demonstrate that ComMapReduce outperforms MapReduce in all metrics without affecting the existing characteristics of MapReduce.  相似文献   

20.
随着海量大数据的出现,聚类算法需要新型计算模式来提高计算速度与运行效率。本文提出一种基于动态双子种群的差分进化K中心点聚类算法DGP-DE-K-mediods(Dynamic Gemini Population based DE-K-mediods)。DGP-DE-K-mediods利用动态双子种群方法,解决聚类算法在维持种群密度的时候避免陷入局部最优的问题;采用差分进化(Differential Evolution, DE)算法来提高全局最优能力的强健性;基于Hadoop云平台来并行处理DGP-DE-K-mediods,加快算法的运行速度和效率;描述基于MapReduce的并行聚类算法的编程过程;DGP-DE-K-mediods利用UIC的大数据分类的案例数据和网络入侵检测这种大数据应用来仿真算法的效果。实验结果表明,与已有的聚类算法相比,DGP-DE-K-mediods在检测精度、运行时间上有明显的优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号