首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 453 毫秒
1.
在异构环境下的MapReduce编程模型中,Reduce任务的调度存在随机性,通常在分配任务时既没有考虑数据本地性,也没有考虑计算节点对当前任务的计算能力。针对以上问题,提出一种异构环境下自适应Reduce任务调度算法(SARS),该算法首先根据Reduce任务的输入数据分布选择所含数据量最大的机架;在选择计算节点的过程中,结合节点所含任务的数据量、节点的计算能力和当前节点的忙碌状态来选出任务的执行节点。最后实验结果表明,SARS算法减少了Reduce任务执行时的网络开销,同时也减少了Reduce任务的执行时间。  相似文献   

2.
在实验室系统处理海量原始数据时,实际应用场景中存在采样率高、偏度(skewness)高的特殊情况,导致在使用两阶分区算法在平衡同构环境下的Reducer节点负载时,无法有效地处理这些问题。为此,引入MapReduce的并行化处理,可以提高实验室系统中采样数据利用率;同时,为了解决数据偏度和采样度高的问题,则采用了ICSC(Improved Cluster Split Combination)分区调度的算法。经过实验证明,基于两阶分区的MapReduce负载均衡算法能够有效减少Mapper和Reducer节点空转的时间。随着数据偏度的增加,算法的执行时长基本不产生变化,即数据偏度对该算法执行时间的影响较小。此外,数据采样度的增加,ICSC分区调度算法也保持着对比模型中最少的时间开销。因此,基于两阶分区的MapReduce负载均衡算法弱化了Reducer节点间的依赖性,并提升MapReduce任务的执行效率和容错率,从而高效地实现MapReduce框架下的实验室系统中数据处理的负载均衡。  相似文献   

3.
机器学习领域内的多数模型均需要通过迭代计算以求解其最优参数,而MapReduce模型在迭代计算中的缺陷不足导致其在迭代计算中无法得到广泛应用。为解决上述矛盾,基于MapReduce模型提出并实现了一种可用于模型参数求解的并行迭代模型MRI。MRI模型在保持Map以及Reduce阶段的基础上,新增了Iterate阶段以及相关通信协议,实现了迭代过程中模型参数的更新、分发与迭代控制;通过对MapReduce状态机进行增强,实现了节点任务的重用,避免了迭代过程中节点任务重复创建、初始化以及回收带来的性能开销;在任务节点实现了数据缓存,保障了数据的本地性,并在Map节点增加了基于内存的块缓存机制,进一步提高训练集加载效率,以提高整体迭代效率。基于梯度下降算法的实验结果表明:MRI模型在并行迭代计算方面性能优于MapReduce模型。  相似文献   

4.
有效地减少云计算系统中对计算任务的处理响应时间,并使各计算机节点负载均衡,数据分布算法是相当重要的.提出了一种面向图像并行计算的适用于主从类型云计算系统结构的数据分布策略,设计节点性能函数来表示节点的处理能力,根据节点间的性能比率进行任务数据量的分布,结合链路带宽制定数据发送的顺序.模拟实验结果表明,该算法适用于云计算环境,能明显提高系统的数据处理效率.  相似文献   

5.
Hadoop已成为研究云计算的基础平台,MapReduce是其大数据分布式处理的计算模型。针对异构集群下MapReduce数据分布、数据本地性、作业执行流程等问题,提出一种基于DAG的MapReduce调度算法。把集群中的节点按计算能力进行划分,将MapReduce作业转换成DAG模型,改进向上排序值计算方法,使其在异构集群中计算更精准、任务的优先级排序更合理。综合节点的计算能力与数据本地性及集群利用情况,选择合理的数据节点分配和执行任务,减少当前任务完成时间。实验表明,该算法能合理分布数据,有效提高数据本地性,减少通信开销,缩短整个作业集的调度长度,从而提高集群的利用率。  相似文献   

6.
数据倾斜是严重影响MapReduce性能的因素之一.数据倾斜问题的现有解决方法需要用户对应用类型提供针对的分区函数,或是为MapReduce编写额外的采样过程,增加了用户的负担.为解决上述问题,提出了一种基于压力统计的负载均衡策略.该策略充分利用MapReduce中的混洗阶段,在reducer准备数据的同时进行统计,以获取全局数据分布.系统根据数据分布情况对负载较重节点进行调度,平衡整个集群负载,而无需用户提供额外的输入.此外,考虑到上层不同的应用类型,引入了压力反馈机制来进一步提高调度策略的性能.实验结果表明,提出的负载均衡调度策略的性能优于默认策略性能.  相似文献   

7.
李玲娟  张敏 《微机发展》2011,(2):43-46,50
云计算为存储和分析海量数据提供了廉价高效的解决方案,云计算环境下的数据挖掘算法的研究具有重要的理论意义和应用价值。针对云计算环境下的关联规则挖掘算法展开研究,介绍了云计算的概念、Hadoop框架平台、MapReduce编程模型和传统的Apriori算法;在此基础上,以实现云计算环境下的并行化数据挖掘为目的,对Apriori算法进行了改进,给出了改进的算法在Hadoop中的MapReduce编程模型上的执行流程;通过一个简单的频繁项集挖掘实例展示了改进的算法的执行效率及实用性。  相似文献   

8.
不确定数据集中频繁模式挖掘的研究热点之一是挖掘算法的时空效率的提高,特别在目前数据量越来越大的情况下,实际应用对挖掘算法效率的要求也更高。针对动态不确定数据流中的频繁模式挖掘模型,在算法AT-Mine的基础上,给出一个基于MapReduce的并行挖掘算法。该算法需要两次MapReduce就可以从一个滑动窗口中挖掘出所有的频繁模式。实验中,多数情况下通过一次MapReduce就可以挖掘到全部频繁项集,并且能按数据量大小均匀地把数据分配到各个节点上。实验验证了该算法的时间效率能提高1个数量级。  相似文献   

9.
云计算环境中,飞速增长的海量数据的安全性越来越受到关注,分组密码算法是保证海量数据安全性的一个有效手段,但面对超大规模的数据量其效率是一个备受关注的问题。提出了一种基于MapReduce架构的并行分组密码机制,能够使标准的分组密码算法应用于大规模的集群环境中,通过并行化来提高海量数据加密与解密的执行效率,并设计了常用的几种并行工作模式。实验证明,提出的算法具有良好的可扩展性和高效的执行性能,能够适用于云计算环境中海量数据的安全保密,为进一步的研究工作奠定了基础。  相似文献   

10.
LS SIMD C编译器的数据通信优化算法   总被引:1,自引:1,他引:0  
1 引言当前理想的程序自动并行化系统的实现存在许多难于解决的问题,因此较为流行的并行计算方法是利用并行语言编写并行程序,编译器对并行程序进行编译生成相应的节点程序执行。并行语言按并行执行的粒度分为基于任务的并行语言(主要面向一般应用领域的计算)和数据并行语言(主要应用于科学数值计算),典型的数据并行语言如HPF。对于数据并行语言而言,程序执行的并行性已由程序设计人员根据程序中的数据相关性给出。因此,如何确定数据的分布、优化数据的通信是影响并行程序执行效率的重要问题。数据分布大致可以分为两个阶段:首先对源程序中数据的相关性分析得到数据在抽象处理机上的分布,然后将抽象处理机上的数据分布映射到物理处理机上。数据分布的确定通常有以下几种实现方式:一种是由程序员给出抽象数据分布,编译  相似文献   

11.
One of the important classes of computational problems is problem-oriented workflow applications executed in distributed computing environment. A problem-oriented workflow application can be represented by a directed graph whose vertices are tasks and arcs are data flows. For a problem-oriented workflow application, we can get a priori estimates of the task execution time and the amount of data to be transferred between the tasks. A distributed computing environment designed for the execution of such tasks in a certain subject domain is called problem-oriented environment. To efficiently use resources of the distributed computing environment, special scheduling algorithms are applied. Nowadays, a great number of such algorithms have been proposed. Some of them (like the DSC algorithm) take into account specific features of problem-oriented workflow applications. Others (like Min–Min algorithm) take into account many-core structure of nodes of the computational network. However, none of them takes into account both factors. In this paper, a mathematical model of problem-oriented computing environment is constructed, and a new problem-oriented scheduling (POS) algorithm is proposed. The POS algorithm takes into account both specifics of the problem-oriented jobs and multi-core structure of the computing system nodes. Results of computational experiments comparing the POS algorithm with other known scheduling algorithms are presented.  相似文献   

12.
Several classes of scientific and commercial applications require the execution of a large number of independent tasks. One highly successful and low‐cost mechanism for acquiring the necessary computing power for these applications is the ‘public‐resource computing’, or ‘desktop Grid’ paradigm, which exploits the computational power of private computers. So far, this paradigm has not been applied to data mining applications for two main reasons. First, it is not straightforward to decompose a data mining algorithm into truly independent sub‐tasks. Second, the large volume of the involved data makes it difficult to handle the communication costs of a parallel paradigm. This paper introduces a general framework for distributed data mining applications called Mining@home. In particular, we focus on one of the main data mining problems: the extraction of closed frequent itemsets from transactional databases. We show that it is possible to decompose this problem into independent tasks, which however need to share a large volume of the data. We thus introduce a data‐intensive computing network, which adopts a P2P topology based on super peers with caching capabilities, aiming to support the dissemination of large amounts of information. Finally, we evaluate the execution of a pattern extraction task on such network. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

13.
The paper presents a complete solution for modeling scientific and business workflow applications, static and just-in-time QoS selection of services and workflow execution in a real environment. The workflow application is modeled as an acyclic directed graph where nodes denote tasks and edges denote dependencies between the tasks. The BeesyCluster middleware is used to allow providers to publish services from sequential or parallel applications, from their servers or clusters. Optimization algorithms are proposed to select a capable service for each task so that a global criterion is optimized such as a product of workflow execution time and cost, a linear combination of those or minimization of the time with a cost constraint. The paper presents implementation details of the multithreaded workflow execution engine implemented in JEE. Several tests were performed for three different optimization goals for two business and scientific workflow applications. Finally, the overhead of the solution is presented.  相似文献   

14.
In parallel and distributed applications, it is very likely that object‐oriented languages, such as Java and Ruby, and large‐scale semistructured data written in XML will be employed. However, because of their inherent dynamic memory management, parallel and distributed applications must sometimes suspend the execution of all tasks running on the processors. This adversely affects their execution on the parallel and distributed platform. In this paper, we propose a new task scheduling method called CP/MM (Critical Path/Memory Management) which can efficiently schedule tasks for applications requiring memory management. The underlying concept is to consider the cost due to memory management when the task scheduling system allocates ready (executable) coarse‐grain tasks, or macro‐tasks, to processors. We have developed three task scheduling modules, including CP/MM, for a task scheduling system which is implemented on a Java RMI (Remote Method Invocation) communication infrastructure. Our experimental results show that CP/MM can successfully prevent high‐priority macro‐tasks from being affected by the garbage collection arising from memory management, so that CP/MM can efficiently schedule distributed programs whose critical paths are relatively long. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

15.
Network overlays support the execution of distributed applications, hiding lower level protocols and the physical topology. This work presents DiVHA: a distributed virtual hypercube algorithm that allows the construction and maintenance of a self‐healing overlay network based on a virtual hypercube. DiVHA keeps logarithmic properties even when the number of nodes is not a power of two, presenting a scalable alternative to connect distributed resources. DiVHA assumes a dynamic fault situation, in which nodes fail and recover continuously, leaving and joining the system. The algorithm is formally specified, and the latency for detecting changes and the subsequent reconstruction of the topology is proved to be bounded. An actual overlay network based on DiVHA called HyperBone was implemented and deployed in the PlanetLab. HyperBone offers services such as monitoring and routing, allowing the execution Grid applications across the Internet. HyperBone also includes a procedure for detecting groups of stable nodes, which allowed the execution of parallel applications on a virtual hypercube built on top of PlanetLab. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

16.
Load sharing in large, heterogeneous distributed systems allows users to access vast amounts of computing resources scattered around the system and may provide substantial performance improvements to applications. We discuss the design and implementation issues in Utopia, a load sharing facility specifically built for large and heterogeneous systems. The system has no restriction on the types of tasks that can be remotely executed, involves few application changes and no operating system change, supports a high degree of transparency for remote task execution, and incurs low overhead. The algorithms for managing resource load information and task placement take advantage of the clustering nature of large-scale distributed systems; centralized algorithms are used within host clusters, and directed graph algorithms are used among the clusters to make Utopia scalable to thousands of hosts. Task placements in Utopia exploit the heterogeneous hosts and consider varying resource demands of the tasks. A range of mechanisms for remote execution is available in Utopia that provides varying degrees of transparency and efficiency. A number of applications have been developed for Utopia, ranging from a load sharing command interpreter, to parallel and distributed applications, to a distributed batch facility. For example, an enhanced Unix command interpreter allows arbitrary commands and user jobs to be executed remotely, and a parallel make facility achieves speed-ups of 15 or more by processing a collection of tasks in parallel on a number of hosts.  相似文献   

17.
Scheduling large-scale application in heterogeneous grid systems is a fundamental NP-complete problem that is critical to obtain good performance and execution cost. To achieve high performance in a grid system it requires effective task partitioning, resource management and load balancing. The heterogeneous and dynamic nature of a grid, as well as the diverse demands of applications running on the grid, makes grid scheduling a major task. Existing schedulers in wide-area heterogeneous systems require a large amount of information about the application and the grid environment to produce reasonable schedules. However, this required information may not be available, may be too expensive to collect, or may increase the runtime overhead of the scheduler such that the scheduler is rendered ineffective. We believe that no one scheduler is appropriate for all grid systems and applications. This is because while data parallel applications in which further data partitioning is possible can be further improved by efficient management of resources, smart selection of resources and load balancing can be possible, in functional/not-dividable-task parallel applications such partitioning is either not possible or difficult or expensive in term of performance. In this paper, we propose a scheduler for data parallel applications (SDPA) which offers an efficient task partitioning and load balancing strategy for data parallel applications in grid environment. The proposed SDPA offers two major features: maintaining job priority even if insufficient number of free resources is available and pre-task assignment to cut the idle time of nodes. The SDPA selects nodes smartly according to the nature of task and the nodes’ resources availability. Simulation results conducted reveal that SDPA achieves performance improvement over reported strategies in the reviewed literature in terms of execution time, throughput and waiting time.  相似文献   

18.
群体智能系统通过邻居个体的信息交互实现群体级别的应用任务,具有良好的鲁棒性和灵活性.与此同时,大多数开发人员难以对分布式、并行的个体交互机制进行描述.一些高级语言允许用户以串行思维方式、从系统全局角度来编程并行的群体智能计算任务,而无需考虑通信协议、数据分布等底层交互细节.但面向用户、全局声明式的群体智能系统应用程序与个体并行执行逻辑存在的巨大语义差距,使得编译过程复杂进而导致应用程序开发效率不高.本文提出了一个编译系统及其支撑工具,支持将高级的群体智能系统应用程序转换为安全、高效的分布式实现.该编译系统通过并行信息识别,计算划分,交互信息生成技术,将面向系统全局、串行编程的群体智能应用程序编译为面向个体独立执行的并行目标代码,从而使用户不必了解个体间的复杂交互机制.设计了一种标准化中间表示,将复杂群体智能计算任务转换为群体智能算子和输入输出变量组合而成的标准化语义模块序列,其以独立于平台的形式表示源程序信息,屏蔽了目标硬件平台的异构性.在一个群体智能系统案例平台中部署和测试了该编译系统,结果表明该系统能够有效将群体智能应用程序编译为平台可执行的目标代码并提升应用程序开发效率,其生成的代码在一系列基准测试中具有比现有编译器更好的性能.  相似文献   

19.
针对人脸识别算法研究过程中测试效率低下的问题,基于分布式技术,设计并实现了通用的分布式大数据测试平台。为了提高人脸识别算法的大数据测试的执行效率,提高测试结果统计计算的执行效率,基于RabbitMQ设计分布式并行执行架构,利用Hadoop集群的MapReduce框架进行分布式并行计算。利用Java语言的Spring框架开发测试平台,将测试代码与测试图片托管于Hadoop集群的HDFS文件系统,实现了测试业务与测试平台的分离,提高了平台的通用性。该测试平台不仅实现了单个测试任务的分布式执行而且满足多个测试任务同时执行,可对测试任务以及测试相关的代码与数据进行有效的管理。与传统测试方法相比,该平台测试效率提高10余倍,测试图片的数量越大测试效率提升越明显。该测试平台具有业务通用性、容量可扩展性,对于其他人工智能算法的大量数据测试具有借鉴意义与参考价值。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号