首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
To solve the load imbalance problem of a solution-adaptive finite element application program on a distributed memory multicomputer, nodes of a refined finite element graph can be remapped to processors or load of a refined finite element graph can be redistributed based on the current load of each processor. For the former case, remapping can be performed by some fast mapping algorithms. For the latter case, a load-balancing algorithm can be applied to balance the computational load of each processor. In this paper, three tree-based parallel load-balancing methods, the MCSTLB method, the BTLB method, and the CBTLB method, were proposed to deal with the load imbalance problems of solution-adaptive finite element application programs. To evaluate the performance of the proposed methods, we have implemented those methods along with three mapping methods, the AE/ORB method, the AE/MC method, and the MLkP method, on an SP2 parallel machine. Three criteria, the execution time of mapping/load-balancing methods, the execution time of a solution-adaptive finite element application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for a solution-adaptive finite element application program, are used for the performance evaluation. The experimental results show that 1) if the initial mapping is performed by a mapping method and the same mapping method and load-balancing methods were used in each refinement to balance the load of processors, the execution time of an application program under a load-balancing method is always shorter than that of the mapping method, and 2) the execution time of an application program under the CBTLB method is shorter than that of the BTLB method and the MCSTLB method  相似文献   

2.
To efficiently execute a finite element program on a 2D torus, we need to map nodes of the corresponding finite element graph to processors of a 2D torus such that each processor has approximately the same amount of computational load and the communication among processors is minimized. If nodes of a finite element graph do not increase during the execution of a program, the mapping only needs to be performed once. However, if a finite element graph is solution-adaptive, that is, nodes of a finite element graph increase discretely due to the refinement of some finite elements during the execution of a program, a dynamic load-balancing algorithm has to be performed many times in order to balance the computational load of processors while keeping the communication cost as low as possible. In the paper we propose a parallel dynamic load-balancing algorithm (LB) to deal with the load-imbalancing problem of a solution-adaptive finite element program on a 2D torus. The algorithm uses an iterative approach to achieve load-balancing. We have implemented the proposed algorithm along with two parallel mapping algorithms, parallel orthogonal recursive bisection (ORB) and parallel recursive mincut bipartitioning (MC), on a simulated 2D torus. Three criteria, the execution time of load-balancing algorithms, the computation time of an application program under different load balancing algorithms, and the total execution time of an application program (under several refinement phases) are used for performance evaluation. Simulation results show that (1) the execution of LB is faster than those of MC and ORB; (2) the mappings of LB are better than those of ORB and MC; and (3) the speedups of LB are better than those of ORB and MC.  相似文献   

3.
To efficiently execute a finite element application program on a distributed memory multicomputer, we need to distribute nodes of a finite element graph to processors of a distributed memory multicomputer as evenly as possible and minimize the communication cost of processors. This partitioning problem is known to be NP-complete. Therefore, many heuristics have been proposed to find satisfactory sub-optimal solutions. Based on these heuristics, many graph partitioners have been developed. Among them, Jostle, Metis, and Party are considered as the best graph partitioners available up-to-date. For these three graph partitioners, in order to minimize the total cut-edges, in general, they allow 3% to 5% load imbalance among processors. This is a tradeoff between the communication cost and the computation cost of the partitioning problem. In this paper, we propose an optimization method, the dynamic diffusion method (DDM), to balance the 3% to 5% load imbalance allowed by these three graph partitioners while minimizing the total cut-edges among partitioned modules. To evaluate the proposed method, we compare the performance of the dynamic diffusion method with the directed diffusion method and the multilevel diffusion method on an IBM SP2 parallel machine. Three 2D and two 3D irregular finite element graphs are used as test samples. For each test sample, 3% and 5% load imbalance situations are tested. From the experimental results, we have the following conclusions. (1) The dynamic diffusion method can improve the partition results of these three partitioners in terms of the total cut-edges and the execution time of a Laplace solver in most test cases while the directed diffusion method and the multilevel diffusion method may fail in many cases. (2) The optimization results of the dynamic diffusion method are better than those of the directed diffusion method and the multilevel diffusion method in terms of the total cut-edges and the execution time of a Laplace solver for most test cases. (3) The dynamic diffusion method can balance the load of processors for all test cases.  相似文献   

4.
夏英  李洪旭 《计算机应用》2017,37(9):2439-2442
无序树常用于半结构化数据建模,对其进行频繁子树挖掘有利于发现隐藏的知识。传统的频繁子树挖掘方法常常输出大规模且带有冗余信息的频繁子树,这样的输出结果会降低后续操作的效率。针对传统方法的不足,提出了一种用于挖掘覆盖模式(MCRP)算法。首先,采用宽度孩子数编码对树进行编码;然后,通过基于最大前缀编码序列的边扩展方式生成所有的候选子树;最后,在频繁子树集和δ'-覆盖概念的基础上输出覆盖模式集。与传统的挖掘频繁闭树模式和极大频繁树模式的算法相比,该算法能够在保留所有频繁子树信息的情况下输出更少的频繁子树,并且将处理效率提高15%到25%。实验结果表明,所提算法能有效减小输出频繁子树的规模,减少冗余信息,在实际操作中具有较高的可行性。  相似文献   

5.
In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.  相似文献   

6.
Summary This paper presents the formal definition of TOMAL (Task-Oriented Microprocessor Applications Language), a programming language intended for real-time systems running on small processors. The formal definition addresses all aspects of the language. Because some modes of semantic definition seem particularly well-suited to certain aspects of a language, and not as suitable for others, the formal definition employs several complementary modes of definition.The primary definition is axiomatic and is employed to define most statements of the language. Simple, denotational (but not lattice-theoretic) semantics complement the axiomatic semantics to define type-related features, such as binding of names to types, data type coercions, and evaluation of expressions. Together, the axiomatic and denotational semantics define all features of the sequential language. An operational definition is used to define real-time execution, and to extend the axiomatic definition to account for all aspects of concurrent execution. Semantic constraints, sufficient to guarantee conformity of a program with the axiomatic definition, can be checked by analysis of a TOMAL program at compilation.  相似文献   

7.
Computational Grids are emerging as a new infrastructure for high performance computing. Since the resources in a Grid can be heterogeneous and distributed, mesh-based applications require a mesh partitioner that considers both processor and network heterogeneity. We have developed a heterogeneous mesh partitioner, called PaGrid. PaGrid uses a multilevel graph partitioning approach, augmented by execution time load balancing in the final uncoarsening phase. We show that minimization of total communication cost (e.g., as used by JOSTLE) can lead to significant load being placed on processors connected by slow links, which results in higher application execution times. Therefore, PaGrid balances the estimated execution time of the application across processors. PaGrid performance is compared with two existing mesh partitioners, METIS 4.0 and JOSTLE 3.0, for mapping several application meshes to two models of heterogeneous computational Grids. PaGrid is found to produce significantly better partitions than JOSTLE and slightly better partitions than METIS in most cases, in terms of estimated application execution time averaged over a large number of runs with different random number seeds.  相似文献   

8.
9.
在研究程序代码相似性度量方法的基础上,提出一种基于XML store的程序代码查询匹配算法。由于XML store以树型结构保存XML文件,算法将通过查询XML store中DVM树来对判断程序之间是否具有相同结构的子树,进行相似度度量。最后,通过在原型系统上进行的一系列实验,进一步证明了提出的算法在程序代码相似度度量实际应用中的可行性和有效性。  相似文献   

10.
Amoth  Thomas R.  Cull  Paul  Tadepalli  Prasad 《Machine Learning》2001,44(3):211-243
Tree patterns are natural candidates for representing rules and hypotheses in many tasks such as information extraction and symbolic mathematics. A tree pattern is a tree with labeled nodes where some of the leaves may be labeled with variables, whereas a tree instance has no variables. A tree pattern matches an instance if there is a consistent substitution for the variables that allows a mapping of subtrees to matching subtrees of the instance. A finite union of tree patterns is called a forest. In this paper, we study the learnability of tree patterns from queries when the subtrees are unordered. The learnability is determined by the semantics of matching as defined by the types of mappings from the pattern subtrees to the instance subtrees. We first show that unordered tree patterns and forests are not exactly learnable from equivalence and subset queries when the mapping between subtrees is one-to-one onto, regardless of the computational power of the learner. Tree and forest patterns are learnable from equivalence and membership queries for the one-to-one into mapping. Finally, we connect the problem of learning tree patterns to inductive logic programming by describing a class of tree patterns called Clausal trees that includes non-recursive single-predicate Horn clauses and show that this class is learnable from equivalence and membership queries.  相似文献   

11.
Observations on using genetic algorithms for dynamic load-balancing   总被引:2,自引:0,他引:2  
Load-balancing problems arise in many applications, but, most importantly, they play a special role in the operation of parallel and distributed computing systems. Load-balancing deals with partitioning a program into smaller tasks that can be executed concurrently and mapping each of these tasks to a computational resource such as a processor (e.g., in a multiprocessor system) or a computer (e.g., in a computer network). By developing strategies that can map these tasks to processors in a way that balances out the load, the total processing time will be reduced with improved processor utilization. Most of the research on load-balancing focused on static scenarios that, in most of the cases, employ heuristic methods. However, genetic algorithms have gained immense popularity over the last few years as a robust and easily adaptable search technique. The work proposed here investigates how a genetic algorithm can be employed to solve the dynamic load-balancing problem. A dynamic load-balancing algorithm is developed whereby optimal or near-optimal task allocations can “evolve” during the operation of the parallel computing system. The algorithm considers other load-balancing issues such as threshold policies, information exchange criteria, and interprocessor communication. The effects of these and other issues on the success of the genetic-based load-balancing algorithm as compared with the first-fit heuristic are outlined  相似文献   

12.
We present a parallel formulation for enumerative search in high dimensional spaces and apply it to planning paths for a 6-dof manipulator robot. Participating processors perform local A* search towards the goal configuration. To exploit all the processors at their maximum capacity at all times, a dynamic load-balancing scheme matches idle and busy processors for load transfer. For comparison purposes, we have also implemented an existing parallel static load-balancing formulation based on regular domain decomposition. Both methods achieved almost linear speed-up in our experiments. The two methods follow different search strategies in parallel and the implementation of the existing method (with tuned space decomposition) was more time efficient on average. However, the planning time of that method is highly dependent on the distribution of the search space among the processors and its tuned decomposition varies for different obstacle placements. Empirical selection of the space decomposition parameters for the existing method does not guarantee minimal planning time in all environments and leads to slower planning than our dynamic load-balancing method in some cases. The performance of the developed dynamic method is independent of the obstacle placements and the method can achieve consistent speed-up in all environments.  相似文献   

13.
由于信息传播模型是社区挖掘、社区影响力研究的基础,文中提出结合用户兴趣的信息传播模型,设计基于频繁子树的信息传播微观模式挖掘方法.首先,基于微博社交网络图表示及用户多标签建模,将微观信息传播模式转换为频繁子树挖掘问题.然后,针对微博社交网络图单节点多标签特性,设计多标签节点树的频繁子树挖掘算法(MLTreeMiner).最后,结合主题提取方法,使用MLTreeMiner挖掘信息传播模式.在人工数据集上的实验表明,MLtreeMiner能高效地对多标签节点树进行频繁子树挖掘.针对新浪微博真实数据的实验也验证方法的有效性.  相似文献   

14.
In this paper we introduce our estimation method for parallel execution times, based on identifying separate “parts” of the work done by parallel programs. Our run time analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions using statistical methods. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers with high precision. Our prediction methods require orders of magnitude less data points than existing approaches. We verified our approach on parallel machines ranging from a multicore computer to a peta-scale supercomputer.  相似文献   

15.
Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors.  相似文献   

16.
The problem of mapping the parallel bottom up execution of Datalog programs to an interconnected network of processors is studied. The parallelization is achieved by using hash functions that partition the set of instantiations for the rules. We first examine this problem in an environment where the number of processors and the interconnection topology is known, and communication between program segments residing at non-adjacent processors is not permitted. An algorithm is presented that decides whether a given Datalog program can be mapped onto such an architecture. We then relax the constraint on the architecture by allowing program segments residing at non-adjacent processors to communicate, A theory of approximate mappings is developed, and an algorithm to obtain the closest approximate mapping of a given Datalog program onto a given architecture is presented  相似文献   

17.
This work investigates the operator mapping problem for in-network stream-processing. In a stream-processing application, a tree of operators is applied, in steady-state mode, to datasets that are continuously updated at different locations in the network. The goal is to generate updated final results at a desired rate. In in-network stream-processing, dataset updates and operator computations are performed by servers distributed in a network. We consider the problem of mapping operators to these servers in the case of multiple concurrent stream-processing applications. In this case, different operator trees corresponding to different applications may share common subtrees, so that intermediate results can be reused by different applications. This work provides complexity results for different versions of the operator mapping problem, which can be formulated as integer linear programs. Several polynomial-time heuristics are proposed for a particularly relevant version of the problem, which is NP-hard. These heuristics are compared and evaluated via simulation. The results demonstrate the importance of mapping the operators to appropriate processors, and the importance of sharing common sub-trees across operator trees.  相似文献   

18.
《Control Engineering Practice》2007,15(11):1321-1331
Enterprise control system integration between business systems, manufacturing execution systems and shop-floor process-control systems remains a key issue for facilitating the deployment of plant-wide information control systems for practical e-business-to-manufacturing industry-led issues. Achievement of the integration-in-manufacturing paradigm based on centralized/distributed hardware/software automation architectures is evolving using the intelligence-in-manufacturing paradigm addressed by IMS industry-led R&D initiatives. The remaining goal is to define and experiment with the next generation of manufacturing systems, which should be able to cope with the high degree of complexity required to implement agility, flexibility and reactivity in customized manufacturing. This introductory paper summarizes some key problems, trends and accomplishments in manufacturing plant control before emphasizing for practical purposes some rationales and forecasts in deploying automation over networks, holonic manufacturing execution systems and their related agent-based technology, and applying formal methods to ensure dependable control of these manufacturing systems.  相似文献   

19.
Research has been conducted to determine how distributed computations can be mapped to multiprocessors to minimize execution time. The approach described here, known as post-game analysis, incrementally changes the program partitioning in between program execution time in subsequent runs. Post-game analysis differs from conventional iterative refinement or controlled opportunistic perturbation in that no abstract program models or any single objective function are employed to determine the relative merits of two alternative mappings. Multiple optimization subgoals are formulated, based on actual timing data gathered during program execution. Heuristics, based on various optimization subgoals, are then applied to propose changes to the current mapping. Finally, a mapping generation process which prioritizes and resolves conflicting proposals is applied. Results obtained from simulations show that post-game analysis consistently out-performs random placement, load-balancing, and clustering algorithms by 15%. Few iterations are required for simulations involving more than 200 processes and 64 sites. A rule-based architecture enables incremental strategy refinement, thus making post-game analysis easily tailorable to programs written in many concurrent programming paradigms and multiprocessor architectures  相似文献   

20.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号