首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
In parallel and distributed applications, it is very likely that object‐oriented languages, such as Java and Ruby, and large‐scale semistructured data written in XML will be employed. However, because of their inherent dynamic memory management, parallel and distributed applications must sometimes suspend the execution of all tasks running on the processors. This adversely affects their execution on the parallel and distributed platform. In this paper, we propose a new task scheduling method called CP/MM (Critical Path/Memory Management) which can efficiently schedule tasks for applications requiring memory management. The underlying concept is to consider the cost due to memory management when the task scheduling system allocates ready (executable) coarse‐grain tasks, or macro‐tasks, to processors. We have developed three task scheduling modules, including CP/MM, for a task scheduling system which is implemented on a Java RMI (Remote Method Invocation) communication infrastructure. Our experimental results show that CP/MM can successfully prevent high‐priority macro‐tasks from being affected by the garbage collection arising from memory management, so that CP/MM can efficiently schedule distributed programs whose critical paths are relatively long. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

2.
Many software applications require the construction and manipulation of graphs. In standard programming languages, this is accomplished using low‐level mechanisms such as pointer manipulation or array indexing. In contrast, graph productions are a convenient high‐level visual notation for coding graph modifications. A graph production replaces one subgraph by another subgraph. Graph productions can define a graph grammar and graph language, or can directly transform an input graph into an output graph. Graph transformation has been applied in many areas, including the definition of visual languages and their tools, the construction of software development environments, the definition of constraint programming algorithms, the modeling of distributed systems, and the construction of neural networks. One application is presented in detail: the interpretation of mathematical notation in scanned document images. The graph models the set of mathematical symbols, and their spatial and logical relationships. This graph is transformed by productions written in the PROGRES language. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

3.
In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.  相似文献   

4.
We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

5.
The traditional, serial, algorithm for finding the strongly connected components in a graph is based on depth first search and has complexity which is linear in the size of the graph. Depth first search is difficult to parallelize, which creates a need for a different parallel algorithm for this problem. We describe the implementation of a recently proposed parallel algorithm that finds strongly connected components in distributed graphs, and discuss how it is used in a radiation transport solver.  相似文献   

6.
The Block Conjugate Gradient algorithm (Block‐CG) was developed to solve sparse linear systems of equations that have multiple right‐hand sides. We have adapted it for use in heterogeneous, geographically distributed, parallel architectures. Once the main operations of the Block‐CG (Tasks) have been collected into smaller groups (subjobs), each subjob is matched by the middleware MJMS (MPI Jobs Management System) with a suitable resource selected among those which are available. Moreover, within each subjob, concurrency is introduced at two different levels and with two different granularities: the coarse‐grained parallelism to perform independent tasks and the fine‐grained parallelism within the execution of a task. We refer to this algorithm as to multi‐grained distributed implementation of the parallel Block‐CG. We compare the performance of a parallel implementation with the one of the distributed implementation running on a variety of Grid computing environments. The middleware MJMS—developed by some of the authors and built on top of Globus Toolkit and Condor‐G—was used for co‐allocation, synchronization, scheduling and resource selection. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

7.
A.  U.  M.  K. 《Future Generation Computer Systems》2005,21(8):1275-1284
For the solution of sparse linear systems from circuit simulation whose coefficient matrices include a few dense rows and columns, a parallel iterative algorithm with distributed Schur complement preconditioning is presented. The parallel efficiency of the solver is increased by transforming the equation system into a problem without dense rows and columns as well as by exploitation of parallel graph partitioning methods. The costs of local, incomplete LU decompositions are decreased by fill-in reducing reordering methods of the matrix and a threshold strategy for the factorization. The efficiency of the parallel solver is demonstrated with real circuit simulation problems on PC clusters.  相似文献   

8.
The exploitation of parallelism in a message passing platform implies a previous modeling phase of the parallel application as a task graph, which properly reflects its temporal behavior. In this paper, we analyze the classical task graph models of the literature and their drawbacks when modeling message passing programs with an arbitrary task structure. We define a new task graph model called temporal task interaction graph (TTIG) that integrates the classical models used in the literature. The TTIG allows us to explicitly capture the ability of concurrency of adjacent tasks for applications where adjacent tasks can communicate at any point inside them. A mapping strategy is developed from this model, which minimizes the expected execution time by properly exploiting task parallelism. The effectiveness of this approach has been proved in different experimentation scopes for a wide range of message passing applications.  相似文献   

9.
We consider a graph theoretical model and study a parallel implementation of the well-known Gaussian elimination method on parallel distributed memory architectures, where the communication delay for the transmission of an elementary data is higher than the computation time of an elementary instruction. We propose and analyze two low-complexity algorithms for scheduling the tasks of the parallel Gaussian elimination on an unbounded number of completely connected processors. We compare these two algorithms with a higher-complexity general-purpose scheduling algorithm, the DSC heuristic, proposed by A. Gerasoulis and T. Yang (1993)  相似文献   

10.
Graph grammar concepts are applied for the design of consistent states and manipulations modeling actions, transactions and schedules for simultaneous executions on data base systems. Especially we show how to use Church-Rosser Theorems for graph grammars to handle synchronization problems for data base systems in a multi-user environment. In this paper theses concepts are applied to a data base system for a library. For reasons of presentation the system is restricted to the basic features of a library but can be extended to more comfortable systems including also generalized data base systems. All manipulation rules are shown to preserve consistency and it is analysed exactly which of theses rules are collateral and for which of them further synchronization mechanisms are necessary. While consistent states are modelled as graphs and manipulation rules (resp. actions) as graph productions transactions correspond to sequences of productions, and schedules to sequences of parallel productions. The application of a schedule to a consistent state yields a “semantical schedule” which transforms the given state to some other state of the system. Sufficient conditions are given for a schedule to be consistent which means that consistent states are transformed into consistent states. Moreover we are able to define a degree of non-parallelism for schedules such that optimal schedules are those with minimal degree with respect to all equivalent ones. Applying results from graph grammar theory we are able to show that for each schedule there is a unique optimal one. A similar result is shown for semantical schedules where the semantical degree of non-parallelism is smaller than the syntactical one in general.  相似文献   

11.
We construct a parallel algorithm, suitable for distributed memory architectures, of an explicit shock-capturing finite volume method for solving the two-dimensional shallow water equations. The finite volume method is based on the very popular approximate Riemann solver of Roe and is extended to second order spatial accuracy by an appropriate TVD technique. The parallel code is applied to distributed memory architectures using domain decomposition techniques and we investigate its performance on a grid computer and on a Distributed Shared Memory supercomputer. The effectiveness of the parallel algorithm is considered for specific benchmark test cases. The performance of the realization measured in terms of execution time and speedup factors reveals the efficiency of the implementation.  相似文献   

12.
时延Petri网分布式模拟的先行值研究   总被引:1,自引:0,他引:1  
先行值计算是提高时延Petri网并行模拟性能的一个好的方法。给出了时延Petri网的先行值计算的四种基本结构,对于存在循环的复杂的Petri网结构给出了预测图算法,通过预测图,能够很容易求出静态和动态先行值,在并行模拟中利用先行值可以分析出存在并发和阻塞的结构,从而为网分块在并行机的结点上运行奠定了基础。  相似文献   

13.
As highly parallel multicore machines become commonplace, programs must exhibit more concurrency to exploit the available hardware. Many multithreaded programming models already encourage programmers to create hundreds or thousands of short‐lived threads that interact in complex ways. Programmers need to be able to analyze, tune, and troubleshoot these large‐scale multithreaded programs. To address this problem, we present ThreadScope: a tool for tracing, visualizing, and analyzing massively multithreaded programs. ThreadScope extracts the machine‐independent program structure from execution trace data from a variety of tracing tools and displays it as a graph of dependent execution blocks and memory objects, enabling identification of synchronization and structural problems, even if they did not occur in the traced run. It also uses graph‐based analysis to identify potential problems. We demonstrate the use of ThreadScope to view program structure, memory access patterns, and synchronization problems in three programming environments and seven applications. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

14.
在多核处理器上,事务存储是一种有望取代锁的同步手段。软件事务存储不需要增加额外硬件支持,就可以充分利用当前商业多核处理器的多线程能力。提出一种软件事务存储实现算法VectorSTM,该算法不需要使用原子操作。VectorSTM采用分布的向量时钟来跟踪各线程事务执行情况,能够提供更高的并发度。对事务存储基准程序STAMP的测试表明,VectorSTM在性能或者语义上比软件事务存储算法TL2和RingSTM有优势。  相似文献   

15.

Community detection (or clustering) in large-scale graphs is an important problem in graph mining. Communities reveal interesting organizational and functional characteristics of a network. Louvain algorithm is an efficient sequential algorithm for community detection. However, such sequential algorithms fail to scale for emerging large-scale data. Scalable parallel algorithms are necessary to process large graph datasets. In this work, we show a comparative analysis of our different parallel implementations of Louvain algorithm. We design parallel algorithms for Louvain method in shared memory and distributed memory settings. Developing distributed memory parallel algorithms is challenging because of inter-process communication and load balancing issues. We incorporate dynamic load balancing in our final algorithm DPLAL (Distributed Parallel Louvain Algorithm with Load-balancing). DPLAL overcomes the performance bottleneck of the previous algorithms and shows around 12-fold speedup scaling to a larger number of processors. We also compare the performance of our algorithm with some other prominent algorithms in the literature and get better or comparable performance . We identify the challenges in developing distributed memory algorithm and provide an optimized solution DPLAL showing performance analysis of the algorithm on large-scale real-world networks from different domains.

  相似文献   

16.
We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on atransition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between shared-memory and message-passing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic trade-off between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.This work was supported in part by the Advanced Research Projects Agency of the Department of Defense monitored by the Office of Naval Research under contract DABT63-92-C-0026, by AT&T, and by the National Science Foundation through an Infrastructure Grant (number CDA-8722788) and a Research Initiation Award (number CCR-9210260). The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.  相似文献   

17.
In this paper we present a new parallel multi-frontal direct solver, dedicated for the hp Finite Element Method (hp-FEM). The self-adaptive hp-FEM generates in a fully automatic mode, a sequence of hp-meshes delivering exponential convergence of the error with respect to the number of degrees of freedom (d.o.f.) as well as the CPU time, by performing a sequence of hp refinements starting from an arbitrary initial mesh. The solver constructs an initial elimination tree for an arbitrary initial mesh, and expands the elimination tree each time the mesh is refined. This allows us to keep track of the order of elimination for the solver. The solver also minimizes the memory usage, by de-allocating partial LU factorizations computed during the elimination stage of the solver, and recomputes them for the backward substitution stage, by utilizing only about 10% of the computational time necessary for the original computations. The solver has been tested on 3D Direct Current (DC) borehole resistivity measurement simulations problems. We measure the execution time and memory usage of the solver over a large regular mesh with 1.5 million degrees of freedom as well as on the highly non-regular mesh, generated by the self-adaptive hphp-FEM, with finite elements of various sizes and polynomial orders of approximation varying from p=1p=1 to p=9p=9. From the presented experiments it follows that the parallel solver scales well up to the maximum number of utilized processors. The limit for the solver scalability is the maximum sequential part of the algorithm: the computations of the partial LU factorizations over the longest path, coming from the root of the elimination tree down to the deepest leaf.  相似文献   

18.
Transactional memory (TM) is an approach to concurrency control that aims to make writing parallel programs both effective and simple. The approach has been initially proposed for nondistributed multiprocessor systems, but it is gaining popularity in distributed systems to synchronize tasks at large scales. Efficiency and scalability are often the key issues in TM research; thus, performance benchmarks are an important part of it. However, while standard TM benchmarks like the Stanford Transactional Applications for Multi‐Processing suite and STMBench7 are available and widely accepted, they do not translate well into distributed systems. Hence, the set of benchmarks usable with distributed TM systems is very limited, and must be padded with microbenchmarks, whose simplicity and artificial nature often makes them uninformative or misleading. Therefore, this paper introduces Helenos, a realistic, complex, and comprehensive distributed TM benchmark based on the problem of the Facebook inbox, an application of the Cassandra distributed store.  相似文献   

19.
In order to minimize the execution time of a parallel application running on a heterogeneously distributed computing system, an appropriate mapping scheme is needed to allocate the application tasks to the processors. The general problem of mapping tasks to machines is a well‐known NP‐hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph‐based mapping algorithm, called Heterogeneous Multi‐phase Mapping (HMM), which permits suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta‐heuristic. HMM allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application, and considering an affinity parameter, that identifies which machine in the heterogeneous computing system is most suitable to execute a task. We compare HMM with some leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of two real applications using HMM. Experimental results show that HMM performs well demonstrating the applicability of our approach. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

20.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号