首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 703 毫秒
1.
Although various strategies have been developed for scheduling parallel applications with independent tasks, very little work exists for scheduling tightly coupled parallel applications on cluster environments. In this paper, we compare four different strategies based on performance models of tightly coupled parallel applications for scheduling the applications on clusters. In addition to algorithms based on existing popular optimization techniques, we also propose a new algorithm called Box Elimination that searches the space of performance model parameters to determine the best schedule of machines. By means of real and simulation experiments, we evaluated the algorithms on single cluster and multi‐cluster setups. We show that our Box Elimination algorithm generates up to 80% more efficient schedules than other algorithms. We also show that the execution times of the schedules produced by our algorithm are more robust against the performance modeling errors. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

2.
Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications. However, there are two research issues to be tackled in the scheduling problem for PC cluster computing: (1) how to reduce the communication overhead of executing a multithreaded application on the cluster; (2) how to exploit the heterogeneity, which is unavoidable in an evolving PC cluster, for the application. In this paper, we propose to use a duplication based approach in scheduling tasks/threads to a heterogeneous cluster of PCs. In duplication based scheduling, critical tasks are redundantly scheduled to more than one machine, in order to reduce the number of inter-task communication operations. The start times of the succeeding tasks are also reduced. The task duplication process is guided given the system heterogeneity in that the critical tasks are scheduled or replicated in faster machines. The algorithm has been implemented in our experimental application parallelization system for generating multithreaded parallel code executable on a cluster of Pentium PCs. Our experiments, using three numerical applications and one protocol processing kernel (multithreading per request), have indicated that heterogeneity of PC cluster is indeed useful for optimizing the execution of parallel multithreaded programs.  相似文献   

3.
In this paper we present a new environment called MERPSYS that allows simulation of parallel application execution time on cluster-based systems. The environment offers a modeling application using the Java language extended with methods representing message passing type communication routines. It also offers a graphical interface for building a system model that incorporates various hardware components such as CPUs, GPUs, interconnects and easily allows various formulas to model execution and communication times of particular blocks of code. A simulator engine within the MERPSYS environment simulates execution of the application that consists of processes with various codes, to which distinct labels are assigned. The simulator runs one Java thread per label and scales computations and communication times adequately. This approach allows fast coarse-grained simulation of large applications on large-scale systems. We have performed tests and verification of results from the simulator for three real parallel applications implemented with C/MPI and run on real HPC clusters: a master-slave code computing similarity measures of points in a multidimensional space, a geometric single program multiple data parallel application with heat distribution and a divide-and-conquer application performing merge sort. In all cases the simulator gave results very similar to the real ones on configurations tested up to 1000 processes. Furthermore, it allowed us to make predictions of execution times on configurations beyond the hardware resources available to us.  相似文献   

4.
Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and OpenMP programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an OpenMP thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works.  相似文献   

5.
This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity.  相似文献   

6.
Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs.  相似文献   

7.
8.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

9.
In this study, we consider an environment composed of a heterogeneous cluster of multicore-based machines used to analyze satellite data. The workload involves large data sets and is subject to a deadline constraint. Multiple applications, each represented by a directed acyclic graph (DAG), are allocated to a dedicated heterogeneous distributed computing system. Each vertex in the DAG represents a task that needs to be executed and task execution times vary substantially across machines. The goal of this research is to assign the tasks in applications to a heterogeneous multicore-based parallel system in such a way that all applications complete before a common deadline, and their completion times are robust against uncertainties in execution times. We define a measure that quantifies robustness in this environment. We design, compare, and evaluate five static resource allocation heuristics that attempt to maximize robustness. We consider six different scenarios with different ratios of computation versus communication, and loose and tight deadlines.  相似文献   

10.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

11.
This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters.  相似文献   

12.
In parallel to the changes in both the architecture domain–the move toward chip multiprocessors (CMPs)–and the application domain–the move toward increasingly data-intensive workloads–issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The CPU availability can change dynamically due to several reasons such as thermal overload, increase in transient errors, or operating system scheduling. An important question in this context is how to adapt, in a CMP, the execution of a given application to CPU availability change at runtime. Our paper studies this problem, targeting the energy-delay product (EDP) as the main metric to optimize. We first discuss that, in adapting the application execution to the varying CPU availability, one needs to consider the number of CPUs to use, the number of application threads to accommodate and the voltage/frequency levels to employ (if the CMP has this capability). We then propose to use helper threads to adapt the application execution to CPU availability change in general with the goal of minimizing the EDP. The helper thread runs parallel to the application execution threads and tries to determine the ideal number of CPUs, threads and voltage/frequency levels to employ at any given point in execution. We illustrate this idea using four applications (Fast Fourier Transform, MultiGrid, LU decomposition and Conjugate Gradient) under different execution scenarios. The results collected through our experiments are very promising and indicate that significant EDP reductions are possible using helper threads. For example, we achieved up to 66.3%, 83.3%, 91.2%, and 94.2% savings in EDP when adjusting all the parameters properly in applications FFT, MG, LU, and CG, respectively. We also discuss how our approach can be extended to address multi-programmed workloads.  相似文献   

13.
While the dataflow execution model can potentially uncover all forms and levels of parallelism in a program, in its traditional fine grain form it does not exploit any form of locality. Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggests a synthesis of traditional nonstrict fine grain instruction execution and strict coarse grain execution in order to exploit locality. While an increase in instruction granularity favors the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We define fine grain intrathread locality as a dynamic measure of instruction level locality and quantify it using a set of numeric and nonnumeric benchmarks. The results point to a very large degree of intrathread locality and a remarkable uniformity and consistency of the distribution of thread locality across a wide variety of benchmarks. As the execution is moved to a coarser granularity it can result in an increase of the input latency of operands that would have a detrimental effect on performance. We evaluate the resulting latency incurred through the partitioning of fine grain instructions into coarser grain threads. We define the concept of a cluster of fine grain instructions to quantify coarse grain input and output latencies. The results of our experiments offer compelling evidence that a coarse grain execution outperforms a fine grain grain one on a significant number of numeric codes. These results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intrathread locality. Furthermore, simulation results indicate that strict or nonstrict data structure access does not change the basic cluster characteristics.  相似文献   

14.
高岚  赵雨晨  张伟功  王晶  钱德沛 《软件学报》2024,35(2):1028-1047
并行计算已成为主流趋势. 在并行计算系统中, 同步是关键设计之一, 对硬件性能的充分利用至关重要. 近年来, GPU (graphic processing unit, 图形处理器)作为应用最为广加速器得到了快速发展, 众多应用也对GPU线程同步提出更高要求. 然而, 现有GPU系统却难以高效地支持真实应用中复杂的线程同步. 研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展, 但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战. 根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类. 在此基础上, 围绕GPU线程同步的表达和执行, 首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战; 而后依据不同的GPU线程同步粒度, 从线程同步表达方法和性能优化方法两个方面入手, 介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究, 对现有研究方法进行分析与总结. 最后, 指出GPU线程同步未来的研究趋势和发展前景, 并给出可能的研究思路, 从而为该领域的研究人员提供参考.  相似文献   

15.
One of the main reasons for using parallel evolutionary algorithms (PEAs) is to obtain efficient algorithms with an execution time much lower than that of their sequential counterparts in order, e.g., to tackle more complex problems. This naturally leads to measuring the speedup of the PEA. PEAs have sometimes been reported to provide super-linear performances for different problems, parameterizations, and machines. Super-linear speedup means that using “m” processors leads to an algorithm that runs more than “m” times faster than the sequential version. However, reporting super-linear speedup is controversial, especially for the “traditional” research community, since some non-orthodox practices could be thought of being the cause for this result. Therefore, we begin by offering a taxonomy for speedup, in order to clarify what is being measured. Also, we analyze the sources for such a scenario in this paper. Finally, we study an assorted set of results. Our conclusion is that super-linear performance is possible for PEAs, theoretically and in practice, both in homogeneous and in heterogeneous parallel machines.  相似文献   

16.
In this paper, we describe lazy threads, a new approach for implementing multithreaded execution models on conventional machines. We show how they can implement a parallel call at nearly the efficiency of a sequential call. The central idea is to specialize the representation of a parallel call so that it can execute as a parallel-ready sequential call. This allows excess parallelism to degrade into sequential calls with the attendant efficient stack management and direct transfer of control and data, yet a call that truly needs to execute in parallel, gets its own thread of control. The efficiency of lazy threads is achieved through a careful attention to storage management and a code generation strategy that allows us to represent potential parallel work with no overhead.  相似文献   

17.
The last five years have been a period of exponential growth in the number of machines connected to the Internet and the speed at which these machines communicate. The infrastructure is now in place to consider a nationwide cluster of workstations as a viable parallel processing platform. In order to achieve acceptable performance on this kind of a machine, performance prediction tools must provide information on where to place computational objects. Incorrect object placement can result in poor performance and congestion in the network. This research develops a new paradigm for predicting performance in the Wide Area Network (WAN) based cluster arena. Statistical samples of the performance of clusters and applications are used to build characteristic surfaces. These surfaces are then used to provide guidance in placement of new applications. This prediction method is intended to minimize both the execution time of the application and the impact of the application on the nationwide virtual machine. Performance prediction tools are an important prerequisite to effectively utilizing WAN based clusters.  相似文献   

18.
如何有效利用多核提供的丰富晶体管资源对串行程序的执行进行加速是当前研究中的热点问题。线程级推测(thread-level speculation,TLS)技术旨在充分利用多核资源,最大化地开发出串行代码中存在的潜在并行性。目前TLS技术已经在多种串行应用的并行化工作中得到有效利用,但嵌入式应用程序仍未在推测并行化方面进行有效的分析。因此,选取了八个具有代表性的嵌入式应用,对其在循环级推测并行化中的性能提升潜力和运行时特征(数据依赖、线程粒度和并行覆盖率)进行探讨。实验结果表明,利用线程级推测并行化嵌入式应用的加速效果优于指令级并行技术,实验中的最大加速比达到了13.29;在嵌入式应用领域,该技术可以有效地利用4到8核的计算资源。  相似文献   

19.
We present a novel algorithm to partition large 3D meshes for GPU-accelerated decompression.Our formulation focuses on minimizing the replicated vertices between patches,and balancing the numbers of faces of patches for efficient parallel computing.First we generate a topology model of the original mesh and remove vertex positions.Then we assign the centers of patches using geodesic farthest point sampling and cluster the faces according to the geodesic distance to the centers.After the segmentation we swap boundary faces to fix jagged boundaries and store the boundary vertices for whole-mesh preservation.The decompression of each patch runs on a thread of GPU,and we evaluate its performance on various large benchmarks.In practice,the GPU-based decompression algorithm runs more than 48x faster on NVIDIA GeForce GTX 580 GPU compared with that on the CPU using single core.  相似文献   

20.
Windows NT的实时性研究   总被引:11,自引:1,他引:11  
研究了WindowsNT在中断处理、线程调度、虚存管理、I/O系统等各方面有利于实时处理的核心机制,并通过实验测量了NT的中断响应时间,中断丢失率等一系列性能指标,证明了NT是一个优秀的弱实时系统平台,最后介绍了NT在HRP系列快速成型机软件数控上的应用。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号