首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing nonnumeric programs, which tend to have irregular and pointer-intensive data structures and complex flows of control. Proper thread formation is crucial for obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general and supports speculative threads, nonspeculative threads, loop-centric threads, and out-of-order thread spawning. It is therefore useful for compiling for a wide variety of SpMT architectures. For effective partitioning of programs, the compiler uses profiling, interprocedural pointer analysis, data dependence information, and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that the use of nonspeculative threads and nonloop speculative threads provides a significant increase in speedup for nonnumeric programs.  相似文献   

2.
Results are reported for a series of experiments involving numerical curve tracking on a shared-memory parallel computer. Several algorithms exist for finding zeros or fixed points of nonlinear systems of equations that are globally convergent for almost all starting points, that is, with probability one. The essence of all such algorithms is the construction of an appropriate homotopy map and then the tracking of some smooth curve in the zero set of this homotopy map. HOMPACK is a mathematical software package implementing globally convergent homotopy algorithms with three different techniques for tracking a homotopy zero curve, and has separate routines for dense and sparse Jacobian matrices. The HOMPACK algorithms for sparse Jacobian matrices use a preconditioned conjugate gradient algorithm for the computation of the kernel of the homotopy Jacobian matrix, a required linear algebra step for homotopy curve tracking. A parallel version of HOMPACK is implemented on a shared-memory parallel computer with various levels and degrees of parallelism (e.g., linear algebra, function, and Jacobian matrix evaluation), and a detailed study is presented for each of these levels with respect to the speedup in execution time obtained with the parallelism, the time spent implementing the parallel code, and the extra memory allocated by the parallel algorithm.  相似文献   

3.
Distributed systems are an alternative to shared-memory multiprocessors for the execution of parallel applications.Panda is a run-time system that provides architectural support for efficient parallel and distributed programming. It supplies fast user-level threads and a means for transparent and coordinated sharing of objects across a homogeneous network. The paper motivates the major architectural choices that guided our design. The problem of sharing data in a distributed environment is discussed, and the performance of the mechanisms provided by thePanda prototype implementation is assessed.  相似文献   

4.
This paper examines a previously unanalyzed bistability phenomenon with respect to the number of threads that are doing useful work. This phenomenon is illustrated by a single work queue on a shared-memory machine. An analysis of designs that use two separate memory accesses to lock and unlock critical sections (split transaction) and that employ a first come/first serve queuing mechanism for shared-memory locations is presented. A bistability in the number of threads working, brought about by these conditions, is analyzed and experimentally demonstrated. A simple analysis is presented which predicts the throughput at a critical section of code as a function of the number of applied threads. The study concludes that the mean size of the work items that can be executed in parallel without the possibility of stalling is proportional to the square of the number of threads applied.  相似文献   

5.
The paper discusses issues pertinent to performance analysis of massively parallel systems. A model of parallel execution based on threads of control and events is then introduced. The key ingredient of this model is a measure of the communication complexity, which gives the number of events E as a function of the number of threads of control P and provides a signature of a parallel computation. Various consequences for speedup and load balancing are presented.  相似文献   

6.
This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation.  相似文献   

7.
Transactional memory programs may have dynamic available parallelism, which is defined as the number of transactions that can be committed concurrently. Prior work presented adaptive concurrency control, which adapts the number of active threads at runtime, and thus the number of concurrently executing transactions, based on available parallelism. Reducing threads when available parallelism is low, and vice versa, improved speedup and reduced wasted work (in aborted transactions). However, prior work did not consider the case where individual threads exhibit dynamic available parallelism. Deactivating threads with low available parallelism, and vice versa, may improve speedup and reduce wasted work further. This paper introduces weighted adaptive concurrency control to exploit the variance in available parallelism between threads. Four algorithms are designed, implemented, and evaluated. They improve speedup and reduce wasted work over prior non-weighted algorithms in applications whose threads exhibit such variance, while maintaining performance parity in applications whose threads do not.  相似文献   

8.
This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor  相似文献   

9.
PRESTO is a programming system for writing object-oriented parallel programs in a multiprocessor environment. PRESTO provides the programmer with a set of pre-defined object types that simplify the construction of parallel programs. Examples of PRESTO objects are threads, which provide fine-grained control over a program's execution, and synchronization objects, which allow simultaneously executing threads to co-ordinate their activities. The goals of PRESTO are to provide a programming environment that makes it easy to express concurrent algorithms, to do so efficiently, and to do so in a manner that invites extensions and modifications. The first two goals, which are the focus of this paper, allow a programmer to use parallelism in a way that is naturally suited to the problem at hand, rather than being constrained by the limitations of a particular underlying kernel or hardware architecture. The third goal is touched upon but not emphasized in this paper. PRESTO is written in C++; it currently runs on the Sequent shared-memory multiprocessor on top of the Dynix operating system. In this paper we describe the system model, its applicability to parallel programming, experiences with the initial implementation, and some early performance measurements.  相似文献   

10.
In Solaris, threads are frequently relocated. The data associated with a relocated thread have to be moved from the cache of the old processor to the new processor. In order to avoid poor memory performance due to thread relocation, threads can be bound to processors—static scheduling. Finding a static schedule which results in maximum speedup is NP-hard. It is even difficult to determine if a static schedule is close to the optimal case or not. Here, a technique for predicting the speedup of multithreaded Solaris programs is presented. Based on an existing theoretical result, a lower bound on the maximal speedup is also obtained. The predicted speedup and the bound are based on recordings from a single-processor execution. When comparing the predictions with the real speedup using a multiprocessor with eight processors, we see that the predictions are very good. By comparing the speedup of a static schedule with the bound, we see that it is worthwhile to look for other schedules.  相似文献   

11.
Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs.  相似文献   

12.
This paper presents the implementation and performance results of anand-parallel execution model of logic programs on a shared-memory multiprocessor. The execution model is meant for logic programs with “don't-know nondeterminism”, and handles binding conflicts by dynamically detecting dependencies among literals. The model also incorporates intelligent backtracking at the clause level. Our implementation of this model is based upon the Warren Abstract Machine (WAM); hence it retains most of the efficiency of the WAM for sequential segments of logic programs. Performance results on Sequent Balance 21000 show that on suitable programs, our parallel implementation can achieve linear speedup on dozens of processors. We also present an analysis of different overheads encountered in the implementation of the execution model.  相似文献   

13.
GPGPU加速器是当前提高图像处理算法性能的主流加速平台,但是,在GPGPU平台上,同一个程序充分利用硬件体系结构特征和软件特征的优化版本与简单实现版本在性能上会有数量级的差异。GPGPU加速器具有多维多层的大量执行线程和层次化存储体系结构,后者的不同层次具有不同的容量、带宽、延迟和访问权限。同时,图像处理应用程序具有复杂的计算操作、边界处理规则和数据访问特性。因此,任务的并发执行模式、线程的组织方式和并发任务到设备的映射不仅影响到程序的并发度、调度、通信和同步等特性,而且也会影响到访存的带宽、延迟等。因此,GPGPU平台上的程序优化是一个困难、复杂且效率较低的过程。本文提出基于语言扩展的领域编程模型:ParaC。ParaC编程环境利用高层语言扩展描述的程序语义信息,自动分析获取应用程序的操作信息、并发任务间的数据重用信息和访存信息等程序特征,同时结合硬件平台特征,利用基于领域先验知识驱动的编译优化模型自动生成GPGPU平台上的优化代码,最后,利用源源变换编译器生成标准OpenCL程序。本文在测试用例上的实验结果表明,ParaC在GPGPU平台上自动生成的优化版本相对于手工优化版本的加速比最高达到3.22倍,但代码行数只是后者的1.2%到39.68%。  相似文献   

14.
任务并行是并行程序设计的基础设计模式.但由于算法本身的复杂性及目标平台的特殊性,设计实现高效率的任务并行程序对程序员来说往往充满挑战.基于新兴的SW26010众核CPU,本文提出支持任务嵌套并行模式的通用运行时框架SWAN.SWAN对任务并行程序的实现提供了高层次的抽象,使程序员能够专注于算法逻辑本身而提高开发效率.在性能方面,SWAN框架对诸多共享资源进行了细粒度的划分,从而有效地避免了众多线程间对共享资源的高强度争用.本文还充分利用平台的高速访存机制,高速可控缓存和原子操作等特性,对SWAN框架的核心数据结构进行优化设计以降低其本身的性能开销.另外,SWAN还具备动态负载均衡能力使得各个处理器核心的资源得以充分利用.本文基于SWAN框架在目标平台上实现了若干典型的具有递归特性的嵌套并行算法,包括N-皇后问题,二叉树遍历,快速排序和凸包求解.实验表明,这些通过使用SWAN框架得以并行化的算法相对其串行版本取得了4.5至32倍的加速,充分说明了SWAN框架具有较高的实用性及性能.  相似文献   

15.
孙乔  黎雷生  赵海涛  赵慧  吴长茂 《软件学报》2021,32(8):2352-2364
任务并行是并行程序设计的基础设计模式.但由于算法本身的复杂性及目标平台的特殊性,设计实现高效率的任务并行程序对程序员来说往往充满挑战.基于新兴的SW26010众核CPU,提出了支持任务嵌套并行模式的通用运行时框架SWAN.SWAN对任务并行程序的实现提供了高层次的抽象,使程序员能够专注于算法逻辑本身而提高开发效率.在性能方面,SWAN框架对诸多共享资源进行了细粒度的划分,从而有效地避免了众多线程间对共享资源的高强度争用.充分利用平台的高速访存机制、高速可控缓存和原子操作等特性,对SWAN框架的核心数据结构进行优化设计以降低其本身的性能开销.SWAN还具备动态负载均衡能力,使各个处理器核心的资源得以充分利用.基于SWAN框架,在目标平台上实现了若干典型的具有递归特性的嵌套并行算法,包括N-皇后问题、二叉树遍历、快速排序和凸包求解.实验结果表明,这些通过使用SWAN框架得以并行化的算法相对于其串行版本取得了4.5~32倍的加速,充分说明了SWAN框架具有较高的实用性及性能.  相似文献   

16.
“Smart” is a new system-level simulation environment that was developed in order to evaluate and improve algorithms for distributed and parallel systems. In this paper we focus our discussion on the developing of new cache coherency mechanisms that were optimized to handle system-level effects such as process switching and task migration. The developing of new cache coherency protocols is a good example to demonstrate many of the important features of Smart, since system-level events have a major influence on the effectiveness of different cache coherency policies and the overall performance of multicache systems. The Smart simulation environment was built as a separate layer that extends existing multi-processing simulators, so we could take the advantage of using mature and reliable simulation engines. Smart also provides a friendly graphical user interface (GUI) that allows: (1) control of different system parameters and mechanisms such as the cache coherency protocol type, cache organization, scheduling policies of processes and threads, etc., (2) simulation of the execution of shared-memory parallel architecture and measuring different systems' performance parameters and (3) use as a powerful visual based debugging tool. Although this paper presents a version of Smart which is dedicated for shared-bus architectures, other libraries of the tool can simulate different parallel and distributed architectures as well.  相似文献   

17.
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.  相似文献   

18.
19.
Yu  Hui  Jiang  Xin-Yu  Zhao  Jin  Qi  Hao  Zhang  Yu  Liao  Xiao-Fei  Liu  Hai-Kun  Mao  Fu-Bing  Jin  Hai 《计算机科学技术学报》2022,37(4):797-813

Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations; while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.

  相似文献   

20.
This paper reports on a parallel implementation of a general 3D multi-block CFD code. The parallelization is achieved by using three strategies. Firstly, it is done on dual-processor PC-clusters where Windows NT systems are running. A multi-thread programming model is adopted for the multi-block code, where one thread corresponds to a block. Shared-memory is used for the exchange of inner-boundaries between neighboring blocks (threads) on the same node, while WinSockets are employed for those on different nodes. Secondly, the parallelization is extended to UNIX operating system. MPI is applied for all the message passing between different processors, including those on the same node. Thirdly, Pthreads (POSIX threads), a standardized application interface for threads, are adopted to take the advantage of the shared-memory feature of the SMP nodes, while MPI is only applied for the message passing between processors on different nodes. In all the strategies, a static load-balancing method is employed for equitable distribution of computational work to specified nodes. The parameters of the present code is studied in detail to facilitate the explanation of the speedup results. Two examples are provided to show the speedup and load balancing of the parallel calculation. Detailed comparison is made to evaluate the efficiency of different strategies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号