首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.  相似文献   

2.
以曲面轮为基础发展的高精度曲面建模方法(HASM)可以建立具有高精度的数字高程模型,但使用该方法需要求解偏微分方程离散产生的大规模线性方程组,计算量巨大,严重制约了对大规模数据的模拟应用;而现代GPU技术的发展使GPU越来越广泛地应用于通用计算加速。为了提高HASM方法的模拟速度,把高精度曲面模拟与GPU通用技术相结合,提出了GPU加速的高精度曲面建模方法。把HASM模拟过程中的有限差分离散、离散后的大规模线性系统求解分别使用GPU进行分解,使用共轭梯度(CG)和预处理共轭梯度方法(PCG)将求解任务分解为可以并行处理的独立的多任务,使得计算任务并行化,同时并行运行大规模线程,每个线程执行一个独立的任务,充分利用了现代GPU强大的通用计算能力,并行处理以获得加速。利用并行化加速的高精度曲面建模算法使用英伟达公司的统一计算开发架构(CUDA)编程实现,GPU采用该公司的Quadro 2000。分别应用该算法进行了数值实验和实际项目区数字高程模型(DEM)模拟实验。实验结果表明,充分利用GPU的并行处理能力加速后的HASM方法,在保证达到相同曲面模拟的精度条件下,和传统的CPU方法相比,算法可以获得超过一个数量级的加速。  相似文献   

3.
This paper describes an experimental study of three dataflow paradigms, namely, no dataflow, pipelined dataflow, and network dataflow, in multithreaded database transitive closure algorithms on shared memory multiprocessors. This study shows that dataflow paradigm directly influences performance parameters such as the amount of interthread communication, how data are partitioned among the threads, whether access to each page of data is exclusive or shared, whether locks are needed for concurrency control, and how calculation termination is detected. The algorithm designed with no dataflow outperforms the algorithms with dataflow. Approximately linear speedup is achieved by the no dataflow algorithm with sufficient workload and primary memory. An exclusive access working set model and a shared access working set model describe the interactions between two or more threads′ working sets when access to each page of data is exclusive or shared among the threads, respectively. These models are experimentally verified.  相似文献   

4.
Distributed systems are an alternative to shared-memory multiprocessors for the execution of parallel applications.Panda is a run-time system that provides architectural support for efficient parallel and distributed programming. It supplies fast user-level threads and a means for transparent and coordinated sharing of objects across a homogeneous network. The paper motivates the major architectural choices that guided our design. The problem of sharing data in a distributed environment is discussed, and the performance of the mechanisms provided by thePanda prototype implementation is assessed.  相似文献   

5.
As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and on different cores. While threads contend for the same resources using the former approach, the latter approach is plagued by the overhead for inter-core communication. Despite the impact of resource contention, our detailed simulations show that the first approach provides the best performance due to lower inter-thread communication cost. The key contribution of the paper is the proposed design and evaluation of the dual-thread speculation system. This design point has very low complexity and reaps most of the gains of a system. The work was carried out while Fredrik Warg was a graduate student at Chalmers University of Technology.  相似文献   

6.
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.  相似文献   

7.
We present a novel and portable threads-based system for concurrent applications on shared- and distributed-memory environments. The Ariadne system provides stateful user-space threads that can be very effective in medium to coarse grained applications. The interface is the same for uniprocessors and multiprocessors. Sequential programs are readily converted into parallel programs for shared or distributed memory, with low development effort. Ariadne supports the development of customized schedulers, and offers a thread migration capability in distributed environments. Scheduling of computations at the threads level enables both task- and data-driven executions. Thread migration is a useful feature which turns remote memory accesses into local accesses, enables load-balancing and simplifies program development. Ariadne employs a unique runtime stack rewriting mechanism to migrate threads between homogeneous processors. Ariadne currently runs on the SPARC (SunOS 4.x, SunOS 5.x), Sequent Symmetry, Intel Paragon, Silicon Graphics IRIX and IBM RS/6000 environments. We present some examples of Ariadne programs, along with performance measurements. © 1998 John Wiley & Sons, Ltd.  相似文献   

8.
A tutorial on dependability and performance-related dependability models for multiprocessors is presented. Multiprocessors are classified as having shared-memory or distributed-memory architectures, and some fundamental dependability modeling concepts. Reliability models based on four types of reliability evaluation techniques (terminal, multiterminal, task-based, and network reliability) are examined. The status of research efforts on performance-related dependability is discussed, and the models' effectiveness is illustrated with a few numerical examples. A brief survey of software packages for dependability computation in included  相似文献   

9.
A parallel processing methodology for high-speed dynamic simulation of controlled multibody mechanical systems is proposed for shared memory multiprocessor implementation. A dual-rate integration method is developed first to account for different time scales of mechanical and control subsystems and to employ different integration algorithms and step sizes that are best suited for individual subsystems. A parallel processing algorithm is designed for shared memory multiprocessors to exploit nested parallelism in a high as well as a medium level. Procedure dependency due to the recurrence relations in the Newton-Euler formulation is eliminated by utilizing a modified system graph that creates independent parallel threads. The effectiveness of the proposed approach is demonstrated using an example on an Alliant FX/8.  相似文献   

10.

The Louvain community detection algorithm is a hierarchal clustering method categorized in the NP-hard problem. Its execution time to find communities in large graphs is, therefore, a challenge. Parallelization is an effective solution for amortizing Louvain's execution time. In this paper, we propose an adaptive CUDA Louvain method (ACLM) algorithm that benefits from the graphic processing unit (GPU). ACLM uses the shared memory in GPU, as well as the optimal number of threads in the GPU blocks. These features minimize parallelization overhead and accelerate the calculation of modularity parameters. The proposed algorithm allocates threads to each block based on the number of required streaming multiprocessors (SMs) and warps on GPU. The implementation results show that ACLM can effectively accelerate the execution time by 77% compared to the competitive method in the large graph benchmarks.

  相似文献   

11.
Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface—because decisions are implicit—and it can lead to better decisions—because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as on their overall performance for several scientific applications.  相似文献   

12.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions.  相似文献   

13.
We discuss the parallelization of an efficient algorithm for the partial stabilization of large‐scale linear control systems in generalized state‐space form. The algorithm is composed of highly parallel iterative schemes that appear in the computation of certain matrix functions. Here we evaluate different approaches to exploit parallelism at two levels, based on threads and processes. Our experimental results on a cluster of symmetric multiprocessors and a CC‐NUMA platform show that the efficiency of the matrix operations underlying the iterative schemes carry over to the parallel implementation of the stabilization algorithm. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

14.
In parallel to the changes in both the architecture domain–the move toward chip multiprocessors (CMPs)–and the application domain–the move toward increasingly data-intensive workloads–issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The CPU availability can change dynamically due to several reasons such as thermal overload, increase in transient errors, or operating system scheduling. An important question in this context is how to adapt, in a CMP, the execution of a given application to CPU availability change at runtime. Our paper studies this problem, targeting the energy-delay product (EDP) as the main metric to optimize. We first discuss that, in adapting the application execution to the varying CPU availability, one needs to consider the number of CPUs to use, the number of application threads to accommodate and the voltage/frequency levels to employ (if the CMP has this capability). We then propose to use helper threads to adapt the application execution to CPU availability change in general with the goal of minimizing the EDP. The helper thread runs parallel to the application execution threads and tries to determine the ideal number of CPUs, threads and voltage/frequency levels to employ at any given point in execution. We illustrate this idea using four applications (Fast Fourier Transform, MultiGrid, LU decomposition and Conjugate Gradient) under different execution scenarios. The results collected through our experiments are very promising and indicate that significant EDP reductions are possible using helper threads. For example, we achieved up to 66.3%, 83.3%, 91.2%, and 94.2% savings in EDP when adjusting all the parameters properly in applications FFT, MG, LU, and CG, respectively. We also discuss how our approach can be extended to address multi-programmed workloads.  相似文献   

15.
This paper describes an approach to system modeling based on heuristic mean value analysis. The virtues of the approach are conceptual simplicity and computational efficiency. The approach can be applied to a large variety of systems, and can handle features such as resource constraints, tightly and loosely coupled multiprocessors, distributed processing, and certain types of CPU priorities. Extensive validation results are presented, including truly predictive situations. The paper is intended primarily as a tutorial on the method and its applications, rather than as an exposition of research results.  相似文献   

16.
This paper describes an approach to system modeling based on heuristic mean value analysis. The virtues of the approach are conceptual simplicity and computational efficiency. The approach can be applied to a large variety of systems, and can handle features such as resource constraints, tightly and loosely coupled multiprocessors, distributed processing, and certain types of CPU priorities. Extensive validation results are presented, including truly predictive situations. The paper is intended primarily as a tutorial on the method and its applications, rather than as an exposition of research results.  相似文献   

17.
This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation.  相似文献   

18.
Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread.  相似文献   

19.
Transactional Memory is a concurrent programming API in which concurrent threads synchronize via transactions (instead of locks). Although this model has mostly been studied in the context of multiprocessors, it has attractive features for distributed systems as well. In this paper, we consider the problem of implementing transactional memory in a network of nodes where communication costs form a metric. The heart of our design is a new cache-coherence protocol, called the Ballistic protocol, for tracking and moving up-to-date copies of cached objects. For constant-doubling metrics, a broad class encompassing both Euclidean spaces and growth-restricted networks, this protocol has stretch logarithmic in the diameter of the network. Supported by NSF grant 0410042 and by grants from Intel Corporation and Sun Microsystems.  相似文献   

20.
Threads exhibit a simply expressed and powerful form of concurrency, easily exploitable in applications that run on both uni- and multi-processors, shared- and distributed-memory systems. This paper presents the design and implementation of Ariadne: a layered, C-based software architecture for multi-threaded distributed computing on a variety of platforms. Ariadne is a portable user-space threads system that runs on shared- and distributed-memory multiprocessors. Thread-migration is supported at the application level in homogeneous environments (e.g., networks of SPARCs and Sequent Symmetrys, Intel hypercubes). Threads may migrate between processes to access remote data, preserving locality of reference for computations with a dynamic data space. Ariadne can be tuned to specific applications through a customization layer. Support is provided for scheduling via a built-in or application-specific scheduler, and interfacing with any communications library. Ariadne currently runs on the SPARC (SunOS 4.x and SunOS 5.x), Sequent Symmetry, Intel i860, Silicon Graphics workstation (IRIX), and IBM RS/6000 environments. We present simple performance benchmarks comparing Ariadne to threads libraries in the SunOS 4.x and SunOS 5.x systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号