首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, the state-of-the-art parallel computational model research is reviewed. We will introduce various models that were developed during the past decades. According to their targeting architecture features, especially memory organization, we classify these parallel computational models into three generations. These models and their characteristics are discussed based on three generations classification. We believe that with the ever increasing speed gap between the CPU and memory systems, incorporating non-uniform memory hierarchy into computational models will become unavoidable. With the emergence of multi-core CPUs, the parallelism hierarchy of current computing platforms becomes more and more complicated. Describing this complicated parallelism hierarchy in future computational models becomes more and more important. A semi-automatic toolkit that can extract model parameters and their values on real computers can reduce the model analysis complexity, thus allowing more complicated models with more parameters to be adopted. Hierarchical memory and hierarchical parallelism will be two very important features that should be considered in future model design and research.  相似文献   

2.
数值计算程序的存储复杂性分析   总被引:12,自引:1,他引:11  
由于越来越多的技术用于缩小处理器与存储器之间的日益加大的速度差距,计算机的存储系统变得日趋复杂.现在,任何一个程序设计者,尤其是数值计算程序的设计者,若不考虑其所用计算平台存储系统的特点是很难获取高性能的.因此公用传统的算法评价方法,从时间复杂性和空间复杂性着手来解释一个算法的不同实现在同一计算平台上很大的性能差异,显然是不够的.计算平台存储系统的特点必须在分析算法的复杂性时加以考虑.孙家昶199  相似文献   

3.
The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.  相似文献   

4.
并行计算模型的发展引入越来越多的模型参数。对并行计算模型参数动态采集分析软件包DEMPAT的整体框架进行研究,实现基于硬件性能计数器的存储层次参数采集模块。实验表明,该模块能够准确快速地获取存储层次参数且具有较好的可移植性。  相似文献   

5.
大规模并行应用程序的性能优化和并行化的关键瓶颈之一在于多核CPU中越来越深和越来越复杂的存储层次。文中系统地分析和总结了当前主要多核CPU和并行程序设计语言中的局部性设计方法,提出了两种局部性,即横向局部性和纵向局部性,从这两种局部性的视角深入分析了当前的主要并行程序设计语言的局部性设计机制,进一步总结对比了其优缺点,并指出了新一代并行程序设计语言应具有的特点,重点提出了新语言应同时综合考虑两种局部性支持的设计机制的研究观点。  相似文献   

6.
Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memory-centric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.  相似文献   

7.
This research defines and analyzes a methodology for deriving a performance model for SPMD hybrid parallel applications. Hybrid parallelism combines shared memory and message passing computing models. This work extends the current practice of application performance modelling by development of a methodology for hybrid applications with these procedures.
  • Creation of a model based on complexity analysis of an application code and its data structures.
  • Enhancement of a static complexity model by dynamic factors to capture execution time phenomena, such as memory hierarchy effects.
  • Quantitative analysis of model characteristics and the effects of perturbations in measured parameters.
These research results are presented in the context of a hybrid parallel implementation of a sparse linear algebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of the model on two large parallel computing platforms provides case studies for the methodology. Operating system issues, machine balance factor, and memory hierarchy effects on model accuracy are examined. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

8.
In recent years, high performance computing underwent a deep transformation. In this paper, we review the state of parallel computation with detailed discussion of the current and future research issues in the area of parallel architectures and compilation methods, instruction level parallelism and optimization methods to improve the performance of the memory hierarchy.  相似文献   

9.
We present GPU implementations of two different nature-inspired optimization methods for well-known optimization problems. Ant Colony Optimization (ACO) is a two-stage population-based method modelled on the foraging behaviour of ants, while P systems provide a high-level computational modelling framework that combines the structure and dynamic aspects of biological systems (in particular, their parallel and non-deterministic nature). Our methods focus on exploiting data parallelism and memory hierarchy to obtain GPU factor gains surpassing 20x for any of the two stages of the ACO algorithm, and 16x for P systems when compared to sequential versions running on a single-threaded high-end CPU. Additionally, we compare performance between GPU generations to validate hardware enhancements introduced by Nvidia’s Fermi architecture.  相似文献   

10.
本文研究机群系统的程序设计问题,旨在建立一种支持虚拟共享存储空间和多种并行性描述方式的并行程序设计模型。文中首先提出了抽象结构共享存储器模型的概念,并在此基础上建立了同时支持数据并行、任务并行和对象并行的层次并行模型,这两种模型构成了并行语言TipC++的并行程序设计模型。文中还初步讨论了基于这种程序设计模型的性能优化原语、编译优化和任务调度等问题。  相似文献   

11.
In this paper we formalise three different views of a virtual shared memory system and show that they are equivalent. The formalisation starts with five basic component processes specified in the language of CSP [Hoa85], which can be adapted as necessary by two operations called labelling and clamping, and are combined in two basic ways: either they are chained, so that the output of one component becomes the input of the next, or they are put in parallel, so that their communications are arbitrarily interleaved. Using the laws of CSP we show that these basic processes and operators satisfy a number of algebraic equivalences, which enable us to prove equivalence of the different models of the memory system by reasoning entirely at the level of processes, instead of at the lower and more complicated level of events. As a result the proofs of equivalence of the different models are purely algebraic and very simple.The specification is intended to provide a general framework for any architecture using an interconnection network, such as the on-chip interconnect between macrocells or the networks of processor nodes connected by bit-serial interconnect which are described in [Jon93]. It addresses architecture independent design issues such as access transparency, connectivity, addressing models and serialisability. By structuring it as a hierarchy of models it is hoped that the treatment of these many issues is made as clear and tractable as possible, whilst the proofs of equivalence ensure consistency.Funded by Esprit Project 7267/ OMI-Standards.  相似文献   

12.
多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销.  相似文献   

13.
14.
15.
This paper presents a system for parallel execution of Prolog supporting both independent conjunctive and disjunctive parallelism. The system is intended for distributed memory architecture and is composed of a set of workers with a hierarchical structure scheduler. The execution model has been designed in such a way that each worker's environment does not contain references to terms in other environments, thus reducing communication overhead. In order to guarantee the improvement of the performance by the parallelism exploitation, a granularity control has been introduced for each kind of parallelism. For conjunctive parallelism PDP applies a control based on the estimation provided by CASLOG. The features of the system allow to introduce this control without adding overhead. For disjunctive parallelism PDP controls granularity by applying a heuristic-based method, which can be adapted to other parallel Prolog systems. Different scheduling policies have also been tested. The system has been implemented on a transputer network and performance results show that it provides a high speedup for coarse grain parallel programs.  相似文献   

16.
InfiniBand是目前HPC系统互连的主流网络之一,其提供的可靠连接传输服务因为支持RDMA、原子操作等功能而被广泛应用于MPI等并行应用编程模型。但是支撑可靠连接所需的消息队列及缓冲区开销往往会随着并行规模的扩大而急剧增加,从而制约了应用规模的扩大。为了解决这种内存开销带来的消息可扩展性问题,先从InfiniBand传输优化方面介绍了共享接收队列和扩展可靠连接技术,然后基于并行通信模型提出了分组连接技术。通过这些技术可以将节点内存开销减少2个数量级,并且开销不会随并行规模的扩大而明显增加。  相似文献   

17.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.  相似文献   

18.
何裕南  安虹  郭锐  梁博 《计算机科学》2007,34(1):248-254
CPU设计正在由仅开发指令级并行性的单线程单核结构转向利用线程级并行性的多线程多核结构,但至今还没有一个可移植性好并被广泛使用的开源多核处理器模拟器,限制了在这样的结构上开展高质量的研究工作。我们开发了一个多核处理器体系结构模拟器OpenCMP,用于支持当前和未来对多线程多核处理器体系结构关键技术的研究。该模拟器适当地抽象了多核处理器结构,为主流的多核处理器结构研究提供一个可扩展、灵活的模拟工具框架,包括支持对乱序、顺序的处理器核和同时多线程处理器核的模拟,以便对更大的多核设计空间进行比较性研究。本文以支持事务存储模型的多核处理器结构模拟器为例,详细描述了如何通过抽象多核结构和事务存储模型的最基本特性和组成部分,扩展单核处理器模拟器SimpleScalar,设计与实现一个多核处理器模拟器。初步研究表明,与现有的多核处理器模拟器相比,该模拟器能够较好地支持对事务存储模型和基于事务存储模型的多核处理器体系结构的研究.  相似文献   

19.
Many problems in the operations research field cannot be solved to optimality within reasonable amounts of time with current computational resources. In order to find acceptable solutions to these computationally demanding problems, heuristic methods such as genetic algorithms are often developed. Parallel computing provides alternative design options for heuristic algorithms, as well as the opportunity to obtain performance benefits in both computational time and solution quality of these heuristics. Heuristic algorithms may be designed to benefit from parallelism by taking advantage of the parallel architecture. This study will investigate the performance of the same global parallel genetic algorithm on two popular parallel architectures to investigate the interaction of parallel platform choice and genetic algorithm design. The computational results of the study illustrate the impact of platform choice on parallel heuristic methods. This paper develops computational experiments to compare algorithm development on a shared memory architecture and a distributed memory architecture. The results suggest that the performance of a parallel heuristic can be increased by considering the desired outcome and tailoring the development of the parallel heuristic to a specific platform based on the hardware and software characteristics of that platform.  相似文献   

20.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号