首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Models of parallel computation :a survey and classification   总被引:5,自引:1,他引:5  
In this paper, the state-of-the-art parallel computational model research is reviewed. We will introduce various models that were developed during the past decades. According to their targeting architecture features, especially memory organization, we classify these parallel computational models into three generations. These models and their characteristics are discussed based on three generations classification. We believe that with the ever increasing speed gap between the CPU and memory systems, incorporating non-uniform memory hierarchy into computational models will become unavoidable. With the emergence of multi-core CPUs, the parallelism hierarchy of current computing platforms becomes more and more complicated. Describing this complicated parallelism hierarchy in future computational models becomes more and more important. A semi-automatic toolkit that can extract model parameters and their values on real computers can reduce the model analysis complexity, thus allowing more complicated models with more parameters to be adopted. Hierarchical memory and hierarchical parallelism will be two very important features that should be considered in future model design and research.  相似文献   

2.
数值计算程序的存储复杂性分析   总被引:12,自引:1,他引:11  
由于越来越多的技术用于缩小处理器与存储器之间的日益加大的速度差距,计算机的存储系统变得日趋复杂.现在,任何一个程序设计者,尤其是数值计算程序的设计者,若不考虑其所用计算平台存储系统的特点是很难获取高性能的.因此公用传统的算法评价方法,从时间复杂性和空间复杂性着手来解释一个算法的不同实现在同一计算平台上很大的性能差异,显然是不够的.计算平台存储系统的特点必须在分析算法的复杂性时加以考虑.孙家昶199  相似文献   

3.
大规模并行应用程序的性能优化和并行化的关键瓶颈之一在于多核CPU中越来越深和越来越复杂的存储层次。文中系统地分析和总结了当前主要多核CPU和并行程序设计语言中的局部性设计方法,提出了两种局部性,即横向局部性和纵向局部性,从这两种局部性的视角深入分析了当前的主要并行程序设计语言的局部性设计机制,进一步总结对比了其优缺点,并指出了新一代并行程序设计语言应具有的特点,重点提出了新语言应同时综合考虑两种局部性支持的设计机制的研究观点。  相似文献   

4.
This research defines and analyzes a methodology for deriving a performance model for SPMD hybrid parallel applications. Hybrid parallelism combines shared memory and message passing computing models. This work extends the current practice of application performance modelling by development of a methodology for hybrid applications with these procedures.
  • Creation of a model based on complexity analysis of an application code and its data structures.
  • Enhancement of a static complexity model by dynamic factors to capture execution time phenomena, such as memory hierarchy effects.
  • Quantitative analysis of model characteristics and the effects of perturbations in measured parameters.
These research results are presented in the context of a hybrid parallel implementation of a sparse linear algebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of the model on two large parallel computing platforms provides case studies for the methodology. Operating system issues, machine balance factor, and memory hierarchy effects on model accuracy are examined. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

5.
In recent years, high performance computing underwent a deep transformation. In this paper, we review the state of parallel computation with detailed discussion of the current and future research issues in the area of parallel architectures and compilation methods, instruction level parallelism and optimization methods to improve the performance of the memory hierarchy.  相似文献   

6.
The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.  相似文献   

7.
We present GPU implementations of two different nature-inspired optimization methods for well-known optimization problems. Ant Colony Optimization (ACO) is a two-stage population-based method modelled on the foraging behaviour of ants, while P systems provide a high-level computational modelling framework that combines the structure and dynamic aspects of biological systems (in particular, their parallel and non-deterministic nature). Our methods focus on exploiting data parallelism and memory hierarchy to obtain GPU factor gains surpassing 20x for any of the two stages of the ACO algorithm, and 16x for P systems when compared to sequential versions running on a single-threaded high-end CPU. Additionally, we compare performance between GPU generations to validate hardware enhancements introduced by Nvidia’s Fermi architecture.  相似文献   

8.
并行计算模型的发展引入越来越多的模型参数。对并行计算模型参数动态采集分析软件包DEMPAT的整体框架进行研究,实现基于硬件性能计数器的存储层次参数采集模块。实验表明,该模块能够准确快速地获取存储层次参数且具有较好的可移植性。  相似文献   

9.
10.
InfiniBand是目前HPC系统互连的主流网络之一,其提供的可靠连接传输服务因为支持RDMA、原子操作等功能而被广泛应用于MPI等并行应用编程模型。但是支撑可靠连接所需的消息队列及缓冲区开销往往会随着并行规模的扩大而急剧增加,从而制约了应用规模的扩大。为了解决这种内存开销带来的消息可扩展性问题,先从InfiniBand传输优化方面介绍了共享接收队列和扩展可靠连接技术,然后基于并行通信模型提出了分组连接技术。通过这些技术可以将节点内存开销减少2个数量级,并且开销不会随并行规模的扩大而明显增加。  相似文献   

11.
Many problems in the operations research field cannot be solved to optimality within reasonable amounts of time with current computational resources. In order to find acceptable solutions to these computationally demanding problems, heuristic methods such as genetic algorithms are often developed. Parallel computing provides alternative design options for heuristic algorithms, as well as the opportunity to obtain performance benefits in both computational time and solution quality of these heuristics. Heuristic algorithms may be designed to benefit from parallelism by taking advantage of the parallel architecture. This study will investigate the performance of the same global parallel genetic algorithm on two popular parallel architectures to investigate the interaction of parallel platform choice and genetic algorithm design. The computational results of the study illustrate the impact of platform choice on parallel heuristic methods. This paper develops computational experiments to compare algorithm development on a shared memory architecture and a distributed memory architecture. The results suggest that the performance of a parallel heuristic can be increased by considering the desired outcome and tailoring the development of the parallel heuristic to a specific platform based on the hardware and software characteristics of that platform.  相似文献   

12.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

13.
仿真模型越来越复杂,受单机计算能力和存储容量的限制,模拟需要花费的时间也越来越长。PDES(Parallel Discrete Event Simulation)策略能够加快仿真程序的执行,因此一度成为研究热点。但是,并行仿真最终并没有在工业界得到广泛应用,其原因在于:并行仿真建模理论缺乏,并行仿真性能具有不可预测性,以及并行程序行为的不可预测性。本文在讨论模拟器并行化的一般方法基础上,给出了一个基于SSF的传感器网络并行仿真环境SensorSSF。SensorSSF设计遵循:可扩展性和简洁性。可扩展性保证CPU执行时间随求解问题的规模和仿真模型的复杂度线性增长;简洁性使得仿真应用人员无需了解太多并行程序设计知识,就可以编写出高效的仿真程序。实验结果表明,SensorSSF具有良好的可扩展性,同NS2相比具有较好的时间特性。  相似文献   

14.
Lipasti  M.H. Shen  J.P. 《Computer》1997,30(9):59-66
Based on their research at Carnegie Mellon University, the authors argue for billion-transistor uniprocessors. They divide the important implementation problems into three components: instruction flow, register dataflow, and memory dataflow. They also argue for trace caches and advanced branch prediction. Their article, however, focuses on using massive speculation at all levels to improve performance. They claim that without this much speculation, future processors will be limited by true data dependences, and will be unable to harvest enough instruction-level parallelism (ILP) to improve performance satisfactorily. Their investigations discovered large speedups on code that have traditionally not been amenable to finding ILP  相似文献   

15.
Wilmarth  D.D. 《Computer》1993,26(8):70-72
The parallel processing requirements of many computer applications, such as machine vision, radar, solar, and signal processing, are reviewed. The major hardware architectural features in optimizing parallel processing performance (interconnect topology, memory locality, and synchronization facilities) are discussed. The various parallel processing models available are also discussed. These include job-level parallelism, data-level parallelism, algorithm-level parallelism, loop-level parallelism, and compute clusters  相似文献   

16.
Computer benchmarking is a common method for measuring the parameters of a computational model. It helps to measure the parameters of any computer. With the emergence of multicore computers, the evaluation of computers was brought under consideration. Since these types of computers can be viewed and considered as parallel computers, the evaluation methods for parallel computers may be appropriate for multicore computers. However, because multicore architectures seriously focus on cache hierarchy, there is a need for new and different benchmarks to evaluate them correctly. To this end, this paper presents a method for measuring the parameters of one of the most famous multicore computational models, namely Multi-Bulk Synchronous Parallel (Multi-BSP). This method measures the hardware latency parameters of multicore computers, namely communication latency (g i ) and synchronization latency (L i ) for all levels of the cache memory hierarchy in a bottom-up manner. By determining the parameters, the performance of algorithms on multicore architectures can be evaluated as a sequence.  相似文献   

17.
Quantum computing emerges as a field that captures a great theoretical interest. Its simulation represents a problem with high memory and computational requirements which makes advisable the use of parallel platforms. In this work we deal with the simulation of an ideal quantum computer on the Compute Unified Device Architecture (CUDA), as such a problem can benefit from the high computational capacities of Graphics Processing Units (GPU). After all, modern GPUs are becoming very powerful computational architectures which is causing a growing interest in their application for general purpose. CUDA provides an execution model oriented towards a more general exploitation of the GPU allowing to use it as a massively parallel SIMT (Single-Instruction Multiple-Thread) multiprocessor. A simulator that takes into account memory reference locality issues is proposed, showing that the challenge of achieving a high performance depends strongly on the explicit exploitation of memory hierarchy. Several strategies have been experimentally evaluated obtaining good performance results in comparison with conventional platforms.  相似文献   

18.
Algorithms from scientific computing often exhibit a two-level parallelism based on potential method parallelism and potential system parallelism. We consider the parallel implementation of those algorithms on distributed memory machines. The two-level potential parallelism of algorithms is expressed in a specification consisting of an upper level hierarchy of multiprocessor tasks each of which has an internal structure of uniprocessor tasks. To achieve an optimal parallel execution time, the parallel execution of such a program requires an optimal scheduling of the multiprocessor tasks and an appropriate treatment of uniprocessor tasks. For an important subclass of structured method parallelism we present a scheduling methodology which takes data redistributions between multiprocessor tasks into account. As costs we use realistic parallel runtimes. The scheduling methodology is designed for an integration into a parallel compiler tool. We illustrate the multitask scheduling by several examples from numerical analysis.  相似文献   

19.
将卷积计算转化为矩阵乘法是FPGA上一种高效实现,而现有的转化方法无法根据卷积参数的不同动态调整,限制了卷积计算的并行度.提出一种新的动态余数处理映射模型.该映射模型包含有3个子模型:特征值映射模型,权值映射模型,和输出映射模型.特征值映射模型将特征值转化为特征值矩阵,权值映射模型将权值转化为权值矩阵,特征值矩阵和权值矩阵通过乘累加计算阵列得到卷积计算结果,由输出映射模型将卷积计算结果存储到内存中.在卷积计算过程中,卷积的输出通道数通常不是乘累加计算阵列行数的整数倍,3个子映射模型会根据产生的余数动态调整映射方法,提高乘累加计算阵列的利用率.通过实验表明,采用动态余数处理映射模型能够将余数并行度的倍数至多提高到卷积核大小,使整个加速器达到了更高的实际吞吐量和能量效率.  相似文献   

20.
Parallel computation model is an abstraction for the performance characteristics of parallel computers, and should evolve with the development of computational infrastructure. The heterogeneous CPU/Graphics Processing Unit (GPU) systems have been and will be important platforms for scientific computing, which introduces an urgent demand for new parallel computation models targeting this kind of supercomputers. In this research, we propose a parallel computation model called HLognGP to abstract the computation and communication features of heterogeneous platforms like TH‐1A. All the substantial parameters of HLognGP are in vector form and deal with the new features in GPU clusters. A simplified version HLog3GP of the proposed model is mapped to a specific GPU cluster and verified with two typical benchmarks. Experimental results show that HLog3GP outperforms the other two evaluated models and can well model the new particularities of GPU clusters. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号