首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

2.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

3.
Molecular dynamics (MD) simulation has broad applications, and an increasing amount of computing power is needed to satisfy the large scale of the real world simulation. The advent of the many-core paradigm brings unprecedented computing power, but it remains a great challenge to harvest the computing power due to MD’s irregular memory-access pattern. To address this challenge, this paper presents a joint application/architecture study to enhance the scalability of MD on Godson-T-like many-core architecture. First, a preprocessing approach leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimization strategies–a novel data-layout to improve data locality, an on-chip locality-aware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory–are proposed to enhance on-chip parallelism for Godson-T many-core processor. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency of 0.99 on 64 cores, which is confirmed by a field-programmable gate array emulator. Also the performance per watt of MD on Godson-T is much higher than MD on a 16-cores Intel core i7 symmetric multiprocessor (SMP) and 26 times higher than MD on an 8-core 64-thread Sun T2 processor. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a hierarchical parallelization scheme is designed to map the MD algorithm to Godson-T many-core cluster and a simple performance model is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.  相似文献   

4.
The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately available parallelism and further extraction of parallelism is limited by small data sets and a relatively high parallelization overhead. Load balance is difficult to obtain due to the limited parallelism and made worse by non-uniform memory latency. Three parallel OpenMP implementations of the application are discussed and evaluated. We show that with some modifications relative speedups in excess of 9 on a 16 CPU system can be reached.  相似文献   

5.
The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing.  相似文献   

6.
Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

7.
Scalable,parallel computers: Alternatives,issues, and challenges   总被引:3,自引:0,他引:3  
The 1990s will be the era of scalable computers. By giving up uniform memory access, computers can be built that scale over a range of several thousand. These provide highpeak announced performance (PAP), by using powerful, distributed CMOS microprocessor-primary memory pairs interconnected by a high performance switch (network). The parameters that determine these structures and their utility include: whether hardware (a multiprocessor) or software (a multicomputer) is used to maintain a distributed, or shared virtual memory (DSM) environemnt; the power of computing nodes (these improve at 60% per year); the size and scalability of the switch; distributability (the ability to connect to geographically dispersed computers including workstations); and all forms of software to exploit their inherent parallelism. To a great extent, viability is determined by a computer's generality—the ability to efficiently handle a range of work that requires varying processing (from serial to fully parallel), memory, and I/O resources. A taxonomy and evolutionary time line outlines the next decade of computer evolution, included distributed workstations, based on scalability and parallelism. Workstations can be the best scalables.  相似文献   

8.
Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization.  相似文献   

9.
圣维南方程组可用于描述明渠非恒定流的汇流过程,在大规模水文模拟软件中,求该方程组的数值解是制约程序运行时间的最大瓶颈。 通过分析串行程序结构及其计算热点,挖掘计算密集型程序中单步模拟循环计算段和指令排列等的可并行性,针对“神威·太湖之光”超级计算机的异构众核架构设计主从核异步并行方案,基于MPI和athread库对求解程序进行移植、并行和加速,采用SIMD技术将从核计算段向量化,使用双缓冲等策略对通信瓶颈进行优化。测试表明,计算热点函数的性能较优化前平均可提高3倍以上,在百万控制单元规模内,众核级优化后的并行程序加速比可保持近线性增长,在神威多结点上具有很好的可扩展性。  相似文献   

10.
Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster  相似文献   

11.
The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology — together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.  相似文献   

12.
一种面向异构众核处理器的并行编译框架   总被引:1,自引:0,他引:1  
异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块:任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果.  相似文献   

13.
Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video.  相似文献   

14.
The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core ar-chitecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor.  相似文献   

15.
Benchmark evaluation of the IBM SP2 for parallel signal processing   总被引:1,自引:0,他引:1  
This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library (MPL) through STAP (Space-Time Adaptive Processing) benchmark experiments. Only coarse-grain parallelism was exploited on the SP2 due to its high communication overhead. A new parallelization scheme is developed for programming message passing multicomputers. Parallel STAP benchmark structures are illustrated with domain decomposition, efficient mapping of partitioned programs, and optimization of collective communication operations. We measure the SP2 performance in terms of execution time, Gflop/s rate, speedup over a single SP2 node, and overall system utilization. With 256 nodes, the Maul SP2 demonstrated the best performance of 23 Gflop/s in executing the High-Order Post-Doppler program, corresponding to a 34% system utilization. We have conducted a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size. Important lessons learned from these parallel processing benchmark experiments are discussed in the context of real-time, adaptive, radar signal processing on massively parallel processors (MPP)  相似文献   

16.
The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. It offers high computational energy efficiency for both integer and floating point calculations as well as parallel scalability. Yet despite the interesting architectural features, a compelling programming model has not been presented to date. This paper demonstrates an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster of serial cores. Our approach enables MPI codes to execute on the RISC array processor with little modification and achieve high performance. We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix–matrix multiplication, N-body particle interaction, five-point 2D stencil update, and 2D FFT) and highlight the importance of fast inter-core communication for the architecture.  相似文献   

17.
Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.  相似文献   

18.
Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform extensive benchmarking of different commonly used electrostatic schemes on a range of computer architectures (Pentium-4, IBM Power 4, and Apple/IBM G5) for single processor and parallel performance up to 8 nodes—we have also tested the scalability on four different networks, namely Infiniband, GigaBit Ethernet, Fast Ethernet, and nearly uniform memory architecture, i.e. communication between CPUs is possible by directly reading from or writing to other CPUs' local memory. It turns out that the particle-mesh Ewald method (PME) performs surprisingly well and offers competitive performance unless parallel runs on PC hardware with older network infrastructure are needed. Lipid bilayers of sizes 128, 512 and 2048 lipid molecules were used as the test systems representing typical cases encountered in biomolecular simulations. Our results enable an accurate prediction of computational speed on most current computing systems, both for serial and parallel runs. These results should be helpful in, for example, choosing the most suitable configuration for a small departmental computer cluster.  相似文献   

19.
As semiconductor manufacturing technology continues to improve, it is possible to integrate more and more transistors onto a single processor. Many-core processor design has resulted in part from the search to utilize this enormous transistor real estate. The Single-Chip Cloud Computer (SCC) is an experimental many-core processor created by Intel Labs. In this paper we present a study in which we analyze this innovative many-core system by running several workloads with distinctive parallelism characteristics. We investigate the effect on system performance by monitoring specific hardware performance counters. Then, we experiment on varying different hardware configuration parameters such as number of cores, clock frequency and voltage levels. We execute the chosen workloads and collect the timing, power consumption and energy consumption information on such a many-core research platform. Thus, we can comprehensively analyze the behavior and scalability of the Intel SCC system with the introduced workload in terms of performance and energy consumption. Our results show that the profiled parallel workload execution has a communication bottleneck on the Intel SCC system. Moreover, our results indicate that we should carefully choose the number of cores to execute different workloads in order to yield a balance between execution performance and energy efficiency for different applications.  相似文献   

20.
Highly regular many-core architectures tend to be more and more popular as they are suitable for inherently highly parallelizable applications such as most of the image and video processing domain. In this article, we present a novel architecture for many-core microprocessor ASIC dedicated to embedded video and image processing applications. We propose a flexible many-core approach with two architectures one implemented in CMOS 65 nm technology containing 16 open-source tiles and the other implemented in CMOS FD-SOI 28 nm technology containing 64 open-source tiles. Each tile of these architectures can choose its communication links depending on the most relevant overall parallelism scheme for a targeted application. Both chips are fully functional in simulation. The layouts are presented with frequency, area and power consumption results. Various case studies are presented to illustrate the proposed flexible many-core architectures and enable to focus on architecture exploration, instantiated scheme of parallelization and timing performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号