首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

2.
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from standard benchmark suites, we measure statistics such as speedups, synchronization and load imbalance, causes of cache misses, cache line utilization, data traffic, and memory costs. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors  相似文献   

3.
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.  相似文献   

4.
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the low-level abstraction of memory access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated array of structs and struct of arrays layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU® lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.  相似文献   

5.
In this paper, we claim that memory migration mechanism is a useful approach to improve the execution of parallel applications in dynamic execution environments, but that their performance depends on related system components such as the processor scheduling. To show that, we evaluate the automatic memory migration mechanism provided by IRIX in Origin systems, under different dynamic processor allocation policies when executing OpenMP parallel multiprogrammed workloads. We have focused the evaluation on the effects of the page migration mechanism on the CPU time consumed by each application, the processor allocation received, and the speedup. Results demonstrate that, if the processor scheduler is memory conscious, that is, it maintains as much as possible the system stable, the automatic memory page migration mechanism provided by IRIX improves the CPU time consumed by OpenMP applications.  相似文献   

6.
一种基于线程的数据预取方法   总被引:1,自引:0,他引:1       下载免费PDF全文
多线程、多核处理器的推广受限于应用。目前,大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场并行执行来提高速度。使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度。在基于线程的数据预取方法中,数据预取线程是从主线程的执执行踪迹中提取的。它们使用空闲的现场,和主线程并行执行,在主线程需要数据之前把数据取到离处理器更近的存储层次。基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式。本文具体分析了应用程序中访存行为的特点,结合控制流处理,设计并验证了一种基于线程的数据预取方法TDP。模拟结果显示,使用TDP可以获得7%左右的性能提升。  相似文献   

7.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.  相似文献   

8.
We describe and analyze parallelization techniques for the implementation of portable structured adaptive mesh applications on distributed memory parallel computers. Such methods are difficult to implement on parallel computers because they employ elaborate dynamic data structures to selectively capture localized irregular phenomena. Our infrastructure supports a set of layered abstractions that encapsulate low-level details of resource management, such as grid generation, interprocessor communication, and load balancing. Our layered design also provides the flexibility necessary to accommodate new applications and to fine-tune performance. This flexibility has enabled us to show that the uniformity restrictions imposed by a data parallel Fortran implementation (e.g., HPF) would significantly impact performance of structured adaptive mesh methods. We present computational results from eigenvalue computation arising in materials design.  相似文献   

9.
陈欢  谢健 《计算机科学》2012,39(106):392-395
随着多核处理器的普及,并为了充分利用多核PC机的特性,计算机技术逐渐向多核架构及多核计算技术发展。为提高对湖南地区100mX 100m小网格气温插值的速度,采用以OpenMP为标准的基于共享存储的并行编程模型对Kriging插值算法进行改进。在不同核的多核PC机中,采用100mX 100m小网格和500mX 500m小网格地形数据对平均气温进行插值,不仅有效减少了插值时间和提高了算法的加速比,而且集成到业务系统中大大提升了系统的反应时间及性能。  相似文献   

10.
11.
We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and data-partitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples.  相似文献   

12.
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer.  相似文献   

13.
为了解决数据流编程模型的可用性问题,使其能在兼顾程序并行性的前提下适用于动态数据交互速率的流应用,设计了一种动态调度与静态优化相结合的数据流编译系统。编译器以COStream语言编写的源程序为输入,通过对源程序进行分析,以动态速率的数据通信边作为边界划分程序到粗粒度的子图,在子图内部应用静态优化。根据子图的每个计算单元的工作量估计计算资源的使用状况,实现子图内计算单元到处理器核的映射,经过阶段划分分配子图内计算单元到相应流水阶段。在运行时,每个子图在各个处理器核上均启动一个线程,通过对线程间通信的优化,避免了运行时多个线程对同一段内存同时读写产生的同步开销,减少了线程的上下文切换次数。使用信号量控制子图内线程间的同步,基于各子图计算单元运行时数据交互速率并结合当前线程的状态,动态调度各个子图的执行,构建动态的软件流水线,生成相应多线程目标代码。实验以通用X86-64多核处理器作为实验平台,测试和分析数据流编译的性能。实验结果表明,编译系统可以实现动态数据交互速率的数据流应用,扩大了编译系统可用性并且具有一定加速效果。  相似文献   

14.
In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library  相似文献   

15.
The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.  相似文献   

16.
The method of discontinuous finite element discrete ordinates which involves inverting an operator by iteratively sweeping across a mesh from multiple directions is commonly used to solve the time-dependent particle transport equation. Graphics Processing Unit (GPU) provides great faculty in solving scientific applications. The particle transport with unstructured grid bringing forward several challenges while implemented on GPU. This paper presents an efficient implementation of particle transport with unstructured grid under 2D cylindrical Lagrange coordinates system on a fine-grained data level parallelism GPU platform from three aspects. The first one is determining the sweep order of elements from different angular directions. The second one is mapping the sweep calculation onto the GPU thread execution model. The last one is efficiently using the on-chip memory to improve performance. As to the authors? knowledge, this is the first implementation of a general purpose particle transport simulation with unstructured grid on GPU. Experimental results show that the performance speedup of NVIDIA M2050 GPU with double precision floating operations ranges from 11.03 to 17.96 compared with the serial implementation on Intel Xeon X5355 and Core Q6600.  相似文献   

17.
Apan Qasem  Josh Magee 《Software》2013,43(6):705-729
Translation Lookaside Buffers (TLBs) can play a critical role in improving the performance of emerging parallel workloads. Most current chip multiprocessor systems include multilevel TLBs and provide support for superpages both at the hardware and software level. Judicious use of superpages can significantly cut down the number of TLB misses and improve overall system performance. However, indiscriminate superpage allocation results in page fragmentation and increased application footprint, which often outweigh the benefits of reduced TLB misses. Previous research has explored policies for smart allocation of superpages from an operating system perspective. This paper presents a compiler‐based strategy for automatic and profitable memory allocation via superpages. A significant advantage of a compiler‐based approach is the availability of data‐reuse information within an application. Our strategy employs data‐locality analysis to estimate the TLB demands for both single‐threaded and multi‐threaded programs and uses this metric to apply selective superpage allocation. Apart from its obvious utility in improving TLB performance, this strategy can be used to improve the effectiveness of certain data‐layout transformations and can be a useful tool in benchmarking and automatic performance tuning. We demonstrate the effectiveness of this strategy with experiments on three multicore platforms on a workload that contains both sequential and parallel applications. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

18.
宋广辉  郭绍忠  赵捷  陶小涵  李飞  许瑾晨 《软件学报》2023,34(12):5704-5723
混合精度在深度学习和精度调整与优化方面取得了许多进展,广泛研究表明,面向Stencil计算的混合精度优化也是一个很有挑战性的方向.同时,多面体模型在自动并行化领域取得的一系列研究成果表明,该模型为循环嵌套提供很好的数学抽象,可以在其基础上进行一系列的循环变换.基于多面体编译技术设计并实现了一个面向Stencil计算的自动混合精度优化器,通过在中间表示层进行迭代空间划分、数据流分析和调度树转换,首次实现了源到源的面向Stencil计算的混合精度优化代码自动生成.实验表明,经过自动混合精度优化之后的代码,在减少精度冗余的基础上能够充分发挥其并行潜力,提升程序性能.以高精度计算为基准,在x86平台上最大加速比是1.76,几何平均加速比是1.15;在新一代国产申威平台上最大加速比是1.64,几何平均加速比是1.20.  相似文献   

19.
Many large scale applications have significant I/O requirements as well as computational and memory requirements. Unfortunately, the limited number of I/O nodes provided in a typical configuration of the modern message-passing distributed-memory architectures such as Intel Paragon and IBM SP-2 limits the I/O performance of these applications severely. We examine some software optimization techniques and evaluate their effects in five different I/O-intensive codes from both small and large application domains. Our goals in this study are twofold. First, we want to understand the behavior of large-scale data-intensive applications and the impact of I/O subsystems on their performance and vice versa. Second, and more importantly, we strive to determine the solutions for improving the applications' performance by a mix of software techniques. Our results reveal that different applications can benefit from different optimizations. For example, we found that some applications benefit from file layout optimizations whereas others take advantage of collective I/O. A combination of architectural and software solutions is normally needed to obtain good I/O performance. For example, we show that with a limited number of I/O resources, it is possible to obtain good performance by using appropriate software optimizations. We also show that beyond a certain level, imbalance in the architecture results in performance degradation even when using optimized software, thereby indicating the necessity of an increase in I/O resources.  相似文献   

20.
In modern scientific computing communities, scientists are involved in managing massive amounts of very large data collections in a geographically distributed environment. Research in the area of grid computing has given us various ideas and solutions to address these requirements. Data grid mostly deals with large computational problems and provides geographically distributed resources for large-scale data-intensive applications that generate large data sets. Peer-to-peer (P2P) networks have also become a major research topic over the last few years. In a distributed P2P system, a discovery algorithm is required to locate specific information, applications, or users within the system. In this research work, we present our scientific data grid as a large P2P-based distributed system model. By using this model, we study various discovery algorithms for locating data sets in a data grid system. The algorithms we studied are based on the P2P architecture. We investigate these algorithms using our Grid Simulator developed using PARSEC. In this paper, we illustrate our scientific data grid model and our Grid Simulator. We then analyze the performance of the discovery algorithms relative to their average number of hop, success rates and bandwidth consumption.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号