首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included.  相似文献   

2.
A broad area in astronomy focuses on simulating extragalactic objects based on Very Long Baseline Interferometry (VLBI) radio-maps. Several algorithms in this scope simulate what would be the observed radio-maps if emitted from a predefined extragalactic object. This work analyzes the performance and scaling of this kind of algorithms on multi-socket, multi-core architectures. In particular, we evaluate a sharing approach, a privatizing approach and a hybrid approach on systems with complex memory hierarchy that includes shared Last Level Cache (LLC). In addition, we investigate which manual processes can be systematized and then automated in future works. The experiments show that the data-privatizing model scales efficiently on medium scale multi-socket, multi-core systems (up to 48 cores) while regardless of algorithmic and scheduling optimizations, the sharing approach is unable to reach acceptable scalability on more than one socket. However, the hybrid model with a specific level of data-sharing provides the best scalability over all used multi-socket, multi-core systems.  相似文献   

3.
多核处理器(multi—core processor)成为高性能处理器体系结构的研究发展方向,核间的连接方式对多核处理器性能的发挥起着重要作用。从降低节点度、减少网络链路数和缩短网络直径的角度出发,提出了一种用于片上核间互连的新型分层互连网络——基三分层互连网络(THIN),该网络拓扑简单,节点度数低,网络链路数相对较少,并具有明显的层次性和对称性以及良好的扩展性。深入比较了THIN和2-D Mesh的静态度量和无阻塞延迟,比较结果表明:在网络规模较小时,THIN比2-D Mesh更宜于用来构建片上核间的通信网络。  相似文献   

4.
We developed new parameterized Particle-in-Cell algorithms and data structures for emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC code to different hardware configurations. Particles are kept ordered at each time step. The first application of these algorithms is to NVIDIA graphical processing units, where speedups of about 15–25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic code. Electromagnetic codes are expected to get higher speedups due to their greater computational intensity.  相似文献   

5.
We consider the energy saving problem for caches on a multi-core processor.In the previous research on low power processors,there are various methods to reduce power dissipation.Tag reduction is one of them.This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors.We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced.We then propose three algorithms using different heuristics for this assignment problem.We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy.Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for.They also significantly outperform the tag reduction based algorithm on a single-core processor.  相似文献   

6.
The tension between software development costs and efficiency is especially high when considering parallel programs intended to run on a variety of architectures. In the domain of shared memory architectures and explicitly parallel programs, the authors have addressed this problem by defining a programming structure that eases the development of effectively portable programs. On each target multiprocessor, an effectively portable program runs almost as efficiently as a program fine-tuned for that machine. Additionally, its software development cost is close to that of a single program that is portable across the targets. Using this model, programs are defined in terms of data structure and partitioning-scheduling abstractions. Low software development cost is attained by writing source programs in terms of abstract interfaces and thereby requiring minimal modification to port; high performance is attained by matching (often dynamically) the interfaces to implementations that are most appropriate to the execution environment. The authors include results of a prototype used to evaluate the benefits and costs of this approach  相似文献   

7.
Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-core architectures. Under ccNUMA, data placement may influence overall application performance significantly as references resolved locally to a processor/core impose lower latencies than remote ones.  相似文献   

8.
With Moore’s law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, the optimal layout of these multiple cores along with the memory subsystem (caches and main memory) to satisfy power, area, and stringent real-time constraints is a challenging design endeavor. The short time-to-market constraint of embedded systems exacerbates this design challenge and necessitates the architectural modeling of embedded systems to reduce the time-to-market by expediting target applications to device/architecture mapping. In this paper, we present a queueing theoretic approach for modeling multi-core embedded systems that provides a quick and inexpensive performance evaluation both in terms of time and resources as compared to the development of multi-core simulators and running benchmarks on these simulators. We verify our queueing theoretic modeling approach by running SPLASH-2 benchmarks on the SuperESCalar simulator (SESC). Results reveal that our queueing theoretic model qualitatively evaluates multi-core architectures accurately with an average difference of 5.6% as compared to the architectures’ evaluations from the SESC simulator. Our modeling approach can be used for performance per watt and performance per unit area characterizations of multi-core embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis.  相似文献   

9.
Unified Parallel C(UPC) is a parallel extension of ANSI C based on the Partitioned Global Address Space(PGAS) programming model,which provides a shared memory view that simplifies code development while it can take advantage of the scalability of distributed memory architectures.Therefore,UPC allows programmers to write parallel applications on hybrid shared/distributed memory architectures,such as multi-core clusters,in a more productive way,accessing remote memory by means of different high-level language constructs,such as assignments to shared variables or collective primitives.However,the standard UPC collectives library includes a reduced set of eight basic primitives with quite limited functionality.This work presents the design and implementation of extended UPC collective functions that overcome the limitations of the standard collectives library,allowing,for example,the use of a specific source and destination thread or defining the amount of data transferred by each particular thread.This library fulfills the demands made by the UPC developers community and implements portable algorithms,independent of the specific UPC compiler/runtime being used.The use of a representative set of these extended collectives has been evaluated using two applications and four kernels as case studies.The results obtained confirm the suitability of the new library to provide easier programming without trading off performance,thus achieving high productivity in parallel programming to harness the performance of hybrid shared/distributed memory architectures in high performance computing.  相似文献   

10.
李士刚  胡长军  王珏  李建江 《软件学报》2013,24(12):2782-2796
低功耗及廉价性使得异构多核在超级计算机计算资源中占有重要比例.然而,异构多核具有高带宽及松耦合一致性等特点,获得理想的存储及计算性能需要更多地考虑底层硬件细节.实现了一种针对典型的异构多核Cell BE 处理器的多级并行模型CellMLP,通过C 语言扩展编译指导语句,实现了对数据并行、任务并行以及流水并行编程模型的支持,提高了并行程序生产率.运行支持优化方面,数据并行采用SPE 并行数据传输、双缓冲等优化手段来提高数据传输带宽;任务并行使用一种新式混合任务队列以支持异步任务窃取,降低SPE 线程间竞争,提高了任务并行的可扩展性;流水并行首次使用阻塞信号传输机制实现SPE 线程间的低开销同步操作.实验对Stream,NASBenchmark 及BOTS 等应用进行了测试,结果表明,CellMLP 可对多种典型并行应用进行高效支持.与目前同类编程模型SARC 及CellSs 进行性能对比,其结果表明,CellMLP 实际数据传输带宽以及非规则应用的支持方面具有明显优势.  相似文献   

11.
In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a “hungry” pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.  相似文献   

12.
Fortran 77 implementations of the Level 3 Basic Linear Algebra Subprograms (BLAS) in double precision, structured and tuned to achieve high performance on the IBM 3090 VF, are presented. The implementations are designed to exploit the memory hierarchy and the vector processor efficiently. Efficient cache reuse is provided by a method for matrix blocking adapted to the memory hierarchy. Vector registers and compound vector instructions are used efficiently through carefully designed Fortran code constructs. Performance results generally show speed comparable to the highly tuned IBM ESSL library. In some cases our implementations are actually faster than ESSL. The generality of the program design and the use of Fortran 77 make the implementations portable and well suited to serve as design platforms for other machines with similar architectures.  相似文献   

13.
从应用角度出发,分析、归纳各种应用中的核心计算过程,利用符合多核处理器芯片架构的并行计算模型对这些核心计算过程进行优化,得出可以被重复利用的高性能可扩展的软件库,它既可以支持新应用的高效开发,也可以保证程序性能的可扩展性。以分层并行计算模型思想为指导,从应用驱动的并行程序性能优化的角度出发,首先提出了面向多核处理器芯片体系结构的并行算法设计模型,在此基础上对并行扫描算法进行分析优化,得出新的具有良好扩展性、高性能的g-scan算法。之后深入研究13种核心计算实体之一的稀疏线性代数计算实体,应用g-scan算法设计实现了新的稀疏矩阵-向量运算算法,并将其应用于结构工程领域中广泛使用的有限元分析,大大提升了其执行效率。  相似文献   

14.
With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets.  相似文献   

15.
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer.  相似文献   

16.
何裕南  安虹  郭锐  梁博 《计算机科学》2007,34(1):248-254
CPU设计正在由仅开发指令级并行性的单线程单核结构转向利用线程级并行性的多线程多核结构,但至今还没有一个可移植性好并被广泛使用的开源多核处理器模拟器,限制了在这样的结构上开展高质量的研究工作。我们开发了一个多核处理器体系结构模拟器OpenCMP,用于支持当前和未来对多线程多核处理器体系结构关键技术的研究。该模拟器适当地抽象了多核处理器结构,为主流的多核处理器结构研究提供一个可扩展、灵活的模拟工具框架,包括支持对乱序、顺序的处理器核和同时多线程处理器核的模拟,以便对更大的多核设计空间进行比较性研究。本文以支持事务存储模型的多核处理器结构模拟器为例,详细描述了如何通过抽象多核结构和事务存储模型的最基本特性和组成部分,扩展单核处理器模拟器SimpleScalar,设计与实现一个多核处理器模拟器。初步研究表明,与现有的多核处理器模拟器相比,该模拟器能够较好地支持对事务存储模型和基于事务存储模型的多核处理器体系结构的研究.  相似文献   

17.
基于cc-NUMA构架的多核处理器是未来的主流。系统内集成百处理器核心也会在几年内出现,而现有的系统软件并不能充分发挥这一构架的优势。文章设计实现了一个虚拟机原型,通过虚拟机向上层操作系统屏蔽底层cc-NU-MA构架的特性,使操作系统无需修改可高效的运行并且应用程序降低开发的难度。实验结果表明运行在虚拟化后同一NUMA节点内的Linux可以达到很好的性能。  相似文献   

18.
Energy efficient single-processor and fully pipelined architectures for the lifting-based JPEG2000's 5/3 two-dimensional (2D)-discrete wavelet transform are presented. The single processor performs both the row-and column-wise processing simultaneously, that is, full 2D transform with 100% hardware utilisation. In addition, the architecture uses minimum embedded memory. The fully pipelined architecture is obtained by replicating the single-processor block depending on the levels of decomposition with much lower memory requirement and higher throughput than the single processor involved in multi-level transforms. These architectures can be directly used in real-time image/video consumer applications to extend the battery life of portable systems.  相似文献   

19.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

20.
视频编码算法复杂度的提高,对处理器性能提出了更高的需求,多核处理器为媒体数据处理提供了有力的平台。分析了视频编码标准算法的特点,总结视频编码加速的方法,按照对称多核处理器、不对称多核处理器以及混合式多核处理器的分类,介绍基于多核处理器的并行视频编码设计方法以及典型例子;总结基于多核处理器进行视频编码设计可能遇到的问题,并指出了未来的研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号