期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Global arrays: A nonuniform memory access programming model for high-performance computers 总被引：1，自引：1，他引：0

Jaroslaw Nieplocha Robert J. Harrison Richard J. Littlefield 《The Journal of supercomputing》1996,10(2):169-189

Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes an approach, called Global Arrays (GAs), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented the GA library on a variety of computer systems, including the Intel Delta and Paragon, the IBM SP-1 and SP-2 (all message passers), the Kendall Square Research KSR-1/2 and the Convex SPP-1200 (nonuniform access shared-memory machines), the CRAY T3D (a globally addressable distributed-memory computer), and networks of UNIX workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GAs in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.(An earlier version of this paper was presented at Supercomputing'94.) 相似文献

2.

软件事务内存的动态竞争管理策略

林菲《计算机工程与设计》2010,31(7)

软件事务内存是为了简化并行程序设计而出现的一种新的程序设计技术.为了降低软件事务内存系统中事务冲突的发生频率以提升系统整体性能,提出了一种新的基于动态控制和队列调度的竞争管理策略.定义了竞争强度的概念和系统总体框架,并在此基础上给出了利用运行时反馈信息动态调节竞争强度的方法.同时给出了事务序列化的设计方法与实现中应注意的问题,通过将冲突概率大的事务序列化以达到避免相同冲突再次发生的目的.结合常用的基准数据结构,对模型和算法进行了实验,最后结果表明了算法的正确性和有效性. 相似文献

3.

OpenCL performance portability for general‐purpose computation on graphics processor units: an exploration on cryptographic primitives

Giovanni Agosta Alessandro Barenghi Alessandro Di Federico Gerardo Pelosi 《Concurrency and Computation》2015,27(14):3633-3660

The modern trend toward heterogeneous many‐core architectures has led to high architectural diversity in both high performance and high‐end embedded systems. To effectively exploit the computational resources of such a wide range of architectures, programming languages and APIs such as OpenCL have become increasingly popular. Although OpenCL provides functional code portability and the ability to fine tune the application to the target hardware, providing performance portability is still an open problem. Thus, many research works have investigated the optimization of specific combinations of application and target platform. In this paper, we aim at leveraging the experience obtained in the implementation of algorithms from the cryptography domain to provide a set of guidelines for modern many‐core heterogeneous architecture performance portability and to establish a base on which domain‐specific languages and compiler transformations could be built in the near future. We study algorithmic choices and the effect of compiler transformations on three representative applications in the chosen domain on a set of seven target platforms. To estimate how well the application fits the architecture, we define a metric of computational intensity both for the architecture and the application implementation. Besides being useful to compare either different implementation or algorithmic choices and their fitness to a specific architecture, it can also be useful to the compiler to guide the code optimization process. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

4.

Unified Programming Models for Heterogeneous High-Performance Computers

下载免费PDF全文

Ma Zi-Xuan Jin Yu-Yang Tang Shi-Zhi Wang Hao-Jie Xue Wei-Cheng Zhai Ji-Dong Zheng Wei-Min 《计算机科学技术学报》2023,38(1):211-218

Journal of Computer Science and Technology - Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming... 相似文献

5.

Resource abstraction and data placement for distributed hybrid memory pool

Tingting CHEN Haikun LIU Xiaofei LIAO Hai JIN 《Frontiers of Computer Science》2021,15(3):153103

Emerging byte-addressable non-volatile memory (NVM) technologies offer higher density and lower cost than DRAM, at the expense of lower performance and limited write endurance. There have been many studies on hybrid NVM/DRAMmemory management in a single physical server. However, it is still an open problem on how to manage hybrid memories efficiently in a distributed environment. This paper proposes Alloy, a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool (DHMP). Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP, without being aware of the hardware details of the DHMP. We propose a hotness-aware data placement scheme, which combines hot data migration, data replication and write merging together to improve application performance and reduce the cost of DRAM. We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads. Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%, while reducing the total memory access time by up to 57% compared with the state-of-the-art approaches. 相似文献

6.

Direct distributed memory access for CMPs

Weiwei Fu Li Liu Tianzhou Chen 《Journal of Parallel and Distributed Computing》2014

On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. The packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA. 相似文献

7.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

8.

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

《Journal of Parallel and Distributed Computing》2014,74(12):3202-3216

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries. 相似文献

9.

The local memory access sequence of multiple induction variables on distributed memory machines

Tsung-Chuan Huang^{Author Vitae} Liang-Cheng Shiu Author VitaeAuthor Vitae 《Computers & Electrical Engineering》2004,30(3):231-244

相似文献

10.

Grid applications on distributed memory architectures: Implementation and evaluation

Karl Solchenbach 《Parallel Computing》1988,7(3):341-356

It was shown in the paper of Solchenbach and Trottenberg (in this special issue) that grid algorithms are inherently parallel and that parallel grid algorithms for regular grids can be efficiently implemented on dm-mp systems using the concept of grid partitioning.

In this paper, we demonstrate that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available.

The evaluation of parallel grid algorithms shows that the multiprocessor speedup and efficiency for single grid applications depends on the communication/calculation performance ratio of the hardware, on the communication/calculation ratio of the algorithms, and on the process size. The efficiency of parallel multigrid algorithms additionally depends on the number of nodes. 相似文献

11.

All‐uses testing of shared memory parallel programs

Cheer‐Sun D. Yang Lori L. Pollock 《Software Testing, Verification and Reliability》2003,13(1):3-24

Parallelism has become a way of life for many scientific programmers. A significant challenge in bringing the power of parallel machines to these programmers is providing them with a suite of software tools similar to the tools that sequential programmers currently utilize. Unfortunately, writing correct parallel programs remains a challenging task.In particular, automatic or semi‐automatic testing tools for parallel programs are lacking. This paper takes a first step in developing an approach to providing all‐uses coverage for parallel programs. A testing framework and theoretical foundations for structural testing are presented, including test data adequacy criteria and hierarchy, formulation and illustration of all‐uses testing problems, classification of all‐uses test cases for parallel programs, and both theoretical and empirical results with regard to what can be achieved with all‐uses coverage for parallel programs. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

12.

An effective preprocessor for structured FORTRAN: The HENTRAN system

Giovanni Guida 《International journal of parallel programming》1981,10(4):283-297

In the paper the new HENTRAN preprocessor for structured FORTRAN is illustrated. The motivations and the goals of the project are first outlined. The extended FORTRAN language implemented through HENTRAN is then illustrated and its adequacy and flexibility for structured programming are discussed. The basic architecture of the HENTRAN translator is described and its main features concerning reliability, portability, and efficiency are discussed in comparison with other similar systems. 相似文献

13.

The use of intermediate memories for low-latency memory access in supercomputer scalar units

Gurindar S. Sohi Wei-Chung Hsu 《The Journal of supercomputing》1990,4(1):5-21

One of the prime considerations for high scalar performance in supercomputers is a low memory latency. With the increasing disparity between main memory and CPU clock speeds, the use of an intermediate memory in the hierarchy becomes necessary. In this paper, we present an intermediate memory structure called a programmable cache. A programmable cache exploits structural locality to decrease the average memory access time. We evaluate the concept of a programmable cache by using the vector registers in the CRAY X-MP and Y-MP supercomputers as a programmable cache. Our results indicate that a programmable cache can be used profitably to reduce the memory latency if the pattern of references to a data structure can be determined at compile time.The work of the first author was supported in part by NSF Grant CCR-8706722. 相似文献

14.

Binding environments for parallel logic programs in non-shared memory multiprocessors

John S. Conery 《International journal of parallel programming》1988,17(2):125-152

A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models. 相似文献

15.

Generating efficient local memory access sequences for coupled subscripts in data-parallel programs

Tsung-Chuan Huang Liang-Cheng Shiu Yi-Jay Lin 《Information Sciences》2003,149(4):249-261

相似文献

16.

Auxiliary stream for optimizing memory access of video decoders

LIU ShaoLi LI Ling CHEN YunJi HU WeiWu 《中国科学:信息科学(英文版)》2014,57(1):1-10

Due to the ever increasing resolution and frame rate of mainstream video sequences,memory access has become the main performance bottleneck of video decoding.To reduce the required of-chip memory,many decoders employ on-chip cache.However,they cannot distinguish whether a data block is reusable due to the lack of the information of undecoded Macro Blocks(MBs),thus often evicting reusable data from the cache and preserving non-reusable data in the cache,which will lead to a waste of of-chip memory bandwidth.In this paper,we manage to make full use of cache from a novel perspective,i.e.,auxiliary bitstream.Concretely speaking,since the memory access behavior of video decoding is determined in video encoding,the encoder can pack the memory access behaviors of video decoding as auxiliary bitstream,which can inform the decoder whether a data block will be reused by future MBs.Hence,such an auxiliary stream can enable optimal management of cache.To efectively reduce the size of auxiliary bitstream,we propose an Auxiliary Prior Information Coding(APIC)method complying with the current video standards.For future video standards,we introduce a Super Block scan Order(SBO)for MB organization to further reduce the bitrate overhead of auxiliary bitstream.The above ideas are evaluated on a number of representative video sequences.The additional prior information can reduce the required of-chip memory bandwidth for motion compensation by over 35%(for a 60 kB cache),while only causing less than 2.3%bitrate increase for high definition(HD)videos. 相似文献

17.

A garbage collection algorithm for shared memory parallel processors

Jim Crammond 《International journal of parallel programming》1988,17(6):497-522

This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm. 相似文献

18.

Optimizing database architecture for the new bottleneck: memory access

Stefan Manegold Peter A. Boncz Martin L. Kersten 《The VLDB Journal The International Journal on Very Large Data Bases》2000,9(3):231-246

In the past decade, advances in the speed of commodity CPUs have far out-paced advances in memory latency. Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. In this article, we use a simple scan test to show the severe impact of this bottleneck. The insights gained are translated into guidelines for database architecture, in terms of both data structures and algorithms. We discuss how vertically fragmented data structures optimize cache performance on sequential data access. We then focus on equi-join, typically a random-access operation, and introduce radix algorithms for partitioned hash-join. The performance of these algorithms is quantified using a detailed analytical model that incorporates memory access cost. Experiments that validate this model were performed on the Monet database system. We obtained exact statistics on events such as TLB misses and L1 and L2 cache misses by using hardware performance counters found in modern CPUs. Using our cost model, we show how the carefully tuned memory access pattern of our radix algorithms makes them perform well, which is confirmed by experimental results. Received April 20, 2000 / Accepted June 23, 2000 相似文献

19.

Quantifying and evaluating the space overhead for alternative C++ memory layouts

Peter F. Sweeney Michael Burke 《Software》2003,33(7):595-636

This paper develops a formalism that precisely characterizes when class tables are required for C++ memory layouts. A memory layout is a particular choice of data structures for implementing run‐time support for object‐oriented languages. We use this formalism to quantify and evaluate, on a set of benchmarks, the space overhead for a set of C++ memory layouts. In particular, this paper studies the space overhead due to three language features: virtual dispatch, virtual inheritance, and dynamic typing. To date, there has been no scientific quantification or evaluation of C++ memory layouts. Our approach can help C++ implementors. This work has already influenced the memory layout design choices in IBM's Visual Age C++ V5 compiler. Applying our approach to a set of five benchmarks, we demonstrate that the impact of object‐oriented space overhead can vary dramatically between applications (ranging from 0.42% to 99.79% for our benchmarks). In particular, applications whose object space is dominated by instances of classes that heavily use object‐oriented language features will be significantly impacted by the choice of a memory layout. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

20.

A comparison of two paradigms for distributed shared memory

Willem G. Levelt M. Frans Kaashoek Henri E. Bal Andrew S. Tanenbaum 《Software》1992,22(11):985-1010

Two paradigms for distributed shared memory on loosely-coupled computing systems are compared: the shared data-object model as used in Orca, a programming language specially designed for loosely-coupled computing systems, and the shared virtual memory model. For both paradigms two systems are described, one using only point-to-point messages, the other using broadcasting as well. The two paradigms and their implementations are described briefly. Their performances are compared on four applications: the travelling-salesman problem, alpha-beta search, matrix multiplication and the all-pairs shortest-paths problem. Measurements were obtained on a system consisting of 10 MC68020 processors connected by an Ethernet. For comparison purposes, the applications have also been run on a system with physical shared memory. In addition, the paper gives measurements for the first two applications above when remote procedure call is used as the communication mechanism. The measurements show that both paradigms can be used efficiently for programming large-grain parallel applications, with significant speed-ups. The structured shared data-object model achieves the highest speed-ups and is easiest to program and to debug. 相似文献