首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper develops a formalism that precisely characterizes when class tables are required for C++ memory layouts. A memory layout is a particular choice of data structures for implementing run‐time support for object‐oriented languages. We use this formalism to quantify and evaluate, on a set of benchmarks, the space overhead for a set of C++ memory layouts. In particular, this paper studies the space overhead due to three language features: virtual dispatch, virtual inheritance, and dynamic typing. To date, there has been no scientific quantification or evaluation of C++ memory layouts. Our approach can help C++ implementors. This work has already influenced the memory layout design choices in IBM's Visual Age C++ V5 compiler. Applying our approach to a set of five benchmarks, we demonstrate that the impact of object‐oriented space overhead can vary dramatically between applications (ranging from 0.42% to 99.79% for our benchmarks). In particular, applications whose object space is dominated by instances of classes that heavily use object‐oriented language features will be significantly impacted by the choice of a memory layout. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

2.
In C++, multi‐dimensional arrays are often used but the language provides limited native support for them. The language, in its Standard Library, supplies sophisticated interfaces for manipulating sequential data, but relies on its bare‐bones C heritage for arrays. The MultiArray library, a part of the Boost library collection, enhances a C++ programmer's tool set with versatile multi‐dimensional array abstractions. It includes a general array class template and native array adaptors that support idiomatic array operations and interoperate with C++ Standard Library containers and algorithms. The arrays share a common interface, expressed as a generic programming concept, in terms of which generic array algorithms can be implemented. We present the library design, introduce a generic interface for array programming, demonstrate how the arrays integrate with the C++ Standard Library, and discuss the essential aspects of their implementation. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

3.
Agent technology is emerging as an important concept for the development of distributed complex systems. A number of mobile agent systems have been developed in the last decade. However, most of them were developed to support only Java mobile agents. In order to provide distributed applications with code mobility, this article presents a library, the Mobile-C library, that allows a mobile agent platform, Mobile-C, to be embeddable in an application to support mobile C/C++ codes carried by mobile agents. Mobile-C uses a C/C++ interpreter as its Agent Execution Engine (AEE). Through the Mobile-C library, Mobile-C can be embedded into an application to support mobile C/C++ codes carried by mobile agents. Using mobile C/C++ codes, it is easy to interface a variety of low-level hardware devices and legacy systems. Through the Mobile-C library, Mobile-C can run on heterogeneous platforms with various operating systems. The Mobile-C library has a small footprint to meet the stringent memory capacity for applications in mechatronic and embedded systems. The Mobile-C library contains different categories of Application Programming Interfaces (APIs) in both binary and agent spaces to facilitate the design of mobile agent based applications. In addition, a rich set of existing APIs for the C/C++ interpreter employed as the AEE allows an application to have complete information and control over the mobile C/C++ codes residing in Mobile-C. With the synchronization mechanism provided by the Mobile-C library for both binary and agent spaces, simultaneous processes across both spaces can be coordinated to get correct runtime order and avoid unexpected race condition. The study of performance comparisons indicates that Mobile-C is about two times faster than JADE in agent migration. The application of the Mobile-C library is illustrated by dynamic runtime control of a mobile robot’s behavior using mobile agents.  相似文献   

4.
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.  相似文献   

5.
Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays, which demand different layouts in different parts of the code. We introduce a new approach, which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization." In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques.  相似文献   

6.
Concurrent C/C++ is a superset of C and C++ that provides parallel programming facilities based on message passing. Upon porting Concurrent C/C++to a shared memory multiprocessor, the authors believed it would be appropriate to supplement Concurrent C/C++ with explicit facilities for synchronizing accesses to shared data structures. The capsule, which is a shared memory access mechanism designed especially for Concurrent C/C++ to match the C++data abstraction facility called the class, is discussed. Capsules are like monitors but they have significant advantages. Capsules satisfy T. Bloom's (1979) criteria for expressiveness of synchronization conditions, support inheritance, allow operations to execute in parallel, and permit them to time out. The design of capsules is reviewed. The author evaluates existing shared memory mechanisms, describes capsules, gives examples of capsules, compares capsules with monitors, and discusses how capsules are implemented by the Concurrent C compiler  相似文献   

7.
Complex coupled multiphysics simulations are playing increasingly important roles in scientific and engineering applications such as fusion, combustion, and climate modeling. At the same time, extreme scales, increased levels of concurrency, and the advent of multicores are making programming of high‐end parallel computing systems on which these simulations run challenging. Although partitioned global address space (PGAS) languages attempt to address the problem by providing a shared memory abstraction for parallel processes within a single program, the PGAS model does not easily support data coupling across multiple heterogeneous programs, which is necessary for coupled multiphysics simulations. This paper explores how multiphysics‐coupled simulations can be supported by the PGAS programming model. Specifically, in this paper, we present the design and implementation of the XpressSpace programming system, which extends existing PGAS data sharing and data access models with a semantically specialized shared data space abstraction to enable data coupling across multiple independent PGAS executables. XpressSpace supports a global‐view style programming interface that is consistent with the PGAS memory model, and provides an efficient runtime system that can dynamically capture the data decomposition of global‐view data‐structures such as arrays, and enable fast exchange of these distributed data‐structures between coupled applications. In this paper, we also evaluate the performance and scalability of a prototype implementation of XpressSpace by using different coupling patterns extracted from real world multiphysics simulation scenarios, on the Jaguar Cray XT5 system at Oak Ridge National Laboratory. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

8.
Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1‐, 2‐, 3‐ or arbitrary physical dimensions and also in a manner that supports exploitation of data‐parallelism on fast modern processing devices. We report on data layouts and transformation algorithms that support both conventional and data‐parallel memory layouts. We present our implementations expressed in both conventional serial C code as well as in NVIDIA's Compute Unified Device Architecture concurrent programming language for use on general purpose graphical processing units. We discuss: general memory layouts; specific optimizations possible for dimensions that are powers‐of‐two and common transformations, such as inverting, shifting and crinkling. We present performance data for some illustrative scientific applications of these layouts and transforms using several current GPU devices and discuss the code and speed scalability of this approach. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

9.

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

  相似文献   

10.
Distributed memory architectures such as Linux clusters have become increasingly common but remain difficult to program. We target this problem and present a novel technique to automatically generate data distribution plans, and subsequently MPI implementations in C++, from programs written in a functional core language. The main novelty of our approach is that we support distributed arrays, maps, and lists in the same framework, rather than just arrays. We formalize distributed data layouts as types, which are then used both to search (via type inference) for optimal data distribution plans and to generate the MPI implementations. We introduce the core language and explain our formalization of distributed data layouts. We describe how we search for data distribution plans using an adaptation of the Damas–Milner type inference algorithm, and how we generate MPI implementations in C++ from such plans.  相似文献   

11.
由于应用需求的快速发展以及网络存储系统的出现,因此异构磁盘阵列的变得越来越常见。RAID5由于较高的性能和可靠性以及较低的代价,是应用最为广泛的RAID结构。目前对异构磁盘阵列RAID5结构的研究,重点主要放在充分利磁盘存储空间以及对性能的定性研究。论文提出了一种异构磁盘阵列RAID5结构数据布局优化方法,该方法充分考虑异构磁盘的相对容量和性能,以及校验单元的散布对RAID5小数据写性能的影响,可以生成负载均匀分布或接近均匀分布的布局。仿真实验结果表明,对于多用户小数据访问模式,优化布局的性能明显优于简单RAID5布局,且具有更高的伸缩性。  相似文献   

12.
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism  相似文献   

13.
This paper introduces NiHu, a C++ template library for boundary element methods (BEM). The library is capable of computing the coefficients of discretised boundary integral operators in a generic way with arbitrarily defined kernels and function spaces. NiHu’s template core defines the workflow of a general BEM algorithm independent of the specific application. The core provides expressive syntax, based on the operator notation of the BEM, reflecting the mathematics behind boundary elements in the C++ source code. The customisable Component library contains elements specific to particular applications such as different numerical integration techniques and regularisation methods. The library can be used for creating a standalone C++ application using external open source libraries, or compiling a Matlab toolbox through the MEX interface. By massively exploiting C++ template metaprogramming, NiHu generates optimised codes for specific applications, including heterogeneous problems. The paper introduces the main concepts of the novel development, demonstrates its versatility and flexibility and compares the implementation’s performance to that of other open source projects.  相似文献   

14.
In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.  相似文献   

15.
Multidimensional array I/O in Panda 1.0   总被引:1,自引:0,他引:1  
Large multidimensional arrays are a common data type in high-performance scientific applications. Without special techniques for handling input and output, I/O can easily become a large fraction of execution time for applications using these arrays, especially on parallel platforms. Our research seeks to provide scientific programmers with simpler and more abstract interfaces for accessing persistent multidimensional arrays, and to produce advanced I/O libraries supporting more efficient layout alternatives for these arrays on disk and in main memory. We have created the Panda (Persistence AND Arrays) I/O library as a result of developing interfaces and libraries for applications in computational fluid dynamics in the areas of checkpoint, restart, and time-step output data. In the applications we have studied, we find that a simple, abstract interface can be used to insulate programmers from physical storage implementation details, while providing improved I/O performance at the same time.(A preliminary version of this paper was presented at Supercomputing '94.)  相似文献   

16.
Verification methods for memory-manipulating C programs need to address not only well-typed programs that respect invariants such as the split-heap memory model, but also programs that access through pointers arbitrary memory objects such as local variables, single struct fields, or array slices. We present a logic for memory layouts that covers these applications and show how proof obligations arising during the verification can be discharged automatically using the layouts. The framework developed in this way is also suitable for reasoning about data structures manipulated by algorithms, which we demonstrate by verifying the Schorr-Waite graph marking algorithm.  相似文献   

17.
近年来,随着嵌入式计算机技术飞速发展,各种异构嵌入式硬件不断更新,提高操作系统的可移植性和应用程序的代码重用性成为了一种趋势.本文设计了一种通用性较强的平台抽象层,针对Linux操作系统、“锐华”操作系统与硬件平台的应用程序开发接口重新进行设计,为用户开发各种嵌入式应用程序提供标准化的接口.研究证明,平台抽象层能提高操作系统的可移植性和应用程序的代码重用性,同时具有可靠的实时性.  相似文献   

18.
The application of graphics processing units (GPU) to solve partial differential equations is gaining popularity with the advent of improved computer hardware. Various lower level interfaces exist that allow the user to access GPU specific functions. One such interface is NVIDIA’s Compute Unified Device Architecture (CUDA) library. However, porting existing codes to run on the GPU requires the user to write kernels that execute on multiple cores, in the form of Single Instruction Multiple Data (SIMD). In the present work, a higher level framework, termed CU++, has been developed that uses object oriented programming techniques available in C++ such as polymorphism, operator overloading, and template meta programming. Using this approach, CUDA kernels can be generated automatically during compile time. Briefly, CU++ allows a code developer with just C/C++ knowledge to write computer programs that will execute on the GPU without any knowledge of specific programming techniques in CUDA. This approach is tremendously beneficial for Computational Fluid Dynamics (CFD) code development because it mitigates the necessity of creating hundreds of GPU kernels for various purposes. In its current form, CU++ provides a framework for parallel array arithmetic, simplified data structures to interface with the GPU, and smart array indexing. An implementation of heterogeneous parallelism, i.e., utilizing multiple GPUs to simultaneously process a partitioned grid system with communication at the interfaces using Message Passing Interface (MPI) has been developed and tested.  相似文献   

19.
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.  相似文献   

20.
《Computer aided design》1987,19(5):257-265
In this paper, a new incremental algorithm for layout compaction is proposed. In addition to its linear time performance in terms of the number of rectangles in the layout, we also describe how incremental compaction can form a good feature in the design of a layout editor. The design of such an editor is also described. In the design of the editor, we describe how arrays can be used to implement quadtrees that represent VLSI layouts. Such a representation provides speed of data access and low storage requirements.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号