期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

LLAMA: The low-level abstraction for memory access

Bernhard Manfred Gruber Guilherme Amadio Jakob Blomer Alexander Matthes René Widera Michael Bussmann 《Software》2023,53(1):115-141

The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the low-level abstraction of memory access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated array of structs and struct of arrays layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU^® lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment. 相似文献

2.

A linear algebra framework for automatic determination of optimaldata layouts

Kandemir M. Choudhary A. Shenoy N. Banerjee P. Ramenujarn J. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):115-135

This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism 相似文献

3.

Hypercubic storage layout and transforms in arbitrary dimensions using GPUs and CUDA

K. A. Hawick D. P. Playne 《Concurrency and Computation》2011,23(10):1027-1050

Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1‐, 2‐, 3‐ or arbitrary physical dimensions and also in a manner that supports exploitation of data‐parallelism on fast modern processing devices. We report on data layouts and transformation algorithms that support both conventional and data‐parallel memory layouts. We present our implementations expressed in both conventional serial C code as well as in NVIDIA's Compute Unified Device Architecture concurrent programming language for use on general purpose graphical processing units. We discuss: general memory layouts; specific optimizations possible for dimensions that are powers‐of‐two and common transformations, such as inverting, shifting and crinkling. We present performance data for some illustrative scientific applications of these layouts and transforms using several current GPU devices and discuss the code and speed scalability of this approach. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

4.

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Tristan?Aubrey-Jones Email author Bernd?Fischer 《International journal of parallel programming》2016,44(3):552-573

Distributed memory architectures such as Linux clusters have become increasingly common but remain difficult to program. We target this problem and present a novel technique to automatically generate data distribution plans, and subsequently MPI implementations in C++, from programs written in a functional core language. The main novelty of our approach is that we support distributed arrays, maps, and lists in the same framework, rather than just arrays. We formalize distributed data layouts as types, which are then used both to search (via type inference) for optimal data distribution plans and to generate the MPI implementations. We introduce the core language and explain our formalization of distributed data layouts. We describe how we search for data distribution plans using an adaptation of the Damas–Milner type inference algorithm, and how we generate MPI implementations in C++ from such plans. 相似文献

5.

Goalie: A Space Efficient System for VLSI Artwork Analysis

Szymanski T.G. Van Wyk C.J. 《Design & Test of Computers, IEEE》1985,2(3):64-72

Advances in VLSI have resulted in more and more complex circuitry, fueling the need for programs that analyze IC mask artwork. This article describes Goalie, an artwork analysis system, by explaining the algorithms used to support circuit extraction, Boolean geometric operations, connectivity analysis, capacitance measurement and design checking. Tests on several systems have shown that Goalie runs at least as fast as algorithms currently in use, but it requires less main memory, so large layouts can be handled on small computers, or even on personal workstations. 相似文献

6.

Quasidynamic layout optimizations for improving data locality

Kadayif I. Kandemir M. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(11):996-1011

Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays, which demand different layouts in different parts of the code. We introduce a new approach, which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization." In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques. 相似文献

7.

Recursive array layouts and fast matrix multiplication 总被引：1，自引：0，他引：1

Chatterjee S. Lebeck A.R. Patnala P.K. Thottethodi M. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(11):1105-1123

The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts. 相似文献

8.

Quantifying and evaluating the space overhead for alternative C++ memory layouts

Peter F. Sweeney Michael Burke 《Software》2003,33(7):595-636

This paper develops a formalism that precisely characterizes when class tables are required for C++ memory layouts. A memory layout is a particular choice of data structures for implementing run‐time support for object‐oriented languages. We use this formalism to quantify and evaluate, on a set of benchmarks, the space overhead for a set of C++ memory layouts. In particular, this paper studies the space overhead due to three language features: virtual dispatch, virtual inheritance, and dynamic typing. To date, there has been no scientific quantification or evaluation of C++ memory layouts. Our approach can help C++ implementors. This work has already influenced the memory layout design choices in IBM's Visual Age C++ V5 compiler. Applying our approach to a set of five benchmarks, we demonstrate that the impact of object‐oriented space overhead can vary dramatically between applications (ranging from 0.42% to 99.79% for our benchmarks). In particular, applications whose object space is dominated by instances of classes that heavily use object‐oriented language features will be significantly impacted by the choice of a memory layout. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

9.

Data-race and concurrent-write freedom are undecidable

《Computer Languages, Systems and Structures》2003,29(1-2):1-13

In a distributed shared memory system, sequential consistency is often assumed as the model for the memory, because it is a natural extension from multitasking in uniprocessor systems. Weaker consistency models allow greater concurrency, but programming is harder, because programs may produce unexpected results.Data-race-free (DRF) and concurrent-write-free (CWF) programs have the same set of possible executions both under a sequentially consistent memory and under some other, weaker model, memories. They can be written for a sequential memory and run unchanged under such a weaker-model memory. Since the sets of possible executions are the same, the run will only produce results that are possible under sequential consistency.This article proves the undecidability of both classes of concurrent programs in a language with if statements, loops, barriers, dynamic process creation, dynamic storage, and recursive data structures, under many models weaker than sequential consistency. Moreover, the article also proves that methods that only add synchronization statements to programs written for sequential consistency must produce some conservatively DRF or CWF programs. 相似文献

10.

Isolating bugs in multithreaded programs using execution suppression

Dennis Jeffrey Yan Wang Chen Tian Rajiv Gupta 《Software》2011,41(11):1259-1288

Memory‐related program failures in multithreaded programs can be caused by a variety of bugs. Concurrency bugs can occur due to unexpected or incorrect thread interleavings during execution. Other kinds of memory bugs, such as buffer overflows and uninitialized reads, may also occur in multithreaded as well as single‐threaded programs. Most prior techniques for isolating these bugs are specialized, addressing only one type of concurrency bug or certain types of other memory bugs. The memory corruption caused by these bugs can also undergo significant propagation during program execution. When a program failure finally occurs due to memory corruption, the true root cause of the failure may be effectively concealed as significant portions of memory may have become corrupted. We propose a general framework that can isolate the root cause of any failure in a multithreaded program that involves memory corruption and reveals at least a subset of this memory corruption. This includes three important types of concurrency bugs—data races, atomicity violations, and order violations—as well as other kinds of memory bugs. To account for propagation of memory corruption, our approach uses a dynamic technique called ‘execution suppression’ that iteratively reveals memory corruption in a failing execution to isolate the true root cause of the failure. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

11.

Composition-based Cache simulation for structure reorganization

Keoncheol Shin Hwansoo Han Kwang-Moo Choe 《Journal of Systems Architecture》2010,56(2-3):136-149

Finding the best data layout has been an ultimate goal of memory optimization. Even with data access profile, heuristic algorithms are needed to reorganize data layout for better locality. The best layout could be found by running the given application with all possible data layouts and selecting the best performing layout. This approach, however, can incur too much overhead, particulary when the number of possible layouts are too many. In this paper, we present a composition-based cache simulation for structure reorganization. Instead of running all possible layouts, we simulate only the primary subsets of layouts and compose the cache misses for all layouts by summing up the cache misses of component subsets. Our experiment with the composition-based cache simulation shows that the differences in the cache misses are within 10% of the full cache simulation for 4-way and 8-way set associative caches. In addition to the cache miss estimation, our heuristic algorithm takes account of the extra instruction overhead incurred by structure reorganization. Our experiment with several structure intensive benchmarks shows the 37% reduction in the L1D read misses and the 28% reduction in the L2 read misses. As a result, the execution times are also reduced by 19% on average. 相似文献

12.

Automatic validation for binary translation

《Computer Languages, Systems and Structures》2015

Binary translation is an important technique for porting programs as it allows binary code for one platform to execute on another. It is widely used in virtual machines and emulators. However, implementing a correct (and efficient) binary translator is still very challenging because many delicate details must be handled smartly. Manually identifying mistranslated instructions in an application program is difficult, especially when the application is large. Therefore, automatic validation tools are needed urgently to uncover hidden problems in a binary translator. We developed a new validation tool for binary translators. In our validation tool, the original binary code and the translated binary code run simultaneously. Both versions of the binary code continuously send their architecture states and the stored values, which are the values stored into memory cells, to a third process, the validator. Since most mistranslated instructions will result in wrong architecture states during execution, our validator can catch most mistranslated instructions emitted by a binary translator by comparing the corresponding architecture states. Corresponding architecture states may differ due to (1) translation errors, (2) different (but correct) memory layouts, and (3) return values of certain system calls. The need to differentiate the three sources of differences makes comparing architecture states very difficult, if not impossible. In our validator, we take special care to make memory layouts exactly the same and make the corresponding system calls always return exactly the same values in the original and in the translated binaries. Therefore, any differences in the corresponding architecture states indicate mistranslated instructions emitted by the binary translator. Besides solving the architecture-state-comparison problems, we also propose several methods to speed up the automatic validation. The first is the validation-block method, which reduces the number of validations while keeping the accuracy of instruction-level validation. The second is quick validation, which provides extremely fast validation at the expense of less accurate error information. Our validator can be applied to different binary translators. In our experiment, the validator has successfully validated programs translated by static, dynamic, and hybrid binary translators. 相似文献

13.

面向顺序存储结构的数据流分析

王淑栋尹文静董玉坤张莉刘浩《软件学报》2020,31(5):1276-1293

C程序中数组、malloc动态分配后的连续内存等顺序存储结构被大量使用，但大多数传统的数据流分析方法未能充分描述其结构及其上的操作，特别是在利用指针访问顺序存储结构时，传统的分析方法只关注了指针的指向关系，而未讨论指针可能发生偏移的数值信息，且未考虑发生偏移时可能存在越界的不安全问题，导致了对顺序存储结构分析不精确.针对以上不足，首先对顺序存储结构进行抽象建模，并对顺序存储结构与指针结合使用时的指向关系与偏移量进行有效表示，建立了用于顺序存储结构的抽象内存模型SeqMM；其次，归纳总结C程序中顺序存储结构涉及的指针相关迁移操作、谓词操作及遍历顺序存储结构的循环操作，提出了安全范围判别保证操作安全性；之后，针对函数调用时形参指针引用顺序存储结构与实参的映射过程进行过程间推导规则设计；最后，基于上述分析，提出了一种内存泄漏缺陷检测算法对5个开源C工程的内存泄漏缺陷进行检测.实验结果表明：所提出的SeqMM能够有效地刻画C程序中的顺序存储结构及其涉及的各种操作，其数据流分析结果能够用于内存泄漏的检测工作，同时在效率和精度之间取得合理的权衡. 相似文献

14.

Program Replacement for Better Throughput

《IEEE transactions on pattern analysis and machine intelligence》1977,(5):369-374

This paper describes a method for increasing the batch-processing efficiency of medium and large scale scientific computer systems with a small main memory. A program-replacement algorithm, RESEP, is defined whose goal is to always keep the CPU busy. RESEP temporarily transfers selected programs from main memory to the secondary storage. Reallocation of such programs is also performed by this algorithm. 相似文献

15.

Efficient Sequential Consistency Using Conditional Fences

Changhui Lin Vijay Nagarajan Rajiv Gupta 《International journal of parallel programming》2012,40(1):84-117

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors. 相似文献

16.

Automated verification of pointer programs in pointer logic

Zhifang WANG Yiyun CHEN Zhenming WANG Baojian HUA 《Frontiers of Computer Science in China》2008,2(4):380-397

Reasoning about pointer programs is difficult and challenging, while their safety is critical in software engineering. Storeless semantics pioneered by Jonkers provides a method to reason about pointer programs. However, the representation of memory states in Jonkers’ model is costly and redundant. This paper presents a new framework under a more efficient storeless model for automatically verifying properties of pointer programs related to the correct use of dynamically allocated memory such as absence of null dereferences, absence of dangling dereferences, absence of memory leaks, and preservation of structural invariants. The introduced logic-Pointer Logic, is developed to achieve such goals. To demonstrate that Pointer Logic is a useful storeless approach to verification, the Schorr-Waite tree-traversal algorithm which is always considered as a key test for pointer formalizations was verified via our analysis. Moreover, an experimental tool-plcc was implemented to automatically verify a number of non-trivial pointer programs. 相似文献

17.

动态内存错误的静态检测 总被引：1，自引：0，他引：1

张广梅李晓维《计算机辅助设计与图形学学报》2005,17(3):400-406

内存泄漏、空指针引用等动态内存错误在C,C 等支持动态内存操作的程序中普遍存在．在程序中,动态内存管理错误是导致动态内存错误的根本原因．动态内存错误的静态检测方法是在对程序进行静态分析的基础上,应用路径别名分析方法,确定动态内存变量之间存在的过程内和过程间的路径别名关系,在此基础上对程序中违反动态内存管理模式的动态内存操作进行分析,以确定程序中存在的动态内存错误．相似文献

18.

Automated verification of pointer programs in pointer logic

WANG Zhifang CHEN Yiyun WANG Zhenming HUA Baojian 《Frontiers of Computer Science》2008,2(4):380

Reasoning about pointer programs is difficult and challenging, while their safety is critical in software engineering. Storeless semantics pioneered by Jonkers provides a method to reason about pointer programs. However, the representation of memory states in Jonkers’ model is costly and redundant. This paper presents a new framework under a more efficient storeless model for automatically verifying properties of pointer programs related to the correct use of dynamically allocated memory such as absence of null dereferences, absence of dangling dereferences, absence of memory leaks, and preservation of structural invariants. The introduced logic – Pointer Logic, is developed to achieve such goals. To demonstrate that Pointer Logic is a useful storeless approach to verification, the Schorr-Waite tree-traversal algorithm which is always considered as a key test for pointer formalizations was verified via our analysis. Moreover, an experimental tool – plcc was implemented to automatically verify a number of non-trivial pointer programs. 相似文献

19.

Cache-Efficient Layouts of Bounding Volume Hierarchies

Sung-Eui Yoon Dinesh Manocha 《Computer Graphics Forum》2006,25(3):507-516

相似文献

20.

Interactive microcomputer graphic methods for smoothing craft layouts

Veeravudhi Charumongkol 《Computers & Industrial Engineering》1990,19(1-4):304-308

Computer-aided layout planning has been under development since the early 1960s, and many layout programs are based on the powerful and well known program, CRAFT (Computerized Relative Allocation of Facilities Technique by Armour and Buffa 1963).

Unfortunately, the major drawback of these facilities layout programs is that they often create unrealistic and impractical designs since the resulting block layouts can have odd shapes. Most architectural building designs require that the rooms and departments be in the form of squares, rectangles or L-shapes. The traditional approach to rectifying this problem is to manually modify the block plans at the corresponding expense of time and effort.

This paper identifies the problems and proposes interactive graphic methods for smoothing CRAFT layouts. In addition, a new method for the assessment of layout efficiency is introduced to measure the quality of a layout before and after smoothing. 相似文献