首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Array syntax, which is supported in many technical programming languages, adds expressive power by allowing operations on and assignments to whole arrays and array sections. To compile an array assignment statement to a uniprocessor, the language processor must convert the statement into a loop that has the same meaning. This process is called scalarization.Scalarization presents a significant technical problem because an array assignment needs to be implemented as if all inputs are fetched before any outputs are stored. Since a loop intermixes loads and stores, the compiler typically allocates a temporary array to hold the intermediate result. Because these extra temporary arrays can cause performance problems in cache, many techniques have been developed to avoid their use or minimize their size.In this paper, we present a novel application of two compiler strategies—loop alignment and loop skewing—to address this problem. We show that these strategies can achieve the asymptotically minimal memory allocation for stencil computations. Our experiments with loop alignment and loop skewing demonstrate that it is extremely effective in improving memory hierarchy performance of Fortran 90 array code on standard uniprocessors. The result should be applicable to other array languages, such as MATLAB.  相似文献   

2.
In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied  相似文献   

3.
New load and store processing algorithms let memory-latency-tolerant architectures sustain thousands of in-flight instructions without scaling cycle-critical fully-associative load and store queues. These algorithms rely on redoing some stores after fetching cache miss data from memory (to fix memory dependences). Doing so provides better power and area characteristics than constantly enforcing memory dependences among a several loads and stores, many of which have unknown addresses.  相似文献   

4.
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.  相似文献   

5.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

6.
Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency  相似文献   

7.
8.
王淑栋  尹文静  董玉坤  张莉  刘浩 《软件学报》2020,31(5):1276-1293
C程序中数组、malloc动态分配后的连续内存等顺序存储结构被大量使用,但大多数传统的数据流分析方法未能充分描述其结构及其上的操作,特别是在利用指针访问顺序存储结构时,传统的分析方法只关注了指针的指向关系,而未讨论指针可能发生偏移的数值信息,且未考虑发生偏移时可能存在越界的不安全问题,导致了对顺序存储结构分析不精确.针对以上不足,首先对顺序存储结构进行抽象建模,并对顺序存储结构与指针结合使用时的指向关系与偏移量进行有效表示,建立了用于顺序存储结构的抽象内存模型SeqMM;其次,归纳总结C程序中顺序存储结构涉及的指针相关迁移操作、谓词操作及遍历顺序存储结构的循环操作,提出了安全范围判别保证操作安全性;之后,针对函数调用时形参指针引用顺序存储结构与实参的映射过程进行过程间推导规则设计;最后,基于上述分析,提出了一种内存泄漏缺陷检测算法对5个开源C工程的内存泄漏缺陷进行检测.实验结果表明:所提出的SeqMM能够有效地刻画C程序中的顺序存储结构及其涉及的各种操作,其数据流分析结果能够用于内存泄漏的检测工作,同时在效率和精度之间取得合理的权衡.  相似文献   

9.
This paper introduces the concept of problem stores: static stores whose dependent loads often miss in the cache. Accurately identifying problem stores allows the early determination of addresses likely to cause later misses, potentially allowing for the development of novel, proactive prefetching and memory hierarchy management schemes. We present a detailed empirical characterization of problem stores using the SPEC2000 CPU benchmarks. The data suggests several key observations about problem stores. First, we find that the number of important problem stores is typically quite small; the worst 100 problem stores write the values that will lead to about 90% of non-cold misses for a variety of cache configurations. We also find that problem stores only account for 1 in 8 dynamic stores, though they result in 9 of 10 misses. Additionally, the problem stores’ dependent loads miss in the L2 cache a larger fraction of the time than loads not dependent on problem stores. We also observe the set of problem stores is stable across a variety of cache configurations. Finally, we found that the instruction distance from problem store to miss and problem store to evict is often greater than one million instructions, but the value is often needed within 100,000 instructions of the eviction.  相似文献   

10.
Data dependence testing is the basic step in detecting loop level parallelism in numerical programs. The problem is undecidable in the general case. Therefore, work has been concentrated on a simplified problem, affine memory disambiguation. In this simpler domain, array references and loops bounds are assumed to be linear integer functions of loop variables. Dataflow information is ignored. For this domain, we have shown that in practice the problem can be solved accurately and efficiently.(1) This paper studies empirically the effectiveness of this domain restriction, how many real references are affine and flow insensitive. We use Larus's llpp system(2) to find all the data dependences dynamically. We compare these to the results given by our affine memory disambiguation system. This system is exact for all the cases we see in practice. We show that while the affine approximation is reasonable, memory disambiguation is not a sufficient approximation for data dependence analysis. We propose extensions to improve the analysis. This research was supported in part by a fellowship from AT & T Bell Laboratories and by DARPA contract N00014-87-K-0828.  相似文献   

11.
12.
For given input the global trace generated by a parallel program in a shared memory multiprocessing environment may change as the memory architecture, and management policies change. A method is proposed for ensuring that a correct global trace is generated in the new environment. This method involves a new characterization of a parallel program that identifies its address change points and address affecting points. An extension of traditional process traces, called the intrinsic trace of each process, is developed. The intrinsic traces maximize the decoupling of program execution from simulation by describing the address flow graph and path expressions of each process program. At each point where an address is issued, the trace-driven simulator uses the intrinsic traces and the sequence of loads and stores before the current cycle, to determine the next address. The mapping between load and store sequences and next addresses to issue, sometimes, requires partial program reexecution. Programs that do not require partial program reexecution are called graph-traceable  相似文献   

13.
Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation.But this in-flight forwarding only occupies about 15% of all store-load communications,which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further.This paper presents a new micro-architecture named ASW (short for active store window).It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ.This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache,which is referred to as far forwarding in this paper.At the back-end of the pipeline,it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding.The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures.Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively.  相似文献   

14.
Statically analyzing JavaScript applications often requires an analysis of JavaScript libraries because many JavaScript applications use libraries. However, static analysis techniques for JavaScript are not yet ready for analyzing libraries in a scalable and precise manner. Simply loading JavaScript libraries uses various dynamic features of JavaScript, which cause static analyzers to suffer from mutually intermingled problems of scalability and imprecision. In this paper, we present a loop‐sensitive analysis (LSA) technique, which can improve the analysis scalability when analyzing JavaScript libraries by enhancing the analysis precision of loops. The LSA technique distinguishes loop iterations when loop conditions can be determined to be either true or false precisely. We formalize LSA in the abstract interpretation framework in the presence of tricky language features such as exceptions and prove its soundness and precision theorems using Coq. We evaluate our LSA implementation with the analysis results of programs that use 5 JavaScript libraries and show that LSA significantly improves the analysis scalability and precision of an existing JavaScript static analyzer when analyzing JavaScript libraries. In addition, using the configurability of LSA, we experimentally show the correlation between scalability and precision in the analysis of JavaScript libraries. We found that even the analysis of simple programs that just load jQuery, which is the most popular JavaScript library, in a scalable way requires distinguishing not only the last 4 functions being called but also 40 iterations in each loop with 2‐level nested loops at least. Both the mechanization and implementation of LSA are publicly available.  相似文献   

15.
Reducing memory space requirement is important to many applications. For data-intensive applications, it may help avoid executing the program out-of-core. For high-performance computing, memory space reduction may improve the cache hit rate as well as performance. For embedded systems, it can reduce the memory requirement, the memory latency and the energy consumption. This paper investigates program transformations which a compiler can use to reduce the memory space required for storing program data. In particular, the paper uses integer programming to model the problem of combining loop shifting, loop fusion and array contraction to minimize the data memory required to execute a collection of multi-level loop nests. The integer programming problem is then reduced to an equivalent network flow problem which can be solved in polynomial time.  相似文献   

16.
Memory accesses introduce big‐time overhead and power consumption because of the performance gap between processors and main memory. This paper describes and evaluates a technique, loop scheduling with memory access reduction (LSMAR), that replaces hidden redundant load operations with register operations in loop kernels and performs partial scheduling for newly generated register operations subject to register constraints. By exploiting data dependence of memory access operations, the LSMAR technique can effectively reduce the number of memory accesses of loop kernels, thereby improving timing performance. The technique has been implemented into the Trimaran compiler and evaluated using a set of benchmarks from DSPstone and MiBench on the cycle‐accurate simulator of the Trimaran infrastructure. The experimental results show that when the LSMAR technique is applied, the number of memory accesses can be reduced by 18.47% on average over the benchmarks when it is not applied. The measurements also indicate that the optimizations only lead to an average 1.41% increase in code size. With such small code size expansion, the technique is more suitable for embedded systems compared with prior work.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

17.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

18.
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds  相似文献   

19.
System analysis is used to develop application software. However, the conventional approach has difficulty to ensure that the software has high user-satisfaction. This research uses the user's memory load to measure the degree of satisfaction of application software. An information flow analysis model is proposed for analyzing the degree of memory load. Ten types of operations, including six internal operations and four external operations, are adopted to measure the memory load. The conventional data flow diagram (DFD) and state transition diagram (STD) are modified by adding more functional components to estimate the memory load. The information flow analysis is demonstrated by building a qualitative analysis software which requests heavier mental works but has simple software architecture.  相似文献   

20.
In a recent study, we discovered that many single load/store operations in embedded applications can be parallelized and thus encoded simultaneously in a single‐instruction multiple‐data instruction, called the multiple load/store (MLS) instruction. In this work, we investigate the problem of utilizing MLS instructions to produce optimized machine code, and propose an effective approach to the problem. Specifically, we formalize the MLS problem, that is, the problem of maximizing the use of MLS instructions with an unlimited register file size. Based on this analysis, we show that we can solve the problem efficiently by translating it into a variant of the problem finding a maximum weighted path cover in a dynamic weighted graph. To handle a more realistic case of the finite size of the register file, our solution is then extended to take into account the constraints of register sequencing in MLS instructions and the limited register resource available in the target processor. We demonstrate the effectiveness of our approach experimentally by using a set of benchmark programs. In summary, our approach can reduce the number of loads/stores by 13.3% on average, compared with the code generated from existing compilers. The total code size reduction is 3.6%. This code size reduction comes at almost no cost because the overall increase in compilation time as a result of our technique remains quite minimal. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号