共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data dependences and vector spaces to quantify data reuse in the program, a reuse-driven transformational approach is presented, which aims at maximizing the amount of data reuse carried in the tiled dimensions of the iteration space while keeping the number of tiled dimensions to a minimum (to reduce loop control overhead). In the special case of one single fully permutable loop nest, an algorithm is presented that tiles the program optimally so that all data reuse is carried in the tiled dimensions. In the general case of multiple fully permutable loop nests, data dependences can prevent all data reuse to be carried in the tiled dimensions. An algorithm is presented that aims at localizing data reuse in the tiled dimensions so that the reuse space localized has the largest dimensionality possible. 相似文献
2.
Pingali Venkata K. McKee Sally A. Hsieh Wilson C. Carter John B. 《International journal of parallel programming》2003,31(4):305-338
Data access costs contribute significantly to the execution time of applications with complex data structures. A the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it is useful to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of seven memory-intensive benchmarks; we describe Computation Regrouping, a source-level approach to improving the performance of memory-bound applications by increasing temporal locality to eliminate cache and TLB misses; and, we demonstrate significant performance improvement by applying Computation Regrouping to our suite of seven benchmarks. Using Computation Regrouping, we observe a geometric mean speedup of 1.90, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory tall time. 相似文献
3.
4.
The Data Locality of Work Stealing 总被引:1,自引:0,他引:1
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines,
where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the
number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental
validation.
{As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit
poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G
n
whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total
number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large,
important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that
on P processors a multiprocessor execution incurs an expected
more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T
∈
fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel
computations using work stealing.}
For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality
of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative
data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads
but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided
work stealing improves the performance of work stealing up to 80%. 相似文献
5.
6.
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop. 相似文献
7.
提出了一种可扩展的局部敏感哈希索引(SLSH),以解决高维动态数据索引中,由于数据集大小及分布特征无法确定而导致索引效率降低的问题.SLSH架构于E2LSH之上,继承了其对高维数据索引速度快,并可直接对欧式空间上的数据点进行索引的特点.为了使得哈希索引具有动态的相似性区分能力,SLSH修改了E2LSH的哈希族,通过哈希桶容量约束自适应调节哈希参数.因此对于分布密度动态变化的数据空间,SLSH也能够给出鲁棒的划分. 相似文献
8.
一种利用数据融合来提高局部性和减少伪共享的方法 总被引:6,自引:0,他引:6
某些应用程序不能通过数组内元素的重排优化获得性能提高 .针对这一问题 ,该文扩展了数组之间数据重组优化方法 ,着重分析了将多个数组的数据按一定方式进行融合来提高局部性和减少伪共享优化方法的特性 .文章针对几种典型的数组关联模式 ,提出了相应的数据融合方法 ,并建立了一组粗略的性能代价判别规则 ,以指导编译器有选择地融合数组以提高程序的全局优化效果 .根据在多个平台上的测试结果 ,该文还分析了数据融合优化方法在不同体系结构上的性能可移植性 ,并将体系结构特征加入到性能代价判别规则中 ,使得此优化方法能适用于不同的体系结构 .测试结果表明 ,数据融合优化方法对提高某些应用程序的性能 ,尤其是其在软件DSM体系结构上的性能 ,是非常有效的 相似文献
9.
Mü ller Christoph Frey Steffen Strengert Magnus Dachsbacher Carsten Ertl Thomas 《IEEE transactions on visualization and computer graphics》2009,15(4):605-617
We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware—including the concept of global memory—on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters. 相似文献
10.
J. Bikker 《Computer Graphics Forum》2012,31(6):1936-1947
In this paper, we investigate the efficiency of ray queries on the CPU in the context of path tracing, where ray distributions are mostly random. We show that existing schemes that exploit data locality to improve ray tracing efficiency fail to do so beyond the first diffuse bounce, and analyze the cause for this. We then present an alternative scheme inspired by the work of Pharr et al. in which we improve data locality by using a data‐centric breadth‐first approach. We show that our scheme improves on state‐of‐the‐art performance for ray distributions in a path tracer. 相似文献
11.
12.
13.
14.
Java语言提供了同步锁、可重入锁和读写锁等几种锁机制,在并行程序设计中不同的数据结构使用这几种锁机制时获得的性能通常是不同的。为了在不同的锁机制之间进行自动转换,进而帮助程序员了解程序的性能,提出了一种面向Java锁机制的字节码自动重构框架,并基于该框架实现了字节码重构工具Lock2Lock。Lock2Lock在Quad中间表示的基础上对字节码进行静态分析,并对分析的结果进行一致性验证,通过Javassist完成字节码的重构。使用红黑树、消费者生产者程序以及SPECjbb2005 3个测试程序对Lock2Lock重构工具进行了测试,结果表明,Lock2Lock可以成功地实现从同步锁到可重入锁或读写锁的重构。 相似文献
15.
《Journal of Parallel and Distributed Computing》1995,27(2):183-200
Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time. 相似文献
16.
Currently there is a lot of interest in graph representations of software systems, as they provide a natural and flexible means to describe complex structures. The various visual sublanguages of the UML are perhaps the most obvious example of this. In [11] a graph representation of object-oriented programs was presented that enables one to describe refactoring operations (behaviour-preserving changes in the structure of a program) in a formal, concise way by graph rewriting productions. In general, however, a refactoring makes changes to a small part of a program, so the graph representation should only contain the information needed to carry out that refactoring. All other details are redundant and make the graph unnecessarily large for good visualization. A possible solution consists in using a hierarchical representation. Such a representation of object-oriented programs is presented in this paper. It is based on node-rewriting graph productions: each refinement step corresponds to a production. The construction is illustrated by applying it to a small Java simulation of a Local Area Network. 相似文献
17.
基于设计模式的重构技术 总被引:2,自引:0,他引:2
设计模式是编程过程中的设计经验,在应用程序的实现过程中可以利用设计模式,用重构的观念来对待设计模式的实现。采用测试优先的单元测试技术,可以保证重构的安全。介绍了这方面的知识,并举例说明了这种方法的可行性和有效性。 相似文献
18.
Defining functions by pattern matching over the arguments is advantageous for understanding and reasoning, but it tends to expose the implementation of a datatype. Significant effort has been invested in tackling this loss of modularity; however, decoupling patterns from concrete representations while maintaining soundness of reasoning has been a challenge. Inspired by the development of invertible programming, we propose an approach to program refactoring based on a right-invertible language rinv—every function has a right (or pre-) inverse. We show how this new design is able to permit a smooth incremental transition from programs with algebraic datatypes and pattern matching, to ones with proper encapsulation, while maintaining simple and sound reasoning. 相似文献
19.
Christopher Brown Marco Danelutto Kevin Hammond Peter Kilpatrick Archibald Elliott 《International journal of parallel programming》2014,42(4):564-582
This paper presents a new programming methodology for introducing and tuning parallelism in Erlang programs, using source-level code refactoring from sequential source programs to parallel programs written using our skeleton library, Skel. High-level cost models allow us to predict with reasonable accuracy the parallel performance of the refactored program, enabling programmers to make informed decisions about which refactorings to apply. Using our approach, we demonstrate easily obtainable, significant and scalable speedups of up to 21 on a 24-core machine over the sequential code. 相似文献
20.
Refactoring Tools: Fitness for Purpose 总被引:1,自引:0,他引:1
Refactoring tools can improve the speed and accuracy with which developers create and maintain software—but only if they are used. In practice, tools are not used as much as they could be; this seems to be because sometimes they do not align with the refactoring tactic preferred by most programmers, a tactic the authors call floss refactoring. They propose five principles that characterize successful floss-refactoring tools—principles that can help programmers to choose the most appropriate refactoring tools and also help toolsmiths to design tools that fit the programmer's purpose. 相似文献