首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
GAGAN AGRAWAL  JOEL SALTZ 《Software》1997,27(5):519-545
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations: placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled usng our system to demonstrate the efficacy of the presented schemes. ©1997 John Wiley & Sons, Ltd.  相似文献   

2.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

3.
The bandwidth mismatch between processor and main memory is one major throughput limiting problem. Although streamed computations have predictable access patterns their references have little temporal locality and are generally too long to cache. A memory and compiler co-optimization aimed at reducing low-level memory accesses using software and hardware locality optimizations is presented. We propose a scalable and predictable parallel memory based on a compiler synthesis of storage schemes for multi-dimensional arrays that are accessed by an arbitrary but known set of data access patterns. Using algebra of non-singular Boolean matrices, we present analysis of conflict-free access to (1) parallel memories, and (2) alignment networks. Finding a multi-pattern storage scheme is one NP-complete problem. An effective compiler heuristic is proposed for finding a storage matrix that minimizes overall memory access time. This applies to arbitrary linear patterns and arbitrary alignment networks. It is shown that the proposed storage scheme finds an optimal storage scheme for parallel (1) FFT, and (2) bitonic sorting. The proposed storage scheme outperforms statically optimized storages in the case of power-of-2 multi-stride access. The case of non power-of-2 strides is also addressed. The performance and scalability of the proposed parallel memory and its predictable access time are presented using numerical and multimedia algorithms. It is shown that a memory utilization above 83% is achieved by our storage scheme for 64 memories, which largely outperforms previous proposals. Our approach provides a tool for matching the storage pattern with the data access patterns needed for embedded systems running streamed computations with predictable data access patterns.  相似文献   

4.
Exploiting compile time knowledge to improve memory bandwidth can produce noticeable improvements at runtime.(1, 2) Allocating the data structure(1) to separate memories whenever the data may be accessed in parallel allows improvements in memory access time of 13 to 40%. We are concerned with synthesizing compiler storage schemes for minimizing array access conflicts in parallel memories for a set of compiler predicted data access patterns. The access patterns can be easily found for many synchronous dataflow computations like multimedia compression/decompression algorithms, DSP, vision, robotics, etc. A storage scheme is a mapping from array addresses into storages. Finding a conflict-free storage scheme for a set of data patterns is NP-complete. This problem is reduceable to weighted graph coloring. Optimizing the storage scheme is investigated by using constructive heuristics, neural methods, and genetic algorithms. The details of implementation of these different approaches are presented. Using realistic data patterns, simulation shows that memory utilization of 80% or higher can be achieved in the case of 20 data patterns over up to 256 parallel memories, i.e., a scalable parallel memory. The neural approach was relatively very fast in producing reasonably good solutions even in the case of large problem sizes. Convergence of proposed neural algorithm seems to be only slightly dependent on problem size. Genetic algorithms are recommended for advanced compiler optimization especially for large problem sizes; and applications which are compiled once and run many times over different data sets. The solutions presented are also useful for other optimization problems.  相似文献   

5.
Irregular access patterns are a major problem for today’s optimizing compilers. In this paper, a novel approach will be presented that enables transformations that were designed for regular loop structures to be applied to linked list data structures. This is achieved by linearizing access to a linked list, after which further data restructuring can be performed. Two subsequent optimization paths will be considered: annihilation and sublimation, which are driven by the occurring regular and irregular access patterns in the applications. These intermediate codes are amenable to traditional compiler optimizations targeting regular loops. In the case of sublimation, a run-time step is involved which takes the access pattern into account and thus generates a data instance specific optimized code. Both approaches are applied to a sparse matrix multiplication algorithm and an iterative solver: preconditioned conjugate gradient. The resulting transformed code is evaluated using the major compilers for the x86 platform, GCC and the Intel C compiler.  相似文献   

6.
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism  相似文献   

7.
We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i)?inter-node parallelization via spatial decomposition; (ii)?inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii)?data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv)?register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98?% on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking?+ multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.  相似文献   

8.
On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.  相似文献   

9.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines  相似文献   

10.
Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach: they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm APM has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm APM has been implemented on a Sun Enterprise 4000 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good effects and APM has a substantial lead in performance over other proposed level-wise algorithms.  相似文献   

11.
Object-oriented techniques have been proffered as aids for managing complexity, enhancing reuse, and improving readability of irregular parallel applications. However, as performance is the major reason for employing parallelism, programmability and high performance must be delivered together. Using a suite of seven challenging irregular applications and the mature Illinois Concert system (a high-level concurrent object-oriented programming model) and an aggressive implementation (whole program compilation plus microsecond threading and communication primitives in the runtime), we evaluate what programming efforts are required to achieve high performance. For all seven applications, we achieve performance comparable to the best achievable via low-level programming means on large-scale parallel systems. In general, a high-level concurrent object-oriented programming model supported by aggressive implementation techniques can eliminate programmer management of many concerns – procedure and computation granularity, namespace management, and low-level concurrency management. Our study indicates that these concerns are fully automated for these applications. Decoupling these concerns makes managing the remaining fundamental concerns – data locality and load balance – much easier. In several cases, data locality and load balance for the complex algorithm and pointer data structures is automatically managed by the compiler and runtime, but in general programmer intervention was required. In a few cases, more detailed control is required, specifically explicit task priority, data consistency, and task placement. Our system integrates the expression of such information cleanly into the programming interface. Finally, only small changes to the sequential code were required to express concurrency and performance optimizations, less than 5 per cent of the source code lines were changed in all cases. This bodes well for supporting both sequential and parallel performance in a single code base. © 1998 John Wiley & Sons, Ltd.  相似文献   

12.
We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on atransition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between shared-memory and message-passing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic trade-off between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.This work was supported in part by the Advanced Research Projects Agency of the Department of Defense monitored by the Office of Naval Research under contract DABT63-92-C-0026, by AT&T, and by the National Science Foundation through an Infrastructure Grant (number CDA-8722788) and a Research Initiation Award (number CCR-9210260). The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.  相似文献   

13.
Management of program data to improve data locality and reduce false sharing is critical for scaling performance on NUMA shared memory multiprocessors. We use HPF-like data decomposition directives to partition and place arrays in data-parallel applications on Hector, a shared-memory NUMA multiprocessor. We describe a compiler system for automating the partitioning and placement of arrays. The compiler exploits Hectors shared memory architecture to efficiently implement distributed arrays. Experimental results from a prototype implementation demonstrate the effectiveness of these techniques. They also demonstrate the magnitude of the performance improvement attainable when our compiler-based data management schemes are used instead of operating system data management policies; performance improves by up to a factor of 5.  相似文献   

14.
Indexing schemes for grids based on space-filling curves (e.g., Hilbert curves) find applications in numerous fields, ranging from parallel processing over data structures to image processing. Because of an increasing interest in discrete multidimensional spaces, indexing schemes for them have won considerable interest. Hilbert curves are the most simple and popular space-filling indexing schemes. We extend the concept of curves with Hilbert property to arbitrary dimensions and present first results concerning their structural analysis that also simplify their applicability. We define and analyze in a precise mathematical way r -dimensional Hilbert curves for arbitrary r ≥ 2 . Moreover, we generalize and simplify previous work and clarify the concept of Hilbert curves for multidimensional grids. As we show, curves with Hilbert property can be completely described and analyzed by ``generating elements of order 1,' thus, in comparison with previous work, reducing their structural complexity decisively. Whereas there is basically one Hilbert curve in the two-dimensional world, our analysis shows that there are 1536 structurally different simple three-dimensional Hilbert curves. Further results include generalizations of locality results for multidimensional indexings and an easy recursive computation scheme for multidimensional curves with Hilbert property. In addition, our formalism lays the groundwork for potential mechanized analysis of locality properties of multidimensional Hilbert curves. Received April 14, 1999, and in final form March 7, 2000.  相似文献   

15.
An efficient approach to mining indirect associations   总被引:1,自引:0,他引:1  
Discovering association rules is one of the important tasks in data mining. While most of the existing algorithms are developed for efficient mining of frequent patterns, it has been noted recently that some of the infrequent patterns, such as indirect associations, provide useful insight into the data. In this paper, we propose an efficient algorithm, called HI-mine, based on a new data structure, called HI-struct, for mining the complete set of indirect associations between items. Our experimental results show that HI-mine's performance is significantly better than that of the previously developed algorithm for mining indirect associations on both synthetic and real world data sets over practical ranges of support specifications.  相似文献   

16.
Lee, Stolfo, and Mok 1 previously reported the use of association rules and frequency episodes for mining audit data to gain knowledge for intrusion detection. The integration of association rules and frequency episodes with fuzzy logic can produce more abstract and flexible patterns for intrusion detection, since many quantitative features are involved in intrusion detection and security itself is fuzzy. We present a modification of a previously reported algorithm for mining fuzzy association rules, define the concept of fuzzy frequency episodes, and present an original algorithm for mining fuzzy frequency episodes. We add a normalization step to the procedure for mining fuzzy association rules in order to prevent one data instance from contributing more than others. We also modify the procedure for mining frequency episodes to learn fuzzy frequency episodes. Experimental results show the utility of fuzzy association rules and fuzzy frequency episodes for intrusion detection. © 2000 John Wiley & Sons, Inc.  相似文献   

17.
The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered—an improved version of the in-place algorithm for square matrices, out-of-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm’s memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia’s algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.  相似文献   

18.
GPGPU加速器是当前提高图像处理算法性能的主流加速平台,但是,在GPGPU平台上,同一个程序充分利用硬件体系结构特征和软件特征的优化版本与简单实现版本在性能上会有数量级的差异。GPGPU加速器具有多维多层的大量执行线程和层次化存储体系结构,后者的不同层次具有不同的容量、带宽、延迟和访问权限。同时,图像处理应用程序具有复杂的计算操作、边界处理规则和数据访问特性。因此,任务的并发执行模式、线程的组织方式和并发任务到设备的映射不仅影响到程序的并发度、调度、通信和同步等特性,而且也会影响到访存的带宽、延迟等。因此,GPGPU平台上的程序优化是一个困难、复杂且效率较低的过程。本文提出基于语言扩展的领域编程模型:ParaC。ParaC编程环境利用高层语言扩展描述的程序语义信息,自动分析获取应用程序的操作信息、并发任务间的数据重用信息和访存信息等程序特征,同时结合硬件平台特征,利用基于领域先验知识驱动的编译优化模型自动生成GPGPU平台上的优化代码,最后,利用源源变换编译器生成标准OpenCL程序。本文在测试用例上的实验结果表明,ParaC在GPGPU平台上自动生成的优化版本相对于手工优化版本的加速比最高达到3.22倍,但代码行数只是后者的1.2%到39.68%。  相似文献   

19.
现有的RDF数据分布式并行推理算法大多需要启动多个MapReduce任务,有些算法对于含有多个实例三元组前件的OWL规则的推理效率低下,使其整体的推理效率不高.针对这些问题,文中提出结合TREAT的基于Spark的分布式并行推理算法(DPRS).该算法首先结合RDF数据本体,构建模式三元组对应的alpha寄存器和规则标记模型;在OWL推理阶段,结合MapReduce实现TREAT算法中的alpha阶段;然后对推理结果进行去重处理,完成一次OWL全部规则推理.实验表明DPRS算法能够高效正确地实现大规模数据的并行推理.  相似文献   

20.
《Parallel Computing》1997,23(7):873-885
Ray tracing is a powerful technique to generate realistic images of 3D scenes. However, rendering complex scenes may easily exceed the processing and memory capabilities of a single workstation. Distributed processing offers a solution if the algorithm can be parallelized in an efficient way. In this paper a hybrid scheduling approach is presented that combines demand driven and data parallel techniques. Which tasks to process demand driven and which data parallel, is decided by the data intensity of the task and the amount of data locality (coherence) that will be present in the task. By combining demand driven and data driven tasks, a better load balance may be achieved, while at the same time the communication is spread evenly across the network. This leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号