首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

2.
《Parallel Computing》2014,40(10):754-767
The processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we present a runtime system for traditional MPI programs that enables the efficient and transparent out-of-core execution of distributed-memory parallel programs. This system, called BDMPI,1 leverages the semantics of MPI’s API to orchestrate the execution of a large number of MPI processes on much fewer compute nodes, so that the running processes maximize the amount of computation that they perform with the data fetched from the disk. BDMPI enables the development of efficient out-of-core parallel distributed memory codes without the high engineering and algorithmic complexities associated with multiple levels of blocking. BDMPI achieves significantly better performance than existing technologies on a single node as well as on a small cluster, and performs within 30% of optimized out-of-core implementations.  相似文献   

3.
核外计算中的几种I/O优化方法   总被引:1,自引:0,他引:1  
大数据量应用问题引入核外计算模式,由于访问磁盘数据的速度比较慢,I/O成为核外计算性能重要的限制因素.提出了一种使用运行库进行I/O优化的方法,给出了3种有效的优化策略:规则区域筛选、数据预取和边缘重用.编程人员可针对不同的应用问题使用相应的优化API来缩短程序执行时间.实验结果表明,通过减少I/O操作次数和内外存交换的数据量以及隐藏部分I/O操作延迟,有效提高了核外计算的性能.  相似文献   

4.
Image population analysis is the class of statistical methods that plays a central role in understanding the development, evolution, and disease of a population. However, these techniques often require excessive computational power and memory that are compounded with a large number of volumetric inputs. Restricted access to supercomputing power limits its influence in general research and practical applications. In this paper we introduce ISP, an Image-Set Processing streaming framework that harnesses the processing power of commodity heterogeneous CPU/GPU systems and attempts to solve this computational problem. In ISP, we introduce specially designed streaming algorithms and data structures that provide an optimal solution for out-of-core multiimage processing problems both in terms of memory usage and computational efficiency. ISP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the ISP out-of-core solution can achieve the same performance as the in-core solution. We demonstrate the efficiency of the ISP framework on synthetic and real datasets.  相似文献   

5.
We propose a storage efficient, fast and parallelizable out-of-core framework for streaming computations of high resolution level sets. The fundamental techniques are skewing and tiling transformations of streamed level set computations which allow for the combination of interface propagation, re-normalization and narrow-band rebuild into a single pass over the data stored on disk. When combined with a new data layout on disk, this improves the overall performance when compared to previous streaming level set frameworks that require multiple passes over the data for each time-step. As a result, streaming level set computations are now CPU bound and consequently the overall performance is unaffected by disk latency and bandwidth limitations. We demonstrate this with several benchmark tests that show sustained out-of-core throughputs close to that of in-core level set simulations.  相似文献   

6.
In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used physical track distribution cyclic(Bd ), where Bd is the physical disk block size. We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem. We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas  相似文献   

7.
Our goal is to develop a robust out-of-core sorting program for a distributed-memory cluster. The literature contains two dominant paradigms for out-of-core sorting algorithms: merging-based and partitioning-based. We explore a third paradigm, that of oblivious algorithms. Unlike the two dominant paradigms, oblivious algorithms do not depend on the input keys and therefore lead to predetermined I/O and communication patterns in an out-of-core setting. Predetermined I/O and communication patterns facilitate overlapping I/O, communication, and computation for efficient implementation. We have developed several out-of-core sorting programs using the paradigm of oblivious algorithms. Our baseline implementation, 3-pass columnsort, was based on Leighton's columnsort algorithm. Though efficient in terms of I/O and communication, 3-pass columnsort has a restriction on the maximum problem size. As our first effort toward relaxing this restriction, we developed two implementations: subblock columnsort and M-columnsort. Both of these implementations incur substantial performance costs: subblock columnsort performs additional disk I/O, and M-columnsort needs substantial amounts of extra communication and computation. In this paper we present slabpose columnsort, a new oblivious algorithm that we have designed explicitly for the out-of-core setting. Slabpose columnsort relaxes the problem-size restriction at no extra I/O or communication cost. Experimental evidence on a Beowulf cluster shows that unlike subblock columnsort and M-columnsort, slabpose columnsort runs almost as fast as 3-pass columnsort. To the best of our knowledge, our implementations are the first out-of-core multiprocessor sorting algorithms that make no assumptions about the keys and produce output that is perfectly load balanced and in the striped order assumed by the Parallel Disk Model.  相似文献   

8.
We present a simple out-of-core algorithm for computing the Fast-Fourier Transform (FFT) needed to determine the two-dimensional potential of surface crystals with large-scale features, like faults, at ultra-high resolution, with around 109 grid points. This algorithm represents a proof of concept that a simple and easy-to-code, out-of-core algorithm can be easily implemented and used to solve large-scale problems on low-cost hardware. The main novelties of our algorithm are: (1) elapsed and I/O times decrease with the number of single records (lines) being read; (2) only basic reading and writing routines is necessary for making the out-of-core access. Our method can be easily extended to 3D and be applied to many grand-challenge problems in science and engineering, such as fluid dynamics.  相似文献   

9.
The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998.  相似文献   

10.
基于Marching Cubes重组的外存模型渐进压缩   总被引:7,自引:0,他引:7  
刘迎  蔡康颖  王文成  吴恩华 《计算机学报》2004,27(11):1457-1463
外存模型是指其规模远远超出内存容量的海量模型.为提高其存储、传输、显示等操作的效率,对外存模型进行渐进式的压缩是非常重要的.但当前已有的外存模型压缩算法都是单一层次的,不能做到渐进压缩.为此,该文提出一种针对外存模型的渐进压缩方法.能高效地压缩外存模型,并进行多分辨率的传输和显示.该方法首先将外存模型的包围盒空间按照八叉树形式进行剖分和层次化组织,使得最精细层次的各个立方块空间中的局部模型都能完全装入内存进行处理;然后,在各个立方块中对局部的模型进行基于Marching Cubes方式的重新拟合,并在此基础上建立各个局部的自适应八叉树;最后,基于各个局部自适应的八叉树,由粗至细渐进地遍历全局自适应八叉树的各个节点,并利用对内存模型能高效渐进压缩编码的先进方法进行编码压缩.实验表明,该方法对外存模型的压缩比达到了与处理内存模型相似的压缩比.高于目前的外存模型压缩方法,是第一个能渐进压缩外存模型的方法.  相似文献   

11.
The Array Management System (AMS) is an integrated set of array management tools designed to increase the productivity of technical programmers engaged in intensive matrix computational applications. These include analog circuit simulator, statistical analysis, dense or sparse equation solving, simulation, and in particular, the finite element program development. AMS is composed of a set of easy-to-use in-core and out-of-core data management subroutines written in FORTRAN 77. The in-core array management subroutines of AMS allow dynamic storage allocation to be accomplished with integer, real, and complex data with a minimum of programming effort. The out-of-core array management subroutines of AMS support simple operations to allow array transfer between in-core and out-of-core systems and allow different programs to access the same data. The out-of-core data management provides for a direct access database file to speed up the input/output operations. Multiple databases are allowed to be accessed by a program; this provides an easy way to share data and restart. This integrated database environment is suitable to be the kernel of a software project with several programmers and data communications among them.  相似文献   

12.
For large time-varying data sets, memory and disk limitations can lower the performance of visualization applications. Algorithms and data structures must be explicitly designed to handle these data sets in order to achieve more interactive rates. The Temporal Branch-on-Need Octree (T-BON) extends the three-dimensional branch-on-need octree for time-varying isosurface extraction. This data structure minimizes the impact of the I/O bottleneck by reading from disk only those portions of the search structure and data necessary to construct the current isosurface. By performing a minimum of I/O and exploiting the hierarchical memory found in modern CPUs, the T-BON algorithm achieves high performance isosurface extraction in time-varying fields. The paper extends earlier work on the T-BON data structure by including techniques for better memory utilization, out-of-core isosurface extraction, and support for nonrectilinear grids. Results from testing the T-BON algorithm on large data sets show that its performance is similar to that of the three-dimensional branch-on-need octree for static data sets while providing substantial advantages for time varying fields  相似文献   

13.
This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications. These techniques are independent of the application and complementaries to each other. The first technique is an optimization of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this technique is the reduction of the number of communications involved in the I/O collective operation. The second technique, called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application, because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not only performance, but also scalability.  相似文献   

14.
提出一种适于大规模地形绘制的光滑调度算法.采用四叉树分块结构和Z 型填充曲线组织地形数据,利用地形数据的局部连续性提高调度效率;设计一种内存空间分配算法调度地形数据,实现对恒定帧速率绘制算法的支持;通过可控调度区实现调度优先级计算,在内存空间需求和调度时间需求之间取得平衡;采取预估调度的策略实现平滑调度并有效减少绘制中的缺块现象.算法实现了地形场景漫游中的数据平滑调度,有效地避免了因内外存大数据量交换而引起的显示延迟和跳跃现象.实验结果表明,利用平滑调度算法,数据调度的准确性和稳定性有了较大提高,在地形漫  相似文献   

15.
核外计算是解决计算机内存不足的一种有效方法。该文面向分布式系统提出了核外存储模型,采用面向对象的方法,在工作站机群系统上实现了处理核外数组的编程接口,并且通过采取对局部数组文件分布进行优化、自适应的数据筛选和数据预取等方法,进一步提高了核外计算性能。  相似文献   

16.
Communication support for distributed collaborative applications   总被引:1,自引:0,他引:1  
The development of distributed, multimedia, collaborative applications requires the resolution of communication issues such as concurrency control and temporal and causal synchronization of traffic over related data streams. Existing transport and/or session-layer protocols do not include the desired support for multistream, multipoint communication. In this paper, we propose new communication abstractions and mechanisms that facilitate the implementation of the necessary coordination and concurrency control semantics in a collaborative application. We propose a protocol suite called themultiflow conversation protocol (MCP) for the realization of these abstractions and describe its prototype implementation in an internetwork of workstations. The paper also describes our experience with the prototype and results of a performance evaluation.  相似文献   

17.
Streaming simplification of tetrahedral meshes   总被引:1,自引:0,他引:1  
Unstructured tetrahedral meshes are commonly used in scientific computing to represent scalar, vector, and tensor fields in three dimensions. Visualization of these meshes can be difficult to perform interactively due to their size and complexity. By reducing the size of the data, we can accomplish real-time visualization necessary for scientific analysis. We propose a two-step approach for streaming simplification of large tetrahedral meshes. Our algorithm arranges the data on disk in a streaming, I/O-efficient format that allows coherent access to the tetrahedral cells. A quadric-based simplification is sequentially performed on small portions of the mesh in-core. Our output is a coherent streaming mesh which facilitates future processing. Our technique is fast, produces high quality approximations, and operates out-of-core to process meshes too large for main memory  相似文献   

18.
Abstract. Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/ O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/ O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting.  相似文献   

19.
This paper introduces a model for parallel computation, called thedistributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network.We introduce the notion of aconservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to shortcut pointers in a data structure so that remote processors can communicate without causing undue congestion. We giveO(lgn)-step, linear-processor, linear-space, conservative algorithms for a variety of problems onn-node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions in an expression tree. We giveO(lg2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of sizen, including finding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the prefix computation to trees. We show that any suchtreefix computation can be performed inO(lgn) steps using a conservative variant of Miller and Reif's tree-contraction technique.This research was supported in part by the Defense Advanced Research Projects Agency under Contract N00014-80-C-0622 and by the Office of Naval Research under Contract N00014-86-K-0593. Charles Leiserson is supported in part by an NSF Presidential Young Investigator Award with matching funds provided by AT&T Bell Laboratories and Xerox Corporation. Bruce Maggs is supported in part by an NSF Fellowship.  相似文献   

20.
基于共享虚拟存储(shared virtual memory,SVM)PC机群的大规模并行地理图像处理原型系统ParGIP(parallel geographical image processing)采用Client-Server计算模型,通过软件分布式共享存储(software distributed shared memory,software DSM)中间层将PC机群组织成一个逻辑上共享的内存的并行计算平台,地理图像处理可以充分利用ParGIP提供的大共享内存和并行处理能力来提高性能,缩短处理周期,从而解决传统单机串行方式下地理图像处理中内存匮乏和计算能力不足的问题,ParGIP还进一步将机群中各个结点上分布的磁盘组织起来,提供地理影像库所需的海量存储空间和并行I/O能力,测试结果表明,ParGIP的8机并行I/O带宽达到102.6MB/s,典型的图像处理算法获得了接近线性的加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号