期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Effectively utilizing global cluster memory for large data-intensive parallel programs

Oleszkiewicz J. Xiao L. Yunhao Liu 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(1):66-77

Large scientific parallel applications demand large amounts of memory space. Current parallel computing platforms schedule jobs without fully knowing their memory requirements. This leads to uneven memory allocation in which some nodes are overloaded. This, in turn, leads to disk paging, which is extremely expensive in the context of scientific parallel computing. To solve this problem, we propose a new peer-to-peer solution called parallel network RAM. This approach avoids the use of disk, better utilizes available RAM resources, and will allow larger problems to be solved while reducing the computational, communication, and synchronization overhead typically involved in parallel applications. We proposed several different parallel network RAM designs and evaluated the performance of each under different conditions. We discovered that different designs are appropriate in different situations. 相似文献

2.

Communication‐aware thread mapping using the translation lookaside buffer

Eduardo H. M. Cruz Matthias Diener Philippe O. A. Navaux 《Concurrency and Computation》2015,27(17):4970-4992

Threads of parallel applications need to communicate in order to fulfill their tasks. The communication performance between the cores in modern multi‐core architectures differs because of the memory and interconnection hierarchies. In these architectures, it is important to map the threads of parallel applications by taking into account the communication between them, to improve their performance and energy consumption. In parallel applications based on shared memory, communication is implicit, which makes it difficult to detect the communication pattern between the threads. In this paper, we introduce a new lightweight mechanism to detect the communication pattern between threads of shared memory applications using the translation lookaside buffer. Our mechanism relies on hardware features, which make it transparent to the programmer and allow the detection to be performed by the operating system during the execution of the application. We also developed a heuristic mapping algorithm that uses the detected pattern to dynamically map the threads to cores. Experiments were performed with applications from the NAS‐OMP and PARSEC parallel benchmark suites in a simulated machine as well as a real machine. Results show that our mechanism can substantially improve parallel application performance, as well as processor and DRAM energy consumption. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

3.

A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

A. Dubey M. Zubair C. E. Grosch 《Parallel Computing》1994,20(12)

One issue which is central in developing a general purpose FFT subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. In this paper we present an FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications. We have also addressed the problem of rearranging the data after computing the FFT. We have evaluated the performance of our implementation on a distributed memory parallel machine, the Intel iPSC/860. 相似文献

4.

Analyzing time-dimension communication characterizations for representative scientific applications on supercomputer systems

Juan CHEN Wenhao ZHOU Yong DONG Zhiyuan WANG Chen CUI Feihao WU Enqiang ZHOU Yuhua TANG 《Frontiers of Computer Science》2019,13(6):1228

Exascale computing is one of the major challenges of this decade, and several studies have shown that communications are becoming one of the bottlenecks for scaling parallel applications. The analysis on the characteristics of communications can effectively aid to improve the performance of scientific applications. In this paper, we focus on the statistical regularity in time-dimension communication characteristics for representative scientific applications on supercomputer systems, and then prove that the distribution of communication-event intervals has a power-law decay, which is common in scientific interests and human activities. We verify the distribution of communication-event intervals has really a power-lawdecay on the Tianhe-2 supercomputer, and also on the other six parallel systems with three different network topologies and two routing policies. In order to do a quantitative study on the power-law distribution, we exploit two groups of statistics: bursty vs. memory and periodicity vs. dispersion. Our results indicate that the communication events show a “strong-bursty and weak-memory” characteristic and the communication event intervals show the periodicity and the dispersion. Finally, our research provides an insight into the relationship between communication optimizations and time-dimension communication characteristics. 相似文献

5.

Parallel processing of adaptive meshes with load balancing

Das S.K. Harvey D.J. Biswas R. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(12):1269-1280

Many scientific applications involve grids that lack a uniform underlying structure. These applications are often also dynamic in nature in that the grid structure significantly changes between successive phases of execution. In parallel computing environments, mesh adaptation of unstructured grids through selective refinement/coarsening has proven to be an effective approach. However, achieving load balance while minimizing interprocessor communication and redistribution costs is a difficult problem. Traditional dynamic load balancers are mostly inadequate because they lack a global view of system loads across processors. In this paper, we propose a novel and general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication topology and compare its performance with a successful global load balancing environment, called PLUM, specifically created to handle adaptive unstructured applications. Our experimental results on an IBM SP2 demonstrate that the SBN-based load balancer achieves lower redistribution costs than that under PLUM by overlapping processing and data migration 相似文献

6.

Image processing applications performance study on Cell BE and Blue Gene/L

Ali A. El‐Moursy Fadi N. Sibai 《Concurrency and Computation》2011,23(4):351-371

Two image processing applications, edge detection and image resizing, are studied in this paper on two HPC platforms namely the Cell BE and the Blue Gene/L machines. In this paper we focus on the performance scalability of the studied applications. Our results show that the scale of the problem to be solved highly affects the fitness of the platform. If the data set size is to fit into the Cell core, the fast on‐chip inter‐core communication of a multi‐core system pays back for its high technology design. On the other hand, the overhead of the distant communication in the massively parallel Blue Gene/L machine will only show its benefits for huge data set size that otherwise mandates multiple round‐trip data communications between the local memory of a core and main memory. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

7.

An introduction to parallel rendering

《Parallel Computing》1997,23(7):819-843

相似文献

8.

Compile-time optimization of near-neighbor communication for scalable shared-memory multiprocessors

David E. Hudak Santosh G. Abraham 《Journal of Parallel and Distributed Computing》1992,15(4)

Scalable shared-memory multiprocessor systems are typically NUMA (nonuniform memory access) machines, where the exploitation of the memory hierarchy is critical to achieving high performance. Iterative data parallel loops with near-neighbor communication account for many important numerical applications. In such loops, the communication of partial results stresses the memory system performance. In this paper, we develop data placement schemes that minimize communication time where the near-neighbor interaction is determined by a stencil. Under a given loop partition, our compile-time algorithm partitions global data into four classes for each processor, with each class requiring specific consistency maintenance requirements. The ADAPT (Automatic Data Allocation and Partitioning Tool) system was implemented to automatically partition parallel code segments for the BBN TC2000, a scalable shared-memory multiprocessor. ADAPT caches global arrays and maintains data consistency in software through instructions that flush data from private caches. Restructuring of a fluid flow code segment by ADAPT improved performance by a factor of more than 3 on the BBN TC2000. Features in current generation pipelined processors with multiple functional units permit the overlap of memory accesses with computation. Our experiments on the BBN TC2000 show that the degree of overlap is limited by architectural parameters, such as the number of CPU registers. 相似文献

9.

Detecting subgraph isomorphism with MapReduce

Péter Fehér Márk Asztalos Tamás Vajk Tamás Mészáros László Lengyel 《The Journal of supercomputing》2017,73(5):1810-1851

In recent years, the MapReduce framework has become one of the most popular parallel computing platforms for processing big data. MapReduce is used by companies such as Facebook, IBM, and Google to process or analyze massive data sets. Since the approach is frequently used for industrial solutions, the algorithms based on the MapReduce framework gained significant attention within the scientific community. The subgraph isomorphism is a fundamental graph theory problem. Finding small patterns in large graphs is a core challenge in the analysis of applications with big data sets. This paper introduces two novel algorithms, which are capable of finding matching patterns in arbitrary large graphs. The algorithms are designed for utilizing the easy parallelization technique offered by the MapReduce framework. The approaches are evaluated regarding their space and memory requirements. The paper also provides the applied data structure and presents formal analysis of the algorithms. 相似文献

10.

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Y. W. Lim P. B. Bhat V. K. Prasanna 《Algorithmica》1999,24(3-4):298-330

The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998. 相似文献

11.

Simulated Annealing Based Parallel State Assignment of Finite State Machines

《Journal of Parallel and Distributed Computing》1997,43(1):21-35

Simulated annealing is an effective tool in many optimization problems in VLSI CAD but its time requirements are prohibitive. In this paper, we report parallel algorithms for a well established simulated annealing based algorithm for the state assignment problem for finite state machines. Our parallel annealing strategy uses parallel moves by multiple processes, each performing local moves within its assigned subspace of the state encoding space. The novelty is in the dynamic repartitioning of the state space among processors, so that each processor gets to perform moves on the entire space over time. This is important to keep the quality of the parallel algorithm comparable to the serial algorithm. Our algorithm gives quality results within 0.05% of the serial algorithm on 64 processors. Our algorithms, ProperJEDI and PartJEDI, are portable across a wide range of MIMD machines. PartJEDI is memory scalable and is incrementally developed from ProperJEDI which is data replicated. For a large circuit, ProperJEDI reduces the runtime from 11 h to 10 min on a 64-processor machine. For the same circuit, PartJEDI reduces the runtime from 11 h to 20 min and memory requirement from 114 to 2 MB. 相似文献

12.

A slowdown model for applications executing on time-shared clustersof workstations

Figueira S.M. Berman F. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(6):653-670

Distributed applications executing on clustered environments typically share resources (computers and network links) with other applications. In such systems, application execution may be retarded by the competition for these shared resources. In this paper, we define a model that calculates the slowdown imposed on applications in time-shared multi-user clusters. Our model focuses on three kinds of slowdown: local slowdown, which synthesizes the effect of contention for CPU in a single workstation; communication slowdown, which synthesizes the effect of contention for the workstations and network links on communication costs; and aggregate slowdown, which determines the effect of contention on a parallel task caused by other applications executing on the entire cluster, i.e., on the nodes used by the parallel application. We verify empirically that this model provides an accurate estimate of application performance for a set of compute-intensive parallel applications on different clusters with a variety of emulated loads 相似文献

13.

Parallel Array Object I/O Support on Distributed Environments

Jenq Kuen Lee Ing-Kuen Tsaur San-Yih Hwang 《Journal of Parallel and Distributed Computing》1997,40(2):1425

This paper presents a parallel file object environment to support distributed array store on shared-nothing distributed computing environments. Our environment enables programmers to extend the concept of array distributions from memory levels to file levels. It allows parallel I/O that facilitates the distribution of objects in an application. When objects are read and/or written by multiple applications using different distributions, we present a novel scheme to help programmers to select the best data distribution pattern according to a minimum amount of remote data movements for the store of array objects on distributed file systems. Our selection scheme, to the best of our knowledge, is the first work to attempt to optimize the distribution patterns in the secondary storage for HPF-like programs with inter-application cases. This is especially important for a class of problems called multiple disciplinary optimization (MDO) problems. Our test bed is built on an 8-node DEC Farm connected with an ethernet, FDDI, or ATM switch. Our experimental results with scientific applications show that not only our parallel file system can provide aggregate bandwidths, but also our selection scheme effectively reduces the communication traffic for the system. 相似文献

14.

ADL-RAID: Energy-Saving Data Layout for Dynamic Loads

Youxi Wu Shengyuan Shi Jingyu Liu Huaizhong Zhu 《计算机系统科学与工程》2018,33(5):309-316

With the rapid development of information technology and the explosive growth of data, the scale of storage equipment is growing rapidly. The high energy consumption storage devices have become a serious problem for data centers. For data storage, the model of sequential data access a storage system is commonly used. Aiming at the characteristics of a sequential data storage system, this paper, we propose a structure, named Adapted to Dynamic Load based on Redundant Array Independent Disks (ADL-RAID) which is an effective energy-saving data layout for dynamic loads based on the existing Semi-Redundant Array Independent Disks (S-RAID). ADL-RAID inherits the local parallel energy-saving strategy, uses an address mapping mechanism, and allocates storage space to satisfy the performance requirements for the user requested application. By sensing different loads, ADL-RAID allocates storage space for its dynamic loads with the appropriate parallelism. One or several data disks are used when the load is minimized, and all the data disks are used in parallel when the load is maximized. Experimental results show that, for 100% continuous write request, ADL-RAID saves 33.6% energy consumption than S-RAID5 and improves write performance than S-RAID5 by 34.3%. Thus, ADL-RAID has higher availability and is ideal for sequential data storage applications. 相似文献

15.

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Jorge González-Domínguez^{Author Vitae} Guillermo L. Taboada Author VitaeBasilio B. Fraguela Author Vitae María J. Martín Author VitaeJuan Touriño Author Vitae 《Computers & Electrical Engineering》2012,38(2):258-269

Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification. 相似文献

16.

一种面向大规模并发的Gatherv优化方法

孙浩男王飞魏迪尹万旺史俊达《计算机工程与科学》2022,44(9):1542-1549

MPI不规则集合通信Gatherv为描述并行通信行为提供了极大的灵活性,但其不规则特性带来了较高的实现难度。现有方法存在通信热点突出、内存开销大和访存效率低等问题,难以满足当今大规模并行应用的性能需求。提出一种面向大规模并发的Gatherv优化方法,从优化等级、缓冲区管理等多个关键问题入手,将规则集合通信实现中常用的Binomial-Tree结构用于实现Gatherv,并提出消息链调度机制,进一步降低开销,提升优化效果。测试结果表明,该方法可以有效解决现有方法存在的性能问题,实现Gatherv集合通信性能在大规模并发条件下的高效可扩展。相似文献

17.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures

Das R. Uysal M. Saltz J. Hwang Y. S. 《Journal of Parallel and Distributed Computing》1994,22(3)

This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements, and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application (the molecular dynamics code CHARMM). 相似文献

18.

Topology mapping of irregular parallel applications on torus-connected supercomputers

Jingjin Wu Xuanxing Xiong Eduardo Berrocal Jia Wang Zhiling Lan 《The Journal of supercomputing》2017,73(4):1691-1714

Supercomputers with ever increasing computing power are being built for scientific applications. As the system size scales up, so does the size of interconnect network. As a result, communication in supercomputers becomes increasingly expensive due to the long distance between nodes and network contention. Topology mapping, which maps parallel application processes onto compute nodes by considering network topology and application communication pattern, is an essential technique for communication optimization. In this paper, we study the topology mapping problem for torus-connected supercomputers, and present an analytical topology mapping algorithm for parallel applications with irregular communication patterns. We consider our problem as a discrete optimization problem in the geometric domain of a torus topology, and design an analytical mapping algorithm, which uses numerical solvers to compute the mapping. Experimental results show that our algorithm provides high-quality mappings on 3-dimensional torus, which significantly reduce the communication time by up to 72%. 相似文献

19.

Time and Energy Performance of Parallel Systems with Hierarchical Memory

Jędrzej M. Marszałkowski Maciej Drozdowski Jakub Marszałkowski 《Journal of Grid Computing》2016,14(1):153-170

In this paper we analyze the impact of memory hierarchies on time-energy trade-off in parallel computations. Contemporary computing systems have deep memory hierarchies with significantly different speeds and power consumptions. This results in nonlinear phenomena in the processing time and energy usage emerging when the size of the computation is growing. In this paper the nonlinear dependence of the time and energy on the size of the solved problem is formalized and verified using measurements in practical computer systems. Then it is applied to formulate a problem of minimum time and minimum energy scheduling parallel processing of divisible loads. Divisible load theory is a scheduling and performance model of data-parallel applications. Mathematical programming is exploited to solve the scheduling problem. A trade-off between energy and schedule length is analyzed and again nonlinear relationships between these two criteria are observed. Further performance analysis reveals that energy consumption and schedule length are ruled by a complex interplay between the costs and speeds of on-core and out-of-core computations, communication delays, and activating new machines. 相似文献

20.

Communication and memory optimal parallel data cube construction

Jin R. Vaidyanathan K. Yang G. Agrawal G. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(12):1105-1119

Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. This paper addresses a number of algorithmic issues in parallel data cube construction. First, we present an aggregation tree for sequential (and parallel) data cube construction, which has minimally bounded memory requirements. An aggregation tree is parameterized by the ordering of dimensions. We present a parallel algorithm based upon the aggregation tree. We analyze the interprocessor communication volume and construct a closed form expression for it. We prove that the same ordering of the dimensions in the aggregation tree minimizes both the computational and communication requirements. We also describe a method for partitioning the initial array and prove that it minimizes the communication volume. Finally, in the cases when memory may be a bottleneck, we describe how tiling can help scale sequential and parallel data cube construction. Experimental results from implementation of our algorithms on a cluster of workstations show the effectiveness of our algorithms and validate our theoretical results. 相似文献