期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalability aspects of parallel multigrid

J. Linden G. Lonsdale H. Ritzdorf A. Schüller 《Future Generation Computer Systems》1994,10(4):429-439

This paper summarizes theoretical and practical investigations into the effect of parallelization by grid-partitioning on the performance of multigrid methods for the solution of partial differential equations on general two-dimensional domains. Particular emphasis will be placed on the algorithmic scalability for MIMD distributed memory systems. Experimental results for two Navier-Stokes test problems, presented in the last section of the paper, show that the theoretically predicted dependency of the combined numerical and parallel efficiencies of multigrid methods on the number of processors employed is in fact very weak. This leads to the conclusion that multigrid is an appropriate candidate for solving partial differential equations on massively parallel machines. 相似文献

2.

Scalable dynamic Monitoring,Analysis and Tuning Environment for parallel applications

P. Caymes-Scutari A. Morajko T. Margalef E. Luque 《Journal of Parallel and Distributed Computing》2010

Parallel/distributed systems are continuously growing. This allows and enables the scalability of the applications, either by considering bigger problems in the same period of time or by solving the problem in a shorter time. In consequence, the methodologies, approaches and tools related to parallel paradigm should be brought up to date to support the increasing requirements of the applications and the users. MATE (Monitoring, Analysis and Tuning Environment) provides automatic and dynamic tuning for parallel/distributed applications. The tuning decisions are made according to performance models, which provide a fast means to decide what to improve in the execution. However, MATE presents some bottlenecks as the application grows, due to the fact that the analysis process is made in a full centralized manner. In this work, we propose a new approach to make MATE scalable. In addition, we present the experimental results and the analysis to validate the proposed approach against the original one. 相似文献

3.

Large-scale parallel topology optimization using a dual-primal substructuring solver 总被引：2，自引：2，他引：0

Anton Evgrafov Cory J. Rupp Kurt Maute Martin L. Dunn 《Structural and Multidisciplinary Optimization》2008,36(4):329-345

Parallel computing is an integral part of many scientific disciplines. In this paper, we discuss issues and difficulties arising when a state-of-the-art parallel linear solver is applied to topology optimization problems. Within the topology optimization framework, we cannot readjust domain decomposition to align with material decomposition, which leads to the deterioration of performance of the substructuring solver. We illustrate the difficulties with detailed condition number estimates and numerical studies. We also report the practical performances of finite element tearing and interconnection/dual–primal solver for topology optimization problems and our attempts to improve it by applying additional scaling and/or preconditioning strategies. The performance of the method is finally illustrated with large-scale topology optimization problems coming from different optimal design fields: compliance minimization, design of compliant mechanisms, and design of elastic surface wave-guides. The authors acknowledge the support of the Air Force Office of Scientific Research (AFOSR) under grant FA9550-05-1-0046. The computational facility was obtained under the grant AFOSR-DURIP FA9550-05-1-0291. 相似文献

4.

Programming support and scheduling for communicating parallel tasks

Jörg Dümmler Thomas Rauber Gudula Rünger 《Journal of Parallel and Distributed Computing》2013

Task-based programming models are beneficial for the development of parallel programs for several reasons. They provide a decoupling of the specification of parallelism from the scheduling and mapping to execution resources of a specific hardware platform, thus allowing a flexible and individual mapping. For platforms with a distributed address space, the use of parallel tasks, instead of sequential tasks, adds the additional advantage of a structuring of the program into communication domains that can help to reduce the overall communication overhead. 相似文献

5.

Scalable parallel FFT for spectral simulations on a Beowulf cluster

P. Dmitruk L. -P. Wang W. H. Matthaeus R. Zhang D. Seckel 《Parallel Computing》2001,27(14):1921-1936

The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N³ (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code. 相似文献

6.

Scalable 3D hybrid parallel Delaunay image-to-mesh conversion algorithm for distributed shared memory architectures

《Computer aided design》2017

In this paper, we present a scalable three-dimensional hybrid parallel Delaunay image-to-mesh conversion algorithm (PDR.PODM) for distributed shared memory architectures. PDR.PODM is able to explore parallelism early in the mesh generation process thanks to the aggressive speculative approach employed by the Parallel Optimistic Delaunay Mesh generation algorithm (PODM). In addition, it decreases the communication overhead and improves data locality by making use of a data partitioning scheme offered by the Parallel Delaunay Refinement algorithm (PDR). PDR.PODM supports fully functional volume grading by creating elements with varying size. Small elements are created near boundary or inside the critical regions in order to capture the fine features while big elements are created in the rest of the mesh. We tested PDR.PODM on Blacklight, a distributed shared memory (DSM) machine in Pittsburgh Supercomputing Center. For the uniform mesh generation, we observed a weak scaling speedup of 163.8 and above for up to 256 cores as opposed to PODM whose weak scaling speedup is only 44.7 on 256 cores. PDR.PODM scales well on uniform refinement cases running on DSM supercomputers. The end result is that PDR.PODM can generate 18 million elements per second as opposed to 14 million per second in our earlier work. The varying size version sharply reduces the number of elements compared to the uniform version and thus reduces the time to generate the mesh while keeping the same fidelity. 相似文献

7.

Configurable parallel memory architecture for multimedia computers

Kimmo Jarno Timo Jarkko 《Journal of Systems Architecture》2002,47(14-15)

This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses. 相似文献

8.

A comparative workload-based methodology for performance evaluation of parallel computers 总被引：1，自引：0，他引：1

E. Onbasioglu Y. Paker 《Future Generation Computer Systems》1997,12(6):521-545

A practical methodology for evaluating and comparing the performance of distributed memory Multiple Instruction Multiple Data (MIMD) systems is presented. The methodology determines machine parameters and program parameters separately, and predicts the performance of a given workload on the machines under consideration. Machine parameters are measured using benchmarks that consist of parallel algorithm structures. The methodology takes a workload-based approach in which a mix of application programs constitutes the workload. Performance of different systems are compared, under the given workload, using the ratio of their speeds. In order to validate the methodology, an example workload has been constructed and the time estimates have been compared with the actual runs, yielding good predicted values. Variations in the workload are analysed in terms of increase in problem sizes and changes in the frequency of particular algorithm groups. Utilization and scalability are used to compare the systems when the number of processors is increased. It has been shown that performance of parallel computers is sensitive to the changes in the workload and therefore any evaluation and comparison must consider a given user workload. Performance improvement that can be obtained by increasing the size of a distributed memory MIMD system depends on the characteristics of the workload as well as the parameters that characterize the communication speed of the parallel system. 相似文献

9.

The design of an operating system for a scalable parallel computing engine

Paul Austin Kevin Murray Andy Wellings 《Software》1991,21(10):989-1013

There are substantial benefits to be gained from building computing systems from a number of processors working in parallel. One of the frequently-stated advantages of parallel and distributed systems is that they may be scaled to the needs of the user. This paper discusses some of the problems associated with designing a general-purpose operating system for a scalable parallel computing engine and then describes the solutions adopted in our experimental parallel operating system. We explain why a parallel computing engine composed of a collection of processors communicating through point-to-point links provides a suitable vehicle in which to realize the advantages of scaling. We then introduce a parallel-processing abstraction which can be used as the basis of an operating system for such a computing engine. We consider how this abstraction can be implemented and retain the ability to scale. As a concrete example of the ideas presented here we describe our own experimental scalable parallel operating-system project, concentrating on the Wisdom nucleus and the Sage file system. Finally, after introducing related work, we describe some of the lessons learnt from our own project. 相似文献

10.

Estimating and optimizing performance for parallel programs 总被引：1，自引：0，他引：1

Fahringer T. 《Computer》1995,28(11):47-56

The article describes P³T, a parameter-based performance prediction tool that estimates performance for parallel programs running on distributed-memory parallel architectures. P³T has been carefully designed to address all of the above performance estimation issues. To achieve high estimation accuracy, P ³T aggressively exploits compiler analysis and optimization information. Our method is based on modeling loop iteration spaces, array access patterns, and data distributions by intersection and volume operations on n-dimensional polytopes. The most critical architecture-specific factors, such as cache line sizes, number of cache lines available, routing policy, start-up times, message transfer time per byte, and so forth, are modeled to reflect the performance impact of the target machine. P³T has been developed in the context of the Vienna Fortran Compilation Systems (VFCS), a state-of-the-art parallelization tool for distributed-memory systems. VFCS translates Fortran programs into explicitly parallel message-passing programs. P ³T successfully guides the interactive and automatic restructuring of programs under this system. The article describes the underlying compilation and programming model and discusses the most critical design decisions made for P³T; in addition, it outlines the implementation of the parallel program parameters. Also described are the VFCS context under which P³T is applied and the P³T graphical user interface 相似文献

11.

The Duodirun merging algorithm: A new fast algorithm for parallel merging

Zhi-Jie Zheng 《Information Processing Letters》1983,17(3):167-168

相似文献

12.

PIPORS: a parallel input parallel output register switching system

Der-Fu Tao Author Vitae Liang-Teh Lee Author Vitae 《Computers & Electrical Engineering》2004,30(6):427-440

In order to make data exchange speed fast enough for supporting the current communication systems or networks, a high speed switching system with low transmission delay and low data loss is required. Many researchers used statistical time division multiplexing techniques to design the switching system for achieving a higher throughput. In such switching systems with n input/output ports, the internal execution speed must be n times faster than the speed of the system with single input/output port. This designing philosophy is really not an appropriate way as the demand trend for higher speed system in the future.For improving the drawbacks of the switching system mentioned above, a novel, revolutionary architecture of a Parallel Input Parallel Output Register Switching System (PIPORS) is proposed in this paper. The PIPORS is based on the interconnection of the small distributed Shared Memory Modules (SMM) and the Shift Register Switch Array (SRSA). This construction will accelerate the switching speed. In addition, the number of input/output ports of the system can easily be extended for providing a higher capacity to respond to the trend of fast increasing amount of data transferred in the system. Three simple methods to extend the input/output ports and the capacity of the internal memory are presented.For evaluating the performance of the proposed system, we made some performance comparisons among our PIPORS and Central Shared Memory Switching System (CSMS) with respect to the amount of total memory required, data loss probability, transmission delay and switching performance. It shows that a better performance can be achieved in our PIPORS. 相似文献

13.

Practical aspects and experiences Scalable massively parallel algorithms for computational nanoelectronics

Xiaodong Wang Vwani P. Roychowdhury Pratheep Balasingam 《Parallel Computing》1997,22(14):1931-1963

There is at present a worldwide effort to overcome the technological barriers to nanoelectronics. Microscopic simulation can significantly enhance our understanding of the physics of nanoscale structures, and constitutes a valuable tool for designing nanoelectronic functional devices. In nanodevices, novel physics effects are used to attain logic functionality which conventional technology can not achieve. Therefore it is necessary to develop quantum-transport simulation methods which include novel physical effects. Moreover, simulation of realistic nanodevices require enormous computing resource, necessitating parallel supercomputing. In this paper, we present massively parallel algorithms for simulating large-scale nanoelectronic networks based on the single-electron tunneling effect, which is arguably the quantum effect of greatest significance to nanoelectronic technology. A MIMD implementation of our simulation algorithm is carried out on a 64-processor nCUBE 2, and a SIMD implementation is carried out on a 16,384-processor MasPar MP-1. By exploiting massive parallelism, both parallel implementations achieve very high parallel efficiency and nearly linear scalability. The result of this work is that we are able to simulate large-scale nanoelectronic network, within a reasonable time period, which would be impractical on conventional workstations. 相似文献

14.

Empirical performance modeling for parallel weather prediction codes

Hermann Mierendorff Wolfgang Joppich 《Parallel Computing》1999,25(13-14):2135-2148

Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible. We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases. 相似文献

15.

Scheduling parallel Kalman filters with quantized deadlines

《Systems & Control Letters》2015

In this paper we explore the problem of scheduling parallel processes of Kalman filters to meet individual estimation error requirements. It is assumed that at each time-step measurements of only one process are received. We define real-time deadlines of transmissions and convert the problem into arranging sequence of tasks with corresponding deadlines. To reduce computations, cycles of transmissions are calculated and virtual processes are introduced into scheduling. A sliding window method is then designed to adjust the processes against real-time disturbances in applications. Compared with algorithms proposed in Lin and Wang (2013), the proposed algorithm is able to schedule a feasible sequence adaptively within a short scheduling window and requires little computation. 相似文献

16.

A fast parallel algorithm for the closest pair problem

Charles R. Dyer 《Information Processing Letters》1980,11(1):49-52

相似文献

17.

并行系统的通讯效率问题

林洪陈国良《小型微型计算机系统》1996,(1)

巨量并行处理（ＭＰＰ）强调并行系统结构和并行算法的可扩放性。在一个可扩放的并行系统结构上，可扩放的并行算法应该能够有效地利用不断增加的处理机，算法的有效性通常以算法运行时的处理机效率来衡量。一个被普遍忽视的因素是通讯效率，这是一个具有一般性的问题。本文给出了通讯效率的定义，研究了它与处理机效率的关系，并通过对一个典型算法的运行情况分析，研究了几个常见的并行系统结构的通讯效率。本文的结果表明：处理机效率和通讯效率的综合才能全面地评价算法的可扩放性并指导并行系统结构的设计。相似文献

18.

Probabilistic performance analysis for parallel search techniques

Wei-Ming Lin Bo Yang 《International journal of parallel programming》1995,23(2):161-189

This paper discusses the performance analysis of two generic fundamental parallel search techniques on shared memory multi-processor systems in solving the constraint satisfaction problem (CSP). Probabilistic analysis on their expected computation steps needed and their inherent load-balancing capability is performed. Corresponding experimental results are alsoprovided to verify the correctness of the proposed analysis. This fundamental analysis approach can be further applied to various advanced parallel search techniques or various problem solving techniques on parallel platforms. This research was supported in part by the University of Texas at San Antonio under the Faculty Research Award program 相似文献

19.

Beowulf-T机群系统高可扩展性的研究

祝永志李丙锋魏榕晖《计算机科学》2008,35(2):298-300

并行计算技术是衡量一个国家科技水平的重要标志之一,PC机群计算机是最廉价的高性能计算机.本文构建了一个Beowulf-T机群系统,提出了在该系统上的加速比和效率计算公式.通过实例的测试说明该机群系统具有高可扩展性. 相似文献

20.

数据库异构集群的性能模型研究

王元珍龚卫华《计算机科学》2006,33(6):106-108

在OLTP应用中数据库集群是一种有效的并行处理方案,由于以前对数据库集群特别是异构情况下的性能评价不够完善,本文主要研究数据库异构集群的性能模型,分析了CPU和内存两种资源的异构带来性能影响,并给出了异构集群并行性的度量标准及系统有效性评估公式。最后,通过TPC-C实验表明数据库异构集群在OLTP处理中仍具有良好的可扩展性,次线性的加速比,以及高效费比的并行处理服务。相似文献