期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance evaluation and comparison of parallel conjugate gradient on modern multi-core accelerator and massively parallel systems

Fadi N. Sibai 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(1):38-67

Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Blue Gene (BG). The solution of systems of linear equations is one of the most central processing unit-intensive steps in engineering and simulation applications and can greatly benefit from the multitude of processing cores and vectorisation on today's parallel computers. We parallelise the conjugate gradient (CG) linear equation solver on the Cell Broadband Engine and the IBM Blue Gene/L machine. We perform a scalability analysis of CG on both machines across 1, 8 and 16 synergistic processing elements and 1–32 cores on BG with heptadiagonal matrices. The results indicate that the multi-core Cell system outperforms by three to four times the massively parallel BG system due to the Cell's higher communication bandwidth and accelerated vector processing capability. 相似文献

2.

New architectures: Performance highlights and new algorithms

Oliver A. McBryan 《Parallel Computing》1988,7(3):477-499

Parallel computers are having a profound impact on computational science. Recently highly parallel machines have taken the lead as the fastest supercomputers, a trend that is likely to accelerate in the future. We describe some of these new computers, and issues involved in using them. We present elliptic PDE solutions currently running at 3.8 gigaflops, and an atmospheric dynamics model running at 1.7 gigaflops, on a 65 536-processor computer.

One intrinsic disadvantage of a parallel machine is the need to perform inter-processor communication. It is important to ensure that such communication time is maintained at a small fraction of computation time. We analyze standard multigrid algorithms in two and three dimensions from this point of view, indicating that performance efficiencies in excess of 95% are attainable under suitable conditions on moderately parallel machines. We also demonstrate that such performance is not attainable for multigrid on massively parallel computers, as indicated by an example of poor multigrid efficiency on 65 536 processors. The fundamental difficulty is the inability to keep 65 536 processors busy when operating on very coarse grids.

Most algorithms used for implementing applications on parallel machines have been derived directly from algorithms designed for serial machines. The previously mentioned multigrid example indicates that such ‘parallelized’ algorithms may not always be optimal. Parallel machines open the possibility of finding totally new approaches to solving standard tasks—intrinsically parallel algorithms. In particular, we present a class of superconvergent multiple scale methods that were motivated directly by massevely parallel machines. These methods differ from standard multigrid methods in an intrinsic way, and allow all processors to be used at all times, even when processing on the coarsest grid levels. Their serial versions are not sensible algorithms. The idea that parallel hardware—the Connection Machine in this case—can lead to discovery of new mathematical algorithms was surprising for us. 相似文献

3.

A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation

Oscar Almer Igor Böhm Tobias Edler von Koch Björn Franke Stephen Kyle Volker Seeker Christopher Thompson Nigel Topham 《International journal of parallel programming》2013,41(2):212-235

In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations. 相似文献

4.

Synthetic-perturbation techniques for screening shared memory programs

Robert Snelick Joseph Jj Raghu Kacker Gordon Lyon 《Software》1994,24(8):679-701

The synthetic-perturbation screening (SPS) methodology is based on an empirical approach; SPS introduces artificial perturbations into the MIMD program and captures the effects of such perturbations by using the modern branch of statistics called design of experiments. SPS can provide the basis of a powerful tool for screening MIMD programs for performance bottlenecks. This technique is portable across machines and architectures, and scales extremely well on massively parallel processors. The purpose of this paper is to explain the general approach and to extend it to address specific features that are the main source of poor performance on the shared memory programming model. These include performance degradation due to load imbalance and insufficient parallelism, and overhead introduced by synchronizations and by accessing shared data structures. We illustrate the practicality of SPS by demonstrating its use on two very different case studies: a large image understanding benchmark and a parallel quicksort. 相似文献

5.

A comparison of the Cray XMT and XMT‐2

Shahid H. Bokhari Saniyah S. Bokhari 《Concurrency and Computation》2013,25(15):2123-2139

We explore the comparative performance of the Cray XMT and XMT‐2 massively multithreaded supercomputers. We use benchmarks to evaluate memory accesses for various types of loops. We also compare the performance of these machines on matrix multiply and on three previously implemented dynamic programming algorithms. It is shown that the relative performance of these machines is dependent on the size (number of processors) of the configuration, as well as the size of the problem being evaluated. In particular, small configurations of the original XMT can sometimes show slightly better performance than larger configurations of the XMT‐2, for the same problem size. We note that, under heavy memory load, performance of loops can saturate well before the maximum number of processors available. This suggests that it may not always be useful to use the maximum number of processors for a specific run. We also show that manual restructuring of nested loops, including decreasing the parallelism, can result in major improvements in performance. The results in this paper indicate that careful exploration of the space of problem sizes, number of processors used, and choices of loop parallelization can yield substantial improvements in performance. These improvements can be very significant for production codes that run for extended periods of time. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

6.

WATERLOPP V2/64: A highly parallel machine for numerical computation

《Computer Physics Communications》1985,37(1-3):109-117

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number (n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture. 相似文献

7.

On the cost–effectiveness of PRAMs

Ferri Abolhassan Jörg Keller Wolfgang J. Paul 《Acta Informatica》1999,36(6):463-487

We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. While present parallel computers of this type only support the programming model of a shared memory but often process simultaneous access by several processors to the shared memory sequentially, theoretical computer science offers solutions for this problem that are provably fast and asymptotically optimal. But the constants in these constructions seemed to be too large to let them be competitive. We modify these constructions under engineering aspects and improve the price/performance ratio by roughly a factor of 6. The resulting machine has surprisingly good price/performance ratio even if compared with distributed memory machines. For almost all access patterns of all processors into the shared memory, access is as fast as the access of only a single processor. Received: 29 June 1993 / 22 June 1999 相似文献

8.

Development of a parallel semi-implicit two-dimensional plasma fluid modeling code using finite-volume method

K.-M. Lin C.-T. Hung F.-N. Hwang M.R. Smith Y.-W. Yang J.-S. Wu 《Computer Physics Communications》2012,183(6):1225-1236

In this paper, the development of a two-dimensional plasma fluid modeling code using the cell-centered finite-volume method and its parallel implementation on distributed memory machines is reported. Simulated discharge currents agree very well with the measured data in a planar dielectric barrier discharge (DBD). Parallel performance of simulating helium DBD solved by the different degrees of overlapping of additive Schwarz method (ASM) preconditioned generalized minimal residual method (GMRES) for different modeling equations is investigated for a small and a large test problem, respectively, employing up to 128 processors. For the large test problem, almost linear speedup can be obtained by using up to 128 processors. Finally, a large-scale realistic two-dimensional DBD problem is employed to demonstrate the capability of the developed fluid modeling code for simulating the low-temperature plasma with complex chemical reactions. 相似文献

9.

Correlating Radio Astronomy Signals with Many-Core Hardware

Rob V. van Nieuwpoort John W. Romein 《International journal of parallel programming》2011,39(1):88-114

A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware, to increase flexibility and to reduce development efforts. We evaluate the correlator algorithm on multi-core CPUs and many-core architectures, such as NVIDIA and ATI GPUs, and the Cell/B.E. The correlator is a streaming, real-time application, and is much more I/O intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P supercomputer. We investigate performance, power efficiency, and programmability. We identify several important architectural problems which cause architectures to perform suboptimally. Our findings are applicable to data-intensive applications in general. The processing power and memory bandwidth of current GPUs are highly imbalanced for correlation purposes. While the production correlator on the Blue Gene/P achieves a superb 96% of the theoretical peak performance, this is only 16% on ATI GPUs, and 32% on NVIDIA GPUs. The Cell/B.E. processor, in contrast, achieves an excellent 92%. We found that the Cell/B.E. and NVIDIA GPUs are the most energy-efficient solutions, they run the correlator at least 4 times more energy efficiently than the Blue Gene/P. The research presented is an important pathfinder for next-generation telescopes. 相似文献

10.

The complexity of small universal Turing machines: A survey

Damien Woods Turlough Neary 《Theoretical computer science》2009,410(4-5):443-450

We survey some work concerned with small universal Turing machines, cellular automata, tag systems, and other simple models of computation. For example, it has been an open question for some time as to whether the smallest known universal Turing machines of Minsky, Rogozhin, Baiocchi and Kudlek are efficient (polynomial time) simulators of Turing machines. These are some of the most intuitively simple computational devices and previously the best known simulations were exponentially slow. We discuss recent work that shows that these machines are indeed efficient simulators. As a related result, we also find that Rule 110, a well-known elementary cellular automaton, is also efficiently universal. We also review a large number of old and new universal program size results, including new small universal Turing machines and new weakly, and semi-weakly, universal Turing machines. We then discuss some ideas for future work arising out of these, and other, results. 相似文献

11.

Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer 总被引：1，自引：0，他引：1

Almasi George S. Caşcaval Călin Castaños José G. Denneau Monty Donath Wilm Eleftheriou Maria Giampapa Mark Ho Howard Lieber Derek Moreira José E. Newns Dennis Snir Marc Warren Henry S. 《International journal of parallel programming》2002,30(4):317-351

The IBM Blue Gene/C parallel computer aims to demonstrate the feasibility of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges in this project is showing that applications can successfully scale to this massive amount of parallelism. In this paper we demonstrate that the simulation of protein folding using classical molecular dynamics falls in this category. Starting from the sequential version of a well known molecular dynamics code, we developed a new parallel implementation that exploited the multiple levels of parallelism present in the Blue Gene/C cellular architecture. We performed both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine. 相似文献

12.

Parallel Molecular Dynamics: Implications for Massively Parallel Machines

Valerie E. Taylor Rick L. Stevens Kathryn E. Arnold 《Journal of Parallel and Distributed Computing》1997,45(2):159

Molecular dynamics simulation is a class of applications that require reducing the execution time of fixed-size problems. This reduction in execution time is important to drug design and protein interaction studies. Many implementations of parallel molecular dynamics have been developed, but very little work has addressed issues related to the use of machines with 50,000 processors for modest-sized problems in the range of 50,000 atoms. Current massively parallel machines present a major obstacle to achieving good performance:communication overhead. In this paper we quantify the communication latency and network bandwidth necessary to achieve 30–40% efficiency on future message-passing machines with sizes on the order of tens of thousands of processors, for executing molecular dynamics problems with the same order of atoms. We derive an analytical model of a benchmark application that simulates a system of helium atoms executing on the Intel Touchstone Delta using an interaction decomposition method. This model is validated and used to extrapolate information on the startup time and network bandwidth. The results indicate that for an MPP with a four-dimensional mesh topology using 400 MHz processors, the communication startup time must be at most 30 clock cycles and the network bandwidth at least 2.3 GB/s. This configuration results in 30–40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors. 相似文献

13.

A load-balancing algorithm for a parallel electromagnetic particle-in-cell code

Steven J. Plimpton David B. Seidel 《Computer Physics Communications》2003,152(3):227-241

Particle-in-cell simulations often suffer from load-imbalance on parallel machines due to the competing requirements of the field-solve and particle-push computations. We propose a new algorithm that balances the two computations independently. The grid for the field-solve computation is statically partitioned. The particles within a processor's sub-domain(s) are dynamically balanced by migrating spatially-compact groups of particles from heavily loaded processors to lightly loaded ones as needed. The algorithm has been implemented in the quicksilver electromagnetic particle-in-cell code. We provide details of the implementation and present performance results for quicksilver running models with up to a billion grid cells and particles on thousands of processors of a large distributed-memory parallel machine. 相似文献

14.

Application‐driven analysis of two generations of capability computing: the transition to multicore processors

Mahesh Rajan Courtenay T. Vaughan Doug W. Doerfler Richard F. Barrett Paul T. Lin Kevin T. Pedretti K. Scott Hemmert 《Concurrency and Computation》2012,24(18):2404-2420

Multicore processors form the basis of most traditional high performance parallel processing architectures. Early experiences with these computers showed significant performance problems, both with regard to computation and inter‐process communication. The transition from Purple, an IBM POWER5‐based machine, to Cielo, a Cray XE6, as the main capability computing platform for the United States Department of Energy's Advanced Simulation and Computing campaign provides an opportunity to reexamine these issues after experiences with a few generations of multicore‐based machines. Experiences with Purple identified some important characteristics that led to strong performance of complex scientific application programs at very large scales. Herein, we compare the performance of some Advanced Simulation and Computing mission critical applications at capability scale across this transition to multicore processors. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

15.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

16.

Rsim: simulating shared-memory multiprocessors with ILP processors 总被引：1，自引：0，他引：1

Hughes C.J. Pai V.S. Ranganathan P. Adve S.V. 《Computer》2002,35(2):40-49

The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms 相似文献

17.

Image processing applications performance study on Cell BE and Blue Gene/L

Ali A. El‐Moursy Fadi N. Sibai 《Concurrency and Computation》2011,23(4):351-371

Two image processing applications, edge detection and image resizing, are studied in this paper on two HPC platforms namely the Cell BE and the Blue Gene/L machines. In this paper we focus on the performance scalability of the studied applications. Our results show that the scale of the problem to be solved highly affects the fitness of the platform. If the data set size is to fit into the Cell core, the fast on‐chip inter‐core communication of a multi‐core system pays back for its high technology design. On the other hand, the overhead of the distant communication in the massively parallel Blue Gene/L machine will only show its benefits for huge data set size that otherwise mandates multiple round‐trip data communications between the local memory of a core and main memory. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

18.

Future computer technology for large power system simulation

H. H. Happ C. Pottle K. A. Wirgau 《Automatica》1979,15(6):621-629

A true technological explosion has taken place in the computer hardware industry in the last few years. Words such as parallel processing, vector processors, array processors, pipelined machines, ‘number crunching’, and megaflops (Millions of FLoating-point OPerations per Second) are heard regularly. Computer manufacturers have responded to the needs of specific groups requiring, above all else, high speed arithmetic capability. The result is a host of new machines which are called in this paper ‘vector processors’. This paper will assess the applicability of vector processors to power flow and transient stability simulation programs and will indicate how these programs should be organized to run efficiently on these new machines. The approach taken will be to survey the entire class of vector processors available now and in the near future, to attempt to raise the reported low efficiency of sparsity-coded programs for large vector processors by reorganizing their sparse structure. 相似文献

19.

Multi-domain job coscheduling for leadership computing systems

Wei Tang Narayan Desai Venkatram Vishwanath Daniel Buettner Zhiling Lan 《The Journal of supercomputing》2013,63(2):367-384

Current supercomputing centers usually deploy a large-scale compute system together with an associated data analysis or visualization system. Multiple scenarios have driven the demand that some associated jobs co-execute on different machines. We propose a multi-domain coscheduling mechanism, providing the ability to coordinate execution between jobs on multiple resource management domains without manual intervention. We have evaluated our mechanism based on real job traces from Intrepid and Eureka, the production Blue Gene/P system and a cluster with the largest GPU installation, deployed at Argonne National Laboratory. The experimental results show that coscheduling can be achieved with limited impact on system performance under varying workloads. 相似文献

20.

Parallel Processing of First Order Linear Recurrence on SMP Machines

Hong-soog Kim Young-ha Yoon Dong-soo Han 《The Journal of supercomputing》2004,27(3):295-310

In this paper, we propose a new algorithm that analyzes the data dependency pattern in the first-order linear recurrence (FOLR) and transforms it into algebraically equivalent expanded form so that it can be processed in parallel using the threads on symmetric multiprocessor (SMP) machines. The transformation aims to eliminate the data dependencies in the naive nested form of the FOLR. However, as this transformation may result in extra multiplication operations, our algorithm examines the immanent overhead of the expanded form of the FOLR and generates a new hybrid form of the FOLR. The hybrid form combines nested and appropriately expanded form in order to make it suitable for parallel processing. The parallel algorithm based on the hybrid form of the FOLR is analytically examined and tested through implementation on SMP machines. The implementation details, such as the workload balancing between processors and the optimization of cache performance, are also discussed. The experimental results show that the parallel algorithm based on the hybrid form of the FOLR considerably improves the performance on SMP machines that have three of more processors. 相似文献