期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

PARCSIM: a parallel computing simulator for scalable software optimization

Cámara Jesús Cano José-Carlos Cuenca Javier Saura-Sánchez Mariano 《The Journal of supercomputing》2022,78(15):17231-17246

The Journal of Supercomputing - PARCSIM is a parallel software simulator that allows a user to capture, through a graphical interface, matrix algorithm schemes that solve scientific problems. With... 相似文献

2.

The design of an operating system for a scalable parallel computing engine

Paul Austin Kevin Murray Andy Wellings 《Software》1991,21(10):989-1013

There are substantial benefits to be gained from building computing systems from a number of processors working in parallel. One of the frequently-stated advantages of parallel and distributed systems is that they may be scaled to the needs of the user. This paper discusses some of the problems associated with designing a general-purpose operating system for a scalable parallel computing engine and then describes the solutions adopted in our experimental parallel operating system. We explain why a parallel computing engine composed of a collection of processors communicating through point-to-point links provides a suitable vehicle in which to realize the advantages of scaling. We then introduce a parallel-processing abstraction which can be used as the basis of an operating system for such a computing engine. We consider how this abstraction can be implemented and retain the ability to scale. As a concrete example of the ideas presented here we describe our own experimental scalable parallel operating-system project, concentrating on the Wisdom nucleus and the Sage file system. Finally, after introducing related work, we describe some of the lessons learnt from our own project. 相似文献

3.

优化并行计算的性能评价

刘杰迟利华胡庆丰《计算机工程与设计》2000,21(6):4-7

传统的并行计算的性能评价模型是加速比,文中讨论了加速比的缺点和不足,在此基础上提出了一种新的优化并行计算的性能评价模型（我们称之为优化加速比）。利用优化加速比分析了NAS基准测试程序MG和FT在IBM SP2(66mhz/wn)上的性能。相似文献

4.

Soft computing system for bank performance prediction

《Applied Soft Computing》2008,8(1):305-315

This paper presents a soft computing based bank performance prediction system. It is an ensemble system whose constituent models are a multi-layered feed forward neural network trained with backpropagation (MLFF-BP), a probabilistic neural network (PNN) and a radial basis function neural network (RBFN), support vector machine (SVM), classification and regression trees (CART) and a fuzzy rule based classifier. Further, principal component analysis (PCA) based hybrid neural networks, viz. PCA-MLFF-BP, PCA-PNN and PCA-RBF are also included as constituents of the ensemble. Moreover, GRNN and PNN were trained with a genetic algorithm to optimize the smoothing factors. Two ensembles (i) simple majority voting based and (ii) weightage based are implemented. This system predicts the performance of a bank in the coming financial year based on its previous 2-years’ financial data. Ten-fold cross-validation is performed in the training sessions and results are validated with an independent production set. It is demonstrated that the ensemble is able to yield lower Type I and Type II errors compared to its constituent models. Further, the ensemble also outperformed an earlier study [P.G. Swicegood, Predicting poor bank profitability: a comparison of neural network, discriminant analysis and professional human judgement, Ph.D. Thesis, Department of Finance, Florida State University, 1998] that used multivariate discriminant analysis (MDA), MLFF-BP and human judgment. 相似文献

5.

Extending Unix for scalable computing

DeBenedictis E.P. Johnson S.C. 《Computer》1993,26(11):43-53

Because it retrieves all instructions and data from a single memory, the von Neumann computer architecture has a fundamental speed limit. The scalable multicomputer architecture, which uses many microprocessors together to solve a single problem and can run at teraflop speeds, may be a solution. While teraflop processor technology is known, the scalable operating and I/O system technology necessary for those speeds are not known. The authors describe how Unix can be extended to scalable computing, permitting teraflop speeds and offering parallel computing to users unfamiliar with parallel programming. They designed this technology into the system software of the Ncube-2, the predecessor to Ncube's announced teraflop parallel computer. The authors describe the system in detail and provide some performance results 相似文献

6.

A task-uncoordinated distributed dataflow model for scalable high performance parallel program execution

《Parallel Computing》2016

相似文献

7.

A class of highly scalable optical crossbar-connectedinterconnection networks (SOCNs) for parallel computing systems

Webb B. Louri A. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(5):444-458

A class of highly scalable interconnect topologies called the Scalable Optical Crossbar-Connected Interconnection Networks (SOCNs) is proposed. This proposed class of networks combines the use of tunable Vertical Cavity Surface Emitting Lasers (VCSEL's), Wavelength Division Multiplexing (WDM) and a scalable, hierarchical network architecture to implement large-scale optical crossbar based networks. A free-space and optical waveguide-based crossbar interconnect utilizing tunable VCSEL arrays is proposed for interconnecting processor elements within a local cluster. A similar WDM optical crossbar using optical fibers is proposed for implementing intercluster crossbar links. The combination of the two technologies produces large-scale optical fan-out switches that could be used to implement relatively low cost, large scale, high bandwidth, low latency, fully connected crossbar clusters supporting up to hundreds of processors. An extension of the crossbar network architecture is also proposed that implements a hybrid network architecture that is much more scalable. This could be used to connect thousands of processors in a multiprocessor configuration while maintaining a low latency and high bandwidth. Such an architecture could be very suitable for constructing relatively inexpensive, highly scalable, high bandwidth, and fault-tolerant interconnects for large-scale, massively parallel computer systems. This paper presents a thorough analysis of two example topologies, including a comparison of the two topologies to other popular networks. In addition, an overview of a proposed optical implementation and power budget is presented, along with analysis of proposed media access control protocols and corresponding optical implementation 相似文献

8.

Empirical performance modeling for parallel weather prediction codes

Hermann Mierendorff Wolfgang Joppich 《Parallel Computing》1999,25(13-14):2135-2148

Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible. We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases. 相似文献

9.

Contention-sensitive static performance prediction for parallel distributed applications

《Performance Evaluation》2006,63(4-5):265-277

Performance prediction for parallel applications running in heterogeneous clusters is difficult to accomplish due to the unpredictable resource contention patterns that can be found in such environments. Typically, components of a parallel application will contend for the use of resources among themselves and with entities external to the application, such as other processes running in the computers of the cluster. The performance modeling approach should be able to represent these sources of contention and to produce an estimate of the execution time, preferably in polynomial time. This paper presents a polynomial time static performance prediction approach in which the prediction takes the form of an interval of values instead of a single value. The extra information given by an interval of values represents the variability of the underlying environment more accurately, as indicated by the practical examples presented. 相似文献

10.

PAWS: a performance evaluation tool for parallel computing systems

Pease D. Ghafoor A. Ahmad I. Andrews D.L. Foudil-Bey K. Karpinski T.E. Mikki M.A. Zerrouki M. 《Computer》1991,24(1):18-29

相似文献

11.

MRPC: A high performance RPC system for MPMD parallel computing

Chi‐Chao Chang Grzegorz Czajkowski Thorsten Von Eicken 《Software》1999,29(1):43-66

MRPC is an RPC system that is designed and optimized for MPMD parallel computing. Existing systems based on standard RPC incur an unnecessarily high cost when used on high‐performance multi‐computers, limiting the appeal of RPC‐based languages in the parallel computing community. MRPC combines the efficient control and data transfer provided by Active Messages (AM) with a minimal multithreaded runtime system that extends AM with the features required to support MPMD. This approach introduces only the necessary RPC overheads for an MPMD environment. MRPC has been integrated into Compositional C++ (CC++), a parallel extension of C++ that offers an MPMD programming model. Basic performance in MRPC is within a factor of two from those of Split‐C, a highly tuned SPMD language, and other messaging layers. CC++ applications perform within a factor of two to six from comparable Split‐C versions, which represent an order of magnitude improvement over previous CC++ implementations. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

12.

A novel scalability metric about iso-area of performance for parallel computing

Huanliang Xiong Guosun Zeng Yuan Zeng Wei Wang Canghai Wu 《The Journal of supercomputing》2014,68(2):652-671

Scalability is an important performance metric of parallel computing, but the traditional scalability metrics only try to reflect the scalability for parallel computing from one side, which makes it difficult to fully measure its overall performance. This paper studies scalability metrics intensively and completely. From lots of performance parameters of parallel computing, a group of key ones is chosen and normalized. Further the area of Kiviat graph is used to characterize the overall performance of parallel computing. Thereby a novel scalability metric about iso-area of performance for parallel computing is proposed and the relationship between the new metric and the traditional ones is analyzed. Finally the novel metric is applied to address the scalability of the matrix multiplication Cannon’s algorithm under LogP model. The proposed metric is significant to improve parallel computing architecture and to tune parallel algorithm design. 相似文献

13.

Abstractions for portable, scalable parallel programming

Alverson G.A. Griswold W.G. Lin C. Notkin D. Snyder L. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(1):71-86

相似文献

14.

Rover: scalable location-aware computing 总被引：1，自引：0，他引：1

Banerjee S. Agarwal S. Kamel K. Kochut A. Kommareddy C. Nadeem T. Thakkar P. Bao Trinh Youssef A. Youssef M. Larsen R.L. Udaya Shankar A. Agrawala A. 《Computer》2002,35(10):46-53

All the components necessary for realizing location-aware computing are available in the marketplace today. What has hindered the widespread deployment of location-based systems is the lack of an integration architecture that scales with user populations. The authors have completed the initial implementation of Rover, a system designed to achieve this sort of integration and to automatically tailor information and services to a mobile user's location. Their studies have validated Rover's underlying software architecture, which achieves system scalability through high-resolution, application-specific resource scheduling at the servers and network. The authors believe that this technology will greatly enhance the user experience in many places, including museums, amusement and theme parks, shopping malls, game fields, offices, and business centers. They designed the system specifically to scale to large user populations and expect its benefits to increase with them. 相似文献

15.

Portability,predictability and performance for parallel computing: BSP in practice

Joy Reed Kevin Parrott Tim Lanfear 《Concurrency and Computation》1996,8(10):799-812

We report on practical experience using the Oxford BSP Library to parallelize a large electromagnetic code, the British Aerospace finite-difference time-domain code EMMA T:FD3D. The Oxford BS Library is one of the first realizations of the Bulk Synchronous Parallel computational model to be targeted at numerically intensive scientific (typically Fortran) computing. The BAe EMMA code is one of the first large-scale applications to be parallelized using this library, and it is an important demonstration of the cost effectiveness of the BSP approach. We illustrate how BSP cost-modelling techniques can be used to predict and optimize performance for single-source programs across different parallel platforms. We provide predicted and observed performance figures for an industrial-strength, single-source parallel code for a variety of real parallel architectures: shared memory multiprocessors, workstation clusters and massively parallel platforms. 相似文献

16.

Synchronization transformations for parallel computing

Pedro C. Diniz Martin C. Rinard 《Concurrency and Computation》1999,11(13):773-802

This article describes a framework for synchronization optimizations and a set of transformations for programs that implement critical sections using mutual exclusion locks. The basic synchronization transformations take constructs that acquire and release locks and move these constructs both within and between procedures. They also eliminate, acquire and release constructs that use the same lock and are adjacent in the program. The article also presents a synchronization optimization algorithm, lock elimination, that uses these transformations to reduce the synchronization overhead. This algorithm locates computations that repeatedly acquire and release the same lock, then transforms the computations so that they acquire and release the lock only once. The goal of this algorithm is to reduce the lock overhead by reducing the number of times that computations acquire and release locks. But because the algorithm also increases the sizes of the critical sections, it may decrease the amount of available concurrency. The algorithm addresses this trade-off by providing several different optimization policies. The policies differ in the amount by which they increase the sizes of the critical sections. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of the lock elimination algorithm. For three benchmark applications, the algorithm can dramatically reduce the number of times the applications acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. The resulting overall performance improvements for these benchmarks range from no observable improvement to up to 30% performance improvement. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

17.

A scalable high-performance computing solution for networks onchips

《Micro, IEEE》2002,22(5):46-55

The Eclipse network-on-a-chip architecture uses a sophisticated parallel programming model, realized through multithreaded processors, interleaved memory modules, and a high-capacity interconnection network to support system-on-a-chip designs 相似文献

18.

Designing for parallel fuzzy computing

Ascia G. Catania V. Giacalone B. Russo M. Vita L. 《Micro, IEEE》1995,15(6):62

As the number of fuzzy logic applications increases, demand for faster architectures will grow. Our design for a VLSI fuzzy processor uses fuzzy inference techniques that optimize processing time. Preprocessing that reduces the number of rules to be processed, parallel computation of active rule degrees of activation, and scalability are major features of this architecture. The journal issue contains a concise summary of this article. The complete article is linked to Micro's home page on the World Wide Web (http://www.computer.org/pubs/micro/micro.htm) 相似文献

19.

Experiences using high performance computing for operational storm scale weather prediction

A. Sathye G. Bassett K. Droegemeier M. Xue K. Brewster 《Concurrency and Computation》1996,8(10):731-740

The Center for Analysis and Prediction of Storms (CAPS) has developed a storm-scale prediction model named the Advanced Regional Prediction System (ARPS). CAPS has been testing the ARPS in an operational mode each Spring since 1993 to evaluate the model and to involve the operational community in CAPS' development efforts. In this paper we describe our experiences with using high performance computing in a operational setting. 相似文献

20.

Highly scalable parallel algorithms for sparse matrix factorization

Gupta A. Karypis G. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):502-520

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer 相似文献