期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Effective VLSI processor architectures for HLL computers: the RISCapproach

Lazzerini B. 《Micro, IEEE》1989,9(1):57-65

The design principles of reduced-instruction-set computer (RISC) architectures as they apply to VLSI implementation for high-level languages (HLLs) are presented. The nature of general-purpose HLL computations is discussed in terms of static and dynamic program measurements, and the HLL features that need efficient support are identified. CISC (complex-instruction-set computer) and RISC approaches to general-purpose HLL computers are outlined, the effects of instruction-set reduction on both code size and execution time are evaluated, and the delayed-jump concept is introduced. The Berkeley RISC architecture is presented as an example 相似文献

2.

A high-performance application data environment for large-scale scientific computations

Shen X. Liao W.-K. Alok Choudhary Memik G. Kandemir M. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(12):1262-1274

Effective high-level data management is becoming an important issue with more and more scientific applications manipulating huge amounts of secondary-storage and tertiary-storage data using parallel processors. A major problem facing the current solutions to this data management problem is that these solutions either require a deep understanding of specific data storage architectures and file layouts to obtain the best performance (as in high-performance storage management systems and parallel file systems), or they sacrifice significant performance in exchange for ease-of-use and portability (as in traditional database management systems). We discuss the design, implementation, and evaluation of a novel application development environment for scientific computations. This environment includes a number of components that make it easy for the programmers to code and run their applications without much programming effort and, at the same time, to harness the available computational and storage power on parallel architectures. 相似文献

3.

Parallel algorithms for model reduction of discrete-time systems

P. Benner E. S. Quintana-Orti G. Quintana-Orti 《International journal of systems science》2013,44(5):319-333

Computing reduced-order models of controlled dynamical systems is of fundamental importance in many analysis and synthesis problems in systems and control theory. Algorithmic aspects of model reduction methods based on state-space truncation for linear discrete-time systems are addressed here. In contrast to the often-used approach of applying methods for continuous-time systems to discrete-time models employing a bilinear transformation, we devise special algorithms for discrete-time systems. Usually, this is more reliable and efficient. All methods discussed require in an initial stage the computation of the Gramians of the system. Using an accelerated fixed-point iteration for computing the full-rank factors of the Gramians yields some favorable computational aspects, particularly for non-minimal systems. The computations only require efficient implementations of basic linear algebra operations readily available on modern computer architectures. We discuss aspects of the parallel implementation of these methods and show the performance and scalability on distributed memory computers. Our approach enables users to deal with very complex systems using relatively cheap infrastructure, as, for example, a local PC or workstation network. 相似文献

4.

Concepts and implementation of parallel finite element analysis 总被引：1，自引：0，他引：1

K. N. Chiang R. E. Fulton 《Computers & Structures》1990,36(6):1039-1046

The design of complex engineering systems such as advanced aircraft structures and offshore platforms requires continually increasing levels of detail in supporting analysis. The finite element method is widely used as a computational method with which to model physical systems in various engineering problems. For detailed analyses of complex designs, structural models composed of several thousands of degrees of freedom are no longer uncommon. Such design activities require large order finite element and/or finite difference models and excessive computation demands in both calculation speed and information management. The computer simulation of the nonlinear dynamic response of structures and the implementation of parallel FEM systems on a high speed multiprocessor have received considerable attention in recent years. The driving forces of these activities included the reliable simulation of automotive and aircraft crash phenomena, and the increased performance of computers. Most existing major structural analysis software systems were designed 10–20 years ago and have been optimized for current sequential computers. Such systems often are not well structured to take maximum advantage of the recent and continuing revolution in parallel vector computing capabilities. These parallel vector computer architectures not only occur in the form of large supercomputers, but are now also occurring for minicomputers and even engineering workstations. To benefit from advances in parallel computers, software must be developed which takes maximum advantage of the parallel processing feature. 相似文献

5.

Factoring: Algorithms,computations, and computers

Duncan A. Buell 《The Journal of supercomputing》1987,1(2):191-216

This article discusses the computational structure of the most effective methods for factoring integers and the computer architectures—existing and used, proposed, and under construction—which efficiently perform the computations of these various methods. New developments in technology and in pricing of computers are making it possible to build powerful parallel machines, at relatively low cost, which can substantially outperform standard computers on specific types of computations. The intent of this article is to use factoring and computers for factoring to provoke general thought about this matching of computer architectures to algorithms and computations.The author's research at Louisiana State University was supported in part by the National Science Foundation and the National Security Agency under grants NSF DCR 83-115-80 and NSA MDA904-85-H-0006. 相似文献

6.

Implementation of an ADI Method on parallel computers

Raad A. Fatoohi Chester E. Grosch 《Journal of scientific computing》1987,2(2):175-193

In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented. 相似文献

7.

Automated Parallelization of Timed Petri-Net Simulations

《Journal of Parallel and Distributed Computing》1995,29(1):60-74

Timed Petri-nets are used to model numerous types of large complex systems, especially computer architectures and communication networks. While formal analysis of such models is sometimes possible, discrete-event simulation remains the most general technique available for assessing the model′s behavior. Simulation′s computational requirements, however, can be massive, especially on the large complex models that defeat analytic methods. One way of meeting these requirements is by executing the simulation on a parallel machine. This paper describes simple techniques for the automated parallelization of timed Petri-net simulations. We address the issue of processor synchronization as well as the automated mapping, both static and dynamic, of the Petri-net to the parallel architecture. As part of this effort we describe a new mapping algorithm, one that also applies to more general parallel computations. We establish analytic properties of the solution produced by the algorithm, including optimality on some regular topologies, The viability of our integrated approach is demonstrated empirically on the Intel iPSC/860 and Delta architectures on Petri-net-based simulations of parallel architectures. 相似文献

8.

Cellular processing tools for high-performance simulation

Talia D. 《Computer》2000,33(9):44-52

Cellular automata offer a powerful modeling approach for complex systems in which global behavior arises from the collective effect of many locally interacting, simple components. Several tools based on CA are providing meaningful results for real-world applications. Cellular automata represent an efficient paradigm for the computer solution of important problems in science and engineering. Moreover, the CA model lets researchers effectively use parallel computers to achieve scalable performance. As researchers use parallel computers to solve scientific problems, they will need problem representations (paradigms) for this class of computers. Abstract mathematical models that offer an implicitly parallel representation of problems better match those architectures, but could benefit from new high-level languages, environments, and techniques. The three should support all the development steps of computational science applications while hiding architectural details from users. Computational science is also an interdisciplinary field in which many areas converge, and developing applications in this field requires the cooperation of people from different domains. Modeling and simulation using parallel cellular methods helps researchers cooperate by offering both a way to code an algorithm and an integrated environment for developing software 相似文献

9.

ComPaSS: A Communication Package for Scalable Software Design

《Journal of Parallel and Distributed Computing》1994,22(3):449-461

In massively parallel computers (MPCs), efficient communication among processors is critical to performance. This paper describes the initial implementation of the ComPaSS communication library to support scalable software development in MPCs. ComPaSS provides high-level global communication operations for both data manipulation and process control, many of which are based upon a small set of low-level communication primitives. The low-level operations of the ComPaSS library are provably optimal for a class of architectures representative of many commercial scalable systems, in particular those using wormhole routing and n-dimensional mesh network topologies. This paper concentrates on the multicast and multireceive components of the ComPaSS library, which are fundamental to implementing efficient high-level data parallel operations. The design of the multicast and multireceive primitives is described and an example of a data parallel application utilizing ComPaSS multicast is given. The scalability of these primitives is discussed, and improvements in performance resulting from use of the library on a 64-node nCUBE-2 are presented. 相似文献

10.

Automatic generation of self-scheduling programs

Foster I. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(1):68-78

Techniques are described for the automatic generation of self-scheduling parallel programs. Both scheduling algorithms and the concurrent components of applications are expressed in a high-level concurrent language. Partitioning and data dependency information are expressed by simple control statements, which may be generated either automatically or manually. A self-scheduling compiler, implemented as a source-to-source transformation, takes application code, control statements, and scheduling routines and generates a new program that can schedule its own execution on a parallel computer. The approach has several advantages compared to previous proposals. It generates programs that are portable over a wide range of parallel computers. There is no need to embed special control structures in application programs. The use of a high-level language to express applications and scheduling algorithms facilitates the development, modification, and reuse of parallel programs 相似文献

11.

An Empirical Methodology for Exploring Reconfigurable Architectures

《Journal of Parallel and Distributed Computing》1993,19(4):323-337

Recent research in reduced instruction set computer architectures has emphasized the importance of the empirical approach to designing computer architectures: architectural features are analyzed for utility and cost with respect to the system software that uses them. This approach has resulted in architectural simulators that allow computer designers to vary the features of the architecture being simulated and to analyze how the addition or removal of these features affects the cost and performance of the architecture. In this paper we apply this technique to a new area: reconfigurable architectures. Our approach is to use an empirical methodology that emphasizes the interaction between the target software and the reconfigurability features of parallel architectures. We have developed a set of tools, the reconfigurable architecture workbench, that assists in this methodology by allowing parallel programs to be simulated on a target architecture in order to study the performance implications of various reconfigurability features. The workbench is based on a framework, the PCI model, which describes the range of parallel programs, parallel architectures, and reconfiguration features. We present details of the design and implementation of a prototype workbench, GT-RAW. GT-RAW is being used to study the utility of one dimension of reconfiguration for image processing and image understanding applications. We present an example of the experiments that are being conducted with GT-RAW as a demonstration of our empirical methodology. 相似文献

12.

Generalized methods for algorithm development on optical systems

A. Al-Ayyoub A. Awwad K. Day M. Ould-Khaoua 《The Journal of supercomputing》2006,38(2):111-125

A number of recent studies have revealed that the Optical Transpose Interconnection Systems (or OTIS) are promising candidates for future high-performance parallel computers. In this paper, we present and evaluate two general methods for algorithm development on the OTIS. The proposed methods are general in the sense that no specific factor network or problem domain is assumed. The proposed methods allow efficient mapping of a wide class of algorithms into the OTIS. These methods are based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer type of algorithms, sorting, and FFT computation. Timing models for measuring the performance of the proposed methods are also provided. Through these models, the performance of various algorithms on the OTIS are evaluated and compared with their counterparts on conventional electronic interconnection systems. This study confirms the viability of the OTIS as an attractive alternative for large-scale parallel architectures. Finally, we show how the proposed methods can be used to design parallel algorithms for linear algebra on the OTIS. 相似文献

13.

Massively parallel quantum computer simulator

K. De Raedt H. De Raedt B. Trieu 《Computer Physics Communications》2007,176(2):121-136

We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers. 相似文献

14.

Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

Abdorreza Savadi Hossein Deldari 《The Journal of supercomputing》2014,67(2):565-584

Computer benchmarking is a common method for measuring the parameters of a computational model. It helps to measure the parameters of any computer. With the emergence of multicore computers, the evaluation of computers was brought under consideration. Since these types of computers can be viewed and considered as parallel computers, the evaluation methods for parallel computers may be appropriate for multicore computers. However, because multicore architectures seriously focus on cache hierarchy, there is a need for new and different benchmarks to evaluate them correctly. To this end, this paper presents a method for measuring the parameters of one of the most famous multicore computational models, namely Multi-Bulk Synchronous Parallel (Multi-BSP). This method measures the hardware latency parameters of multicore computers, namely communication latency (g _i) and synchronization latency (L _i) for all levels of the cache memory hierarchy in a bottom-up manner. By determining the parameters, the performance of algorithms on multicore architectures can be evaluated as a sequence. 相似文献

15.

Global Register Allocation for SIMD Multiprocessors

下载免费PDF全文

Benjamin HAO David PEARSON Richard ZIPPEL 《计算机科学技术学报》1996,11(3):222-236

It is relatively clear how to map regular,repetitive or grid oriented computations onto SIMD architectures.It is not so clear,however,how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code.We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper,we present one of the compilation techniques,global register allocation,we have developed for SIMD computers,and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code.This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers.Our performance results validate our method. 相似文献

16.

A framework to generate domain-specific manycore architectures from dataflow programs

《Microprocessors and Microsystems》2020

In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. 相似文献

17.

A general framework for concurrent simulation on neural networkmodels

Heileman G.L. Georgiopoulos M. Roome W.D. 《IEEE transactions on pattern analysis and machine intelligence》1992,18(7):551-562

The analysis of complex neural network models via analytical techniques is often quite difficult due to the large numbers of components involved and the nonlinearities associated with these components. The authors present a framework for simulating neural networks as discrete event nonlinear dynamical systems. This includes neural network models whose components are described by continuous-time differential equations or by discrete-time difference equations. Specifically, the authors consider the design and construction of a concurrent object-oriented discrete event simulation environment for neural networks. The use of an object-oriented language provides the data abstraction facilities necessary to support modification and extension of the simulation system at a high level of abstraction. Furthermore, the ability to specify concurrent processing supports execution on parallel architectures. The use of this system is demonstrated by simulating a specific neural network model on a general-purpose parallel computer 相似文献

18.

Parallel Coarse Grain Computing of Boltzmann Machines

Ortega Julio Rojas Ignacio Diaz Antonio F. Prieto Alberto 《Neural Processing Letters》1998,7(3):169-184

The resolution of combinatorial optimization problems can greatly benefit from the parallel and distributed processing which is characteristic of neural network paradigms. Nevertheless, the fine grain parallelism of the usual neural models cannot be implemented in an entirely efficient way either in general-purpose multicomputers or in networks of computers, which are nowadays the most common parallel computer architectures. Therefore, we present a parallel implementation of a modified Boltzmann machine where the neurons are distributed among the processors of the multicomputer, which asynchronously compute the evolution of their subset of neurons using values for the other neurons that might not be updated, thus reducing the communication requirements. Several alternatives to allow the processors to work cooperatively are analyzed and their performance detailed. Among the proposed schemes, we have identified one that allows the corresponding Boltzmann Machine to converge to solutions with high quality and which provides a high acceleration over the execution of the Boltzmann machine in uniprocessor computers. 相似文献

19.

Modelling and analysis of the variance in parallelism in parallel computations

C.R.M. Sundaram Y. Narahari 《Computers & Electrical Engineering》1993,19(6):495-506

In this paper, we introduce an analytical technique based on queueing networks and Petri nets for making a performance analysis of dataflow computations when executed on the Manchester machine. This technique is also applicable for the analysis of parallel computations on multiprocessors. We characterize the parallelism in dataflow computations through a four-parameter characterization, namely, the minimum parallelism, the maximum parallelism, the average parallelism and the variance in parallelism. We observe through detailed investigation of our analytical models that the average parallelism is a good characterization of the dataflow computations only as long as the variance in parallelism is small. However, significant difference in performance measures will result when the variance in parallelism is comparable to or higher than the average parallelism. 相似文献

20.

Architecture-independent parallel computation

Skillicorn D.B. 《Computer》1990,23(12):38-50

The major parallel architecture classes are considered: single-instruction multiple-data (SIMD) computers, tightly coupled multiple-instruction multiple-data (MIMD) computers, hypercuboid computers and constant-valence MIMD computers. An argument that the PRAM model is universal over tightly coupled and hypercube systems, but not over constant-valence-topology, loosely coupled-system is reviewed, showing precisely how the PRAM model is too powerful to permit broad universality. Ways in which a model of computation can be restricted to become universal over less powerful architectures are discussed. The Bird-Meertens formalism (R.S. Bird, 1989), is introduced and it is shown how it is used to express computations in a compact way. It is also shown that the Bird-Meertens formalism is universal over all four architecture classes and that nontrivial restrictions of functional programming languages exist that can be efficiently executed on disparate architectures. The use of the Bird-Meertens formalism as the basis for a programming language is discussed, and it is shown that it is expressive enough to be used for general programming. Other models and programming languages with architecture-independent properties are reviewed 相似文献