期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Globally scheduled real-time multiprocessor systems with GPUs

Glenn A. Elliott James H. Anderson 《Real-Time Systems》2012,48(1):34-74

Graphics processing units, GPUs, are powerful processors that can offer significant performance advantages over traditional CPUs. The last decade has seen rapid advancement in GPU computational power and generality. Recent technologies make it possible to use GPUs as co-processors to CPUs. The performance advantages of GPUs can be great, often outperforming traditional CPUs by orders of magnitude. While the motivations for developing systems with GPUs are clear, little research in the real-time systems field has been done to integrate GPUs into real-time multiprocessor systems. We present two real-time analysis methods, addressing real-world platform constraints, for such an integration into a soft real-time multiprocessor system and show that a GPU can be exploited to achieve greater levels of total system performance. 相似文献

2.

Dynamic partitioning of multiprocessor systems

Kee-Hyun Park Lawrence W. Dowdy 《International journal of parallel programming》1989,18(2):91-120

Partitioning of processors on a multiprocessor system involves logically dividing the system into processor partitions. Programs can be executed in the different partitions in parallel. Optimally setting the partition size can significantly improve the throughput of multiprocessor systems. The speedup characteristics of parallel programs can be defined by execution signatures. The execution signature of a parallel program on a multiprocessor system is the rate at which the program executes in the absence of other programs and depends upon the number of allocated processors, the specific architecture, and the specific program implementation. Based on the execution signatures, this paper analyzes simple Markovian models of dynamic partitioning. From the analysis, when there are at most two multiprocessor partitions, the optimal dynamic partition size can be found which maximizes throughput. Compared against other partitioning schemes, the dynamic partitioning scheme is shown to be the best in terms of throughput when thereconfiguration overhead is low. If the reconfiguration overhead is high, dynamic partitioning is to be avoided. An expression for the reconfiguration overhead threshold is derived. A general iterative partitioning technique is presented. It is shown that the technique gives maximum throughput forn partions. 相似文献

3.

A syntax-directed integrated programming environment for developing simd supercomputer software

R. H. Perrott T. F. Lunney 《Software》1991,21(3):269-286

The increasing availability and use of supercomputers has highlighted the need for better software development techniques and tools. Supercomputers have traditionally been extensively used by engineers and scientists whose preference for Fortran is well established and recognized. However, in the parallel environment offered by the latest configurations of supercomputers, more sophisticated languages and tools are required. The present experiment is concerned with devising a syntax-directed integrated programming environment based on the language Actus, which enables a user to develop and debug programs before submitting them to an actual supercomputer. Actus is a high-level, Pascal-like language, with SIMD parallel processing features. It enables the user to ignore the idiosyncrasies of the chosen supercomputer by abstracting the parallel processing detail to a higher level. The editing, compilation and testing phases of program development are all integrated, providing a single consistent interface for all activities relating to program development. 相似文献

4.

基于Linux集群电磁散射并行计算实现 总被引：1，自引：0，他引：1

韩明华彭宇行李思昆陈福接《计算机研究与发展》2005,42(6):1085-1088

工业应用,特别是军事应用对计算电磁学(CEM)的需求提出挑战,解决电大尺寸电磁散射问题(物理尺寸/λ》1)的有效方法是采用并行计算技术．给出了MLFMA算法基于Linux集群技术的并行实现,并给出了电大尺寸目标电磁散射的计算实例．由于这种并行化方法只是充分利用已有的工作站,编程容易,所以是一种值得推广的并行化实现方法．相似文献

5.

Assembling a resolution multiprocessor from interface,programming and distributed processing components

《Computer Languages》1996,22(2-3):181-192

An effective resolution multiprocessor can be built from distributed processing, logic programming, and interface elements. Widely used, portable, components can be modularly composed into a portable parallel system that displays good resistance to premature obsolescence by software evolution. A virtual multiprocessor offering common message passing and configuration services integrates a distributed mesh of sequential resolution engines. Users configure and control the resolution engines and virtual multiprocessor through a GUI using an embedded command language to drive its facilities. Prolog programs either explicitly control parallel execution through message passing or would have to rely on program transformation techniques to extract parallelism implicitly. 相似文献

6.

IPS-2: the second generation of a parallel program measurementsystem

Miller B.P. Clark M. Hollingsworth J. Kierstead S. Lim S.-S. Torzewski T. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):206-217

IPS, a performance measurement system for parallel and distributed programs, is currently running on its second implementation. IPS's model of parallel programs uses knowledge about the semantics of a program's structure to provide two important features. First, IPS provides a large amount of performance data about the execution of a parallel program, and this information is organized so that access to it is easy and intuitive. Secondly, IPS provides performance analysis techniques that help to guide the programmer automatically to the location of program bottlenecks. The first implementation of IPS was a testbed for the basic design concepts, providing experience with a hierarchical program and measurement model, interactive program analysis, and automatic guidance techniques. It was built on the Charlotte distributed operating system. The second implementation, IPS-2, extends the basic system with new instrumentation techniques, an interactive and graphical user interface, and new automatic guidance analysis techniques. This implementation runs on 4.3BSD UNIX systems, on the VAX, DECstation, Sun 4, and Sequent Symmetry multiprocessor 相似文献

7.

A Metric for the Temporal Characterization of Parallel Programs

Bernardo Rodriguez Harry Jordan Gita Alaghband 《Journal of Parallel and Distributed Computing》1997,46(2):113

We consider the time-dependent demands for data movement that a parallel program makes on the architecture that executes it. The result is an architecture-independent metric that represents the temporal behavior of data-movement requirements. Programs are described as series of computations and data movements, and while message passing is not ruled out, we focus on explicit parallel programs using a fixed number of processes in a distributed shared-memory environment. Operations are assumed to be explicitly allocated to processors when the metric is applied, which might correspond to intermediate code in a parallelizing compiler. The metric is called the interprocess read (IR) temporal metric. A key to developing an architecture-independent temporal metric is modeling program execution time in an architecture-independent way. This is possible because well-synchronized parallel programs make coordinated progress above a certain level of granularity. Our execution time characterization takes into account barrier synchronization and critical sections. We illustrate the metric using instruction count on simple code fragments and then from multiprocessor program traces (Splash benchmarks). Results of running the benchmarks on simulated network architectures show that the IR metric for the time scale of network response predicts performance better than whole program measures. 相似文献

8.

Compile-time partitioning of iterative parallel loops to reducecache coherency traffic

Abraham S.G. Hudak D.E. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):318-328

Adaptive data partitioning (ADP) which reduces the execution time of parallel programs by reducing interprocessor communication for iterative parallel loops is discussed. It is shown that ADP can be integrated into a communication-reducing back end for existing parallelizing compilers or as part of a machine-specific partitioner for parallel programs. A multiprocessor model to analyze program execution factors that lead to interprocessor communication and a model for the iterative parallel loop to quantify communication patterns within a program are defined. A vector notation is chosen to quantify communication across a global data set. Communication parameters are computed by examining the indexes of array accesses and are adjusted to reflect the underlying system architecture by compensating for cache line sizes. These values are used to generate rectangular and hexagonal partitions that reduce interprocessor communication 相似文献

9.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

10.

计算机仿真中并行处理技术的研究

杜铁塔胡守仁《计算机学报》1994,17(3):192-197

用多机系统进行并行仿真是解决大规模连续系统实时仿真问题的有效途径。多机并行仿真中关键要解决的问题，是如何有效地将一个仿真任务分配到多机系统上并发执行，并获得高的加速比。本文介绍了作者自行研制的并行仿真软件支撑环境ＰＡＲＳＩＭ，它可将一个传统单机上串行执行的仿真程序自动转换成在同构型多机系统上高效并发执行的并行仿真程序，并就并行性识别，多任务自动划分等问题展开了讨论，给出了相应的算法和应用实例。相似文献

11.

Guaranteeing good memory bounds for parallel programs

Burton F.W. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(10):762-773

The amount of memory required by a parallel program may be spectacularly larger than the memory required by an equivalent sequential program, particularly for programs that use recursion extensively. Since most parallel programs are nondeterministic in behavior, even when computing a deterministic result, parallel memory requirements may vary from run to run, even with the same data. Hence, parallel memory requirements may be both large (relative to memory requirements of an equivalent sequential program) and unpredictable. Assume that each parallel program has an underlying sequential execution order that may be used as a basis for predicting parallel memory requirements. We propose a simple restriction that is sufficient to ensure that any program that will run in n units of memory sequentially can run in mn units of memory on m processors, using a scheduling algorithm that is always within a factor of two of being optimal with respect to time. Any program can be transformed into one that satisfies the restriction, but some potential parallelism may be lost in the transformation. Alternatively, it is possible to define a parallel programming language in which only programs satisfying the restriction can be written 相似文献

12.

Low-cost,Concurrent Checking of Pointer and Array Accesses in C Programs

HARISH PATIL CHARLES FISCHER 《Software》1997,27(1):87-110

Illegal pointer and array accesses are a major cause of failure for C programs. We present a technique called ‘guarding’ to catch illegal array and pointer accesses. Our implementation of guarding for C programs works as a source-to-source translator. Auxiliary objects called guards are added to a user program to monitor pointer and array accesses at run time. Guards maintain attributes to catch out of bounds array accesses and accesses to deallocated memory. Our system has found a number of previously unreported errors in widely-used Unix utilities and SPEC92 benchmarks. Many commonly used programs have bugs which may not always manifest themselves as a program crash, but may instead produce a subtly wrong answer. These programs are not routinely checked for run-time errors because the increase in execution time due to run-time checking can be very high. We present two techniques to handle the high cost of run-time checking of pointer and array accesses in C programs: ‘customization’ and ‘shadow processing’. Customization works by decoupling run-time checking from original computation. A user program is customized for guarding by throwing away computation not relevant for guarding. We have explored using program slicing for customization. Customization can cut the overhead of guarding by up to half. Shadow processing uses idle processors in multiprocessor workstations to perform run-time checking in the background. A user program is instrumented to obtain a ‘main process’ and a ‘shadow process’. The main process performs computations from the orignal program, occasionally communicating a few key values to the shadow process. The shadow process follows the main process, checking pointer and array accesses. The overhead to the main process which the user sees is very low – almost always less than 10%. © 1997 by John Wiley & Sons, Ltd. 相似文献

13.

Performance Optimization Using Extended Critical Path Analysis in Multithreaded Programs on Multiprocessors

《Journal of Parallel and Distributed Computing》2001,61(1):115-136

Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation. 相似文献

14.

Shared memory multimicroprocessor operating system with an extendedPetri net model

Vallejo F. Gregorio J.A. Gonzalez Harbour M. Drake J.M. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(7):749-762

相似文献

15.

Multiprocessor execution of functional programs 总被引：1，自引：0，他引：1

Benjamin Goldberg 《International journal of parallel programming》1988,17(5):425-473

Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.Two implementations of the functional language ALFL were built on commercially available multiprocessors.Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, andBuckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, calledserial combinators, that can be executed in parallel.The abstract machine model supported by Alfalfa and Buckwheat is calledheterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and highe order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation. 相似文献

16.

Simulating the DYNIX operating system parallel programming interface on a UNIX system

Mehdi Badii 《Software》1998,28(5):463-480

This paper presents the implementation of multitasking functions of DYNIX Sequent computers on the UNIX operating system. The Sequent computers are shared memory multiprocessor computers running the DYNIX operating system. These functions support data and function partitioning. They let the user implement subprograms by the processors of a Sequent computer in parallel. The functions can synchronize, lock, and unlock data and program segments. As a result, the simulator allows the users to develop their multitasking programs on a uniprocessor computer such as a SUN workstation, and later port them to a Sequent computer. Further, the simulator adds a level of abstraction on top of UNIX for concurrent programming. The functions of the simulator allow the user to handle the communication and synchronization of the processes in a program at a higher level of abstraction, while concentrating on the design of multitasking algorithms. The simulator is applied to a parallel selection algorithm. © 1998 John Wiley & Sons, Ltd. 相似文献

17.

Simulation of hierarchical multiprocessor database systems

P. S. Kostenetskii L. B. Sokolinsky 《Programming and Computer Software》2013,39(1):10-24

The paper is dedicated to issues concerning simulation and analysis of hierarchical multiprocessor systems oriented to database applications. Requirements for a parallel database system model are given. A survey and comparative analysis of known parallel database system models are presented. A new multiprocessor database system model is introduced. This model allows us to simulate and evaluate arbitrary hierarchical multiprocessor configurations in the context of the OLTP class database applications. Examples of using the database multiprocessor model for simulation study of multiprocessor database systems are presented. 相似文献

18.

A general technique for proving lock-freedom

Robert Colvin Brijesh Dongol 《Science of Computer Programming》2009,74(3):143-165

Lock-freedom is a property of concurrent programs which states that, from any state of the program, eventually some process will complete its operation. Lock-freedom is a weaker property than the usual expectation that eventually all processes will complete their operations. By weakening their completion guarantees, lock-free programs increase the potential for parallelism, and hence make more efficient use of multiprocessor architectures than lock-based algorithms. However, lock-free algorithms, and reasoning about them, are considerably more complex.In this paper we present a technique for proving that a program is lock-free. The technique is designed to be as general as possible and is guided by heuristics that simplify the proofs. We demonstrate our theory by proving lock-freedom of two non-trivial examples from the literature. The proofs have been machine-checked by the PVS theorem prover, and we have developed proof strategies to minimise user interaction. 相似文献

19.

Analyses and Optimizations for Shared Address Space Programs

Arvind Krishnamurthy Katherine Yelick 《Journal of Parallel and Distributed Computing》1996,38(2):130

We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed. 相似文献

20.

A compiler optimization algorithm for shared-memory multiprocessors

McKinley K.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):769-787

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献