期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An algorithm for parallel combined continuous and discrete-event simulation

《Simulation Practice and Theory》1997,5(2):167-184

As the level of detail and the complexity of the model increase, computation time needed to simulate a model grows strongly. A parallel computer system consisting of many processors may be needed to generate behavior of a model within a reasonable time. Therefore, an algorithm for parallel combined continuous and discrete-event simulation has been designed and implemented. In this article the algorithm will be described and the performance will be reported. The developed combined simulation algorithm extended the parallel discrete simulation algorithm Time Warp towards combined simulation. Experiments show that compared to the sequential implementation of the model a modest speed-up can be achieved. 相似文献

2.

ROSS: A high-performance,low-memory,modular Time Warp system

《Journal of Parallel and Distributed Computing》2002,62(11):1648-1669

In this paper, we introduce a new Time Warp system called ROSS: Rensselaer’s optimistic simulation system. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system by up to 180% on a quad processor PC server and up to 200% on the SGI Origin 2000. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. ROSS demonstrates for the first time that stable, highly efficient execution using little memory above what the sequential model would require is possible for low-event granularity simulation models. The driving force behind these high-performance and low-memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimoto's fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of kernel processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution. These characteristics make ROSS an ideal system for use in large-scale networking simulation models. The principle conclusion drawn from this study is that the performance of an optimistic simulator is largely determined by its memory usage. 相似文献

3.

Efficient execution of Time Warp programs on heterogeneous, NOWplatforms

Carothers C.D. Fujimoto R.M. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(3):299-317

Time Warp is an optimistic protocol for synchronizing parallel discrete event simulations. To achieve performance in a multiuser network of workstation (NOW) environment, Time Warp must continue to operate efficiently in the presence of external workloads caused by other users, processor heterogeneity, and irregular internal workloads caused by the simulation model. However, these performance problems can cause a Time Warp program to become grossly unbalanced, resulting in slower execution. The key observation asserted in this article is that each of these performance problems, while different in source, has a similar manifestation. For a Time Warp program to be balanced, the amount of wall clock time necessary to advance an LP one unit of simulation time should be about the same for all LPs. Using this observation, we devise a single algorithm that mitigates these performance problems and enables the “background” execution of Time Warp programs on heterogeneous distributed computing platforms in the presence of external as well as irregular internal workloads 相似文献

4.

Parameterized Time Warp (PTW): An Integrated Adaptive Solution to Optimistic PDES

《Journal of Parallel and Distributed Computing》1996,37(2):134-145

Time Warp is an optimistic synchronization protocol used for parallel discrete event simulation. While Time Warp has the potential to reduce the execution time of large simulations, it has been plagued by a variety of problems, namely: 1. Instability due to thrashing effects caused by echoing and cascading rollbacks. 2. Memory bottlenecks due to state saving and excessive optimism. 3. Inefficient scheduling algorithms for scheduling Time Warp processes on each processing node. These problems have inhibited the widespread use of Time Warp as a general purpose synchronization algorithm. The general trend of researchers attempting to solve these problems has been to statically limit the optimism of Time Warp. Unfortunately, these attempts have achieved only limited success. This is because a static set of parameters may perform well for one simulation but not for another. This paper attacks the problem using adaptive mechanisms to control optimism, using an index of performance called useful work. This research presents solutions for the above mentioned problems, by: 1. Stabilizing Time Warp using adaptive bounded time windows. 2. Reducing memory usage and overall execution time by using an adaptive mechanism to vary the checkpoint interval. 3. Scheduling Time Warp processes with the useful work parameter to favor more productive processes. Using this new performance index called Useful Work, several modifications to Time Warp are implemented to stabilize and improve Time Warp. Thus, this new improved Time Warp synchronization mechanism termed Parameterized Time Warp provides an integrated adaptive solution to optimistic Parallel Discrete Event Simulation. Empirical work showing that PTW outperforms an equivalent Time Warp simulation executing under similar partitioning and load conditions is also presented. 相似文献

5.

Core frequency adjustment to optimize Time Warp on many-core processors

《Simulation Modelling Practice and Theory》2012

Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulation objects. In place of an explicit synchronization mechanism, the concurrent simulators implement an independent but common virtual clock model and a rollback/recovery mechanism to restore causal order when out-of-order events are detected. When the critical path of execution of the simulation is balanced across this parallel threads of execution, this can result in a highly effective, lightweight synchronization mechanism to implement parallel simulation. However, imbalances in the workload across the threads can result in excessive rollback in some threads and slowed progress of the critical path. On small shared memory multi-core systems, a lowest time-stamp scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. In particular, the recently developed Intel Single-chip Cloud Computer (SCC) provides mechanisms for the runtime control of the frequency and voltage settings of the chip. Furthermore, the frequency and voltage settings are independently set within different regions (called islands) of the chip. Thus, in a Time Warp simulation, one could increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in some contemporary x86 multi-core processors to identify the platforms that can support the exploration of dynamic run-time control of core frequency settings. The results show that while all multi-core processors have software controllable core frequency modulation capabilities, they are generally not fully independent as the system comes under load and are therefore unsuitable for these studies. Fortunately, one processor, the AMD X6 line, provides software control for core frequencies that can be fixed (by software) even as the system operates under load. 相似文献

6.

The Effect of Memory Capacity on Time Warp Performance

《Journal of Parallel and Distributed Computing》1993,18(4):411-422

The behavior of n interacting processes synchronized by the "Time Warp" rollback mechanism is analyzed under the constraint that the total amount of memory to execute the program is limited. In Time Warp, a protocol called "cancelback" has been proposed to reclaim storage when the system runs out of memory. A discrete state, continuous time Markov chain model for Time Warp augmented with the cancelback protocol is developed for a shared memory system with n homogeneous processors and homogeneous workload with constant message population. The model allows one to predict speedup as the amount of available memory is varied. The performance predicted by the model is validated through performance measurements on an operational Time Warp system executing on a shared-memory multiprocessor using a workload similar to that in the model. It is observed that if the sequential simulation requires m message buffers, Time Warp with a small fraction of message buffers beyond m performs almost as well as Time Warp with unlimited memory. 相似文献

7.

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Roman Lysecky 《International journal of parallel programming》2008,36(5):478-492

Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor. 相似文献

8.

Tradeoffs in granularity and parallelization for a Monte Carlo shower simulation code

Kenichi Miura 《Parallel Computing》1988,8(1-3):91-100

The EGS4 code, developed at Stanford Linear Accelerator Center, simulates electron-photon cascading phenomena. The original code is inherently sequential: processing one particle at a time. This paper reports on a series of experiments in parallelizing different versions of EGS4. Our parallel experiments were run on a 30-processor Sequent Balance B21 and a 6-processor Symmetry S27. We have considered the following approaches for parallel execution of this application code:

1. (1) Original sequential version modified for parallel processing: 1 processor;
2. (2) Version 1 run multiprocessed: 1 to 29 processors;
3. (3) Sequential version modified for large-grain parallel processing: 1 procssor;
4. (4) Version 3 run using the Sequent Microtasking Library: 1 to 29 processors.

For each approach, we discuss the relative advantages and disadvantages in the areas of coding effort, understandability and portability, as well as performance, and outline a new parallelization approach we are currently pursuing based on Large-Grain Data Flow techniques. 相似文献

9.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

10.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

11.

A distributed fair polling scheme applied to OR-parallel logic programming

Zheng Lin 《International journal of parallel programming》1991,20(4):315-339

A task scheduling algorithm for parallel execution of logic programs on NUMA multiprocessors is proposed. The algorithm endorces a so-calledfair polling policy. We show analytically that the proposed algorithm has a good isoefficiency function. Results from simulation on switch based NUMA architecture multiprocessors, the BBN Butterfly GP1000 and TC2000, corroborate the analysis. The proposed algorithm exhibits performance characteristics very similar to that of its counterpart that uses a shared memory. It achieves reasonable speed-up on benchmarks, using a nonconstant time task migration protocol. In addition, fair polling algorithms (with or without a shared memory) are shown to beconsistently superior than several other known polling schemes that do not maintain fairness, for a variety of benchmark programs.Supported by Airforce Office of Scientific Research Grant AFOSR-88-0152, AFOSR-91-0350. 相似文献

12.

Parallel algorithm design for workstation clusters

J. N. Magee S. C. Cheung 《Software》1991,21(3):235-250

Clusters of workstations connected by local area networks are in common use in many organizations. The combined processing power of these clusters is rarely exploited owing to the lack of suitable parallel algorithms. The paper describes a parallel programming paradigm called supervisor-worker, suitable for the workstation environment, which can be used to speed up the execution of a large class of existing sequential programs. Simple formulae are developed to predict the speed-up of a parallel algorithm developed in this way. The predictions depend on two easily-determined parameters of the sequential program and the characteristic communication cost of the workstation cluster. Consequently, it is possible to estimate the benefits of the parallel program before proceeding with detailed implementation. As an example, the parallel version of a travelling salesman program is developed and the measured speed-up compared with the predicted speed-up. 相似文献

13.

The muse approach to Or-parallel prolog

Khayri A. M. Ali Roland Karlsson 《International journal of parallel programming》1990,19(2):129-162

Muse (Multi-sequential Prolog engines) is a simple and efficient approach to Or-parallel execution of Prolog programs. It is based on having several sequential Prolog engines, each with its local address space, and some shared memory space. It is currently implemented on a 7-processors machine with local/shared memory constructed at SICS, a 16-processors Sequent Symmetry, a 96-processors BBN Butterfly I, and a 45-processors BBN Butterfly II. The sequential SICStus Prolog system has been adapted to Or-parallel implementation. Extra overhead associated with this adaptation is very low in comparison with the other approaches. The speed-up factor is very close to the number of processors in the system for a large class of problems.The goal of this paper is to present the Muse execution model, some of its implementation issues, a variant of Prolog suitable for multiprocessor implementations, and some experimental results obtained from two different multiprocessor systems. 相似文献

14.

Trade-off between sequential and time warp-based parallelsimulation

Quaglia F. Cortellessa V. Ciciani B. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(8):781-794

Discrete event simulation is a methodology to study the behavior of complex systems. Its drawback is that, in order to get reliable results, simulations usually have to be run over a long stretch of time. This time requirement could decrease through the usage of parallel or distributed computing systems. In this paper, we analyze the Time Warp synchronization protocol for parallel discrete event simulation and present an analytical model evaluating the upper bound on the completion time of a Time Warp simulation. In our analysis, we consider the case of a simulation model with homogeneous logical processes, where “homogeneous” means they have the same average event routine time and the same state saving cost. Then we propose a methodology to determine when it is time-convenient to use a Time Warp synchronized simulation, instead of a sequential one, for a simulation model with features matching those considered in our analysis. We give an answer to this question without the need to preliminary generate the simulation code. Examples of methodology usage are reported for the case of both a synthetic benchmark and a real world model 相似文献

15.

Simulation of emission tomography using grid middleware for distributed computing

Thomason MG Longton RF Gregor J Smith GT Hutson RK 《Computer methods and programs in biomedicine》2004,75(3):251-258

SimSET is Monte Carlo simulation software for emission tomography. This paper describes a simple but effective scheme for parallel execution of SimSET using NetSolve, a client-server system for distributed computation. NetSolve (version 1.4.1) is "grid middleware" which enables a user (the client) to run specific computations remotely and simultaneously on a grid of networked computers (the servers). Since the servers do not have to be identical machines, computation may take place in a heterogeneous environment. To take advantage of diversity in machines and their workloads, a client-side scheduler was implemented for the Monte Carlo simulation. The scheduler partitions the total decay events by taking into account the inherent compute-speeds and recent average workloads, i.e., the scheduler assigns more decay events to processors expected to give faster service and fewer decay events to those expected to give slower service. When compute-speeds and sustained workloads are taken into account, the speed-up is essentially linear in the number of equivalent "maximum-service" processors. One modification in the SimSET code (version 2.6.2.3) was made to ensure that the total number of decay events specified by the user is maintained in the distributed simulation. No other modifications in the standard SimSET code were made. Each processor runs complete SimSET code for its assignment of decay events, independently of others running simultaneously. Empirical results are reported for simulation of a clinical-quality lung perfusion study. 相似文献

16.

An empirical evaluation of performance-memory trade-offs in timewarp

Das S.R. Fujimoto R.M. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):210-224

The performance of the Time Warp mechanism is experimentally evaluated when only a limited amount of memory is available to the parallel computation. An implementation of the cancelback protocol is used for memory management on a shared memory architecture, viz., KSR to evaluate the performance vs. memory tradeoff. The implementation of the cancelback protocol supports canceling back more than one memory object when memory has been exhausted (the precise number is referred to as the salvage parameter) and incorporates a non-work-conserving processor scheduling technique to prevent starvation. Several synthetic and benchmark programs are used that provide interesting stress cases for evaluating the limited memory behavior. The experiments are extensively monitored to determine the extent to which various factors may affect performance. Several observations are made by analyzing the behavior of Time Warp under limited memory: (1) Depending on the available memory and asymmetry in the workload, canceling back several memory objects at one time (i.e. a salvage parameter value of more than one) improves performance significantly, by reducing certain overheads. However, performance is relatively insensitive to the salvage parameter except at extreme values. (2) The speedup vs. memory curve for Time Warp programs has a well-defined knee before which speedup increases very rapidly with memory and beyond which there is little performance gain with increased memory. (3) A performance nearly equivalent to that with large amounts of memory can be achieved with only a modest amount of additional memory beyond that required for sequential execution, if memory management overheads are small compared to the event granularity. These results indicate that contrary to the common belief, memory usage by Time Warp can be controlled within reasonable limits without any significant loss of performance 相似文献

17.

Machine independent AND and OR parallel execution of logicprograms. II. Compiled execution

Ramkumar B. Kale L.V. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):181-192

For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler 相似文献

18.

A vector-multiplication dominated parallel algorithm for the computation of real eigenvalue spectra

M. Clint 《Computer Physics Communications》1982,26(3-4):373-376

In order to exploit effectively the power of array and vector processors for the numerical solution of linear algebraic problems it is desirable to express algorithms principally in terms of vector and matrix operations. Algorithms which manipulate vectors and matrices at component level are best suited for execution on single processor hardware. Often, however, it is difficult, if not impossible, to construct efficient versions of such algorithms which are suitable foe execution on parallwl hardware. A method for computing the eigenvalues of real unsymmetric matrices with real eigenvalue spectra is presented. The method is an extension of the one described in ref. [1]. The algorithm makes heavy use of vector inner product evaluations. The manipulation of individual components of vectors and matrices is kept to a minimum. Essentially, the method involves the construction of a sequence of biorthogonal transformation matrices the combined effect of which is to diagonalise the matrix. The eigenvalues of the matrix are diagonal elements of the final diagonalised form. If the eigenvectors of the matrix are also required the algorithm may be extended in a straightforward way. The effectiveness of the algorithm is demonstrated by an application of sequential version to several small matrices and some comments are made about the time complexity of the parallel version. 相似文献

19.

On parallel computation of centrality measures of graphs

García Juan F. Carriegos M. V. 《The Journal of supercomputing》2019,75(3):1410-1428

Centrality measures or indicators of centrality identify most relevant nodes of graphs. Although optimized algorithms exist for computing of most of them, they are still time consuming and are even infeasible to apply to big enough graphs like the ones representing social networks or extensive enough computer networks. In this paper, we present a parallel implementation in C language of some optimal algorithms for computing of some indicators of centrality. Our parallel version greatly reduces the execution time of their sequential (non-parallel) counterpart. The proposed solution relies on threading, allowing for a theoretical improvement in performance close to the number of logical processors (cores) of the single computer in which it is running. Our software has been tested in several platforms, including the supercomputer Calendula, in which we achieved execution times close to 18 times faster when running our parallel implementation instead of our sequential one. Our solution is multi-platform and portable, working on any machine with several logical processor which is capable of compiling and running C language code.

相似文献

20.

Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures

《Journal of Parallel and Distributed Computing》1995,28(1):1-18

Most parallel jobs cannot be fully parallelized. In a homogeneous parallel machine-one in which all processors are identical-the serial fraction of the computation has to be executed at the speed of any of the identical processors, limiting the speedup that can be obtained due to parallelism. In a heterogeneous architecture, the sequential bottleneck can be greatly reduced by running the sequential part of the job or even the critical tasks in a faster processor. This paper uses Markov chain based models to analyze the performance of static and dynamic processor assignment policies for heterogeneous architectures. Parallel jobs are assumed to be described by acyclic directed task graphs. A new static processor assignment policy, called Largest Task First Minimum Finish Time (LTFMFT), is introduced. The analysis shows that this policy is very sensitive to the degree of heterogeneity of the architecture, and that it outperforms all other policies analyzed. Three dynamic assignment disciplines are compared and it is shown that, in heterogeneous environments, the disciplines that perform better are those that consider the structure of the task graph, and not only the service demands of the individual tasks. The performance of heterogeneous architectures is compared with cost-equivalent homogeneous ones taking into account different scheduling policies. Finally, static and dynamic processor assignment disciplines are compared in terms of performance. 相似文献