期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compiler-Optimized Simulation of Large-Scale Applications on High Performance Architectures

《Journal of Parallel and Distributed Computing》2002,62(3):393-426

In this paper, we propose and evaluate practical, automatic techniques that exploit compiler analysis to facilitate simulation of very large message-passing systems. We use compiler techniques and a compiler-synthesized static task graph model to identify the subset of the computations whose values have no significant effect on the performance of the program, and to generate symbolic estimates of the execution times of these computations. For programs with regular computation and communication patterns, this information allows us to avoid executing or simulating large portions of the computational code during the simulation. It also allows us to avoid performing some of the message data transfers, while still simulating the message performance in detail. We have used these techniques to integrate the MPI-Sim parallel simulator at UCLA with the Rice dHPF compiler infrastructure. We evaluate the accuracy and benefits of these techniques for three standard message-passing benchmarks on a wide range of problem and system sizes. The optimized simulator has errors of less than 16% compared with direct program measurement in all the cases we studied, and typically much smaller errors. Furthermore, it requires factors of 5 to 2000 less memory and up to a factor of 10 less time to execute than the original simulator. These dramatic savings allow us to simulate regular message-passing programs on systems and problem sizes 10 to 100 times larger than is possible with the original simulator, or other current state-of-the-art simulators. 相似文献

2.

Experiences with component‐oriented technologies in nuclear power plant simulators

Manuel Díaz Daniel Garrido Sergio Romero Bartolom Rubio Enrique Soler Jos M. Troya 《Software》2006,36(13):1489-1512

This paper proposes the application of modern component‐oriented technologies to the development of nuclear power plant simulators. On the one hand, as a significant improvement on previous simulators, the new kernel is based on the Common Component Architecture (CCA). The use of such a high‐performance computing oriented component technology, together with a novel algorithm to automatically resolve simulation data dependencies, allows the efficient execution of both parallel and sequential simulation models. On the other hand, RT‐CORBA is employed in the development of the rest of the applications that comprise the simulator. This real‐time communication middleware not only makes the management of communications easier, but also provides the applications with real‐time capabilities. Software components used in these two ways, simulation models integrating the kernel and distributed applications from which the simulator is comprised, improve the evolution and maintenance of the entire system, as well as promoting code reusability in other projects. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

3.

Grids with multiple batch systems for performance enhancement of multi-component and parameter sweep parallel applications

M. Sathish S. Ravi S. 《Future Generation Computer Systems》2010,26(2):217-227

In this work, we evaluate the benefits of using Grids with multiple batch systems to improve the performance of multi-component and parameter sweep parallel applications by reduction in queue waiting times. Using different job traces of different loads, job distributions and queue waiting times corresponding to three different queuing policies (FCFS, conservative and EASY backfilling), we conducted a large number of experiments using simulators of two important classes of applications. The first simulator models Community Climate System Model (CCSM), a prominent multi-component application and the second simulator models parameter sweep applications. We compare the performance of the applications when executed on multiple batch systems and on a single batch system for different system and application configurations. We show that there are a large number of configurations for which application execution using multiple batch systems can give improved performance over execution on a single system. 相似文献

4.

Parallelized direct execution simulation of message-passingparallel programs

Dickens P.M. Heidelberger P. Nicol D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(10):1090-1105

As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors 相似文献

5.

Nonblocking checkpointing for optimistic parallel simulation: description and an implementation

Quaglia F. Santoro A. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):593-610

Describes a nonblocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g, event list update, event execution) with the aim of removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting nonblocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard off-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks, we show that, except for the case of minimal state granularity applications, nonblocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a personal communication system (PCS) simulation is additionally reported to point out the benefits from nonblocking checkpointing for a real world application. 相似文献

6.

Studying energy trade offs in offloading computation/compilation in Java-enabled mobile devices 总被引：1，自引：0，他引：1

Chen G. Kang B.-T. Kandemir M. Vijaykrishnan N. Irwin M.J. Chandramouli R. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(9):795-809

Java-enabled wireless devices are preferred for various reasons. For example, users can dynamically download Java applications on demand. The dynamic download capability supports extensibility of the mobile client features and centralizes application maintenance at the server. Also, it enables service providers to customize features for the clients. In this work, we extend this client-server collaboration further by offloading some of the computations (i.e., method execution and dynamic compilation) normally performed by the mobile client to the resource-rich server in order to conserve energy consumed by the client in a wireless Java environment. In the proposed framework, the object serialization feature of Java is used to allow offloading of both method execution and bytecode-to-native code compilation to the server when executing a Java application. Our framework takes into account communication, computation, and compilation energies to decide where to compile and execute a method (locally or remotely), and how to execute it (using interpretation or just-in-time compilation with different levels of optimizations). As both computation and communication energies vary based on external conditions (such as the wireless channel state and user supplied inputs), our decision must be done dynamically when a method is invoked. Our experiments, using a set of Java applications executed on a simulation framework, reveal that the proposed techniques are very effective in conserving the energy of the mobile client. 相似文献

7.

A simulator for adaptive parallel applications

Basile Schaeli Sebastian Gerlach Roger D. Hersch 《Journal of Computer and System Sciences》2008,74(6):983-999

Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. Detailed simulations can help identify allocation strategies and problem decomposition parameters that increase the efficiency of parallel applications. We describe a simulation framework supporting dynamic node allocation which, given a simple cluster model, predicts the running time of parallel applications taking CPU and network sharing into account. Simulations can be carried out without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements are reduced. In partial direct execution simulations, the application's parallel behavior is retrieved via direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. Simulations may then vary cluster model parameters, operation durations and problem decomposition parameters to analyze their impact on the application performance and identify the limiting factors. We implemented the proposed techniques by adding direct execution simulation capabilities to the Dynamic Parallel Schedules parallelization framework. We introduce the concept of dynamic efficiency to express the resource utilization efficiency as a function of time. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator under different parallelization and dynamic node allocation strategies. 相似文献

8.

Harmless,a hardware architecture description language dedicated to real-time embedded system simulation

Rola Kassem Mikaël Briday Jean-Luc Béchennec Guillaume Savaton Yvon Trinquet 《Journal of Systems Architecture》2012,58(8):318-337

相似文献

9.

Debugging mixed‐environment programs with Blink

Byeongcheol Lee Martin Hirzel Robert Grimm Kathryn S. McKinley 《Software》2015,45(9):1277-1306

Programmers build large‐scale systems with multiple languages to leverage legacy code and languages best suited to their problems. For instance, the same program may use Java for ease of programming and C to interface with the operating system. These programs pose significant debugging challenges, because programmers need to understand and control code across languages, which often execute in different environments. Unfortunately, traditional multilingual debuggers require a single execution environment. This paper presents a novel composition approach to building portable mixed‐environment debuggers, in which an intermediate agent interposes on language transitions, controlling and reusing single‐environment debuggers. We implement debugger composition in Blink, a debugger for Java, C, and the Jeannie programming language. We show that Blink is (i) simple: it requires modest amounts of new code; (ii) portable: it supports multiple Java virtual machines, C compilers, operating systems, and component debuggers; and (iii) powerful: composition eases debugging, while supporting new mixed‐language expression evaluation and Java native interface bug diagnostics. To demonstrate the generality of interposition, we build prototypes and demonstrate debugger language transitions with C for five of six other languages (Caml, Common Lisp, C#, Perl 5, Python, and Ruby) without modifications to their debuggers. Using real‐world case studies, we show that diagnosing language interface errors require prior single‐environment debuggers to restart execution multiple times, whereas Blink directly diagnoses them with one execution. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

10.

Architecture Independent Characterization of Embedded Java Workloads

Desai A. Singh J. 《Computer Architecture Letters》2009,8(1):29-32

This paper presents architecture independent characterization of embedded Java workloads based on the industry standard GrinderBench benchmark which includes different classes of real world embedded Java applications. This work is based on a custom built embedded Java virtual machine (JVM) simulator specifically designed for embedded JVM modeling and embodies domain specific details such as thread scheduling, algorithms used for native CLDC APIs and runtime data structures optimized for use in embedded systems. The results presented include dynamic execution characteristics, dynamic bytecode instruction mix, application and API workload distribution, object allocation statistics, instruction-set coverage, memory usage statistics and method code and stack frame characteristics. 相似文献

11.

Message-passing environments for metacomputing

Matthias A. Brune Graham E. Fagg Michael M. Resch 《Future Generation Computer Systems》1999,15(5-6):699-712

In this paper, we present the three libraries PACX-MPI, PLUS, and PVMPI that provide message-passing between different high-performance computers in metacomputing environments. Each library supports the development and execution of distributed metacomputer applications.

The PACX-MPI approach offers a transparent interface for the communication between two or more MPI environments. PVAMPI allows the user spawning parallel processes under the MPI environment. The PLUS protocol bridges the gap between vendor-specific (e.g., MPL, NX, and PARIX) and vendor-independent message-passing environments (e.g., PVM and MPI). Moreover, it offers the ability to create and control processes at application runtime. 相似文献

12.

Parallel-processing applications for data analysis in the social sciences

John Skvoretz Shelley A. Smith Chuck Baldwin 《Concurrency and Computation》1992,4(3):207-221

We examine parallel-processing applications to the analysis of large data sets typically used in social science research. Our research uses a parallel environment which makes it possible to have 1024 processors working simultaneously on a problem. The application is tested using various configurations of number of processors and block-size of data reads on the estimation of a linear model of earnings for the California portion of the 15% sample of the 1970 Census. Performance factors assessed include total execution time, speed-up and efficiency. Execution times are also compared with reference to execution times on an IBM 3081 using SPSS-X. Results indicate that optimal configurations of number of processors and data block-size can produce significantly faster execution times for linear model estimation on relatively large (80,000 cases) data sets. We also discuss other applications of parallel processing to statistical analyses commonly found in social science. 相似文献

13.

Dynamically adapting to system load and program behavior in multiprogrammed multiprocessor systems

Iffat H. Kazi David J. Lilja 《Concurrency and Computation》2002,14(12):957-985

Parallel execution of application programs on a multiprocessor system may lead to performance degradation if the workload of a parallel region is not large enough to amortize the overheads associated with the parallel execution. Furthermore, if too many processes are running on the system in a multiprogrammed environment, the performance of the parallel application may degrade due to resource contention. This work proposes a comprehensive dynamic processor allocation scheme that takes both program behavior and system load into consideration when dynamically allocating processors. This mechanism was implemented on the Solaris operating system to dynamically control the execution of parallel C and Java application programs. Performance results show the effectiveness of this scheme in dynamically adapting to the current execution environment and program behavior, and that it outperforms a conventional time‐shared system. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

14.

并行处理任务级仿真环境的研究及实现

林成江李三立《计算机学报》1995,18(7):496-501

并行处理仿真为并行系统的建模分析，并行算法的模拟执行以及并行环境的性能评价提供支持，本文利用任务相关仿真时钟和重叠时间片建立了一种支持完全并和用户并发方式的并行多任务模型，并结合对不同调度算法和互连结构的仿真实验，着重分析了任务调度对系统性能的影响以及互连网络技术与通信开销的关系。同时，仿真环境还提供模拟执行的并发度曲线和任务执行踪迹供和户分析调试并行程序。相似文献

15.

JASAG: a gridification tool for agricultural simulation applications

M. Arroqui J. Rodriguez Alvarez H. Vazquez C. Machado C. Mateos A. Zunino 《Concurrency and Computation》2015,27(17):4716-4740

The Grid Computing paradigm aims to create a ‘virtual’ and powerful single computer with many distributed resources to solve resource intensive problems. The term ‘gridification’ involves the process of transforming a conventional application to run in a Grid environment. In that sense, the more automatic this process is, the easier is for developers with low expertise in parallel and distributed computing to take advantage of these resources. To date, many semiautomatic gridifiers were built to support different gridification approaches and application code structures or anatomies. Furthermore, agricultural simulation applications have a particular common anatomy based on biophysical entities, such as animals, crops, and pastures, which are updated by actions, such as growing animals, growing crops, and growing pastures, along simulation execution. However, this anatomy is not fully supported by any of the existing gridifiers. Thus, this paper presents Agricultural Simulation Applications Gridifier (ASAG), a method for easy gridification of agricultural simulation applications, and its Java implementation, named Java ASAG (JASAG). The main design drivers of JASAG are middleware independence, separation of business logic and Grid behavior, and performance increase. An experimental evaluation showing the feasibility of the gridification method and its implementation is also reported, which resulted in speedups of up to 25 by using a real agricultural simulation application. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

16.

A modernized version of a 1D soil vegetation atmosphere transfer model for improving its future use in land surface interactions studies

《Environmental Modelling & Software》2017

SimSphere is a land biosphere model that provides a mathematical representation of vertical ‘views’ of the physical mechanisms controlling Earth's energy and mass transfers in the soil/vegetation/atmosphere continuum. Herein, we present recent advancements introduced to SimSphere code, aiming at making its use more integrated to the automation of processes within High Performance Computing (HPC) that allows using the model at large scale. In particular, a new interface to the model is presented, so-called “SimSphere-SOA” which forms a command line land biosphere tool, a Web Service interface and a parameters verification facade that offers a standardised environment for specification execution and result retrieval of a typical model simulation based on Service Oriented Architecture (SOA). SimSphere-SOA library can now execute various simulations in parallel. This allows exploitation of the tool in a simple and efficient way in comparison to the currently distributed approach. In SimSphere-SOA, an Application Programming Interface (API) is also provided to execute simulations that can be publicly consumed. Finally this API is exported as a Web Service for remotely executing simulations through web based tools. This way a simulation by the model can be executed efficiently and subsequently the model simulation outputs may be used in any kind of relevant analysis required.The use of these new functionalities offered by SimSphere-SOA is also demonstrated using a “real world” simulation configuration file. The inclusion of those new functions in SimSphere are of considerable importance in the light of the model's expanding use worldwide as an educational and research tool. 相似文献

17.

The simulation model partitioning problem: An adaptive solution based on self-Clustering

《Simulation Modelling Practice and Theory》2017

This paper is about partitioning in parallel and distributed simulation. That means decomposing the simulation model into a number of components and to properly allocate them on the execution units. An adaptive solution based on self-clustering, that considers both communication reduction and computational load-balancing, is proposed. The implementation of the proposed mechanism is tested using a simulation model that is challenging both in terms of structure and dynamicity. Various configurations of the simulation model and the execution environment have been considered. The obtained performance results are analyzed using a reference cost model. The results demonstrate that the proposed approach is promising and that it can reduce the simulation execution time in both parallel and distributed architectures. 相似文献

18.

RTS: A system to simulate the real time cost behaviour of parallel computations

Bin Qin Howard A. Sholl Reda A. Ammar 《Software》1988,18(10):967-985

In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed. 相似文献

19.

A NoC-based hybrid message-passing/shared-memory approach to CMP design

Mario R. CasuAuthor Vitae Massimo Ruo RochAuthor VitaeSergio V. TotaAuthor Vitae Maurizio ZamboniAuthor Vitae 《Microprocessors and Microsystems》2011,35(2):261-273

Future chip-multiprocessors (CMP) will integrate many cores interconnected with a high-bandwidth and low-latency scalable network-on-chip (NoC). However, the potential that this approach offers at the transport level needs to be paired with an analogous paradigm shift at the higher levels. In particular, the standard shared-memory programming model fails to address the requirements of scalability of the many-core era. Fast data exchange among the cores and low-latency synchronization are desirable but hard to achieve in practice due to the memory hierarchy. The message-passing paradigm permits instead direct data communication and synchronization between the cores. The shared-memory could still be used for the instruction fetch. Hence, we propose a hybrid approach that combines shared-memory and message passing in a single general-purpose CMP architecture that allows efficient execution of applications developed with both parallel programming approaches. Cores fetch instructions from a hierarchical memory and exchange their data through the same memory, for compatibility with existing software, or directly through the fast NoC. We developed a fast SystemC based cycle-accurate simulator for design space explorations that we used to evaluate the performance with real benchmarks. The various components have been RTL coded and mapped to a CMOS 45 nm technology to build a silicon area model that we used to select the best architectural configurations. 相似文献

20.

Dynamic load‐balancing mechanism for distributed Java applications

Violeta Felea Bernard Toursel 《Concurrency and Computation》2006,18(3):305-331

Program environments or operating systems generally leave the decision on the allocation of program entities to the developer, offering either placement directives, or tools available through the manipulation of a graphical interface. These approaches cannot always take into account the dynamic behavior of applications, dynamicity in the execution environment or the heterogeneity of the execution platform. Transparent deployment algorithms are necessary for automizing and optimizing application distribution. The Adaptive Distributed Applications in Java (ADAJ) project deals with placement and migration of Java objects. It automatically deploys parallel Java applications on a cluster of workstations using monitoring information about the application behavior. The transparency obtained through the integration of these tools in the middleware makes such an environment easy to use and improves efficiency. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献