首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
This paper describes the design and implementation of an Efficient Architecture for Running THreads (EARTH) runtime system for a multi‐processor/multi‐node cluster. The (EARTH) model was designed to support the efficient execution of parallel (multi‐threaded) programs with irregular fine‐grain parallelism using off‐the‐shelf computers. Implementing an EARTH runtime system requires an explicitly threaded runtime system. For portability, we built this runtime system on top of Pthreads under Linux and used sockets for inter‐node communication. Moreover, in order to make the best use of the resources available on a cluster of symmetric multi‐processors (SMP), this implementation enables the overlapping of communication and computation. We used Threaded‐C, a language designed to implement the programming model supported by the EARTH architecture. This language allows the expression of various levels of parallelism and provides the primitives needed to manage the required communication and synchronization. The Threaded‐C programming language supports irregular fine‐grain parallelism through a two‐level hierarchy of threads and fibers. It also provides various synchronization and communication constructs that reflect the nature of EARTH's fibers—non‐preemptive execution with data‐driven scheduling—as well as the extensive use of split‐phase transactions on EARTH to execute long‐latency operations. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

2.
This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.  相似文献   

3.
It is difficult to express the parallelism present in complex computations by using existing higher level abstractions such as MapReduce and Dryad. These computations include applications from wide variety of domains, like Artificial Intelligence, Decision Tree Algorithms, Association Rule Mining, Recommender Systems, Graph Algorithms, Clustering Algorithms, Compute Intensive Scientific Workflows, Optimization Algorithms, and so forth. Their execution graphs introduce new challenges in terms of programmer expressibility and runtime performance such as iterative and recursive computations, shared communication model, and so forth. We propose an extension to MapReduce, called Generate‐Map‐Reduce (GMR), targeted towards modeling these applications. GMR introduces a new Generate abstraction into the MapReduce framework that captures recursive computations. The runtime also supports iterative jobs and a distributed communication model by using shared data structures. We illustrate recursive computations with GMR by modeling complex applications such as simulated annealing, A* search, and adaptive quadrature computation that require recursive spawning of new tasks to handle variable degree of parallelism. GMR runtime supports caching of common data across iterations in memory and local disks. We illustrate how this caching helps in achieving significant speedup for iterative computations by modeling k‐means clustering. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

4.
The development of service robots has gained more attention over the last years. Advanced robots have to cope with many different situations emerging at runtime, while executing complex tasks. They should be programmed as dynamically adaptive systems, capable of adapting themselves to the execution environment, including the computing, user, and physical environment. Recently, dynamic languages are becoming widely used because of the high runtime adaptability they offer. Therefore, we have analyzed the suitability of these languages to implement robotic systems with high runtime adaptability requirements, using Python as case study because of its maturity. To evaluate their suitability, we have implemented a reflective robotics framework that can be programmed in both Java and any dynamic language supported by the standard Java Scripting API. An example scenario has been developed using Python to show how its distinguishing meta‐programming features have facilitated the development of runtime‐adaptable robotics services. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

5.
Fast multipole methods (FMMs) have complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next‐generation supercomputers. Their most common application is to accelerate N‐body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load balancing becomes a non‐trivial question. A common strategy for load balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data‐driven execution to efficiently tackle this challenging load balancing problem. The core idea consists of breaking the most time‐consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a directed acyclic graph where nodes represent tasks and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the queueing and runtime for kernels runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out‐of‐order execution. The performance results of the data‐driven FMM execution outperform the previous strategy and show linear speedup on a quad‐socket quad‐core Intel Xeon system.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

6.
The asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming on multicore and clusters, with good productivity. However, it currently lacks support for fault tolerance (FT) such that a single transient failure may render hours to months of computation useless.  相似文献   

7.
8.
The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. The result is a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring.  相似文献   

9.
While large‐scale parallel/distributed simulations are rapidly becoming critical research modalities in academia and industry, their efficient and scalable implementations continue to present many challenges. A key challenge is that the dynamic and complex communication/coordination required by these applications (dependent on the state of the phenomenon being modeled) are determined by the specific numerical formulation, the domain decomposition and/or sub‐domain refinement algorithms used, etc. and are known only at runtime. This paper presents Seine, a dynamic geometry‐based shared‐space interaction framework for scientific applications. The framework provides the flexibility of shared‐space‐based models and supports extremely dynamic communication/coordination patterns, while still enabling scalable implementations. The design and prototype implementation of Seine are presented. Seine complements and can be used in conjunction with existing parallel programming systems such as MPI and OpenMP. An experimental evaluation using an adaptive multi‐block oil‐reservoir simulation is used to demonstrate the performance and scalability of applications using Seine. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

10.
Multiprocessor execution of functional programs   总被引:1,自引:0,他引:1  
Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.Two implementations of the functional language ALFL were built on commercially available multiprocessors.Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, andBuckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, calledserial combinators, that can be executed in parallel.The abstract machine model supported by Alfalfa and Buckwheat is calledheterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and highe order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation.  相似文献   

11.
The arrival of multicore architectures has generated an interest in reformulating dense matrix computations as algorithms‐by‐blocks, where submatrices are units of data and computations with those blocks are units of computation. Rather than directly executing such an algorithm, a directed acyclic graph is generated at runtime that is then scheduled by a runtime system such as SuperMatrix. The benefit is a clear separation of concerns between the library and the heuristics for scheduling. In this paper, we show that this approach can be taken one step further using the same methodology and an ad hoc runtime to map algorithms‐by‐blocks to small clusters. With no change to the library code, and the application that uses it, the computational power of such small clusters can be utilized. An impressive performance on a number of small clusters is reported. As a proof of the flexibility of the solution, we report performance results on accelerated clusters based on graphics processors. We believe this to be a possible step towards programming many‐core architectures, as demonstrated by a port of the solution to Intel's Single‐chip Cloud Computer (Intel, Santa Clara, CA, USA). Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

12.
Threads of parallel applications need to communicate in order to fulfill their tasks. The communication performance between the cores in modern multi‐core architectures differs because of the memory and interconnection hierarchies. In these architectures, it is important to map the threads of parallel applications by taking into account the communication between them, to improve their performance and energy consumption. In parallel applications based on shared memory, communication is implicit, which makes it difficult to detect the communication pattern between the threads. In this paper, we introduce a new lightweight mechanism to detect the communication pattern between threads of shared memory applications using the translation lookaside buffer. Our mechanism relies on hardware features, which make it transparent to the programmer and allow the detection to be performed by the operating system during the execution of the application. We also developed a heuristic mapping algorithm that uses the detected pattern to dynamically map the threads to cores. Experiments were performed with applications from the NAS‐OMP and PARSEC parallel benchmark suites in a simulated machine as well as a real machine. Results show that our mechanism can substantially improve parallel application performance, as well as processor and DRAM energy consumption. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

13.
Ivan Čukić 《Software》2016,46(12):1617-1656
There is a big class of problems that requires writing programs in an asynchronous manner. Cloud computing, service‐oriented architectures, multi‐core and heterogeneous systems all require programs to be written with asynchronous components. The necessity of concurrency and asynchronous execution brings in the added complexity of the inversion of control into the system, either through message passing or through event processing. In this paper, we introduce explicit programming language support for asynchronous programming that completely hides inversion of control. The presented programming model defines a common abstraction of the different types of tasks, both synchronous and asynchronous. It defines common imperative control constructs equivalent to those of the host programming language, along with a few more advanced ones for transactional and parallel execution that can universally work for any task type. It allows the programmer to implement the logic of an asynchronous system in a natural way by writing simple, seemingly, synchronous imperative code. We will show that the programs written using this approach are easier to understand by programmers. They are also easier to design automated tests for, and for performing computer‐based static analysis of the program logic. The principles behind this approach were tested in a couple of real‐world systems with worldwide user base. Our experience shows that it makes the complex code with a lot of interdependencies between asynchronously executed tasks easy to write and reason about. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

14.
Constituent tasks of modern day Embedded Streaming Applications (ESAs), such as engine control systems, multimedia and software defined radios often exhibit execution behaviors that do not conform to conventional task models. ESAs consist of iterative, pipelined sequences of tasks that are conditioned by intra- and inter-iteration dependencies, and often have strict throughput and latency requirements. We model ESAs as dataflow graphs, where actors represent computational units, and directed edges represent communication channels between actors. Due to practical constraints like cost-effectiveness, power consumption and chip-area, multiple ESAs are run on a shared (multi-processor) platform. Thus rigorous timing analysis is required to verify whether individual ESAs meet their respective timing requirements.We look at response modeling, a compositional timing analysis approach wherein the local worst-case influence of runtime scheduling is represented within the constructs provided in dataflow. These local representations (called response models) can be composed together to construct a global understanding of an ESA’s worst-case execution which is then used to verify whether its real-time requirements are met. This paper proposes a generic response modeling technique for runtime scheduling of ESAs. We focus on preemptive Fixed Priority Scheduling (FPS) but also demonstrate that we can apply our technique to a wide range of runtime schedulers. In our experiments, we present academic and industrial case-studies that highlight the effectiveness of our approach in the timing analysis of ESAs with unconventional execution behavior.  相似文献   

15.
A scheduling technique is presented to minimize service delay of aperiodic tasks in hard real‐time systems that employ dynamic‐priority scheduling and do not allow task preemption. In a real‐time scheduling process, the execution of periodic tasks can be deferred as long as this does not cause other tasks to violate their time constraints. However, aperiodic tasks that usually have urgent missions should complete execution as early as possible. In this paper, it is assumed that aperiodic tasks also have time constraints. Thus, the problem of deciding whether an aperiodic task with an unpredictable arrival time can be scheduled successfully or not is difficult to solve because delaying periodic tasks may cause them to fail to meet their time constraints. We present a dynamic scheduling technique to solve this problem which makes use of the symmetric property of a schedule. The maximum possible idle slot is always reserved at every scheduling point so that aperiodic tasks can be serviced immediately if the reserved idle slot is big enough to service them. The proposed technique also maximizes utilization of idle slots by reserving them for the longest possible time span.  相似文献   

16.
Fine-grain MPI (FG-MPI) extends the execution model of MPI to allow for interleaved execution of multiple concurrent MPI processes inside an OS-process. It provides a runtime that is integrated into the MPICH2 middleware and uses light-weight coroutines to implement an MPI-aware scheduler. In this paper we describe the FG-MPI runtime system and discuss the main design issues in its implementation. FG-MPI enables expression of function-level parallelism, which along with a runtime scheduler, can be used to simplify MPI programming and achieve performance without adding complexity to the program. As an example, we use FG-MPI to re-structure a typical use of non-blocking communication and show that the integrated scheduler relieves the programmer from scheduling computation and communication inside the application and brings the performance part outside of the program specification into the runtime.  相似文献   

17.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

18.
Pilsung Kang 《Software》2018,48(3):385-401
Function call interception (FCI), or method call interception (MCI) in the object‐oriented programming domain, is a technique of intercepting function calls at program runtime. Without directly modifying the original code, FCI enables to undertake certain operations before and/or after the called function or even to replace the intercepted call. Thanks to this capability, FCI has been typically used to profile programs, where functions of interest are dynamically intercepted by instrumentation code so that the execution control is transferred to an external module that performs execution time measurement or logging operations. In addition, FCI allows for manipulating the runtime behavior of program components at the fine‐grained function level, which can be useful in changing an application's original behavior at runtime to meet certain execution requirements such as maintaining performance characteristics for different input problem sets. Due to this capability, however, some FCI techniques can be used as a basis of many security exploits for vulnerable systems. In this paper, we survey a variety of FCI techniques and tools along with their applications to diverse areas in the computing and software domains. We describe static and dynamic FCI techniques at large and discuss the strengths and weaknesses of different implementations in this category. In addition, we also discuss aspect‐oriented programming implementation techniques for intercepting method calls.  相似文献   

19.
Several classes of scientific and commercial applications require the execution of a large number of independent tasks. One highly successful and low‐cost mechanism for acquiring the necessary computing power for these applications is the ‘public‐resource computing’, or ‘desktop Grid’ paradigm, which exploits the computational power of private computers. So far, this paradigm has not been applied to data mining applications for two main reasons. First, it is not straightforward to decompose a data mining algorithm into truly independent sub‐tasks. Second, the large volume of the involved data makes it difficult to handle the communication costs of a parallel paradigm. This paper introduces a general framework for distributed data mining applications called Mining@home. In particular, we focus on one of the main data mining problems: the extraction of closed frequent itemsets from transactional databases. We show that it is possible to decompose this problem into independent tasks, which however need to share a large volume of the data. We thus introduce a data‐intensive computing network, which adopts a P2P topology based on super peers with caching capabilities, aiming to support the dissemination of large amounts of information. Finally, we evaluate the execution of a pattern extraction task on such network. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

20.
This paper studies the optimal portfolio trading problem under the generalized second‐order autoregressive execution price model. The problem of minimizing expected execution cost under the proposed price model is formulated as a quadratic programming (QP) problem. For a risk‐averse trader, problem formulation under the second‐order stochastic dominance constraints results in a quadratically constrained QP problem. Under some conditions on the execution price model, it is proved that the portfolio trading problems for risk‐neutral and risk‐averse traders become convex programming problems, which have many theoretical and computational advantages over the general class of optimization problems. Extensive numerical illustrations are provided, which render the practical significance of the proposed execution price model and the portfolio trading problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号