首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rule-based systems have been hypothesized to contain only minimal parallelism. However, techniques to extract more parallelism from existing systems are being investigated. Among these methods, it is desirable to find those which balance the work being performed in parallel evenly among the rules, while decreasing the amount of work being performed sequentially in each cycle. The automatic transformation of creating constrained copies of culprit rules accomplishes both of the above goals. Rule-based systems are plagued by occasional rules which slow slow down the entire execution. These culprit rules require much more processing than others, causing other processors to idle while they continue to match. By creating constrained copies of culprit rules and distributing them to their own processors, more parallelism is achieved, as evidenced by increased speed up. This effect is shown to be specific to rule-based systems with certain characteristics. These characteristics are identified as being common within an important class of rule-based systems: expert database systems  相似文献   

2.
Presents the hierarchical task graph (HTG) as an intermediate parallel program representation which encapsulates minimal data and control dependences, and which can be used for the extraction and exploitation of functional, or task-level parallelism. The hierarchical nature of the HTG facilitates efficient task-granularity control during code generation, and thus applicability to a variety of parallel architectures. The construction of the HTG at a given hierarchy level, the derivation of the execution conditions of tasks which maximizes task-level parallelism, and the optimization of these conditions which results in reducing synchronization overhead imposed by data and control dependences are emphasized. Algorithms for the formation of tasks and their execution conditions based on data and control dependence constraints are presented. The issue of optimization of such conditions is discussed, and optimization algorithms are proposed. The HTG is used as the intermediate representation of parallel Fortran and C programs for generating parallel source as well as parallel machine code  相似文献   

3.
This paper discusses the relationship between parallelism granularity and system overhead of dataflow computer systems,and indicates that a trade-off between them should be determined to obtain optimal efficiency of the overall system.On the basis of this discussion,a macro-dataflow computational model is established to exploit the task-level parallelism.Working as a macro-dataflow computer,an Experimental Distributed Dataflow Simulation System(EDDSS)is developed to examine the effectiveness of the macro-dataflow computational model.  相似文献   

4.
The design process of complex Cyber-Physical Systems often relies on co-simulations of the system, involving the interaction of several simulated models of sub-systems. However, reaching real-time simulations is currently prevented by prohibitive CPU times using the single-threaded existing simulation tools. This paper investigates the problem of the efficient parallel co-simulation of hybrid dynamical systems. It introduces a finely-grained co-simulation method enabling numerical integration speed-ups. It is obtained using a partition across the model into loosely coupled sub-systems with sparse communication between modules. The proposed scheme leads to schedule a large number of operations with a wide range of execution times. A suitable off-line scheduling algorithm, based on the input/output dynamics of the models, is proposed to minimize the simulation errors induced by the parallel execution. This scheme is finally tested using the phenomenological model of a combustion engine issued from the Functional Mockup Interface framework. Compared with the sequential case, it shows significant speed-ups while keeping the numerical integration accuracy under control.  相似文献   

5.
Until now, most results reported for parallelism in production systems (rulebased systems) have been simulation results-very few real parallel implementations exist. In this paper, we present results from our parallel implementation of OPS5 on the Encore multiprocessor. The implementation exploits very finegrained parallelism to achieve significant speed-ups. For one of the applications, we achieve 12.4 fold speed-up using 13 processes. Our implementation is also distinct from other parallel implementations in that we parallelize a highly optimized C-based implementation of OPS5. Running on a uniprocessor, our C-based implementation is 10–20 times faster than the standard lisp implementation distributed by Carnegie Mellon University. In addition to presenting the performance numbers, the paper discusses the details of the parallel implementation-the data structures used, the amount of contention observed for shared data structures, and the techniques used to reduce such contention.  相似文献   

6.
Platzner  M. 《Computer》2000,33(4):58-60
Reconfigurable accelerators can improve process time on combinatorial problems with fine-grained parallelism. Such problems contain a huge number of logical operations (NOT, AND and OR) that can evaluate simultaneously, a characteristic that varies considerably from problem to problem. Because of this variability, such combinatorial problems are approached using instance-specific reconfiguration-hardware tailored to a specific algorithm and a specific set of input data. Boolean satisfiability (SAT for short) is a common combinatorial problem that exhibits fine-grained parallelism. SAT varies considerably based on the situation. Its solution is thus an ideal candidate for improvements based on instance-specific reconfiguration. In fact, simulation of an instance-specific accelerator show potential speed-ups by a factor of up to 140,000 in execution time over the solution by a software solver. The authors detail the results of their prototype that results an order-of-magnitude speed-up in the execution of difficult satisfiability problems  相似文献   

7.
8.
9.
Abstract: This paper describes the results of a five-year research program on production systems carried out in Vienna by a large engineering group at the Alcatel-Elin research laboratory. In the course of our investigations, a real-time expert system shell, PAMELA, was developed. We describe features of PAMELA and discuss ways in which the matching algorithm can be speeded up. It was found that the single processor version is one of the most efficient forward chainers available worldwide. Since single processor enhancements have been widely explored, we concentrated our efforts on multiple processor architectures to provide further performance gain. Efficient exploitation of parallelism in production systems requires suitable algorithms for load-balancing, without simultaneously increasing organisational or communication overheads. With this in mind, we developed a transputer-based architecture which serves as a testbedfor our new parallel matching algorithm. The execution speeds of various expert systems based on PAMELA are presented.  相似文献   

10.
In this article, the problem of finding a tight estimate on the worst-case execution time (WCET) of a hard real-time program is addressed. The analysis is focused on straight-line code (without loops and recursive function calls) which is quite commonly found in synthesised code for embedded systems. A comprehensive timing analysis system covering both low-level (assembler instruction level) as well as high-level aspects (programming language level) is presented. The low-level analysis covers all speed-up mechanisms used for modern superscalar processors: pipelining, instruction-level parallelism and caching. The high-level analysis uses the results from the low-level to compute the final estimate on the WCET. This is done by a heuristic for searching the longest really executable path in the control flow, based on the functional dependencies between various program parts.  相似文献   

11.
Load sharing in large, heterogeneous distributed systems allows users to access vast amounts of computing resources scattered around the system and may provide substantial performance improvements to applications. We discuss the design and implementation issues in Utopia, a load sharing facility specifically built for large and heterogeneous systems. The system has no restriction on the types of tasks that can be remotely executed, involves few application changes and no operating system change, supports a high degree of transparency for remote task execution, and incurs low overhead. The algorithms for managing resource load information and task placement take advantage of the clustering nature of large-scale distributed systems; centralized algorithms are used within host clusters, and directed graph algorithms are used among the clusters to make Utopia scalable to thousands of hosts. Task placements in Utopia exploit the heterogeneous hosts and consider varying resource demands of the tasks. A range of mechanisms for remote execution is available in Utopia that provides varying degrees of transparency and efficiency. A number of applications have been developed for Utopia, ranging from a load sharing command interpreter, to parallel and distributed applications, to a distributed batch facility. For example, an enhanced Unix command interpreter allows arbitrary commands and user jobs to be executed remotely, and a parallel make facility achieves speed-ups of 15 or more by processing a collection of tasks in parallel on a number of hosts.  相似文献   

12.
The problem of the execution of plans of actions by a robot inspired the conception of various representations. Some of them concern the problems of control theory and geometry involved in the execution of each task of a robot. We are interested in the task-level sequencing of such actions, from the point of view of planning in artificial intelligence and languages for the synchronisation of tasks. Planning formalisms are often based on predicate logic and sometimes temporal logic. Robot programming languages, at the task-level, have classical control structures derived from computer programming languages, as well as ones more specifically related to real-time execution. We propose a logical and temporal model of plans of actions augmented by an imperative control structure. We therefore define, on the basis of an interval-based temporal logic, a set of imperative control primitives that define the temporal arrangement of the actions. After that, primitives for the reaction to evolutions in the environment are defined in the same formalism, in order to respond to constraints concerning interaction and adaptation to the external world. The application of the model in a simulation system is described, as well as its use in execution monitoring systems.  相似文献   

13.
This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity.  相似文献   

14.
This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.  相似文献   

15.
Since its introduction in 1993, the Message Passing Interface (MPI) has become a de facto standard for writing High Performance Computing (HPC) applications on clusters and Massively Parallel Processors (MPPs). The recent emergence of multi-core processor systems presents a new challenge for established parallel programming paradigms, including those based on MPI. This paper presents a new Java messaging system called MPJ Express. Using this system, we exploit multiple levels of parallelism–messaging and threading–to improve application performance on multi-core processors. We refer to our approach as nested parallelism. This MPI-like Java library can support nested parallelism by using Java or Java OpenMP (JOMP) threads within an MPJ Express process. Practicality of this approach is assessed by porting to Java a massively parallel structure formation code from Cosmology called Gadget-2. We introduce nested parallelism in the Java version of the simulation code and report good speed-ups. To the best of our knowledge it is the first time this kind of hybrid parallelism is demonstrated in a high performance Java application.  相似文献   

16.
The multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, intertask control flow, and intertask data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP windows. We also present a breakdown of execution times to show that register wait, load imbalance, control flow squash, and conventional pipeline losses are significant for almost all the SPEC95 benchmarks.  相似文献   

17.
Heterogeneous chip multiprocessors   总被引:2,自引:0,他引:2  
Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl's law. On-chip heterogeneity allow the processor to better match execution resources to each application's needs and to address a much wider spectrum of system loads - from low to high thread parallelism - with high efficiency.  相似文献   

18.
Compile-time scheduling is one approach to extract parallelism which has proved effective when the execution behavior is predictable. Unfortunately, the performance of most priority-based scheduling algorithms is computation dependent. Scheduling based on the concept of earliest-startable-task produces reasonably short schedules only when available parallelism is large enough to cover the communications. A priority-based decision is more effective when parallelism is low. We propose a scheduling in which the decision function combines two concepts: (1) task-level as global priority and (2) earliest-task-first as local priority The degree of dominance of one of the above concepts is controlled by a computation profile factor that is the ratio of task parallelism to communication. It is shown that the above factor is an upper bound on the deviation of schedule length from optimum. To tune the solution finish time the above scheduler is iteratively applied on the computation graph. In each iteration, the newly generated schedule is used to sharpen the task-levels which contribute in finding shorter schedules in the next iteration. Evaluation is carried out for a wide category of computation graphs with communications for which optimum schedules are known. It is found that pure local scheduling and static priority-based scheduling significantly deviate from the optimum under specific problem instances. Our approach to adapting the scheduling decision to the computation profile is able to produce near-optimum solutions via a much reduced number of iterations than other approaches.  相似文献   

19.
Setting up and deploying complex applications on a Grid infrastructure is still challenging and the programming models are rapidly evolving. Efficiently exploiting Grid parallelism is often not straight forward. In this paper, we report on the techniques used for deploying applications on the EGEE production Grid through four experiments coming from completely different scientific areas: nuclear fusion, astrophysics and medical imaging. These applications have in common the need for manipulating huge amounts of data and all are computationally intensive. All the cases studied show that the deployment of data intensive applications require the development of more or less elaborated application-level workload management systems on top of the gLite middleware to efficiently exploit the EGEE Grid resources. In particular, the adoption of high level workflow management systems eases the integration of large scale applications while exploiting Grid parallelism transparently. Different approaches for scientific workflow management are discussed. The MOTEUR workflow manager strategy to efficiently deal with complex data flows is more particularly detailed. Without requiring specific application development, it leads to very significant speed-ups.  相似文献   

20.
Design and development costs for extremely large systems could be significantly reduced if only there were efficient techniques for evaluating design alternatives and predicting their impact on overall system performance metrics. Due to the systems' analytical intractability, simulation is the most common performance evaluation technique for such systems. However, the long execution times needed for sequential simulation models often hampers evaluation. The slow speeds of sequential model execution have led to growing interest in the use of parallel execution for simulating large-scale systems. Widespread use of parallel simulation, however; has been significantly hindered by a lack of tools for integrating parallel model execution into the overall framework of system simulation. Another drawback to widespread use of simulations is the cost of model design and maintenance. The simulation environment the authors developed at UCLA attempts to address some of these issues. It consists of three primary components: a parallel simulation language called Parsec (parallel simulation environment for complex systems), its GUI, called Pave, and the portable runtime system that implements the simulation algorithms  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号