首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

2.
A parallel ray tracing algorithm is presented. It subdivides the seene into 3D regions, the adjacency of which is modelled by a connectivity graph of regions. Since with each region is associated a ray tracing process, this graph becomes a graph of processes, the edges of which represent the communications between processes. This graph of processes is suitably mapped onto a hypercube topology so as to minimize the communication cost. Static load balancing is performed and solutions are brought to the problems of network congestion and termination.This work has been supported byC 3 and by the CCETT (Centre Commun d'Etudes de Télédiffusion et Télécommunications) under contract 86ME46  相似文献   

3.
The multiflow trace scheduling compiler   总被引:3,自引:0,他引:3  
The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.  相似文献   

4.
Modern compilers employ sophisticated instruction scheduling techniques to shorten the number of cycles taken to execute the instruction stream. In addition to correctness, the instruction scheduler must also ensure that hardware resources are not oversubscribed in any cycle. For a contemporary processor implementation with multiple pipelines and complex resource usage restrictions, this is not an easy task. The complexity involved in reasoning about such resource hazards is one of the primary factors that constrain the instruction scheduler from performing certain kinds of transformations that can result in improved schedules. We extend a technique for detecting pipeline resource hazards based on finite state automata, to support the efficient implementation of such transformations that are essential for aggressive instruction scheduling beyond basic blocks. Although similar code transformations can be supported by other schemes such as reservation tables, our scheme is superior in terms of space and time. A global instruction scheduler using these techniques was implemented in the KSR compiler. This work was begun while the authors were at Kendall Square Research (KSR).  相似文献   

5.
A technique is presented for obtaining vector performance from a pipelined MIMD computer that does not have hardwired vector instructions. The specific computer in mind is the Denelcor HEP, but the technique might influence the use and possibly even the design of future machines with this type of architecture. This preliminary report presents the basic idea and demonstrates that it can be implemented. Buffering blocks of data to registers is used in conjunction with pipelined floating-point operations to achieve vector performance. Empirical evidence is presented to show that up to 5.8 megaflop performance is possible from the Denelcor HEP on very regular tasks such as matrix vector products. While this rate is not in the ‘super-computer’ range, it is certainly respectable given the hardware capabilities of the HEP (this machine is rated at 10 MIPS peak). This performance indicates that an apparently minor refinement to the architectural design could provide very efficient vector operations in addition to the parallelism and low-overhead synchronization already offered by the HEP architecture.  相似文献   

6.
We describe a MIMD multiprocessor simulator and application of that simulator to a multiprocessor of current interest, the S-1 MkIIa. The simulator runs on the CRAY-1 and is designed so that computational physics benchmarks are actually run and produce results. Simulator output from this run is fed into a second level (hardware) simulator which calculates the behavior of the multiprocessor. The simulator can simulate multiprocessors whose basic architecture is that of a few, large processors with or without data caches, sharing global memory through an interconnection switch. The simulator is applied to the investigation of the behavior of four problems on the S-1: The benchmark physics code SIMPLE, a conjugate gradient linear algebra problem, a simple Monte-Carlo problem, and a new method for neutron transport calculations.  相似文献   

7.
Two algorithms for barrier synchronization   总被引:5,自引:0,他引:5  
We describe two new algorithms for implementing barrier synchronization on a shared-memory multicomputer. Both algorithms are based on a method due to Brooks. We first improve Brooks' algorithm by introducing double buffering. Our dissemination algorithm replaces Brook's communication pattern with an information dissemination algorithm described by Han and Finkel. Our tournament algorithm uses a different communication pattern and generally requires fewer total instructions. The resulting algorithms improve Brook's original barrier by a factor of two when the number of processes is a power of two. When the number of processes is not a power of two, these algorithms improve even more upon Brooks' algorithm because absent processes need not be simulated. These algorithms share with Brooks' barrier the limitation that each of then processes meeting at the barrier must be assigned identifiersi such that 0i<n.  相似文献   

8.
This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph parti-tion problem and proved to be NP-Complete. A heuristic algorithm MMP-Solver is then proposed to solve the problem. Experiment result shows that the task scheduling can reduce communication over-head of parallel applications greatly and MMP-Solver outperforms the existing algorithms.  相似文献   

9.
Most scientific applications rely on parallel multiprocessor computing to enhance performance. However, the irregular loops within these applications obstruct the parallelism analysis at compile-time. Rauchwerger et al. presented a run-time method to extract the hidden parallelism in a program using dependence chains. The relative overhead degrades this approach’s performance due to the mass storage requirement and huge array reference processing. In this study, a new predecessor/successor approach is developed in which high-level predecessor/successor information is recorded and processed efficiently. A predecessor/successor table is constructed first in the inspector phase so that only the successor iterations in the current wavefront need to be examined, instead of the entire loop iterations during wavefront scheduling. Usually, the performance of dependence chain approach degrades dramatically for a hot-spot access pattern, but our scheme works very efficiently in this case. The experimental results using synthetic code and real programs are presented to prove the superiority of the proposed approach.  相似文献   

10.
11.
同步开销是影响并行程序性能的一个重要方面,如果同步操作出现在循环中,将会使这种影响进一步扩大.为了降低循环中同步操作的开销,本文提出一种利用即时编译器外提Java程序中循环内同步操作的优化算法,并在实际的Java虚拟机中实现.该算法在保证程序语义不变的前提下,大量减少运行时实际执行的同步操作数量,降低同步开销,并能保证外提变换后同步代码块不会太大而降低程序的并发度.实验结果表明该算法能提高程序的整体性能,并且不降低程序的可扩放性.  相似文献   

12.
Enabled by RISC technologies, low-cost commodity microprocessors are performing at ever increasing levels, significantly via instruction level parallelism (ILP). This in turn increases the opportunities for their use in a variety of day-to-day applications ranging from the simple control of appliances such as microwave ovens, to sophisticated systems for cabin control in modern aircraft. Indeed, embedded applications such as these represent segments in the computer industry with great potential for growth. However, this growth is currently impeded by the lack of robust optimizing compiler technologies that support the assured, rapid and inexpensive prototyping of real-time software in the context of microprocessors with ILP. In this paper we describe a novel notation, TimeC, for specifying timing constraints in programs, independent of the base language being used to develop the embedded application; TimeC specifications are language independent and can be instrumented into imperative and object-oriented languages non-intrusively. As we will show, the program synthesis problem that arise out of Time_tract specifications, a subset of TimeC, are always tractable. In contrast, a range of specification mechanisms proposed earlier yield substantially intractable synthesis questions, thereby limiting their potential utility. We will compare the tractability and related expressive power issues between TimeC and some of the extant mechanisms for specifying properties of timed programs.  相似文献   

13.
This research responds to practical requirements in the porting of embedded software over platforms and the well-known multiprocessor anomaly. In particular, we consider the task scheduling problem when the system configuration changes. With mutual-exclusive resource accessing, we show that new violations of the timing constraints of tasks might occur even when a more powerful processor or device is adopted. The concept of scheduler stability and rules are then proposed to prevent scheduling anomaly from occurring in task executions that might be involved with task synchronization or I/O access. Finally, we explore policies for bounding the duration of scheduling anomalies.  相似文献   

14.
出具证明编译器在软件安全研究得到越来越多的关注,是程序验证研究的一个重要方向.但目前关于出具证明编译器的研究主要是在程序逻辑设计和定理自动化证明方面,很少关注编译优化对规范的影响.而编译优化是决定出具证明编译器是否能走向应用的关键因素之一.通过研究数据流优化的基本行为,提出利用数据流分析结果来变换规范的方法,以使原规范的约束准确而充分地施加于优化后的代码,并实现了一个包含多种优化和相应规范转换的编译器原型系统,展示了方法的可行性.  相似文献   

15.
An efficient method based on particle swarm optimization (PSO) is developed to solve the Multiprocessor Task Scheduling Problem (MPTSP). To efficiently execute parallelized programs on a multiprocessor environment, a scheduling problem must be solved to determine the assignment of tasks to the processors, the execution order of the tasks, and the starting time of each task, such that some optimality criteria are met. The scheduling problem is known as an NP-complete problem even when the target processors are fully connected and no communication delay is considered among the tasks in the task graph. The complexity of the scheduling problem depends on the number of tasks (N), the number of processors (M), the task processing time and the precedence constraints. The Directed Acyclic Graph (DAG) was exploited to represent the tasks and their precedence constraints. The proposed algorithm was compared with the Genetic Algorithm (GA) and the Duplication Scheduling Heuristic (DSH). We also provide a systematic investigation on the effect of varying problem settings. The results show that the proposed algorithm could not outperform the DSH while it could outperform the GA in some cases.  相似文献   

16.
Cross-layer optimization policy for QoS scheduling in computational grid   总被引:1,自引:0,他引:1  
This paper presents a cross-layer quality of service (QoS) optimization policy for computational grid. Efficient QoS management is critical for computational grid to meet heterogeneity and dynamics of resources and users’ requirements. There are different QoS metrics at different layers of computational grid. To improve perceived QoS by end users over computational grid, QoS supports can be addressed in different layers, including application layer, collective layer, fabric layer and so forth. The paper tackles cross-layer grid QoS optimization as optimization decomposition, each layer corresponds to a decomposed subproblem. The proposed policy produces an optimal set of grid resources, service compositions and user's payments at the fabric layer, collective layer and application layer respectively to maximize global grid QoS. The cross-layer optimization problem decomposes into three subproblems: grid resource allocation problem, service composing and user satisfaction degree maximization problem, all of which interact through the optimal variables for capacities of grid resources and service demand. In order to coordinate the subproblems, cross-layer QoS feedback mechanism is established to ensure different layer interactions. The simulations are conducted to validate the efficiency of the proposed policy.  相似文献   

17.
The application of object-oriented design methods to real-time embedded systems is seriously hindered by the lack of existing real-time scheduling techniques that can be seamlessly integrated into these methods. Preemption threshold scheduling (PTS) enables a scalable real-time system design and thus has been suggested as a solution to this problem. However, direct adoption of PTS may lead to long priority inversion since object-oriented real-time systems require synchronization considerations in order to maintain consistent object states. In this paper, we propose the dual ceiling protocol (DCP) in order to solve this problem. While DCP exploits both priority ceilings and preemption threshold ceilings, this is not a straightforward integration of existing real-time synchronization protocols for PTS. We present the rationale for the locking conditions of DCP and show that it leads to the least blocking and response times by comparison with other real-time synchronization protocols. We also present its blocking properties and schedulability analyses. We implemented PTS and DCP in a real-time object-oriented CASE tool and present the associated experimental results, which show that the proposed protocol is a viable solution that is superior to other real-time synchronization protocols for PTS.  相似文献   

18.
This paper presents a new particle swarm optimization (PSO) for the open shop scheduling problem. Compared with the original PSO, we modified the particle position representation using priorities, and the particle movement using an insert operator. We also implemented a modified parameterized active schedule generation algorithm (mP-ASG) to decode a particle position into a schedule. In mP-ASG, we can reduce or increase the search area between non-delay schedules and active schedules by controlling the maximum delay time allowed. Furthermore, we hybridized our PSO with beam search. The computational results show that our PSO found many new best solutions of the unsolved problems.  相似文献   

19.
In distributed query processing systems, load balancing plays an important role in maximizing system throughput. When queries can leverage cached intermediate results, improving the cache hit ratio becomes as important as load balancing in query scheduling, especially when dealing with computationally expensive queries. The scheduling policies must be designed to take into consideration the dynamic contents of the distributed caching infrastructure. In this paper, we propose and discuss several distributed query scheduling policies that directly consider the available cache contents by employing distributed multidimensional indexing structures and an exponential moving average approach to predicting cache contents. These approaches are shown to produce better query plans and faster query response times than traditional scheduling policies that do not predict dynamic contents in distributed caches. We experimentally demonstrate the utility of the scheduling policies using MQO, which is a distributed, Grid-enabled, multiple query processing middleware system we developed to optimize query processing for data analysis and visualization applications.  相似文献   

20.
We study the problem of simultaneously minimizing the makespan and the average weighted completion time for the precedence multiprocessor constrained scheduling problem with unit execution times and unit communication delays, known as the UET–UCT problem (Munier and König, Operations Research, 45(1), 145–148 (1997)). We propose a simple (16/9, 16/9)-approximation algorithm for the problem with an unrestricted number of machines. We improve our algorithm by adapting a technique first introduced by Aslam et al. (Proceedings of ACM-SODA, pp. 846–847, 1999) and provide a (1.745, 1.745)-approximate solution. For the considered scheduling problem, we prove the existence of a (1.445, 1.445)-approximate solution, improving the generic existence result of Aslam et al. (Proceedings of ACM-SODA, pp. 846–847, 1999). Also notice that our results for the case of an unrestricted number of processors hold for the more general scheduling problem with small communication delays (SCT problem), and for two other classical optimality criteria: maximum lateness and weighted lateness. Finally, we propose an approximation algorithm for the UET–UCT problem with a restricted number of processors.Research partially supported by the thematic network APPOL II (IST 2001-32007) of the European Union, the ACI-GRID2 project of the French government, and the MULT-APPROX project of the France-Berkeley Fund.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号