共查询到20条相似文献,搜索用时 15 毫秒
1.
Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors 下载免费PDF全文
In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler. 相似文献
2.
Xinmin Tian Jay P. Hoeflinger Grant Haab Yen-Kuang Chen Milind Girkar Sanjiv Shah 《Parallel Computing》2005,31(10-12):960
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs. 相似文献
3.
We present an effective approach to performing data flow analysis in parallel and identify three types of parallelism inherent in this solution process: independent-problem parallelism, separate-unit parallelism and algorithmic parallelism. We present our investigations of Fortran procedures from thePerfect Benchmarks andnetlib libraries, which reveal structural characteristics of program flow graphs that are amenable to algorithmic parallelism. Previously, the utility of algorithmic parallelism had been explored using our parallel hybrid algorithm in the context of solving the Reaching Definitions problem for Fortran procedures. Here we present new refinements that optimize performance by increasing the grain size of the parallelism, to improve communication on distributed-memory machines. The empirical performance of our optimized and unoptimized hybrid algorithms for Reaching Definitions are compared on this large data set using an iPSC/2. Our empirical findings qualitatively validate the usefulness of algorithmic parallelism.This research was supported, in part, by National Science Foundation grants CCR-8920078 and CCR-9023628-1, 2/5. An earlier version of this paper appears inProceedings of the 6th ACM International Conference on Supecomputing (Washington, D.C., July 1992), pp. 236–247. 相似文献
4.
Compiling programs for distributed-memory multiprocessors 总被引:1,自引:0,他引:1
We describe a new approach to programming distributed-memory computers. Rather than having each node in the system explicitly programmed, we derive an efficient message-passing program from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors. This article describes one possible input language for describing distributions and then details the compilation process and the optimization necessary to generate an efficient program.Research supported by Intel. 相似文献
5.
Acacio M.E. Gonzalez J. Garcia J.M. Duato J. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(8):755-768
Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time. 相似文献
6.
Ramaswamy S. Sapatnekar S. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(11):1098-1116
Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster 相似文献
7.
Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献
8.
随着多核技术日益发展,并发程序通过引入Fork/Join并行性,将任务分解为更细粒度的子任务并行执行,从而充分利用多核处理器提供的计算性能。并发执行线程之间的交错可能产生隐匿的程序设计错误,因此有必要对此类并发程序的正确性进行分析。上下文定界分析方法是一种检测并发程序中隐匿错误的高效方法,计算线程有限次上下文切换内的可达状态,确定错误状态是否可达。针对Fork/Join并行性的并发程序的可达性分析思想如下:首先,动态并发程序被建模为可模拟线程Fork/Join操作的动态并发下推系统P;然后从P中提取模拟其k-定界执行的并发下推系统Pk。现有的上下文定界可达算法可解决提取后的并发下推系统的k-定界可达性问题。 相似文献
9.
Models for the constraints under which an application should be scaled, including constant problem-size scaling, memory-constrained scaling, and time-constrained scaling, are reviewed. A realistic method is described that scales all relevant parameters under considerations imposed by the application domain. This method leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size. The primary example application is a simulation of galaxies using the Barnes-Hut hierarchical N -body method 相似文献
10.
This paper explores the macro data flow approach for solving numerical applications on distributed memory systems. We discuss the problems of this approach with a sophisticated ‘real life’ algorithm—the adaptive full multigrid method.
It is shown that the nonnumeric parts of the algorithm—the initialization, the termination and the mapping of processes to processors—are very important for the overall performance.
To avoid unnecessary global synchronization points we propose to use the distributed supervisors. We compare this solution with more centralized algorithms. The performance evaluation is done for nearest neighbour and bus connected multiprocessors using a simulation systems. 相似文献
11.
John S. Conery 《International journal of parallel programming》1988,17(2):125-152
A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models. 相似文献
12.
Puente V. Gregorio J.-A. Beivide R. Izu C. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(5):487-501
This work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully, adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade-off between the network performance and hardware cost. The outcome of this research is a high-performance adaptive router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input bufferring, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors. 相似文献
13.
《Journal of Systems Architecture》1999,45(10):781-790
The calculation of radiological paths is the most important part in statistical positron emission tomography image reconstruction algorithms. We present a new, faster algorithm which replaces Siddon's. Further code transformations on this algorithm prove to be beneficial in a Maximum Likelihood–Expectation Maximization reconstruction algorithm and the result is perfectly suitable for an implementation that exploits the VISual instruction set from Sun or other modern architectural extensions providing subword parallelism. The final speed-up achieved with this new algorithm and its subword parallel implementation is 13. Though smaller data formats are used in subword parallelism, the resulting images are as good as the original ones. 相似文献
14.
A vector clock is a valuable tool for maintaining run time concurrency information in parallel programs. A novel method is presented for modifying vector clocks to make them suitable for programs with nested fork join parallelism (having a variable number of tasks). The resulting kind of clock is called a clock tree, due to its tree structure. The clock tree method compares favorably with other timestamping methods for variable parallelism: task identifier reuse and task recycling. The worst case space requirements of clock trees equals the best case for the latter two methods, and the average size of a clock tree is much smaller than the size of a vector with task recycling. Furthermore, the algorithm for maintaining clock trees does not require a shared data structure and thus avoids the serialization bottleneck that task recycling suffers from 相似文献
15.
Modern society increasingly relies on mobile devices. This explains the growing demand for high quality software for such devices. To improve the efficiency of the development life-cycle, shortening time-to-market while keeping quality under control, mobile applications are typically developed by composing together ad-hoc developed components, services available on-line, and other third-party mobile applications. Applications are thus built as heterogeneous compositions, whose characteristics strongly depend on the components and services they integrate. To cope with unpredictable changes and failures, but also with the various settings offered by the plethora of available devices, mobile applications need to be as adaptive as possible. However, mainstream adaptation strategies are usually defined imperatively and require complex control strategies strongly intertwined with the application logic, yielding to applications that are difficult to build, maintain, and evolve. We address this issue by proposing a declarative approach to compose adaptive heterogeneous mobile applications. The advantages of this approach are demonstrated through an example inspired by an existing worldwide distributed mobile application, while the implementation of the proposed solution has been validated through a set of simulations and experiments aimed at illustrating its performance. 相似文献
16.
This paper presents new schedulability tests for preemptive global fixed-priority (FP) scheduling of sporadic tasks on identical multiprocessor platform. One of the main challenges in deriving a schedulability test for global FP scheduling is identifying the worst-case runtime behavior, i.e., the critical instant, at which the release of a job suffers the maximum interference from the jobs of its higher priority tasks. Unfortunately, the critical instant is not yet known for sporadic tasks under global FP scheduling. To overcome this limitation, pessimism is introduced during the schedulability analysis to safely approximate the worst-case. The endeavor in this paper is to reduce such pessimism by proposing three new schedulability tests for global FP scheduling. Another challenge for global FP scheduling is the problem of assigning the fixed priorities to the tasks because no efficient method to find the optimal priority ordering in such case is currently known. Each of the schedulability tests proposed in this paper can be used to determine the priority of each task based on Audsley’s approach. It is shown that the proposed tests not only theoretically dominate but also empirically perform better than the state-of-the-art schedulability test for global FP scheduling of sporadic tasks. 相似文献
17.
This paper presents a parallel execution model for exploiting AND-parallelism in Horn Clause logic programs. The model is
based upon the generator-consumer approach, and can be implemented efficiently with small run-time overhead. Other related
models that have been proposed to minimize the run-time overhead are unable to exploit the full parallelism inherent in the
generator-consumer approach. Furthermore, our model performs backtracking more intelligently than these models. We also present
two implementation schemes to realize our model: one has a coordinator to control the activities of processes solving different
literals in the same clause; and the other achieves synchronization by letting processes pass messages to each other in a
distributed fashion. Trade-offs between these two schemes are then discussed.
This work was supported by Army Research Office grant #DAAG29-84-K-0060 to the Artificial Intelligence Laboratory at the University
of Texas at Austin. 相似文献
18.
19.
The VMMP (virtual machine for multiprocessors) software package is presented. It provides a coherent set of services for parallel application programs running on diverse multiple input multiple data (MIMD) multiprocessors, including shared memory and message passing multiprocessors. The communication, synchronization, and data distribution requirements of parallel algorithms are analyzed. Related languages and tools are described. VMMP services are identified. VMMP implementation, coding and portability are discussed. Some measurements of the performance of VROMP application programs and VMMP overhead are given. Several hints for improving the performance of application programs are described 相似文献
20.
On the declarative and procedural semantics of logic programs 总被引:1,自引:0,他引:1
Teodor C. Przymusinski 《Journal of Automated Reasoning》1989,5(2):167-205