首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Although dataflow computers have many attractive features, skepticism exists concerning their efficiency in handling arrays (vectors) in high performance scientific computation. This paper outlines an efficient implementation scheme for arrays in applicative languages (such as VAL and SISAL) based on the principles of dataflow software pipelining. It illustrates how the fine-grain parallelism of dataflow approach can effectively handle large amount of data structured in applicative array operations. This is done through dataflow software pipelining between pairs of code blocks which act as producer-consumer of array values. To make effective use of the pipelined code mapping scheme, a compiler needs information concerning the overall program structure as well as the structure of each code block. An applicative language provides a basis for such analysis.

The program transformation techniques described here are developed primarily for the computationally intensive part of a scientific numerical program, which is usually formed by one or a few clusters of acyclic connected code blocks. Each code block defines an array value from several input arrays. We outline how mapping decisions of arrays can be based on a global analysis of attributes of the code blocks. We emphasize the role of overall program structure and the strategy of global optimization of the machine code structure. The structure of a proposed dataflow compiler based on the scheme described in this paper is outlined.  相似文献   


2.
Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism; balancing workloads, and reducing synchronization. In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. We make the distinction between intratile and intertile locality and show that as the number of processors grows, intertile locality becomes more important. We consider and experimentally evaluate existing strategies for scheduling wavefront parallelism. We show that dynamic self-scheduling can be efficiently used on a small number of processors, but performs poorly at large scale because it does not enhance intertile locality. By contrast, static scheduling strategies enhance intertile locality for small tiles, maintaining parallelism and resulting in better performance at large scale. Results from a Convex SPP1000 multiprocessor demonstrate the importance of taking intertile locality into account. Static scheduling outperforms dynamic self-scheduling by a factor of up to 2.3 on 30 processors  相似文献   

3.
Espasa  R. Valero  M. 《Micro, IEEE》1997,17(5):20-27
Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level parallelism and perform better than either approach could separately. Our design achieves performance equivalent to executing 15 to 26 scalar instructions/cycle for numerical applications  相似文献   

4.
In this paper,a new loop transformation is proposed that can expolit paralleism in loops which cannot be found by traditional methods.Then the method is extended to show how to achieve maximum speedup of loops if there are infinite processors and how to balance the workload of parallel sections in loops if there is fixed number of processors.  相似文献   

5.
Exploiting the parallelism available in loops   总被引:1,自引:0,他引:1  
Lilja  D.J. 《Computer》1994,27(2):13-26
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop  相似文献   

6.
With the increasing amount of parallelism obtainable on multicore platforms, stream programming has been proposed as an effective solution for exposing distributed parallelization. Nonetheless, a pressing demand of scheduling task and data parallelism in stream programming exists that can accomplish robust multicore performance in the face of varying application characteristics. This paper addresses the problem of scheduling task and data parallelism in stream programming. We present StreamMDE, an asynchronous concurrency stream programming framework which offers a novel parallel programming model for scheduling task and data parallelism in the message-driven execution paradigm. A key property of this framework is exposing controlled-grained parallelism, which allows us to control the granularity of task and data parallelism in stream graph. Our empirical evaluation of StreamMDE shows that higher efficiency of mixed task and data parallelism in stream programming can be exploited with the appropriate granularity control. The framework bridges the gap between the parallel scale and the architecture of stream programs and facilitates in designing and coding stream features in different schedules.  相似文献   

7.
Multi-threaded programs on shared-memory hardware tend to be non-deterministic, which brings challenges to software debugging and testing. Current deterministic implementations eliminate nondeterminism of multi-threaded programs by trading much parallelism for determinism, which leads to low performance. Researchers typically improve parallelism by weakening determinism or introducing weak memory consistency models. However, weak determinism cannot deal with non-determinism caused by data races which are very common in multi-threaded programs. Weak memory consistency models impact the productivity of programming and may bring correctness problems of legacy programs. To address the problems, this paper presents a fully parallelized deterministic runtime, FPDet, which exploits parallelism of deterministic multi-threaded programs by preserving strong determinism and sequential memory consistency. FPDet creates a Working Set Memory (WSM) for each thread to make threads run independently for parallelism. FPDet guarantees determinism by redistributing memory blocks among threads’ WSMs in specified synchronization points. As a result, FPDet obtains parallelism and determinism simultaneously. To further exploit parallelism, we propose an Adaptive Budget Adjustment (ABA) mechanism to minimize wait time caused by thread synchronization.  相似文献   

8.
While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.  相似文献   

9.
Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.  相似文献   

10.
We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh.  相似文献   

11.
In this paper, we address the matrix chain multiplication problem, i.e., the multiplication of several matrices. Although several studies have investigated the problem, our approach has some different points. First, we propose MapReduce algorithms that allow us to provide scalable computation for large matrices. Second, we transform the matrix chain multiplication problem from sequential multiplications of two matrices into a single multiplication of several matrices. Since matrix multiplication is associative, this approach helps to improve the performance of the algorithms. To implement the idea, we adopt multi-way join algorithms in MapReduce that have been studied in recent years. In our experiments, we show that the proposed algorithms are fast and scalable, compared to several baseline algorithms.  相似文献   

12.
《国际计算机数学杂志》2012,89(9):1051-1067
Semantic trees have often been used as a theoretical tool for showing the unsatisfiability of clauses in first-order predicate logic. Their practicality has been overshadowed, however, by other strategies. In this paper, we introduce unit clauses derived from resolutions when necessary to construct a semantic tree, leading to a strategy that combines the construction of semantic trees with resolution–refutation. The parallel semantic tree theorem prover, called PrHERBY, combines semantic trees and resolution–refutation methods. The parallel system is scalable by strategically selecting atoms with the help of dedicated resolutions. In addition, a parallel grounding scheme allows each system to have its own instance of generated atoms, thereby increasing the possibility of success. The PrHERBY system presented performs significantly better and generally finds proof using fewer atoms than the semantic tree prover, HERBY and its parallel version, PHERBY.  相似文献   

13.
A method is presented which aims to enhance the run-time performance of real-time production systems of utilising natural concurrency in the application knowledge base. This exploiting application parallelism (EAP) method includes an automated analysis of the knowledge base and the use of this analysis information to partition and execute rules on a novel parallel production system (PPS) architecture. Prototype analysis tools and a PPS simulator have been developed for the Inference ART environment in order to apply the method to a naval data-fusion problem. The results of this experimental investigation revealed that an average maximum of 12.06 rule-firings/cycle was possible but, due to serial bottlenecks inherent in the data-fusion problem, up to only 2.14 rule-firings/cycle was achieved overall. Limitations of the EAP method are discussed within the context of the experimental results and an enhanced method is investigated.  相似文献   

14.
15.
This paper addresses the problem of parallelism in 3-D object recognition and localization using a fine-grained SIMD architecture such as the Connection MachineTM. The input images of partially occluding three-dimensional objects. The objects are made up of piecewise compositions of curved surfaces. The surfaces of interest are cylindrical, conical and spherical surfaces since they cover a large portion of objects encountered in an industrial environment. Qualitative classification of surfaces based on the signs of the Mean and Gaussian curvature is used to come up with Dihedral feature junctions. Dihedral feature junctions are shown to be robust to occlusion and offer a viewpoint independent scheme for modeling the object models and are well suited for matching and pose determination. Hough clustering is chosen as the constraint propagation mechanism on account of the ease of parallelization. Issues regarding parallelism on the Connection MachineTM for every stage of the recognition and localization process such as segmentations, feature extraction matching and pose determination are discussed. Experimental results on the Connection MachineTM bring out the advantages of exploiting parallelism for 3-D object recognition and localization on a fine-grained SIMD architecture.  相似文献   

16.
We investigate the efficient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU preconditioner which preserves the mathematical semantics of the sequential method in ILUPACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efficiency of this solution.  相似文献   

17.
A new technique for source-to-source transformation of sequential programs is described. It is shown that the transformed programs so generated provide significant speedups over the original program on vector processors and vector multiprocessors. The parallelism that arises when multiple instances of a program are executed on simultaneously available data sets is exploited. This is in contrast to the existing approaches that aim at detecting parallelism within a program. The analytic model is used to prove the optimality of the complete first policy for block selection for a class of program graphs known as nonregressive graphs. Analytic and simulation models of the technique clearly indicate the speedups that could be achieved when several data sets are available simultaneously, as is the case in many fields of interest  相似文献   

18.
The solution of Protein–Ligand Docking Problems can be approached through metaheuristics, and satisfactory metaheuristics can be obtained with hyperheuristics searching in the space of metaheuristics implemented inside a parameterized schema. These hyperheuristics apply several metaheuristics, resulting in high computational costs. To reduce execution times, a shared-memory schema of hyperheuristics is used with four levels of parallelism, two for the hyperheuristic and two for the metaheuristics. The parallel schema is executed in a many-core system in “native mode,” and the four-level parallelism allows us to take full advantage of the massive parallelism offered by this architecture and obtain satisfactory fitness and an important reduction in the execution time.  相似文献   

19.
The increase of computer performance continues to support the practice of large-scale optimization. Computers with multiple computing cores and vector processing capabilities are now widely available. We investigate how the recently introduced Advanced Vector Instruction (AVX) set on Intel-compatible architectures can be exploited in interior point methods for linear and nonlinear optimization. We focus on data structures and implementation techniques that utilize the new vector instructions. Our numerical experiments demonstrate that the AVX instruction set provides a significant performance boost in our implementation on large-scale problem that have significant fill-in in the sparse Cholesky factorization, achieving up to 100 gigaflops performance on a standard desktop computer on linear optimization problems for which the required Cholesky factorization is relatively dense.  相似文献   

20.
Shang  Junyuan  Niu  Chang  Huang  Junchu  Zhou  Zhiheng  Yang  Junmei  Xu  Shiting  Yang  Liu 《Applied Intelligence》2022,52(10):10917-10933
Applied Intelligence - Most existing domain adaptation methods require large amounts of data in the target domain to train the model and a relatively long time to adapt different domains. Recently,...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号