首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater  相似文献   

2.
The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately available parallelism and further extraction of parallelism is limited by small data sets and a relatively high parallelization overhead. Load balance is difficult to obtain due to the limited parallelism and made worse by non-uniform memory latency. Three parallel OpenMP implementations of the application are discussed and evaluated. We show that with some modifications relative speedups in excess of 9 on a 16 CPU system can be reached.  相似文献   

3.
Multi-class contour preserving classification is a contour conservancy technique that synthesizes two types of vectors; fundamental multi-class outpost vectors (FMCOVs) and additional multi-class outpost vectors (AMCOVs), at the judging border between classes of data to improve the classification accuracy of the feed-forward neural network. However, the number of both new vectors is tremendous, resulting in a significantly prolonged training time. Reduced multi-class contour preserving classification provides three practical methods to lessen the number of FMCOVs and AMCOVs. Nevertheless, the three reduced multi-class outpost vector methods are serial and therefore have limited applicability on modern machines with multiple CPU cores or processors. This paper presents the methodologies and the frameworks of the three parallel reduced multi-class outpost vector methods that can effectively utilize thread-level parallelism and process-level parallelism to (1) substantially lessen the number of FMCOVs and AMCOVs, (2) efficiently increase the speedups in execution times to be proportional to the number of available CPU cores or processors, and (3) significantly increase the classification performance (accuracy, precision, recall, and F1 score) of the feed-forward neural network. The experiments carried out on the balanced and imbalanced real-world multi-class data sets downloaded from the UCI machine learning repository confirmed the reduction performance, the speedups, and the classification performance aforementioned.  相似文献   

4.
Dependence analysis algorithms have been proposed to identify parallelism in programs with tree-like data structures. However, they cannot analyze the dependence of statements if recursive data structures of programs are cyclic. This paper presents a technique to identify parallelism in programs with cyclic graphs. The technique consists of three steps: (1) traversal patterns that loops or recursive procedures traverse graphs are identified, and the statements that construct the links of traversal patterns will be located by definition–use chains of recursive data structures; (2) traversal-pattern-sensitive shape analysis is performed to estimate possible shapes of traversal patterns; (3) dependence analysis is performed to identify parallelism using the result of shape analysis. This approach can identify parallelism in programs with cyclic data structures due to the facts that many programs follow acyclic structures (i.e. traversal patterns) to access all nodes on the cyclic data structures. Once the traversal patterns are isolated from the overall data structures, dependence analysis can be applied to identify parallelism.  相似文献   

5.
This paper defines an abstract interpreter for logic programs based on a system of asynchronous, independent processors which communicate only by passing messages. Each logic program is automatically partitioned and its pieces distributed to available processors. This approach permits two distinct forms of parallelism. OR parallelism arises from evaluating nondeterministic choices simultaneously. AND parallelism arises when a computation involves independent, but necessary, subcomputations. Algorithms like quicksort, which follow a divide and conquer approach, usually exhibit this form of parallelism. These two forms of parallelism are conjointly achieved by the parallel interpreter.  相似文献   

6.
When two or more literals in the body of a Prolog clause are solved in (AND) parallel, their solutions need to bejoined to compute solutions for the clause. This is often a difficult problem in parallel Prolog systems that exploit OR and independent AND parallelism in Prolog programs. In several AND/OR parallel systems proposed recently, this problem is side-stepped at the cost of unexploited OR parallelism in the program, in part due to the complexity of the backtracking algorithm beneath AND parallel branches. In some cases, the data dependency graphs used by these systems cannot represent all the exploitable indenpendent AND parallelism known at compile time.In this paper, we describe the compile time analysis for an optimizedjoin algorithm for supporting independent AND parallelism in logic programs efficiently without leaving any OR parallelism unexploited. We then discuss how this analysis can be used to yield very efficient runtime behavior. We also discuss problems associated with a tree representation of the search space when arbitrarily complex data dependency graphs are permitted. We describe how these problems can be resolved by mapping the search space onto the data dependency graphs themselves. The algorithm has been implemented in a compiler for parallel Prolog based on the Reduce-OR process model. The algorithm is suitable for the implementation of AND/OR systems on both shared and nonshared memory machines. Performance on benchmark programs exhibiting AND and OR parallelism on one shared memory machine and one message passing machine is presented.This work was supported in part by NSF Grants CCR-87-00988 and CCR-89-02496.A shorter version of this paper appears in theProceedings of NACLP 1990.  相似文献   

7.
There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually.  相似文献   

8.
Compiling scientific code using partial evaluation   总被引:1,自引:0,他引:1  
Berlin  A. Weise  D. 《Computer》1990,23(12):25-37
The partial evaluation approach, which transforms a high-level program into a low-level program that is specialized for a particular application, exposing the parallelism inherent in the underlying numerical computation, is discussed. A prototype compiler that uses partial evaluation is described. Experiments with the compiler have shown that for an important class of numerical programs, partial evaluation can provide marked performance improvements: speedups over conventionally compiled code that range from seven times faster to 91 times faster have been measured. By coupling partial evaluation with parallel scheduling techniques, the low-level parallelism inherent in a computation can be exploited on heavily pipelined or parallel architectures. The approach has been demonstrated by applying a parallel scheduler to a partially evaluated program that simulates the motion of a nine-body solar system  相似文献   

9.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

10.
We investigate the claim that functional languages offer low-cost parallelism in the context of symbolic programs on modest parallel architectures. In our investigation we present the first comparative study of the construction of large applications in a parallel functional language, in our case in Glasgow Parallel Haskell (GP H). The applications cover a range of application areas, use several parallel programming paradigms, and are measured on two very different parallel architectures. On the applications level the most significant result is that we are able to achieve modest wall-clock speedups (between factors of 2 and 10) over the optimised sequential versions for all but one of the programs. Speedups are obtained even for programs that were not written with the intention of being parallelised. These gains are achieved with a relatively small programmer-effort. One reason for the relative ease of parallelisation is the use of evaluation strategies, a new parallel programming technique that separates the algorithm from the co-ordination of parallel behaviour. On the language level we show that the combination of lazy and parallel evaluation is useful for achieving a high level of abstraction. In particular we can describe top-level parallelism, and also preserve module abstraction by describing parallelism over the data structures provided at the module interface (‘data-oriented parallelism’). Furthermore, we find that the determinism of the language is helpful, as is the largely implicit nature of parallelism in GP H. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

11.
The data parallel meta language (DPML) and its associated Fortran source code rewriter (DP77) support architecture independent, high performance climate and weather prediction models. The language allows the data domains over which a program operates, the communication patterns required between elements of those data domains, and some or all of the calculations of a program to be expressed at a very high level. DPML uses explicit data parallelism to express the inherent parallelism of the models, with the result that programs are easily compilable into target machine code. DP77 uses information from the DPML program to translate Fortran routines into the host specific Fortran form required for their parallel execution within the model. This paper describes the general strategy behind the development of DPML, discusses its language features using examples drawn from climate modelling, and provides details of the mechanism it uses for incorporating Fortran into data parallel programs. Encouraging results are reported for DPML versions of the standard weather benchmark models executing on vector, SIMD, and MIMD (shared memory) machines. While the paper is set within the framework of climate modelling, the technique has obvious wider implications.  相似文献   

12.
随着向量长度的不断增长, SIMD扩展部件得以处理更为庞大的数据级并行,但程序的并行阈值也随之提高.对于现有的自动向量化编译器,如果在分析阶段不能从串行代码中发掘出足够的数据级并行以完全填充向量寄存器,则不会进入相应的向量代码变换阶段,从而无法向量化.较长的向量长度使得某些并行性不足的程序失去了向量化的机会,造成了性能下降.为了更加充分的利用SIMD部件,介绍了一种面向基本块的非满载向量化方法ISLP.基于开源GCC编译器,从并行性检测、代码生成和代价模型3个方面详细阐述了ISLP的设计与实现.在标准测试集上的实验结果表明,该方法可以有效地对超字级并行性不足的程序进行向量化处理,提高程序执行效率.选取的测试用例在向量化后的平均加速比达到1.14,性能较常规SLP方法提升11.8%.  相似文献   

13.
Transformation to single assignment form is presented as a technique enabling the exploitation of fine-grain parallelism in programs. An efficient algorithm is presented for the creation of Single Assignment and Static Single Assignment code from unstructured FORTRAN codes, including programs with irreducible flow graphs. The algorithm transforms code directly without requiring conversion to flow graph form. The algorithm creates code of near optimal quality with respect to both the number of names and assignment statements added to the code. Experimental results show the degree of enlargement of storage and program length when creating single assignment code, and the containment of enlargement using name reclamation. Other results show the extent of improved parallelizalion using single assignment code.  相似文献   

14.
The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, or amount, of the potentially available parallelism in programs. In this paper we will analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a meta re-arrange buffer to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, while the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.08 to 89.60. The results showed that the potential parallelism of normal programs was still far from fully exploited by existing technologies. This observation makes the automatic parallelization a promising research direction for many-core architectures.  相似文献   

15.
A new technique for estimating and understanding the speed improvement that can result from executing a program on a parallel computer is described. The technique requires no additional programming and minimal effort by a program's author. The analysis begins by tracing a sequential program. A parallelism analyzer uses information from the trace to simulate parallel execution of the program. In addition to predicting parallel performance, the parallelism analyzer measures many aspects of a program's dynamic behavior. Measurements of six substantial programs are presented. These results indicate that the three symbolic programs differ substantially from the numeric programs and, as a consequence, cannot be automatically parallelized with the same compilation techniques  相似文献   

16.
As moderate-scale multiprocessors become widely used, we foresee an increased demand for effective compiler parallelization and efficient management of parallelism. While parallelizing compilers are achieving success at identifying parallelism, they are less adept at predetermining the degree of parallelism in different program phases. Thus, a compiler-parallelized application may execute on more processors than it can effectively use – a waste of computational resources that becomes more acute as the number of processors increases, particularly for systems used as multiprogrammed compute servers. This paper examines the dynamic parallelism behavior of multiprogrammed workloads using programs from the Specfp95 and Nas benchmark suites, automatically parallelized by the Stanford SUIF compiler. Our results demonstrate that even the programs with good overall speedups display wide variability in the number of processors each phase (or loop) can exploit. We propose and evaluate a run-time system mechanism that dynamically adjusts the number of processors used by a compiler-parallelized application, responding to observed performance during the program's execution. Programs can thus adapt processor usage as they run, responding both to poor parallelism within certain parts of their code, and also to heavy multiprogramming loads during the execution. This mechanism improves workload performance up to 33% over one-at-a-time runs of the workload programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

17.
This paper presents a new programming methodology for introducing and tuning parallelism in Erlang programs, using source-level code refactoring from sequential source programs to parallel programs written using our skeleton library, Skel. High-level cost models allow us to predict with reasonable accuracy the parallel performance of the refactored program, enabling programmers to make informed decisions about which refactorings to apply. Using our approach, we demonstrate easily obtainable, significant and scalable speedups of up to 21 on a 24-core machine over the sequential code.  相似文献   

18.
19.
A compositional method of constructing data dependency graphs for Ada programs is presented. These graphs are useful in a program development environment for analyzing data dependencies and tracking information flow within a program. Graphs for primitive program statements are combined together to form graphs for larger program units. Composition rules are described for iteration, recursion, exception handling, and tasking, as well as for simpler Ada constructs. The correctness of the construction and the practicality of the technique are discussed  相似文献   

20.
We describe an efficient parallel implementation of the push-relabel maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows the "global relabeling" heuristic to be executed concurrently with the main algorithm; this heuristic is essential for good performance in practice. We present performance results from a Sequent Symmetry for five input distributions. On these five input distributions we achieve speedups in the range 6.2-8.8 with 16 processors, relative to the parallel program with 1 processor (4.1-7.2 when compared to our best sequential program). We consider these speedups very good and we provide evidence that hardware effects and insufficient parallelism in certain inputs are the main obstacles to achieving better performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号