首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Our work investigates how to map loops efficiently onto Coarse-Grained Reconfigurable Architecture (CGRA). This paper examines the properties of CGRA and builds MapReduce inspired models for the loop parallelization problem. The proposed model has a more detailed performance metric and a more flexible unrolling scheme that can unroll different loop levels with different factors. A Geometric Programming based approach is proposed to resolve the optimization problem of loop parallelization problem. The proposed approach can find the optimal unrolling factor for each level loop, resulting in better parallelization of loops. Experimental results show that the proposed approach achieved up to 44% performance gain compared to the state-of-the-art loop mapping scheme.  相似文献   

2.
A study is made of the problem of estimating interference in an imperative language with dynamic data structures. The authors focus on developing efficient and implementable methods for recursive data structures. In particular, they present interference analysis tools and parallelization techniques for imperative programs that contain dynamically updatable trees and directed acyclic graphs. The analysis methods are based on a regular-expression-like representation of the relationship between accessible nodes in the data structure. They authors have implemented their analysis, and they present some concrete examples that have been processed by this system  相似文献   

3.
In recent years, the computational power of modern processors has been increasing mainly because of the increase in the number of processor cores. Computationally intensive applications can gain from this trend only if they employ parallelism, such as thread-level parallelization. Geometric simulations can employ thread-level parallelization because the main part of a geometric simulation can be divided into a subset of mutually independent tasks. This approach is especially interesting for acoustic beam tracing because it is an intensive computing task. This paper presents the parallelization of an existing beam-tracing simulation composed of three algorithms. Two of them are iterative algorithms, and they are parallelized with an already known technique. The most novel method is the parallelization of the third algorithm, the recursive octree generation. To check the performance of the multi-threaded parallelization, several tests are performed using three different computer platforms. On all of the platforms, the multi-threaded octree generation algorithm shows a significant speedup, which is linear when all of the threads are executed on the same processor.  相似文献   

4.

In this paper we present two strategies to enable “parallelization across the method” for spectral deferred corrections (SDC). Using standard low-order time-stepping methods in an iterative fashion, SDC can be seen as preconditioned Picard iteration for the collocation problem. Typically, a serial Gauß–Seidel-like preconditioner is used, computing updates for each collocation node one by one. The goal of this paper is to show how this process can be parallelized, so that all collocation nodes are updated simultaneously. The first strategy aims at finding parallel preconditioners for the Picard iteration and we test three choices using four different test problems. For the second strategy we diagonalize the quadrature matrix of the collocation problem directly. In order to integrate non-linear problems we employ simplified and inexact Newton methods. Here, we estimate the speed of convergence depending on the time-step size and verify our results using a non-linear diffusion problem.

  相似文献   

5.
Lamport's parallelization algorithm (cf. [7]) is generalized to a broader class of loops, and the complexity of the transformation process has been estimated. It is shown that every loop can be parallelized using methods similar to those in [7]; moreover, they also have the property that all their inner loops are devoid of data dependencies, and so are fully parallelizable. Unfortunately, without restricting the nature of the loop to be parallelized, the negative solution to Hilbert's tenth problem (cf. [3]) can be applied to show that the parallelizing transformations are not computable. The class of affine loops was therefore introduced. This class is more general than that considered by Lamport, and it is shown that parallelizing transformations for affine loops are computable. In general, however, the complexity estimates for finding such loops suggest that the parallelization procedure will take longer than executing the original loop sequentially. It is further shown that, if the loop satisfies an additional, nondegeneracy condition, then the loop can be efficiently transformed.

Finally, although more generally applicable, these methods are best applied to vectorization problems.  相似文献   


6.
As multithreaded server applications and runtime systems prevail, garbage collection is becoming an essential feature to support high performance systems, especially those running data-intensive applications. The fundamental issue of garbage collector (GC) design is to maximize the recycled space with minimal time overhead. This paper proposes two innovative solutions: one to improve space efficiency, and the other to improve time efficiency. To achieve space efficiency, we propose the Space Tuner that utilizes the novel concept of allocation speed to reduce wasted space. Conventional static space partitioning techniques often lead to inefficient space utilization. The Space Tuner adjusts the heap partitioning dynamically such that when a collection is triggered, all space partitions are fully filled. To achieve time efficiency, we propose a novel parallelization method that reduces the compacting GC parallelization problem into a tree traversal parallelization problem. This method can be applied for both normal and large object compaction. Object compaction is hard to parallelize due to strong data dependencies such that the source object can not be moved to its target location until the object originally in the target location has been moved out. Our proposed algorithm overcomes the difficulties by dividing the heap into equal-sized blocks and parallelizing the movement of the independent blocks. It is noteworthy that these proposed algorithms are generic such that they can be utilized in different GC designs. The proposed techniques have been implemented in Apache Harmony JVM and we evaluated the proposed algorithms with SPECjbb and Dacapo benchmark suites. The experiment results demonstrate that our proposed algorithms greatly improve space utilization and the corresponding parallelization schemes are scalable, which brings time efficiency.  相似文献   

7.
A large class of intensive numerical applications show an irregular structure, exhibiting an unpredictable runtime behavior. Two kinds of irregularity can be distinguished in these applications. First, irregular control structures, derived from the use of conditional statements on data only known at runtime. Second, irregular data structures, derived from computations involving sparse matrices, grids, trees, graphs, etc. Many of these applications exhibit a large amount of parallelism, but the above features usually make that exploiting such parallelism becomes a very difficult task. This paper discusses the effective parallelization of numerical irregular codes, focusing on the definition and use of data-parallel extensions to express the parallelism that they exhibit. We show that the combination of data distributions with storage structures allows to obtain efficient parallel codes. Codes dealing with sparse matrices, finite element methods and molecular dynamics (MD) simulations are taken as working examples.  相似文献   

8.
吴家皋  夏轩  刘林峰 《计算机应用》2017,37(5):1282-1286
带有全球定位系统(GPS)功能设备的增多,产生大量的时空轨迹数据,给数据的存储、传输和处理带来了沉重的负担。为了减轻这种负担,各种轨迹压缩方法也随之产生。提出了一种基于MapReduce的并行化轨迹压缩方法,针对并行化导致的分段点前后轨迹的相关性被破坏的问题,首先,采用两种分段点相互交错的划分方法划分轨迹;然后,将分段轨迹分配到多个节点上进行并行化压缩;最后,对压缩结果进行匹配合并。性能测试分析结果表明,所提出的并行化轨迹压缩方法能够大幅提高压缩效率,而且能完全消除因分段导致分段点前后相关性被破坏带来的误差。  相似文献   

9.
李雁冰  赵荣彩  刘晓娴  赵捷 《软件学报》2014,25(S2):101-110
现有的OpenMP代价模型较为简单,既没有充分考虑OpenMP程序的执行细节,也无法适应不同的循环并行执行方式.针对上述问题,对最先进的产品级优化编译器Open64中已有的代价模型进行扩展,以单个并行候选循环为对象,建立一种用于OpenMP自动并行收益分析的代价模型.该模型在改进了Open64原有DOALL并行代价模型的基础上,又增加了DOACROSS流水并行代价模型和DSWP并行代价模型.实验结果表明,建立的代价模型能够较好地评估循环并行执行开销的趋势,为OpenMP自动并行化中的收益分析提供了有效的支持.  相似文献   

10.
The automatic generation of 3D finite element meshes (FEM) is still a bottleneck for the simulation of large fluid dynamic problems. Although today there are several algorithms that can generate good meshes without user intervention, in cases where the geometry changes during the calculation and thousands of meshes must be constructed, the computational cost of this process can exceed the cost of the FEM. There has been a lot of work in FEM parallelization and the algorithms work well in different parallel architectures, but at present there has not been much success in the parallelization of mesh generation methods. This paper will present a massive parallelization scheme for re-meshing with tetrahedral elements using the local modification algorithm. This method is frequently used to improve the quality of elements once the mesh has been generated, but we will show it can also be applied as a regeneration process, starting with the distorted and invalid mesh of the previous step. The parallelization is carried out using OpenCL and OpenMP in order to test the method in a multiple CPU architecture and also in Graphics Processing Units (GPUs). Finally we present the speedup and quality results obtained in meshes with hundreds of thousands of elements and different parallel APIs.  相似文献   

11.
12.
We have parallelized the FASTA algorithm for biological sequence comparison using Linda, a machine-independent parallel programming language. The resulting parallel program runs on a variety of different parallel machines. A straight-forward parallelization strategy works well if the amount of computation to be done is relatively large. When the amount of computation is reduced, however, disk I/O becomes a bottleneck which may prevent additional speed-up as the number of processors is increased. The paper describes the parallelization of FASTA, and uses FASTA to illustrate the I/O bottleneck problem that may arise when performing parallel database search with a fast sequence comparison algorithm. The paper also describes several program design strategies that can help with this problem. The paper discusses how this bottleneck is an example of a general problem that may occur when parallelizing, or otherwise speeding up, a time-consuming computation.  相似文献   

13.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors.  相似文献   

14.
为有效实现迭代问题的并行化, 提出了面向过程的任务并行化设计方法. 该方法的主要思想是对任务求解的单次迭代过程进行并行化设计. 将面向过程的思想运用到K-means聚类算法的并行设计过程中, 并通过OpenMP编程模型来验证该方法的有效性. 通过实验结果分析得知, 面向过程的任务并行化执行相较于传统的串行执行在效率上有很大的优势, 可以运用到迭代问题的并行化设计过程中.  相似文献   

15.
Many of the differential equations arising in science and engineering can be recast in the form of a matrix eigenvalue problem. Solution of this equation within the context of the Rayleigh-Ritz variational method may be viewed as one of the fundamental tasks of numerical analysis. Successive approximation approaches to the Rayleigh-Ritz problem seek to improve eigenvectors and eigenfunctions by sequentially refining a trial function. Parallelization of successive approximation approaches has been demonstrated numerous times in the literature; these studies addressed either the successive approximations or the matrix diagonalization levels of the algorithm. It is shown in this paper that these two strategies may be applied independently of one another, and the advantages of applying both parallelization levels simultaneously to the problem are discussed. Performance estimates for a two-tiered parallelization strategy are obtained by extrapolating from existing published performance data for which the two levels of parallelization were applied separately  相似文献   

16.
When formulated as a system of linear inequalities, the image restoration problem yields huge, unstructured, sparse matrices even for images of small size. To solve the image restoration problem, we use the surrogate constraint methods that can work efficiently for large problems. Among variants of the surrogate constraint method, we consider a basic method performing a single block projection in each step and a coarse-grain parallel version making simultaneous block projections. Using several state-of-the-art partitioning strategies and adopting different communication models, we develop competing parallel implementations of the two methods. The implementations are evaluated based on the per iteration performance and on the overall performance. The experimental results on a PC cluster reveal that the proposed parallelization schemes are quite beneficial.  相似文献   

17.
The paper describes a parallel implementation of a grand challenge problem: global atmospheric modeling. The novel contributions of our work include (1) a detailed investigation of opportunities for parallelism in atmospheric global modeling based on spectral solution methods, (2) the experimental evaluation of overheads arising from load imbalances and data movement for alternative parallelization methods, and (3) the development of a parallel code that can be monitored and steered interactively based on output data visualizations and animations of program functionality or performance. Code parallelization takes advantage of the relative independence of computations at different levels in the earth's atmosphere, resulting in parallelism of up to 40 processors, each independently performing computations for different atmospheric levels and requiring few communications between different levels across model time steps. Next, additional parallelism is attained within each level by taking advantage of the natural parallelism offered by the spectral computations being performed (e.g. taking advantage of independently computable terms in equations). Performance measurements are performed on a 64-node KSR2 supercomputer. However, the parallel code has been ported to several shared memory parallel machines, including SGI multiprocessors, and has also been ported to distributed memory platforms like the IBM SP-2.  相似文献   

18.
Graphlet Degree Vector (GDV)是一种研究生物网络的重要方法,能揭示生物网络中各节点与其局部网络结构的相关性,但随着需要挖掘的自同构轨道数量的增加以及生物网络规模的增大,GDV方法的时间复杂度会呈指数级增长。针对这个问题,在现有串行GDV方法的基础上,实现了基于消息传递接口(MPI)的GDV方法并行化;此外又将GDV方法进行了改进并将改进后的方法实现了并行优化,改进后的方法在寻找不同节点自同构轨道的过程中优化了计算过程以解决重复计算的问题,同时结合负载均衡策略合理分配任务。模拟网络数据和真实生物网络数据上的实验结果表明,并行化的GDV方法与改进后的并行化GDV方法都具有较好的并行性能,并且对不同类型不同规模的网络都具有较强的适用性,扩展性强,可有效地保持寻找网络中自同构轨道的高效率。  相似文献   

19.
An agent that must learn to act in the world by trial and error faces the reinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist for this problem in the general case, they are often quite inefficient and do not exhibit generalization. One strategy is to find restricted classes of action policies that can be learned more efficiently. This paper pursues that strategy by developing algorithms that can efficiently learn action maps that are expressible in k-DNF. The algorithms are compared with existing methods in empirical trials and are shown to have very good performance.  相似文献   

20.
随着地震台站数量大大增加,测量数据量也急剧增长.传统的串行化相对波速变化计算方法面向海量数据时存在计算速度慢、消耗时间长等问题,已不能满足日常业务的需求.针对此问题,提出一种面向海量数据的相对波速变化计算的并行化方法.通过对地震数据集的划分和算法调度,将数据集分布到基于Spark计算框架的分布式集群上进行并行运算.实验...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号