期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A simulator for adaptive parallel applications

Basile Schaeli Sebastian Gerlach Roger D. Hersch 《Journal of Computer and System Sciences》2008,74(6):983-999

Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. Detailed simulations can help identify allocation strategies and problem decomposition parameters that increase the efficiency of parallel applications. We describe a simulation framework supporting dynamic node allocation which, given a simple cluster model, predicts the running time of parallel applications taking CPU and network sharing into account. Simulations can be carried out without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements are reduced. In partial direct execution simulations, the application's parallel behavior is retrieved via direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. Simulations may then vary cluster model parameters, operation durations and problem decomposition parameters to analyze their impact on the application performance and identify the limiting factors. We implemented the proposed techniques by adding direct execution simulation capabilities to the Dynamic Parallel Schedules parallelization framework. We introduce the concept of dynamic efficiency to express the resource utilization efficiency as a function of time. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator under different parallelization and dynamic node allocation strategies. 相似文献

2.

Design and Performance of Parallel and Distributed Approximation Algorithms for Maxcut

《Journal of Parallel and Distributed Computing》1997,46(1):48-61

We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation starts with the recent (serial) algorithm of Goemans and Williamson for this problem. We consider several different versions of this algorithm, varying the interior-point part of the algorithm in order to optimize the parallel efficiency of our method. Our work aims for an efficient, practical formulation of the algorithm with close-to-optimal parallelization. We analyze our parallel algorithm in the LogP model and predict linear speedup for a wide range of the parameters. We have implemented the algorithm using the message passing interface (MPI) and run it on several parallel machines. In particular, we present performance measurements on the IBM SP2, the Connection Machine CM5, and a cluster of workstations. We observe that the measured speedups are predicted well by our analysis in the LogP model. Finally, we test our implementation on several large graphs (up to 13,000 vertices), particularly on large instances of the Ising model. 相似文献

3.

A methodology for detailed performance modeling of reduction computations on SMP machines

Ruoming Gagan 《Performance Evaluation》2005,60(1-4):73-105

In this paper, we revisit the problem of performance prediction on SMP machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.

In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cache-sensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, “Can we predict the performance of these techniques for a given problem, dataset, and machine?”.

This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (two-level cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20% in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques. 相似文献

4.

Efficient Parallelization of a Three-Dimensional Navier-Stokes Solver on MIMD Multiprocessors 总被引：1，自引：1，他引：0

Averbuch A. Epstein B. Ioffe L. Yavneh I. 《The Journal of supercomputing》2000,17(2):123-142

We present an efficient parallelization strategy for speeding up the computation of a high-accuracy 3-dimensional serial Navier-Stokes solver that treats turbulent transonic high-Reynolds flows. The code solves the full compressible Navier-Stokes equations and is applicable to realistic large size aerodynamic configurations and as such requires huge computational resources in terms of computer memory and execution time. The solver can resolve the flow properly on relatively coarse grids. Since the serial code contains a complex infrastructure typical for industrial code (which ensures its flexibility and applicability to complex configurations), then the parallelization task is not straightforward. We get scalable implementation on massively parallel machines by maintaining efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.The 3-D Navier-Stokes solver was implemented on three MIMD message-passing multiprocessors (a 64-processors IBM SP2, a 20-processors MOSIX, and a 64-processors Origin 2000). The same code written with PVM and MPI software packages was executed on all the above distinct computational platforms. The examples in the paper demonstrate that we can achieve efficiency of about 60% for as many as 64 processors on Origin 2000 on a full-size 3-D aerodynamic problem which is solved on realistic computational grids. 相似文献

5.

Xen虚拟机资源调度优化算法

乔百友蔡仁翰陈东海王虹陈洋王国仁《软件学报》2014,25(S2):201-212

针对Xen虚拟化平台中虚拟机资源分配不合理的问题,提出了两种资源调度优化算法,即细粒度优化算法和粗粒度优化算法.细粒度优化算法主要解决单个物理节点上虚拟机资源分配不合理问题,能够根据物理节点上运行的各虚拟机的资源利用情况来调整资源分配量,适当增加利用率较高的虚拟机的资源,减少资源利用率低的虚拟机的资源,从而优化资源分配,提高资源利用效率,避免不必要的虚拟机迁移.粗粒度优化算法是针对集群中多个物理节点之间虚拟机负载不均衡问题而提出的.该算法结合粒子群优化技术,选择将集群系统中热点物理机上的部分虚拟机迁移到最适合的冷点物理机上,从而避免高载物理机宕机.实验结果表明,这两种资源调度优化算法能够有效解决虚拟机资源分配不合理的问题,具有较好的适用性和应用前景. 相似文献

6.

Runtime Support for Parallelization of Data-Parallel Applications on Adaptive and Nonuniform Computational Environments

Maher Kaddoura Sanjay Ranka 《Journal of Parallel and Distributed Computing》1997,43(2):163

In this paper we discuss the runtime support required for the parallelization of unstructured data-parallel applications on nonuniform and adaptive environments. We describe several optimization techniques for fast remapping of data and for reducing the amount of communications between machines when the computational load on the machines changes during the execution of data-parallel applications. The approach presented is reasonably general and is applicable to a wide variety of regular as well as irregular applications. We present performance results for the solution of an unstructured mesh on a cluster of workstations. 相似文献

7.

云环境中海量数据的并行分组密码体制研究

师金钢郑艳孙焕良栾方军《计算机科学与探索》2014,(2):161-170

云计算环境中,飞速增长的海量数据的安全性越来越受到关注,分组密码算法是保证海量数据安全性的一个有效手段,但面对超大规模的数据量其效率是一个备受关注的问题。提出了一种基于MapReduce架构的并行分组密码机制,能够使标准的分组密码算法应用于大规模的集群环境中,通过并行化来提高海量数据加密与解密的执行效率,并设计了常用的几种并行工作模式。实验证明,提出的算法具有良好的可扩展性和高效的执行性能,能够适用于云计算环境中海量数据的安全保密,为进一步的研究工作奠定了基础。相似文献

8.

基于三层混合编程模型的Petri网并行算法研究

周杰李文敬《计算机科学》2017,44(Z11):586-591, 595

为解决多核机群Petri网并行化过程中,运用MPI+OPenMP混合编程实现同步会出现死锁的问题,提出了基于三层混合编程模型的Petri网并行算法。首先,根据事务内存的同步优势,在多核机群环境下构建MPI+OPenMP+STM的三层编程模型;然后,对Petri网的几何模型与代数模型的并行化进行分析,建立MPI+OPenMP+STM三层结构的Petri网并行模型,并对三层混合编程模型的Petri网并行算法进行设计与分析;最后,通过示例进行编程验证,该算法的运行效率明显优于其他编程模式,而且Petri网的规模越大,其并行计算的效果就越明显。因此,该算法是多核机群环境下模拟Petri网并行运行的一种高效且可行的算法。相似文献

9.

面向多核的基于RSTM系统的冲突管理策略

下载免费PDF全文

魏广博张平黄国睿《计算机工程》2010,36(10):35-37

程序并行化是充分发挥多核处理器性能的有效手段。现有编程模型受锁、管道等同步方式的约束,并行度很难提高。针对上述问题,提出一种面向多核的基于Rochester软件事务存储(RSTM)系统的冲突管理策略,在现有编程语言中提供接口,通过事务方式提高程序并行度,以优先级方式解决2个事务发生冲突时的裁决问题,减少不必要的一致性验证,减小系统开销。相似文献

10.

基于Hadoop的协同过滤并行化算法

曹霞谢颖华《计算机系统应用》2018,27(5):166-170

在针对大数据的迅速增长,为了改善协同过滤算法的推荐效率,使得推荐精度越来越高,提出基于Hadoop平台的协同过滤并行化算法,将传统的基于用户的协同过滤在Hadoop平台下进行MapReduce编程模型,实现并行化.通过利用MovieLens公用数据集对改进前后的算法对比,验证了并行化的协同过滤效率更高,也更加适合大规模数据的推荐. 相似文献

11.

面向公有云的支持快速解密的CP-ABE方案

邹莉萍冯朝胜秦志光袁丁罗王平李敏《软件学报》2020,31(6):1817-1828

现有的密文策略基于属性加密CP-ABE（ciphertext-policy attribute-based encryption）算法普遍在解密时存在计算量过大、计算时间过长的问题，该问题造成CP-ABE难以应用和实施.针对该问题，将计算外包引入到方案的设计之中，提出一种面向公有云的基于Spark大数据平台的CP-ABE快速解密方案.在该方案中，专门根据CP-ABE的解密特点设计了解密并行化算法；利用并行化算法，将计算量较大的叶子节点解密和根节点解密并行化；之后，将并行化任务交给Spark集群进行处理.计算外包使得绝大多数解密工作由云服务器完成，用户客户端只需进行一次指数运算；而并行化处理则提高了解密速度.安全性分析表明，所提出的方案在一般群模型和随机预言模型下能对抗选择明文攻击. 相似文献

12.

Performance metrics for embedded parallel pipelines

Fleury M. Downton A.C. Clark A.F. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(11):1164-1185

A statistical approach to performance prediction is applied to a system development methodology for pipelines comprised of independent parallel stages. The methodology is aimed at distributed memory machines employing medium-grained parallelization. The target applications are continuous-flow embedded systems. The use of order statistics on this type of system is compared to previous practical usage which appears largely confined to traditional Non-Uniform Memory Access (NUMA) machines for loop parallelization. A range of suitable performance metrics which give upper bounds or estimates for task durations are discussed. The metrics have a practical role when included in prediction equations in checking fidelity to an application performance specification. An empirical study applies the mathematical findings to the performance of a multicomputer for a synchronous pipeline stage. The results of a simulation are given for larger numbers of processors. In a further simulation, the results are extended to take account of waiting-time distributions while data are buffered between stages of an asynchronous pipeline. Order statistics are also employed to estimate the degradation due to an output ordering constraint. Practical illustrations in the image communication and vision application domains are included. 相似文献

13.

Efficient parallelization of a parabolized flow solver

A. Povitsky 《Computers & Fluids》1998,27(8):985-1000

This article describes application of our theory of parallelization of implicit ADI schemes to parabolized flows. A parallel multi-domain version of a turbulent developing flow in a straight duct (case A) and viscous flow in a curved duct (case B) are presented. Semi-implicit and explicit methods for the determination of boundary values for the auxiliary ADI functions on the interfaces between the sub-domains are utilized. Numerical runs show that the proposed algorithm is valid in the regions with rapidly varying fields of governing variables (near-entrance region for the case A, region 30°<θ<60° for the case B) as well as in the regions with slow axial modification of the flowfield. The algorithm is suitable for small transverse velocity (case A) and for transverse velocity of order of streamwise velocity (case B). A simplified version of our theoretical model of parallel efficiency is developed and utilized for optimal multidomain partitioning. Computer runs of the multi-domain code are done on a Meiko CS and on a DEC Alpha farm with PVM communication software. The predictions of parallel efficiency obtained by the model compare well with those of actual computer runs. The parallelization parameters obtained are quite different for two considered MIMD machines. This fact confirms the importance of a priori estimation of parallelization efficiency of an algorithm and correct choice of a parallel computer. 相似文献

14.

On Exploiting Heterogeneity for Cluster Based Parallel Multithreading Using Task Duplication

Kwok Yu-Kwong 《The Journal of supercomputing》2003,25(1):63-72

Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications. However, there are two research issues to be tackled in the scheduling problem for PC cluster computing: (1) how to reduce the communication overhead of executing a multithreaded application on the cluster; (2) how to exploit the heterogeneity, which is unavoidable in an evolving PC cluster, for the application. In this paper, we propose to use a duplication based approach in scheduling tasks/threads to a heterogeneous cluster of PCs. In duplication based scheduling, critical tasks are redundantly scheduled to more than one machine, in order to reduce the number of inter-task communication operations. The start times of the succeeding tasks are also reduced. The task duplication process is guided given the system heterogeneity in that the critical tasks are scheduled or replicated in faster machines. The algorithm has been implemented in our experimental application parallelization system for generating multithreaded parallel code executable on a cluster of Pentium PCs. Our experiments, using three numerical applications and one protocol processing kernel (multithreading per request), have indicated that heterogeneity of PC cluster is indeed useful for optimizing the execution of parallel multithreaded programs. 相似文献

15.

Performance evaluation of a workstation cluster,TMC CM-5, and Intel Paragon/XP using a parallel homology analysis program

《Parallel Computing》1997,22(12):1677-1693

In the homology analysis, required memory size and computing time proportionally increase with the square of the number of amino acids in protein. In order to expand the limit of size and to carry out analysis effectively, we parallelized for homology analysis program adopting the wavefront method with tiling. The performance of a workstation cluster, CM5, and Paragon was compared by using our program.For performance evaluation of these machines, we have made a model of our parallel homology analysis computation. Using the model, we can understand the characteristics of these machines with respect to our program.Execution time estimated by the execution model is in good agreement with the observation. The averaged error of the estimation to the observation on a workstation cluster, CM5, and Paragon are 0.1%, 5%, and 2%. Almost linear speed-up has been observed from 1PU to 20PUs on them. The ratio of the performance of Paragon, CM5, and the workstation cluster is 1:0.38:0.22. This ratio is constant to the variety of the size of problem and the number of processors. 相似文献

16.

Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm 总被引：4，自引：0，他引：4

P L Miller P M Nadkarni N M Carriero 《Computer applications in the biosciences》1991,7(1):71-78

We have parallelized the FASTA algorithm for biological sequence comparison using Linda, a machine-independent parallel programming language. The resulting parallel program runs on a variety of different parallel machines. A straight-forward parallelization strategy works well if the amount of computation to be done is relatively large. When the amount of computation is reduced, however, disk I/O becomes a bottleneck which may prevent additional speed-up as the number of processors is increased. The paper describes the parallelization of FASTA, and uses FASTA to illustrate the I/O bottleneck problem that may arise when performing parallel database search with a fast sequence comparison algorithm. The paper also describes several program design strategies that can help with this problem. The paper discusses how this bottleneck is an example of a general problem that may occur when parallelizing, or otherwise speeding up, a time-consuming computation. 相似文献

17.

Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence

《Parallel Computing》2015

Optimization techniques of a plasma turbulence simulation code GKV for improved strong scaling are presented. This work is motivated by multi-scale plasma turbulence extending over multiple spatio-temporal scales of electrons and ions, whose simulations based on the gyrokinetic theory require huge calculations of five-dimensional (5D) computational fluid dynamics by means of spectral and finite difference methods. First, we present the multi-layer domain decomposition of the multi-dimensional and multi-species problem, and segmented MPI-process mapping on 3D torus interconnects, which fully utilizes the bi-section bandwidth for data transpose and reduces the conflicts of simultaneous point-to-point communications. These techniques reduce the inter-node communication cost drastically. Second, pipelined computation-communication overlaps are implemented by using the OpenMP/MPI hybrid parallelization, which effectively mask the communication cost. Careful regulations of the pipeline length and of the thread granularity respectively suppress latencies of MPI, load imbalance and scheduling overheads of OpenMP. Thanks to the above optimizations, GKV achieves excellent strong scaling up to ∼600k cores with high computational performance 782.4 TFlops (8.29% of the theoretical peak) and high effective parallelization rate ∼99.99994% on K, which demonstrates its applicability and efficiency toward a million of cores. The optimized code realizes multi-scale plasma turbulence simulations covering electron and ion scales, and reveals cross-scale interactions of electron- and ion-scale turbulence. 相似文献

18.

Valid transformations: a new class of loop transformations forhigh-level synthesis and pipelined scheduling applications

Minjoong Rim Jain R. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(4):399-410

In this paper we present a new class of loop optimizing transformations called valid transformations, which are suitable for fine-grain parallelization applications such as high-level synthesis of VLSI designs or compilers for super-scalar or VLIW machines. This class of transformations are different from existing ones in that valid transformations can be illegal. Nevertheless, if a transformation is valid, the transformed loop has a feasible pipeline schedule. We present an example valid transformation called loop expansion which can help produce cost-performance efficient designs and explore a larger design space for a satisfactory design. Several examples are used to demonstrate the efficacy of the proposed technique 相似文献

19.

随机算法异步并行化的效率分析

徐云陈国良张强峰顾钧《软件学报》2003,14(5):871-876

随机算法的执行时间具有不确定性,这种不确定性为随机算法的异步并行提供了良好的基础,已有许多计算实验表明了随机算法的异步并行可以达到线性甚至超线性的加速.对于求解SAT问题的随机算法RDP,研究了异步并行效率与运行时间分布和处理器数目之间的关系.应用一种单峰分布──分段线性分布模型来模拟随机算法的运行时间分布.理论分析和计算结果均表明:当处理器数目k较小和单峰位于分布的前部时,随机算法的异步并行具有近线性加速. 相似文献

20.

From the theory to the tools: parallel dynamic programming

D. Gonzlez F. Almeida J. Roda C. Rodríguez 《Concurrency and Computation》2000,12(1):21-34

Dynamic programming is an important paradigm that has been widely used to solve problems in various areas such as control theory, operation research, biology and computer science. We generalize the finite automaton formal model for dynamic programming deriving pipeline parallel algorithms. The optimality of these algorithms is established for the new class of non‐decreasing finite automata. As an intermediate step for the construction of a skeleton for the automatic parallelization of dynamic programming, we have developed a tool for the implementation of pipeline algorithms. The tool maps the processes in the pipeline in the target architecture following a mix of block and cyclic policies adapted to the grain of the machine. Based on the former tool, the automatic parallelization of dynamic programming is straightforward. The use of the model and its associated tools is illustrated with the Single Resource Allocation Problem. The performance and portability of these tools is compared with specific ‘hand made’ code written by experienced programmers. The experimental results on distributed memory and shared distributed memory architectures prove the scalability of the proposed paradigm and its associated tools. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献