首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of energy-aware multicore processor architectures.  相似文献   

2.
陈剑骏  陈耀武 《计算机工程》2012,38(12):214-217
针对H.264视频解码算法的并行模块选择、划分及解码速度优化等问题,面向TilePro64多核平台,提出一种可扩展的H.264并行解码算法。对该算法的内部功能模块进行整合和划分,根据核间数据的依赖关系,动态分配功能模块及优化算法并行效率。实验结果表明,该算法在解码效率、多核并行程度、解码时延等方面均有较好性能,相比传统并行解码算法,其并行加速比提高约25%。  相似文献   

3.
With energy consumption becoming one of the first-class optimization parameters in computer system design, compilation techniques that consider performance and energy simultaneously are expected to play a central role. In particular, compiling a given application code under performance and energy constraints is becoming an important problem. In this paper, we focus on an on-chip multiprocessor architecture and present a set of code optimization strategies. We first evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using a different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop are shut down (i.e., placed into a power-down or sleep state). Our results show that shutting down unused processors can lead to as much as 67 percent energy savings at the expense of up to 17 percent performance loss in a set of array-intensive applications. To eliminate this performance penalty, we also discuss and evaluate a processor preactivation strategy based on compile-time analysis of nested loops. Based on our experiments, we conclude that an adaptive loop parallelization strategy combined with idle processor shut down and preactivation can be very effective in reducing energy consumption without increasing execution time. We then generalize our strategy and present an application parallelization strategy based on integer linear programming (ILP). Given an array-intensive application, our optimization strategy determines the number of processors to be used in executing each loop nest based on the objective function and additional compilation constraints provided by the user/programmer. Our initial experience with this constraint-based optimization strategy shows that it is very successful in optimizing array-intensive applications on on-chip multiprocessors under multiple energy and performance constraints.  相似文献   

4.
针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。  相似文献   

5.
吴悦  雷超付  杨洪斌 《计算机工程》2010,36(9):35-37,40
针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。  相似文献   

6.
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.  相似文献   

7.
头脑风暴优化BSO算法是一种新型的群体智能优化算法,启发于众人集思广益求解问题的模式,适合求解复杂多峰函数优化问题。但是,BSO求解多峰极值时需进行重复的迭代运算,面对大规模数据集时会出现计算效率与求解精度过低的现象。为解决上述问题,设计并实现了一种基于Spark的并行化头脑风暴优化算法,通过将BSO算法中计算复杂度最高的聚类与新解产生过程并行化,以提高算法的加速比与计算效率。特别地,基于并行化思想,将种群划分为多个子群进行协同演化,每个子群独立产生新解来保持种群多样性,提高算法的收敛速度。最后,利用并行化BSO算法求解多峰函数。实验表明,在并行节点的总核心数为10的情况下,并行化BSO算法计算时间节省一半,计算精度和串行BSO算法基本持平,收敛速度明显提高,实验结果说明了并行化BSO的有效性。  相似文献   

8.
针对高清图像视频的实时解码需求,提出一种基于多层次并行流水架构的解码算法。该算法首先针对图像的宏块行实现基于功能模块的行级并行算法,并通过功能模块的二次划分进行核间负载均衡的优化,再针对解码过程中开销较大的滤波环节,利用宏块之间的依赖关系进行多核并行处理,对行级并行算法进行更深层次上的再优化设计。实验结果表明,该算法可以在TILEPro64平台上实现1 080P全高清码流的实时解码,实现了较高的并行加速比,最高达到10.01,和已有的并行解码算法相比,加速比提升80%。  相似文献   

9.
It is meaningful to use a little energy to obtain more performance improvement compared with the increased energy. It also makes sense to relax a small quantity of performance restriction to save an enormous amount of energy. Trading a small amount of energy for a considerable sum of performance or vice versa is possible if the relativities between performance and energy of parallel programs are exactly known. This work studies the relativities by recording the performance speedup and energy consumption of parallel programs when the number of cores on which programs run are changed. We demonstrate that the performance improvement and the increased energy consumption have a linear negative correlation.In addition, these relativities can guide us to do performance–energy adaptation under two assumptions. Our experiments show that the average correlation coefficients between performance and energy are higher than 97 %. Furthermore, it can be found that exchanging less than 6 % performance loss for more than 37 % energy consumption is feasible and vise versa.  相似文献   

10.
We report on the parallelization of two widely used algorithms in computational physics: The Monte Carlo simulation of the Ising model and a cluster identification algorithm which is used for percolation or percolation-like problems. Both parallel algorithms were tested on a multi-transputer system using up to 128 processors. The results show that the algorithms can perform with a linear speedup. We propose a scaling law for the speedup and show that the speedup for both algorithms satisfies this scaling.  相似文献   

11.
HOG特征是一种简单高效的常用来进行物体检测的特征描述子,广泛应用于行人检测等领域,然而在处理海量图片时却面临着严峻的性能挑战。解决方法之一就是通过使用"神威太湖之光"超级计算机的处理器节点对海量图像背景下的行人检测算法进行加速。主要采用了两种并行方案:一种是一个处理器同时处理4张图片,另一种是同时处理256张图片。大量的串行和并行处理的实验测试结果表明,对高分辨率多幅图像的并行处理可采用第一种方案,加速比可达83倍;对低分辨率图像可采用第二种方案,加速比最高可达到95。两种并行设计方案在"神威太湖之光"的多处理器节点上具有很好的可扩展性能。  相似文献   

12.
本文根据影响并行蚁群算法性能的关键因素,提出了一种自适应的并行蚁群算法.首先提出了基于适应度和基于距离选择的两种不同的信息交流策略,使得各处理机自适应地选择与之进行信息交换的处理机,然后采用自适应的更新策略进行信息素的更新.为了增强该算法的搜索能力,还根据解的多样性给出了自适应地调节处理机之间的信息交流周期的方法.在MPP处理机深腾1800上对TSP问题的实验结果表明了该算法在保证有效的加速比的同时,具有很好的收敛性.  相似文献   

13.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors.  相似文献   

14.
We address the task of measuring the relative speed (speedup) of two systems and for solving the same problem. For example, may be a parallel algorithm, parametrized by the number of processors used, whose running time has to be related to a serial standard algorithm . If and/or are randomized or if we are interested in their performance on a (discrete) probability distribution of problem instances, the running times are described by random variables and . The speedup of over is usually defined as where denotes the expected value. In many cases this definition is not appropriate for the user of or , because the summation in and hides information about the speedup of individual runs. We propose an alternative speedup definition of the form and present a set of intuitive functional equations, which any such function should fulfill. Finally, we prove that the weighted geometric mean is the only solution of these equations. Received: 1 July 1994 / 10 May 1996  相似文献   

15.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

16.
针对具有独立DVFS的多核处理器系统,提出了一种K线程低能耗模型的并行任务调度优化算法(Tasks Optimization based on Energy-Effectiveness Model,TO-EEM)。与传统的并行任务节能调度相比,该算法的主要目标是不仅通过降低处理器频率来减少处理器瞬时功耗,而且结合并行任务间的同步互斥所造成的线程阻塞情况,合理分配线程资源来减少线程同步时间,优化并行性能;保证任务在一定的并行加速比性能前提下,提高资源利用率,减少能耗,达到程序能耗和性能之间的折衷。文中进行了大量模拟实验,结果证明提出的任务优化模型算法节能效果明显,能有效降低处理器的功耗,并始终保持线性加速比。  相似文献   

17.
This paper reports on a parallel implementation of a general 3D multi-block CFD code. The parallelization is achieved by using three strategies. Firstly, it is done on dual-processor PC-clusters where Windows NT systems are running. A multi-thread programming model is adopted for the multi-block code, where one thread corresponds to a block. Shared-memory is used for the exchange of inner-boundaries between neighboring blocks (threads) on the same node, while WinSockets are employed for those on different nodes. Secondly, the parallelization is extended to UNIX operating system. MPI is applied for all the message passing between different processors, including those on the same node. Thirdly, Pthreads (POSIX threads), a standardized application interface for threads, are adopted to take the advantage of the shared-memory feature of the SMP nodes, while MPI is only applied for the message passing between processors on different nodes. In all the strategies, a static load-balancing method is employed for equitable distribution of computational work to specified nodes. The parameters of the present code is studied in detail to facilitate the explanation of the speedup results. Two examples are provided to show the speedup and load balancing of the parallel calculation. Detailed comparison is made to evaluate the efficiency of different strategies.  相似文献   

18.
Parallel and serial heuristics for the minimum set cover problem   总被引:3,自引:0,他引:3  
We present a theoretical analysis and an experimental evaluation of four serial heuristics and four parallel heuristics for the minimum set cover problem. The serial heuristics trade off run time with the quality of the solution. The parallel heuristics are derived from one of the serial heuristics. These heuristics show considerable speedup when the number of processors is increased. The quality of the solution computed by the heuristics does not degrade with an increase in the number of processors.Research of both authors was supported by NSF Grant No. MIP-8807540.  相似文献   

19.

Models extending Amdahl’s law have been developed to study the behavior of parallel programs energy consumption. In addition, it has been shown that energy consumption of those programs also relies on the layout of the resources on the chip, such as power supply. Other extensions over Amdahl’s law have been conducted to study the behavior of parallel programs speedup for frequency variable processors. Previous models have focused on the use of Turbo Boost in the parallel regions of a program, without considering that Turbo Boost also affects the sequential regions. Hence, we present a model to analyze energy consumption of parallel programs executed on Intel multicore processors with Turbo Boost frequencies to cover this gap. The model is an extension to Amdahl’s law, and it is validated with a double-precision matrix multiplication running on Intel multicore processors that enable Turbo Boost technology.

  相似文献   

20.
硅各向异性腐蚀过程复杂,采用元胞自动机模拟硅各向异性腐蚀非常耗时。为了加速腐蚀模拟过程,研究了基于图形处理器(GPU)进行硅的各向异性腐蚀模拟。针对串行算法直接并行化方法存在加速效率低等问题,提出了一个改进的并行模拟方法。该方法增加了并行部分的负载,减少了内存管理的开销,从而提高了加速性能。实验证明该方法能够获得较理想的加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号