首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
马春燕  吕炳旭  叶许姣  张雨 《软件学报》2023,34(7):3022-3042
随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%.  相似文献   

2.
低能耗软件设计中的性能无损电压调度技术研究   总被引:3,自引:2,他引:1  
合理地运用动态电压调整技术可以有效降低软件运行所需的能耗·从归纳分析电压调整特征入手,针对程序执行中存在电压调整特征差异的情况,提出了性能无损的低能耗电压调度问题·把该问题形式化为一个混合整数规划模型(MILP),提出了基于剖析结果的PGS算法和基于分析结果的ADS算法·实例分析表明所提出的方法能够有效实现性能无损的低能耗软件设计,模拟实验表明启发式算法可实现较好的近似解·  相似文献   

3.
An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations  相似文献   

4.
多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。  相似文献   

5.
现有并行识别方法用于众核处理器时存在一定不足,当选择的循环并行维迭代数较少时可能导致严重地负载不均衡。针对这一问题,提出了一种面向众核处理器的多维并行识别方法,在现有并行识别方法无法做到较好的负载均衡时,选择嵌套循环的多个维进行并行,将多个并行维的迭代空间合并后再做任务划分,减少负载不均衡对程序并行效率的影响。此方法已在课题组开发的自动并行化系统中进行了实现,实际应用过程中能够提升一些应用程序在众核处理器上并行执行的效率。  相似文献   

6.
The high power consumption of modern processors becomes a major concern because it leads to decreased mission duration (for battery-operated systems), increased heat dissipation, and decreased reliability. While many techniques have been proposed to reduce power consumption for uniprocessor systems, there has been considerably less work on multiprocessor systems. In this paper, based on the concept of slack sharing among processors, we propose two novel power-aware scheduling algorithms for task sets with and without precedence constraints executing on multiprocessor systems. These scheduling techniques reclaim the time unused by a task to reduce the execution speed of future tasks and, thus, reduce the total energy consumption of the system. We also study the effect of discrete voltage/speed levels on the energy savings for multiprocessor systems and propose a new scheme of slack reservation to incorporate voltage/speed adjustment overhead in the scheduling algorithms. Simulation and trace-based results indicate that our algorithms achieve substantial energy savings on systems with variable voltage processors. Moreover, processors with a few discrete voltage/speed levels obtain nearly the same energy savings as processors with continuous voltage/speed, and the effect of voltage/speed adjustment overhead on the energy savings is relatively small.  相似文献   

7.
Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor.  相似文献   

8.
《Parallel Computing》1999,25(13-14):1741-1783
Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed.  相似文献   

9.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors.  相似文献   

10.
Corollaries to Amdahl's Law for Energy   总被引:1,自引:0,他引:1  
This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahl's Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formula we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing.  相似文献   

11.
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.  相似文献   

12.
We demonstrate approaches to the static parallelization of loops and recursions on the example of the polynomial product. Phrased as a loop nest, the polynomial product can be parallelized automatically by applying a space-time mapping technique based on linear algebra and linear programming. One can choose a parallel program that is optimal with respect to some objective function like the number of execution steps, processors, channels, etc. However,at best,linear execution time complexity can be atained. Through phrasing the polynomial product as a divide-and-conquer recursion, one can obtain a parallel program with sublinear execution time. In this case, the target program is not derived by an automatic search but given as a program skeleton, which can be deduced by a sequence of equational program transformations. We discuss the use of such skeletons, compare and assess the models in which loops and divide-and-conquer resursions are parallelized and comment on the performance properties of the resulting parallel implementations.  相似文献   

13.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

14.
This paper presents an algorithm for optimization of programs at the compilation stage that analyzes execution of various program segments at available processor frequencies and selects an energy-effective schedule of frequencies with regard to the constraints of the arising additional time of operation. This algorithm is complemented by an operating-system manager that controls setting of a desired processor mode and resolves conflicts in multi-program environment.  相似文献   

15.
一种面向异构众核处理器的并行编译框架   总被引:1,自引:0,他引:1  
异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块:任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果.  相似文献   

16.
As moderate-scale multiprocessors become widely used, we foresee an increased demand for effective compiler parallelization and efficient management of parallelism. While parallelizing compilers are achieving success at identifying parallelism, they are less adept at predetermining the degree of parallelism in different program phases. Thus, a compiler-parallelized application may execute on more processors than it can effectively use – a waste of computational resources that becomes more acute as the number of processors increases, particularly for systems used as multiprogrammed compute servers. This paper examines the dynamic parallelism behavior of multiprogrammed workloads using programs from the Specfp95 and Nas benchmark suites, automatically parallelized by the Stanford SUIF compiler. Our results demonstrate that even the programs with good overall speedups display wide variability in the number of processors each phase (or loop) can exploit. We propose and evaluate a run-time system mechanism that dynamically adjusts the number of processors used by a compiler-parallelized application, responding to observed performance during the program's execution. Programs can thus adapt processor usage as they run, responding both to poor parallelism within certain parts of their code, and also to heavy multiprogramming loads during the execution. This mechanism improves workload performance up to 33% over one-at-a-time runs of the workload programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

17.
Applications in industry often have grown and improved over many years. Since their performance demands increase, they also need to benefit from the availability of multi-core processors. However, a reimplementation from scratch and even a restructuring of these industrial applications is very expensive, often due to high certification efforts. Therefore, a strategy for a systematic parallelization of legacy code is needed. We present a parallelization approach for hard real-time systems, which ensures a high reusage of legacy code and preserves timing analysability. To show its applicability, we apply it on the core algorithm of an avionics application as well as on the control program of a large construction machine. We create models of the legacy programs showing the potential of parallelism, optimize them and change the source codes accordingly. The parallelized applications are placed on a predictable multi-core processor with up to 18 cores. For evaluation, we compare the worst case execution times and their speedups. Furthermore, we analyse limitations coming up at the parallelization process.  相似文献   

18.
We address scheduling independent and precedence constrained parallel tasks on multiple homogeneous processors in a data center with dynamically variable voltage and speed as combinatorial optimization problems. We consider the problem of minimizing schedule length with energy consumption constraint and the problem of minimizing energy consumption with schedule length constraint on multiple processors. Our approach is to use level-by-level scheduling algorithms to deal with precedence constraints. We use a simple system partitioning and processor allocation scheme, which always schedules as many parallel tasks as possible for simultaneous execution. We use two heuristic algorithms for scheduling independent parallel tasks in the same level, i.e., SIMPLE and GREEDY. We adopt a two-level energy/time/power allocation scheme, namely, optimal energy/time allocation among levels of tasks and equal power supply to tasks in the same level. Our approach results in significant performance improvement compared with previous algorithms in scheduling independent and precedence constrained parallel tasks.  相似文献   

19.
Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%.  相似文献   

20.
As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号