首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
对于超字级并行(SLP)算法不能有效地处理大型程序中并行代码率较小,且可向量化的代码中可能存在对向量化不利的代码的问题,提出了一种新型的SLP改进算法NSLPO。首先,将程序中不能向量化的非同构语句进行同构化处理,定位SLP丢失的向量化机会;然后,通过冗余节点添加构建最大通用子图,通过冗余删除等优化过程得到同构化之后的补充SLP图,提高程序中代码的并行性;最后,运用节流法将对向量化有害的代码摒除在向量化之外,仅对它们进行标量处理,通过只向量化处理那些向量化有收益的代码以尽可能地提升程序效率。在一组广泛使用的内核测试集中进行实验,结果显示,与SLP算法相比,NSLPO算法性能更优,其执行时间比SLP平均减少9.1%。  相似文献   

2.
针对异构多核处理器间的任务调度问题,为了更好地发挥异构多核处理器间的平台优势,提出一种基于将有关联的且不在同一处理器上的任务进行复制的思想,从而使每个异构多核的处理器能独立执行任务,来减少不同处理器之间的通信开销,并且通过混合粒子群算法(HPSO)来调度异构多核处理器中的任务,避免由于当任意一个异构多核处理器由于任务分配过多而导致计算机不能及时且准确地得出结果.最后实验证明,对比传统的启发式分配方案和常见的遗传算法(GA),基于任务复制思想分配方案和混合粒子群算法(HPSO)具有更好的求解能力,并且可以提供执行时间更少的调度分配方案,具有较好的应用价值.  相似文献   

3.
Multi-threaded programs on shared-memory hardware tend to be non-deterministic, which brings challenges to software debugging and testing. Current deterministic implementations eliminate nondeterminism of multi-threaded programs by trading much parallelism for determinism, which leads to low performance. Researchers typically improve parallelism by weakening determinism or introducing weak memory consistency models. However, weak determinism cannot deal with non-determinism caused by data races which are very common in multi-threaded programs. Weak memory consistency models impact the productivity of programming and may bring correctness problems of legacy programs. To address the problems, this paper presents a fully parallelized deterministic runtime, FPDet, which exploits parallelism of deterministic multi-threaded programs by preserving strong determinism and sequential memory consistency. FPDet creates a Working Set Memory (WSM) for each thread to make threads run independently for parallelism. FPDet guarantees determinism by redistributing memory blocks among threads’ WSMs in specified synchronization points. As a result, FPDet obtains parallelism and determinism simultaneously. To further exploit parallelism, we propose an Adaptive Budget Adjustment (ABA) mechanism to minimize wait time caused by thread synchronization.  相似文献   

4.
While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.  相似文献   

5.
In this paper, we deal with multiprocessor task scheduling with ready times and prespecified processor allocation. We consider an on‐line scenario where tasks arrive over time, and, at any point in time, the scheduler only has knowledge of the released tasks. An application of this problem arises in wavelength division multiplexing broadcasting where the main future will be in the so‐called one‐to‐many transmission. We propose algorithms to find lower bounds of the minimum makespan, and present experiments on various scenarios.  相似文献   

6.
A new method of parallelism between statements is exploited. The method is to decompose a process into separate sequential processes that are connected through queues of message buffers. The conditions for decomposition are analyzed, and a decomposition algorithm is developed. PL/I is used to describe processes.  相似文献   

7.
Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization.  相似文献   

8.
提出了一种毫米波雷达数据采集与处理系统的设计方案。对系统的存储器结构、数据通信通道组成和系统总线结构进行了分析;讨论了算法划分、算法的多处理器映射及调度;并用两种典型算法对系统性能进行了测试。实验结果表明,本系统在处理能力和实时性方面均达到了预期水平。  相似文献   

9.
兰舟  孙世新 《计算机学报》2007,30(3):454-462
多处理器调度问题是影响系统性能的关键问题,基于任务复制的调度算法是解决多处理器调度问题较为有效的方法.文中分析了几个典型的基于任务复制算法,提出了基于动态关键任务(DCT)的多处理器任务分配算法.DCT算法以克服贪心算法不足为要点,调度过程中动态计算任务时间参数,准确确定处理器的关键任务,以关键任务为核心优化调度,逐步改善调度结果,最终取得最优的调度结果.分析和实验证明,DCT算法优于现有其它同类算法.  相似文献   

10.
David M. Rogers 《Software》2023,53(1):99-114
Runtime scheduling and workflow systems are an increasingly popular algorithmic component in HPC because they allow full system utilization with relaxed synchronization requirements. There are so many special-purpose tools for task scheduling, one might wonder why more are needed. Use cases seen on the Summit supercomputer needed better integration with MPI and greater flexibility in job launch configurations. Preparation, execution, and analysis of computational chemistry simulations at the scale of tens of thousands of processors revealed three distinct workflow patterns. A separate job scheduler was implemented for each one using extremely simple and robust designs: file-based, task-list based, and bulk-synchronous. Comparing to existing methods shows unique benefits of this work, including simplicity of design, suitability for HPC centers, short startup time, and well-understood per-task overhead. All three new tools have been shown to scale to full utilization of Summit, and have been made publicly available with tests and documentation. This work presents a complete characterization of the minimum effective task granularity for efficient scheduler usage scenarios. These schedulers have the same bottlenecks, and hence similar task granularities as those reported for existing tools following comparable paradigms.  相似文献   

11.
The hybrid flow-shop scheduling problem with multiprocessor tasks finds its applications in real-time machine-vision systems among others. Motivated by this application and the computational complexity of the problem, we propose a genetic algorithm in this paper. We first describe the implementation details, which include a new crossover operator. We then perform a preliminary test to set the best values of the control parameters, namely the population size, crossover rate and mutation rate. Next, given these values, we carry out an extensive computational experiment to evaluate the performance of four versions of the proposed genetic algorithm in terms of the percentage deviation of the solution from the lower bound value. The results of the experiments demonstrate that the genetic algorithm performs the best when the new crossover operator is used along with the insertion mutation. This genetic algorithm also outperforms the tabu search algorithm proposed in the literature for the same problem.  相似文献   

12.
To simplify the task of building distributed streaming applications, we propose a new abstraction for information flow – Infopipes. Infopipes make information flow primary, not an auxiliary mechanism that is hidden away. Systems are built by connecting predefined component Infopipes such as sources, sinks, buffers, filters, broadcasting pipes, and multiplexing pipes. The goal of Infopipes is not to hide communication, like an RPC system, but to reify it: to represent communication explicitly as objects that the program can interrogate and manipulate. Moreover, these objects represent communication in application-level terms, not in terms of network or process implementation.  相似文献   

13.
The hardware, system software and Scientific application examples with relative performances are reported for the ICAP-1, ICAP-2, ICAP-3 and ICAP-3090 experimental systems, with emphases on motivation, strategy and accomplishments. These pioneering efforts are considered in the light of future large-scale computational applications, which will require parallel super computing power and also a new outlook to computational models.  相似文献   

14.
In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sin k variable renaming technique, which can reposition an undesired anti-dependence and/or output-dependence link, is capable of breaking an anti-dependence and/or output-dependence link. (2) For recurrences connected by only true dependences, a dynamic dependence concept and the derived technique are powerful in terms of parallelism exploitation. (3) By the employment of global dependence testing, link-breaking strategy, Tarjan’s depth-first search algorithm, and a topological sorting, an algorithm for resolving a general multi-statement recurrence in a nested loop is proposed. Experiments with benchmark cited from Vector loops showed that among 134 subroutines tested, 3 had their parallelism exploitation amended by our proposed method. That is, our offered algorithm increased the rate of parallelism exploitation of Vector loops by approximately 2.24%.  相似文献   

15.
在多核嵌入式平台下,针对具有约束关系的实时周期任务,提出一种基于任务关键因子和截止时间的调度算法BVDS(Based on Value and Deadline Scheduling).该算法以有效利用处理器为原则,根据每个处理器的实际运行情况,为有可能在截止时间前完成的任务分配处理器资源.算法实现分为两个阶段:第一阶段根据任务的到达时间、关键因子以及执行时间构建等待任务链表;第二阶段,在执行过程中,充分考虑不同任务的执行时间以及任务之间的约束关系进行优先级分配.实验结果表明,该算法在牺牲少量处理器利用率的前提下,有效地降低了任务的死限丢失率.  相似文献   

16.
分段约束的超字并行向量发掘路径优化算法   总被引:1,自引:0,他引:1  
超字并行(SLP)是一种针对基本块的向量并行发掘方法,结合循环展开可以发掘更多的并行性,但同时也会产生过多的发掘路径。针对上述问题,提出了一种分段约束的SLP发掘路径优化算法;采用分段的冗余删除方法,来保证冗余删除后段的同构性。采用段间的SLP发掘,来约束发掘路径;最后进行pack调整来处理访存重叠的情况。实验结果表明,该方法有效增强了SLP向量化功能,对于测试程序,向量化的平均加速比接近2。  相似文献   

17.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

18.
This paper presents the theoretical basis of a proof procedure, which allows a high degree of parallel processing. The theoretical method is based upon the works of Prawitz's improved proof procedure and Robinson's unification algorithm. The input to the method is a set of clauses (or alternatively, well-formed formulae). The output of the method is the solution to the problem, if it exists. To overcome the inefficiency of the theoretical approach, we outline the main steps of a practical proof procedure. Besides parallel processing, the proof procedure works with bit manipulation rather than symbol manipulation as found in most of the existing proof mechanisms.  相似文献   

19.
随着多处理器系统规模的不断扩大,如何节能成为一个亟待解决的重要问题。为此,基于多处理器系统提出一种针对随机任务的在线节能实时调度算法。使用统计方法,根据已有任务的到达时间和计算量估计新任务在空闲处理器上执行的电压/频率,使还未到达的任务能够满足截止期限并有效节能。在考虑单个处理器上执行的任务时,计算执行这些任务所需的平均电压/频率,使所有任务的执行速度尽量均衡,当某些任务不能满足截止期限要求时,则调高未执行任务的电压/频率。实验结果表明,与EDF,HVEA,MEG和ME-MC算法相比,该算法在满足截止期限和节能方面具有明显的优势。  相似文献   

20.
介绍了一种新型的超级指令任务调用方法,该任务调用方法基于单芯片多处理器架构,它由一个主处理器和三个从处理器构成;在该系统中,终端用户无需掌握应用领域中复杂的、底层的专业算法,这些算法可以由底层的算法工程师事先集成在各个从处理器中,而终端用户只需要对主处理器进行编程;当主处理器的应用程序需要调用某个算法程序的时候,只需要发送一条超级指令给相应的从处理器,该算法程序就可以由从处理器独立完成,这种任务调用方法大大简化了多处理器任务调用,提高了整个系统的并行工作能力.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号