首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在高性能计算机上,研究基于不同编译器的三维短波射线追踪计算.通过对三维短波射线追踪计算的分析,设计和实现了三维短波射线追踪串行计算软件和并行计算软件,分别在GCC编译器和PGI编译器中编译生成可执行文件,基于PGI编译器的运行时间明显优于基于GCC编译器的运行时间.  相似文献   

2.
This paper addresses embedded multiprocessor implementation of iterative, real-time applications, such as digital signal and image processing, that are specified as dataflow graphs. Scheduling dataflow graphs on multiple processors involves assigning tasks to processors (processor assignment), ordering the execution of tasks within each processor (task ordering), and determining when each task must commence execution. We consider three scheduling strategies: fully-static, self-timed and ordered transactions, all of which perform the assignment and ordering steps at compile time. Run time costs are small for the fully-static strategy; however it is not robust with respect to changes or uncertainty in task execution times. The self-timed approach is tolerant of variations in task execution times, but pays the penalty of high run time costs, because processors need to explicitly synchronize whenever they communicate. The ordered transactions approach lies between the fully-static and self-timed strategies; in this approach the order in which processors communicate is determined at compile time and enforced at run time. The ordered transactions strategy retains some of the flexibility of self-timed schedules and at the same time has lower run time costs than the self-timed approach.In this paper we determine an order of processor transactions that is nearly optimal given information about task execution times at compile time, and for a given processor assignment and task ordering. The criterion for optimality is the average throughput achieved by the schedule. Our main result is that it is possible to choose a transaction order such that the resulting ordered transactions schedule incurs no performance penalty compared to the more flexible self-timed strategy, even when the higher run time costs implied by the self-timed strategy are ignored.  相似文献   

3.
MapReduce是由并行编程模型及相关支撑系统组成的数据处理框架,通过定义接口和运行时支持库,通过定义良好的接口和运行时支持库,能够自动并行执行大规模计算任务,通过隐藏底层实现细节,降低实现并行编程的难度,Hadoop是目前MapReduce框架最流行的开源实现.文章首先介绍了MapReduce并行编程模型及其hadoop的运行原理、运行机制,深入研究了MapReduce计算任务在Hadoop系统中的运行过程.  相似文献   

4.
Partial reconfiguration (PR) of FPGAs can be used to dynamically extend and adapt the functionality of computing systems by swapping in and out HW tasks. To coordinate the on-demand task execution, we propose and implement a Run-Time System Manager (RTSM) for scheduling software (SW) tasks on available processor(s) and hardware (HW) tasks on any number of reconfigurable regions (RRs) of a partially reconfigurable FPGA. Fed with the initial partitioning of the application into tasks, the corresponding task graph, and the available task mappings, the RTSM controls system operation considering the status of each task and region (e.g. busy, idle, scheduled for reconfiguration/execution, etc). Our RTSM supports task reuse and configuration prefetching to minimize reconfigurations, task movement among regions to efficiently manage the FPGA area, and region reservation for future reconfiguration and execution. We validate the correctness and portability of our RTSM executing an image processing application on two Xilinx-based platforms: ZedBoard and XUPV5. We also perform a more extensive evaluation of its features using a simulation framework, and find that – despite the technology limitations – our approach can give promising results in terms of scheduling quality. Since our RTSM supports also the scheduling of parallel SW tasks, we use it to manage the execution of the entire parallel Edge Detection application on a desktop; we compare the application execution time with that using the OpenMP framework and find that with our RTSM execution is 2.4 times faster than the unoptimized OpenMP version. When processor affinity optimization is enabled for OpenMP, our RTMS and the OpenMP are on par, indicating that the scheduling efficiency of our RTSM is competitive to this state-of-the-art scheduler, while supporting in addition the management of HW tasks.  相似文献   

5.
The ParaScope parallel programming environment   总被引:1,自引:0,他引:1  
The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, is described. It includes a collection of tools that use global program analysis to help users develop and debug parallel programs. The focus is on ParaScope's compilation system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. A project aimed at extending ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language for use with both distributed-memory and shared-memory parallel computers, is described  相似文献   

6.
云计算中任务分解是提高任务执行并行度的重要手段。针对云计算中任务分解算法在解决复杂任务分解问题时容易陷入分解粒度过大及局部最优的缺陷,提出了一种树形分解问题思想与启发式策略相结合的任务分解算法(Improve Heuristic Algorithm,IHA)。该算法首先对任务进行分解,然后将问题用形式化方法转化成可行操作集,最后使用推理机调度任务给解空间进行处理,此算法在Cloudsim中进行了仿真验证。  相似文献   

7.
基于硬件性能计数器的编译器性能测试与分析   总被引:1,自引:0,他引:1  
Itanium 2处理器提供的性能监控单元实现了在程序运行过程中捕捉微结构事件的功能.利用GNU Gcc、Intel Icc和HP-Opencc编译器的不同优化选项编译并运行SPEC2006基准程序.通过CPU硬件计数器(HPCs)采集的性能数据,了解特定程序特征,分析比较编译器性能差异,对HP-Opencc编译器的性能优化给出相关参考数据.实验表明HP-Opencc编译器的的分支预测优化技术可再改进.  相似文献   

8.
Real-time computer systems are often used in harsh environments, such as aerospace, and in industry. Such systems are subject to many transient faults while in operation. Checkpointing enables a reduction in the recovery time from a transient fault by saving intermediate states of a task in a reliable storage facility, and then, on detection of a fault, restoring from a previously stored state. The interval between checkpoints affects the execution time of the task. Whereas inserting more checkpoints and reducing the interval between them reduces the reprocessing time after faults, checkpoints have associated execution costs, and inserting extra checkpoints increases the overall task execution time. Thus, a trade-off between the reprocessing time and the checkpointing overhead leads to an optimal checkpoint placement strategy that optimizes certain performance measures. Real-time control systems are characterized by a timely, and correct, execution of iterative tasks within deadlines. The reliability is the probability that a system functions according to its specification over a period of time. This paper reports on the reliability of a checkpointed real-time control system, where any errors are detected at the checkpointing time. The reliability is used as a performance measure to find the optimal checkpointing strategy. For a single-task control system, the reliability equation over a mission time is derived using the Markov model. Detecting errors at the checkpointing time makes reliability jitter with the number of checkpoints. This forces the need to apply other search algorithms to find the optimal number of checkpoints. By considering the properties of the reliability jittering, a simple algorithm is provided to find the optimal checkpoints effectively. Finally, the reliability model is extended to include multiple tasks by a task allocation algorithm  相似文献   

9.
推测多线程主要针对编译器生成的指令进行线程划分,在控制流和数据流分析基础上,实现串行程序的自动并行化.模拟器作为检验线程划分算法的有效手段,不仅能验证程序执行结果的正确性,而且可以评估程序并发执行的加速比性能,进一步也可以反映线程划分算法的合理性.针对Olden Suite程序在模拟器上的运行时统计信息,分析线程划分中所存在的寄存器依赖问题.同时,结合实例详细讨论造成寄存器依赖的主要原因.最后,针对寄存器依赖问题提出一种改进的线程划分方法.  相似文献   

10.

Design space exploration of a configurable, heterogeneous system for a given application with required throughput searches for a combination of cores or softcores with different architectures that can be accommodated within the given ASIC or FPGA area and that achieves the required throughput and optimizes power consumption. For a soft real-time streaming application, modeled as a task graph with internally parallelizable streaming tasks, this requires assigning a core type and quantity and DVFS frequency level to each task, which implies task runtime and energy consumption, and mapping and scheduling the tasks, such that the throughput requirement is met. We tightly integrate such static scheduling for stream processing applications with design space exploration of the best heterogeneous core combination, and solve the resulting combined optimization problem by an integer linear program (ILP). We evaluate our solution for different numbers of core types on synthetic and application-based task graphs, and demonstrate improvements of up to 34.8% for ARM big and LITTLE cores, and 70.5% for 3 different core types.

  相似文献   

11.
We describe an integrated compile time and run time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model. The run time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses and transforms the code to insert calls to the run time system that provide it with the access information computed by the compiler. The run time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run time maintenance of shared memory and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns  相似文献   

12.
In Linux, real‐time tasks are supported by separating real‐time task priorities from non‐real‐time task priorities. However, this separation of priority ranges may not be effective when real‐time tasks make the system calls that are taken care of by the kernel threads. Thus, Linux is considered a soft real‐time system. Moreover, kernel threads are configured to have static priorities for throughputs. The static assignment of priorities to kernel threads causes trouble for real‐time tasks when real‐time tasks require kernel threads to be invoked to handle the system calls because kernel threads do not discriminate between real‐time and non‐real‐time tasks. We present a dynamic kernel thread scheduling mechanism with weighted average priority inheritance protocol (PIP), a variation of the PIP. The scheduling algorithm assigns proper priorities to kernel threads at runtime by monitoring the activities of user‐level real‐time tasks. Experimental results show that the algorithms can greatly improve the unexpected execution latency of real‐time tasks.  相似文献   

13.
在众核系统中,并行任务在执行前需要被映射到处理器,这一过程被称为任务映射,任务映射算法对芯片性能影响巨大,所以近年来众核任务映射算法成为研究热点。针对不同的系统架构(如二维和三维众核系统)和优化目标(如通信开销、功耗、温度等)对现有任务映射算法进行综述,并展望了任务映射算法的未来发展趋势。  相似文献   

14.
Fang  Weiwei  Ding  Shuai  Li  Yangyang  Zhou  Wenchen  Xiong  Naixue 《Wireless Networks》2019,25(5):2851-2867

To cope with the computational and energy constraints of mobile devices, Mobile Edge Computing (MEC) has recently emerged as a new paradigm that provides IT and cloud-computing services at mobile network edge in close proximity to mobile devices. This paper investigates the energy consumption problem for mobile devices in a multi-user MEC system with different types of computation tasks, random task arrivals, and unpredictable channel conditions. By jointly considering computation task scheduling, CPU frequency scaling, transmit power allocation and subcarrier bandwidth assignment, we formulate it as a stochastic optimization problem aiming at minimizing the power consumption of mobile devices and to maintain the long-term stability of task queues. By leveraging the Lyapunov optimization technique, we propose an online control algorithm (OKRA) to solve the formulation. We prove that this algorithm is able to provide deterministic worst-case latency guarantee for latency-sensitive computation tasks, and balance a desirable tradeoff between power consumption and system stability by appropriately tuning the control parameter. Extensive simulations are carried out to verify the theoretical analysis, and illustrate the impacts of critical parameters to algorithm performance.

  相似文献   

15.
Mobile cloud computing (MCC) is an emerging technology to facilitate complex application execution on mobile devices. Mobile users are motivated to implement various tasks using their mobile devices for great flexibility and portability. However, such advantages are challenged by the limited battery life of mobile devices. This paper presents Cuckoo, a scheme of flexible compute‐intensive task offloading in MCC for energy saving. Cuckoo seeks to balance the key design goals: maximize energy saving (technical feasibility) and minimize the impact on user experience with limited cost for offloading (realistic feasibility). Specifically, using a combination of static analysis and dynamic profiling, compute‐intensive tasks are fine‐grained marked from mobile application codes offline. According to the network transmission technologies supported in mobile devices and the runtime network conditions, adopting “task‐bundled” strategy online offloads these tasks to MCC. In the task‐hosted stage, we propose a skyline‐based online resource scheduling strategy to satisfy the realistic feasibility of MCC. In addition, we adopt resource reservation to reduce the extra energy consumption caused by the task multi‐offloading phenomenon. Further, we evaluate the performance of Cuckoo using real‐life data sets on our MCC testbed. Our extensive experiments demonstrate that Cuckoo is able to balance energy consumption and execution performance. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

16.
17.
The paper considers a grid computing systems in which the resource management systems (RMS) can divide service tasks into execution blocks (EB), and send these blocks to different resources. To provide a desired level of service reliability, the RMS can assign the same EB to several independent resources for parallel (redundant) execution. According to the optimal schedule for service task partition, and distribution among resources, one can achieve the greatest possible expected service performance (i.e. least execution time), or reliability. For solving this optimization problem, the paper suggests an algorithm that is based on graph theory, Bayesian approach, and the evolutionary optimization approach. A virtual tree-structure model is constructed in which failure correlation in common communication channels is taken into account. Illustrative examples are presented.  相似文献   

18.
Elastic DVS Management in Processors With Discrete Voltage/Frequency Modes   总被引:1,自引:0,他引:1  
Applying classical dynamic voltage scaling (DVS) techniques to real-time systems running on processors with discrete voltage/frequency modes causes a waste of computational resources. In fact, whenever the ideal speed level computed by the DVS algorithm is not available in the system, to guarantee the feasibility of the task set, the processor speed must be set to the nearest level greater than the optimal one, thus underutilizing the system. Whenever the task set allows a certain degree of flexibility in specifying timing constraints, rate adaptation techniques can be adopted to balance performance (which is a function of task rates) versus energy consumption (which is a function of the processor speed). In this paper, we propose a new method that combines discrete DVS management with elastic scheduling to fully exploit the available computational resources. Depending on the application requirements, the algorithm can be set to improve performance or reduce energy consumption, so enhancing the flexibility of the system. A reclaiming mechanism is also used to take advantage of early completions. To make the proposed approach usable in real-world applications, the task model is enhanced to consider some of the real CPU characteristics, such as discrete voltage/frequency levels, switching overhead, task execution times nonlinear with the frequency, and tasks with different power consumption. Implementation issues and experimental results for the proposed algorithm are also discussed  相似文献   

19.
线程级推测(Thread-Level Speculation, TLS)是多核上一种加速串行程序的线程级自动并行化技术。循环具有规则的结构并在运行时占有大量的执行时间,因此循环是挖掘并行性的理想对象。然而,选择哪些循环并行才能提高程序的加速比是一个很难决定的问题。为了解决该问题,该文提出一种基于性能预测的循环选择方法。基于输入训练集获取程序预执行的剖析信息,同时结合各种推测因素,构建了循环结构的性能预测模型。预测结果定量评估了循环推测并行的加速比并决定该循环在运行时是否适合并行。实验结果表明,该文提出的方法能有效地预测循环并行时所蕴含的并行性,并依据预测结果准确地选择具有并行收益的循环推测并行,最终Olden基准测试集加速比性能平均提升了12.34%。  相似文献   

20.

Whilst FPGAs have been used in cloud ecosystems, it is still extremely challenging to achieve high compute density when mapping heterogeneous multi-tasks on shared resources at runtime. This work addresses this by treating the FPGA resource as a service and employing multi-task processing at the high level, design space exploration and static off-line partitioning in order to allow more efficient mapping of heterogeneous tasks onto the FPGA. In addition, a new, comprehensive runtime functional simulator is used to evaluate the effect of various spatial and temporal constraints on both the existing and new approaches when varying system design parameters. A comprehensive suite of real high performance computing tasks was implemented on a Nallatech 385 FPGA card and show that our approach can provide on average 2.9 × and 2.3 × higher system throughput for compute and mixed intensity tasks, while 0.2 × lower for memory intensive tasks due to external memory access latency and bandwidth limitations. The work has been extended by introducing a novel scheduling scheme to enhance temporal utilization of resources when using the proposed approach. Additional results for large queues of mixed intensity tasks (compute and memory) show that the proposed partitioning and scheduling approach can provide higher than 3 × system speedup over previous schemes.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号