期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A task graph execution manager for reconfigurable multi-tasking systems

Juan Antonio Clemente Carlos González Javier Resano Daniel Mozos 《Microprocessors and Microsystems》2010,34(2-4):73-83

Reconfigurable hardware can be used to build multi-tasking systems that dynamically adapt themselves to the requirements of the running applications. This is especially useful in embedded systems, since the available resources are very limited and the reconfigurable hardware can be reused for different applications. In these systems computations are frequently represented as task graphs that are executed taking into account their internal dependencies and the task schedule. The management of the task graph execution is critical for the system performance. In this regard, we have developed two different versions, a software module and a hardware architecture, of a generic task graph execution manager for reconfigurable multi-tasking systems. The second version reduces the run-time management overheads by almost two orders of magnitude. Hence it is especially suitable for systems with exigent timing constraints. Both versions include specific support to optimize the reconfiguration process. 相似文献

2.

Efficient datapath merging for the overhead reduction of run-time reconfigurable systems

Mahmood Fazlali Ali Zakerolhosseini Georgi Gaydadjiev 《The Journal of supercomputing》2012,59(2):636-657

High latencies in FPGA reconfiguration are known as a major overhead in run-time reconfigurable systems. This overhead can be reduced by merging multiple data flow graphs representing different kernels of the original program into a single (merged) datapath that will be configured less often compared to the separate datapaths scenario. However, the additional hardware introduced by this technique increases the kernels execution time. In this paper, we present a novel datapath merging technique that reduces both the configuration and execution times of kernels mapped on the reconfigurable fabric. Experimental results show up to 13% reduction in the configuration and execution times of kernels from media-bench workloads, compared to previous art on datapath merging. When compared to conventional high-level synthesis algorithms, our proposal reduces kernels configuration and execution times by up to 48%. 相似文献

3.

Expressing embedded systems configurations at high abstraction levels with UML MARTE profile: Advantages,limitations and alternatives

Imran Rafiq Quadri Abdoulaye Gamatié Pierre Boulet Samy Meftali Jean-Luc Dekeyser 《Journal of Systems Architecture》2012,58(5):178-194

Embedded systems have become an essential aspect of our professional and personal lives. From avionics, transport and telecommunication systems to general commercial appliances such as smart phones, high definition TVs and gaming consoles; it is difficult to find a domain where these systems have not made their mark. Moreover, Systems-on-Chips (SoCs) which are considered as an integral solution for designing embedded systems, offer advantages such as run-time reconfiguration that can change system configurations during execution, depending upon Quality-of-Service (QoS) criteria such as performance and energy levels. This article deals with aspects related to modeling of these configurations, useful for describing various states of an embedded system, from both structural and operational viewpoints. Our proposal adapts a high abstraction level approach based on the principles of Model-Driven Engineering (MDE) and takes into account the UML MARTE profile for modeling of real-time and embedded systems. Elevating the design abstraction levels help to increase design productivity and achieve execution platform independence, among other advantages. The article details the current proposition of configurations in MARTE via some examples, and points out the advantages as well as some limitations, mainly concerning the semantic aspects of the defined concepts. Finally, we report our experiences on the modeling of an alternate notion of configurations and execution modes within the MARTE compliant Gaspard2 SoC Co-Design framework that has been successful for the design as well as implementation of FPGA based SoCs. 相似文献

4.

片上多核处理器容软错误执行模型

龚锐戴葵王志英《计算机学报》2008,31(11)

随着工艺的进步,微处理器将面临越来越严重的软错误威胁.文中提出了两种片上多核处理器容软错误执行模型:双核冗余执行模型DCR和三核冗余执行模型TCR.DCR在两个冗余的内核上以一定的时间间距运行两份相同的线程,store指令只有在进行了结果比较以后才能提交.每个内核增加了硬件实现的现场保存与恢复机制,以实现对软错误的恢复.文中选择的现场保存点有利于隐藏现场保存带来的时间开销,并且采用了特殊的机制保证恢复执行和原始执行过程中load数据的一致性.TCR执行模型通过在3个不同的内核上运行相同的线程实现对软错误的屏蔽.在检测到软错误以后,TCR可以进行动态重构,屏蔽被软错误破坏的内核.实验结果表明,与传统的软错误恢复执行模型CRTR相比,DCR和TCR对核间通信带宽的需求分别降低了57.5%和54.2%.在检测到软错误的情况下,DCR的恢复执行带来5.2%的性能开销,而TCR的重构带来的性能开销为1.3%.错误注入实验表明,DCR能够恢复99.69%的软错误,而TCR实现了对SEU(Single Event Upset)型故障的全面屏蔽. 相似文献

5.

Overhead Analysis of Scientific Workflows in Grid Environments

Prodan R. Fahringer T. 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(3):378-393

Scientific workflows are a topic of great interest in the grid community that sees in the workflow model an attractive paradigm for programming distributed wide-area grid infrastructures. Traditionally, the grid workflow execution is approached as a pure best effort scheduling problem that maps the activities onto the grid processors based on appropriate optimization or local matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results, the execution in dynamic and unpredictable grid environments is prone to severe performance losses that must be understood for minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur when executing scientific workflows in dynamic grid environments. We introduce an ideal model for the lowest execution time that can be achieved by a workflow and explain the difference to the real measured grid execution time based on a hierarchy of performance overheads for grid computing. We describe how to systematically measure and compute the overheads from individual activities to larger workflow regions and adjust well-known parallel processing metrics to the scope of grid computing, including speedup and efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance analysis of two real-world workflow applications executed in the Austrian grid environment. 相似文献

6.

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

下载免费PDF全文

徐新海杨学军薛京灵林宇斐林一松《计算机科学技术学报》2012,27(2):240-255

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 相似文献

7.

Tuning the victim selection policy of Intel TBB

《Journal of Systems Architecture》2015,61(10):584-591

The wide adoption of Chip Multiprocessors (CMPs) in almost all ICT segments has triggered a change in the way software needs to be developed. Parallel programming maximizes the performance and energy efficiency of CMPs, but also comes with a new set of challenges. Parallelization overheads can account for sub-linear speedups and can increase the energy consumption of applications. In past experiments we looked at specific operations such as spawning new tasks, dequeuing the task queue and task stealing for Intel TBB. Our results showed that failed steals account for the largest overhead. In this work, we focus on TBB’s victim selection policy. We implement a new occupancy-aware policy and we improve the implementation of the pseudo-random policy we proposed in a previous paper. We compare the results of our new policies against an “oracle scheme” as well as against TBB’s random victim selection approach. Our results show improvements in execution times and energy-efficiency of up to 11.23% and 14.72% respectively when compared to TBB’s default policy. 相似文献

8.

Efficient exception handling in Java bytecode-to-C ahead-of-time compiler for embedded systems

《Computer Languages, Systems and Structures》2008,34(4):170-183

One of the most promising approaches to Java acceleration in embedded systems is a bytecode-to-C ahead-of-time compiler (AOTC). It improves the performance of a Java virtual machine (JVM) by translating bytecode into C code, which is then compiled into machine code via an existing C compiler. One important design issue in AOTC is efficient exception handling. Since the excepting point and the exception handler may locate in different methods on a call stack, control transfer between them should be streamlined, while an exception would be an “exceptional” event, so it should not slow down normal execution paths. Previous AOTCs often employed a technique called stack cutting based on a setjmp()/longjmp() pair, which we found is involved with too much performance overheads. Also, when the AOTC and the interpreter are employed concurrently (e.g., some methods are AOTCed while other methods are interpreted), the performance of normal execution paths is affected more seriously. This paper proposes a simpler solution based on an exception check after each method call, merged with garbage collection check for reducing its overhead. Our evaluation results on SPECjvm98 on Sun's CVM indicate that our technique can improve the performance of stack cutting by more than 25%. A similar performance benefit can be noted on a hybrid execution environment of both the AOTC and the interpreter. 相似文献

9.

基于动态可重构技术的装备测试系统研究

下载免费PDF全文

罗东明马跃董旭连文涛《计算机测量与控制》2023,31(3):83-88

当前国内自动测试系统存在实时性差、测试资源冗余、成本高等问题,针对以上问题,提出了基于FPGA部分动态重构技术的自动测试系统,该系统基于FPGA动态可重构技术并结合嵌入式操作系统实现测试资源的动态管理,并开发了用于测试过程的硬件自动测试任务编程模型,提出了一种用于重构任务加载的ICAP控制器;该系统实现测试过程的并发执行,从而增强自动测试系统测试的实时性,进而提高测试的准确性与覆盖性。在验证试验中,将动态重构测试系统应用于自动测试实例中,试验结果表明硬件重构测试任务加载正常,各测试资源功能执行正确相似文献

10.

Configuration Reusing in On-Line Task Scheduling for Reconfigurable Computing Systems

下载免费PDF全文

Maisam Mansub Bassiri Hadi Shahriar Shahhoseini 《计算机科学技术学报》2011,26(3):463-473

Reconfigurable computing systems can be reconfigured at runtime and support partial reconfigurability which makes us able to execute tasks in a true multitasking manner.To manage such systems at runtime,a reconfigurable operating system is needed.The main part of this operating system is resource management unit which performs on-line scheduling and placement of hardware tasks at runtime.Reconfiguration overhead is an important obstacle that limits the performance of on-line scheduling algorithms in reconfigurable computing systems and increases the overall execution time.Configuration reusing (task reusing) can decrease reconfiguration overhead considerably,particularly in periodic applications or the applications in which the probability of tasks recurrence is high.In this paper,we present a technique called reusing-based scheduling (RBS),for on-line scheduling and placement in which configuration reusing is considered as a main characteristic in order to reduce reconfiguration overhead and decrease total execution time of the tasks.Several experiments have been conducted on the proposed algorithm.Obtained results show considerable improvement in overall execution time of the tasks. 相似文献

11.

Performance properties of vertically partitioned object-orientedsystems

Hufnagel S.P. Browne J.C. 《IEEE transactions on pattern analysis and machine intelligence》1989,15(8):935-946

A vertically partitioned structure for the design and implementation of object-oriented systems is proposed, and their performance is demonstrated. It is shown that the application-independent portion of the execution overheads in object-oriented systems can be less than the application-independent overheads in conventionally organized systems built on layered structures. Vertical partitioning implements objects through extended type managers. Two key design concepts result in performance improvement: object semantics can be used in the state management functions of an object type and atomicity is maintained at the type manager boundaries providing efficient recovery points. The performance evaluation is based on a case study of a simple but nontrivial distributed real-time system application 相似文献

12.

Cross‐profiling for Java processors

Walter Binder Martin Schoeberl Philippe Moret Alex Villazón 《Software》2009,39(18):1439-1465

Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. The prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this article, we introduce a customizable cross‐profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java virtual machine, but the generated profiles represent the execution time metric of the target system. Our cross‐profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross‐profiling framework also enables the performance evaluation of new processor architectures before they are implemented. As a case study, we explore the performance impact of various processor design choices and optimizations, such as different cache sizes or pipeline organizations, and come up with an improved processor design that yields speedups of up to 40% on standard Java benchmarks. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

13.

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Roman Lysecky 《International journal of parallel programming》2008,36(5):478-492

Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor. 相似文献

14.

Robust control reconfiguration of resource allocation systems with Petri nets and integer programming

Jun Li MengChu Zhou Tao Guo Yahui Gan Xianzhong Dai 《Automatica》2014

Supervisory control reconfiguration can handle the uncertainties including resource failures and task changes in discrete event systems. It was not addressed to exploit the robustness of closed-loop systems to accommodate some uncertainties in the prior studies. Such exploitation can cost-efficiently achieve reconfigurability and flexibility for real systems. This paper presents a robust reconfiguration method based on Petri nets (PNs) and integer programming for supervisory control of resource allocation systems (RASs) subject to varying resource allocation relationships. An allocation relationship is seen as a control specification while the execution processes requiring resources as an uncontrolled plant. First, a robust reconfiguration mechanism is proposed. It includes updating the P

P

-invariant-based supervisor and evolving the state of the closed-loop system. The latter adapts to the control specification changes by the self-regulation of the closed-loop system’s state. Next, two novel integer programming models for control reconfiguration are proposed, called a reconfiguration model with acceptability and reconfiguration one with specification correction. Since both models integrate the firability condition of transitions, no additional efforts are required for the state reachability analysis. Finally, a hospital emergency service system is used as an example to illustrate them. 相似文献

15.

Automatic compilation of loops to exploit operator parallelism onconfigurable arithmetic logic units

Ramasubramanian N. Subramanian R. Pande S. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(1):45-66

Configurable arithmetic logic units (ALUs) offer opportunities for adapting the underlying hardware to support the varying amount of parallelism in the computation. The problem of identifying the optimal parallel configurations (a configuration is defined as a given hardware implementation of different operators along with their multiplicities) at different steps in a program is a very complex issue but, if solved, allows the power of these ALUs to be maximally used. This paper focuses on developing an automatic compilation framework for configuration analysis to exploit operator parallelism within loop nests. The focus of this work is on performing configuration analysis to minimize costly reconfiguration overheads. In our framework, we initially carry out some operator and loop transformations to expose more opportunities for configuration reuse. We then present a two pass solution. The first pass attempts to generate either maximal cutsets (a cutset is defined as a group of statements that execute under a given configuration) or maximally parallel configurations by performing an analysis on the program dependency graph (PDG) of a loop nest. The second pass analyzes the trade-offs between the costs and benefits of reconfigurations across different cutsets and attempts to eliminate the reconfiguration overheads by merging cutsets. This methodology is implemented in the SUIF compilation system and is tested using some loops extracted from Perfect benchmarks and Livermore kernels. Good speedups are obtained, showing the merit of the proposed method. The method also scales well with the loop sizes and the amount of space available on FPGAs for configurable logic 相似文献

16.

Value Prediction and Speculative Execution on GPU

Shaoshan Liu Christine Eisenbeis Jean-Luc Gaudiot 《International journal of parallel programming》2011,39(5):533-552

GPUs and CPUs have fundamentally different architectures. It is conventional wisdom that GPUs can accelerate only those applications that exhibit very high parallelism, especially vector parallelism such as image processing. In this paper, we explore the possibility of using GPUs for value prediction and speculative execution: we implement software value prediction techniques to accelerate programs with limited parallelism, and software speculation techniques to accelerate programs that contain runtime parallelism, which are hard to parallelize statically. Our experiment results show that due to the relatively high overhead, mapping software value prediction techniques on existing GPUs may not bring any immediate performance gain. On the other hand, although software speculation techniques introduce some overhead as well, mapping these techniques to existing GPUs can already bring some performance gain over CPU. Based on these observations, we explore the hardware implementation of speculative execution operations on GPU architectures to reduce the software performance overheads. The results indicate that the hardware extensions result in almost tenfold reduction of the control divergent sequential operations with only moderate hardware (5–8%) and power consumption (1–5%) overheads. 相似文献

17.

Achieving middleware execution efficiency: hardware-assisted garbage collection operations

Jie Tang Shaoshan Liu Zhimin Gu Xiao-Feng Li Jean-Luc Gaudiot 《The Journal of supercomputing》2012,59(3):1101-1119

Although virtualization technologies bring many benefits to cloud computing environments, as the virtual machines provide more features, the middleware layer has become bloated, introducing a high overhead. Our ultimate goal is to provide hardware-assisted solutions to improve the middleware performance in cloud computing environments. As a starting point, in this paper, we design, implement, and evaluate specialized hardware instructions to accelerate GC operations. We select GC because it is a common component in virtual machine designs and it incurs high performance and energy consumption overheads. We performed a profiling study on various GC algorithms to identify the GC performance hotspots, which contribute to more than 50% of the total GC execution time. By moving these hotspot functions into hardware, we achieved an order of magnitude speedup and significant improvement on energy efficiency. In addition, the results of our performance estimation study indicate that the hardware-assisted GC instructions can reduce the GC execution time by half and lead to a 7% improvement on the overall execution time. 相似文献

18.

基于进程投机并行的运行时系统设计与优化

刘雷李晶陈莉冯晓兵《计算机工程》2014,(3):99-102,112

投机并行化是解决遗留串行代码并行化的重要技术,但以往投机并行化运行时系统面临着诸多的性能问题,如任务分配不均衡、通信频繁、冲突代价高,以及进程启动,结柬频繁而导致开销过高等。为此,提出一种基于进程实现的投机并行化运行时系统。采用隐式单程序多数据的并行任务划分和执行模式。通过实现重甩进程的投机任务调度策略和委托正确性检查技术,降低投机进程启动/结束和通信的开销,提高投机进程的利用率,同时利用守护进程与投机进程协同执行的方式,确保在投机进程出现异常情况时程序也能正确执行。实验结果表明,该基于进程实现的投机运行时系统比同类型系统的性能提高231%。相似文献

19.

EMERALDS: a small-memory real-time microkernel

Zuberi K.M. Shin K.G. 《IEEE transactions on pattern analysis and machine intelligence》2001,27(10):909-928

EMERALDS (Extensible Microkernel for Embedded, ReAL-time, Distributed Systems) is a real-time microkernel designed for small-memory embedded applications. These applications must run on slow (15-25 MHz) processors with just 32-128 kbytes of memory, either to keep production costs down in mass produced systems or to keep weight and power consumption low. To be feasible for such applications, the OS must not only be small in size (less than 20 kbytes), but also have low overhead kernel services. Unlike commercial embedded OSs which rely on carefully optimized code to achieve efficiency, EMERALDS takes the approach of redesigning the basic OS services of task scheduling, synchronization, communication, and system call mechanism by using characteristics found in small-memory embedded systems, such as small code size and a priori knowledge of task execution and communication patterns. With these new schemes, the overheads of various OS services are reduced 20-40 percent without compromising any OS functionality 相似文献

20.

A queueing theoretic approach for performance evaluation of low-power multi-core embedded systems

Arslan Munir Ann Gordon-Ross Sanjay Ranka Farinaz Koushanfar 《Journal of Parallel and Distributed Computing》2014

With Moore’s law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, the optimal layout of these multiple cores along with the memory subsystem (caches and main memory) to satisfy power, area, and stringent real-time constraints is a challenging design endeavor. The short time-to-market constraint of embedded systems exacerbates this design challenge and necessitates the architectural modeling of embedded systems to reduce the time-to-market by expediting target applications to device/architecture mapping. In this paper, we present a queueing theoretic approach for modeling multi-core embedded systems that provides a quick and inexpensive performance evaluation both in terms of time and resources as compared to the development of multi-core simulators and running benchmarks on these simulators. We verify our queueing theoretic modeling approach by running SPLASH-2 benchmarks on the SuperESCalar simulator (SESC). Results reveal that our queueing theoretic model qualitatively evaluates multi-core architectures accurately with an average difference of 5.6% as compared to the architectures’ evaluations from the SESC simulator. Our modeling approach can be used for performance per watt and performance per unit area characterizations of multi-core embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis. 相似文献