首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Power consumption is an increasingly important consideration in the design of mixed hardware/software systems. This work defines the notion of instruction subsetting and explores its use as a means of reducing power consumption from the system level of design. Instruction subsetting is defined as creating an application specific instruction set processor from a more general processor, such as a DSP. Although not as effective as an ASIC solution, instruction subsetting provides much of the power savings while maintaining some level of programmability. Beyond energy savings, instruction subsetting also offers the opportunity to reduce the design cycle through the re-use of existing processor intellectual property including behavioral and structural designs, hardware simulators, application code, and compilers. We synthesized 9 ASIPs through place and route and found that a poorly chosen instruction set may consume more than 4 times the energy of an ASIP with a proper instruction set choice. This finding will allow designers to consider another set of trade-offs in their hardware/software design space exploration.  相似文献   

2.
Dynamic instruction scheduling logic is quite complex and dissipates significant energy in microprocessors that support superscalar and out-of-order execution. We propose a novel microarchitectural technique to reduce the complexity and energy consumption of the dynamic instruction scheduling logic. The proposed method groups several instructions as a single issue unit and reduces the required number of ports and the size of the structure. This paper describes the microarchitecture mechanisms and shows evaluation results for energy savings and performance. These results reveal that the proposed technique can greatly reduce energy with almost no performance degradation, compared to the conventional dynamic instruction scheduling logic.   相似文献   

3.
低功耗编译技术综述   总被引:9,自引:1,他引:8  
胡定磊  陈书明 《电子学报》2005,33(4):676-682
功耗问题已经成为制约电子系统发展的重要因素.功耗是由硬件在运行软件时产生的,软件的数据存取和指令执行都会使硬件产生功耗.编译器可以通过适当的调度优化,改变软件在硬件上的运行轨迹,使得硬件执行某一个程序时的功耗变小.本文从如何对软件的功耗进行评估和如何实现低功耗的编译两大方面对低功耗编译的相关研究进行了广泛介绍,着重评述了专门的低功耗编译技术.最后对当前低功耗编译存在的问题做了分析,给出了对于低功耗编译新方向的预测.  相似文献   

4.
Conventional scheduling algorithms usually adjust the clock cycle duration to the execution time of the longest operations. This results in large slack times wasted in those cycles with faster operations. To reduce the wasted times multi-cycle and chaining techniques have been employed. Chaining contributes to reduce the circuit latency if it is applied to the critical path operations, and multi-cycle operators usually result in smaller clock cycles. Both techniques are applied at the operation level, and thus their impact on the circuit performance is bounded by the selected latency. Additionally, they have limited reusability. The design methodology presented in this paper overcomes the limitations of previous techniques to obtain substantially faster circuits. It fragments some of the specification operations into several smaller ones that are handled independently. This way, some operations can begin before their predecessors have finished and can also be executed in several unconsecutive cycles. Furthermore, the fragmentation of operations favours the reusability of hardware resources, leading also to smaller designs.  相似文献   

5.
Available energy becomes a critical design issue for the increasingly complex real-time embedded systems. Phase Change Memory (PCM), with high density and low idle power, has recently been extensively studied as a promising alternative of DRAM. Hybrid PCM-DRAM main memory architecture has been proposed to leverage the low power of PCM and high speed of DRAM. In this paper, we propose energy-aware real-time task scheduling strategies for hybrid PCM-DRAM based embedded systems. Given the execution time variation when a task is loaded into PCM or DRAM, we re-design the static table-driven scheduling for a set of fixed tasks, as well as the Rate-Monotonic (RM) and Earliest Deadline First (EDF) scheduling policies for periodic task sets. Furthermore, since the actual execution time can be much shorter than the worst-case execution time in the actual execution, we propose online schedulers which migrates the tasks between PCM and DRAM to optimize the energy consumption by utilizing the slack time resulted from the completed tasks. All the proposed algorithms minimize the number of task migrations from PCM to DRAM by ensuring that aperiodic tasks are not migrated while each periodic task instance can be migrated at most once. Experimental results show our proposed scheduling algorithms satisfy the real-time constraints and significantly reduce the energy consumption.  相似文献   

6.
Many embedded multimedia systems employ special hardware blocks to co‐process with the main processor. Even though an efficient handling of such hardware blocks is critical on the overall performance of real‐time multimedia systems, traditional real‐time scheduling techniques cannot afford to guarantee a high quality of multimedia playbacks with neither delay nor jerking. This paper presents a hardware‐aware rate monotonic scheduling (HA‐RMS) algorithm to manage hardware tasks efficiently and handle special hardware blocks in the embedded multimedia system. The HA‐RMS prioritizes the hardware tasks over software tasks not only to increase the hardware utilization of the system but also to reduce the output jitter of multimedia applications, which results in reducing the overall response time.  相似文献   

7.
肖鹏  胡志刚  屈喜龙 《通信学报》2015,36(1):149-158
随着数据中心规模的扩大,高能耗问题已经成为高性能计算领域的一个重要问题。针对数据密集型工作流的高能耗问题,提出通过引入“虚拟数据访问节点”的方法来量化评估工作流任务的数据访问能耗开销,并在此基础上设计了一种“最小能耗路径”的启发式策略。在经典的HEFT算法和CPOP算法基础上,通过引入该启发式策略设计并实现了2种具有能耗感知能力的调度算法(HEFT-MECP和CPOP-MECP)。实验结果显示,基于最小能耗路径的启发式调度算法能有效降低数据访问操作的能耗开销,在面对大型的数据密集工作流任务时,该启发式调度策略体现了较好的适应性。  相似文献   

8.
In multiprocessor system-on-chip, tasks and communications should be scheduled carefully since their execution order affects the performance of the entire system. When we implement an MPSoC according to the scheduling result, we may find that the scheduling result is not correct or timing constraints are not met unless it takes into account the delays of MPSoC architecture. The unexpected scheduling results are mainly caused from inaccurate communication delays and or runtime scheduler’s overhead. Due to the big complexity of scheduling problem, most previous work neglects the inter-processor communication, or just assumes a fixed delay proportional to the communication volume, without taking into consideration subtle effects like the communication congestion and synchronization delay, which may change dynamically throughout tasks execution. In this paper, we propose an accurate scheduling model of hardware/software communication architecture to improve timing accuracy by taking into account the effects of dynamic software synchronization and detailed hardware resource constraints such as communication congestion and buffer sharing. We also propose a method for runtime scheduler implementation and consider its performance overhead in scheduling. In particular, we introduce efficient hardware and software scheduler architectures. Furthermore, we address the issue of centralized implementation versus distributed implementation of the schedulers. We investigate the pros and cons of the two different scheduler implementations. Through experiments with significant demonstration examples, we show the effectiveness of the proposed approach.  相似文献   

9.
降低系统功耗不仅要考虑硬件方面的因素,同时也要分析因软件引起的功耗。为了降低系统整体功耗,首先需要明确影响系统功耗的软硬件因素。在硬件方面,通过对硬件构件进行选择、设计和整合等方法降低功耗;软件方面则是重点优化与功耗密切相关的要素,如算法、指令与方法等。这些因素往往是相互制约、相互影响的。设计一个成功的低功耗系统,需要通过分析与实验,明确一个以硬件构件为思想的嵌入式系统低功耗设计时所需考虑的一些问题。  相似文献   

10.
The recent growth of applications in the emerging Internet of Things field is posing new challenges in the long-term deployments of sensing devices. Currently, system designers rely on energy harvesting to reduce battery size and extend system lifetime. While some system functions need constant power supply, others can have their service adapted dynamically to the available harvested energy and harvesting power. Our proposed Torpor is a power-aware hardware scheduler which continuously monitors harvesting power and in combination with its software runtime, dynamically activates system functions depending on the available energy and its rate of change. By performing a few key functions in hardware, Torpor incurs a very low power overhead during continuous monitoring, while the software runtime provides a high degree of flexibility to enable different scheduling policies. We implemented Torpor on a FPGA-based prototype and demonstrated that dynamic scheduling policies which take the harvesting power into account can have a 2× or more improvement in execution rates compared to static (input-power-independent) policies, while dynamic policies that are aware also of the system's power consumption can achieve 1.5× improvement in the execution rates compared to the ones that do not. The power consumption of Torpor's always-on hardware integrated on chip is estimated to be less than 4 μW, making it a very promising power-management add-on for microprocessors used in IoT nodes.  相似文献   

11.
High-performance scalar architectures that have the capability to issue multiple instructions per clock period are considered. The essential characteristics and the principal architectural tradeoffs in scientific array processors, very-long-instruction-word (VLIW) machines, the polycyclic architecture and decoupled computers are examined. Array processors rely solely on static code scheduling done manually or by the compiler. The scheduling task is quite complex, and the resulting code may not be very efficient. In a VLIW, sophisticated compiler technology provides software solutions for functions traditionally done in hardware. The polycyclic architecture is similar to array processors in its structure but provides architectural support to the instruction scheduling task. In decoupled architectures the hardware changes the order of instruction execution at run time. This dynamic code scheduling capability does not come at the expense of additional control complexity  相似文献   

12.
Exploiting instruction-level parallelism (ILP) is extremely important for achieving high performance in application specific instruction set processors (ASIPs) and embedded processors. Unlike conventional general purpose processors, ASIPs and embedded processors typically run a single application and hence must be optimized extensively for this in order to extract maximum performance. Further, low power and low cost requirements of ASIPs may demand reuse of pipeline stages causing pipelines with complex structural hazards. In such architectures, exploiting higher ILP is a major challenge to the designer.Existing techniques deal with either scheduling hardware pipelines to obtain higher throughput or software pipelining—an instruction scheduling technique for iterative computation—for exploiting greater ILP. We integrate these techniques to co-schedule hardware and software pipelines to achieve greater instruction throughput. In this paper, we develop the underlying theory of Co-Scheduling, called the Modulo-Scheduled Pipeline (or MS-Pipeline) theory. More specifically, we establish the necessary and sufficient condition for achieving the maximum throughput in a given pipeline operating under modulo scheduling. Further, we establish a sufficient condition to achieve a specified throughput, based on which we also develop a methodology for designing the hardware pipelines that achieve such a throughput. Further, we present initial experimental results which help to establish the usefulness of MS-pipeline theory in software pipelining. As the proposed theory helps to analyze and improve the throughput of Modulo-Scheduled Pipelines (MS-pipelines), it is especially useful in designing ASIPs and embedded processors.  相似文献   

13.
In this paper we present a software programmable design flow that facilitates the implementation and integration of efficient digital pre-distortion (DPD) solutions on the leading-edge field programmable gate arrays, combining industry-standard embedded processors and programmable logic fabric into one chip. In addition to software programmability, another key contribution of this design flow is the flexible partitioning of functionality among the hardware and software components, depending on the complexity of the DPD parameter estimation algorithm in use. We have applied processor-specific optimizations to the software implementation and used Vivado high-level synthesis (HLS) tool as the design tool for the programmable logic. Furthermore, we have compared two different techniques for the integration of hardware and software components, where we have chosen the one with better area/latency trade-off. We present a comprehensive study reporting the DPD parameter update times when exploring the partitioning of the functionality among hardware and software. For low-complexity algorithms, we show that a software-only solution is applicable after carrying out the processor-specific software optimizations. For higher-complexity algorithms, we use Vivado HLS to accelerate the time-consuming blocks in the programmable logic, leading to a speed-up factor of up to 7× in the overall algorithm execution time. We present the performance results for two target devices. We also show that our accelerators use only a small portion of the programmable logic fabric on these devices and that a significant reduction of the system’s energy consumption can be obtained by leveraging the FPGA fabric.  相似文献   

14.
In this paper, we consider the increased performance that can be obtained by using, in concert, three previously proposed enhancements. These enhancements are aggressive dynamic (run time) instruction scheduling, the reuse of decoded instructions, and trace scheduling (both aggressive dynamic instruction scheduling and decoded instruction reuse have been used in commercial systems). We show that these three enhancements complement and support one another. Hence, while each of these enhancements has been shown to have merit in its own right, when used in concert, we claim the overall advantage is greater than that obtained by using any one singly. To support this claim, we present the results from running benchmarks representing several common multimedia kernels. Subsequent simulations show results of 7.3 instructions completed per cycle for the best-performing benchmark for a reasonably aggressive microarchitecture that combines trace scheduling of decoded instructions (i.e., decoded traces) with aggressive dynamic execution.  相似文献   

15.
The channel scheduling problem is to decide how to commit channels for transmitting data between nodes in wireless networks. This problem is one of the most important problems in wireless sensor networks. In this problem, we aim to obtain a near‐optimal solution with the minimal energy consumption within a reasonable time. As the number of nodes increases in the network, however, the amount of calculation for finding the solution would be too high. It can be difficult to obtain an optimal solution in a reasonable execution time because this problem is NP‐hard. Therefore, most of the recent studies for such problems seem to focus on heuristic algorithms. In this paper, we propose efficient channel scheduling algorithms to obtain a near‐optimal solution on the basis of three meta‐heuristic algorithms; the genetic algorithm, the Tabu search, and the simulated annealing. In order to make a search more efficient, we propose some neighborhood generating methods for the proposed algorithms. We evaluate the performance of the proposed algorithms through some experiments in terms of energy consumption and algorithm execution time. The experimental results show that the proposed algorithms are efficient for solving the channel scheduling problem in wireless sensor networks. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

16.
As new applications in embedded communications and control systems push the computational limits of digital signal processing (DSP) functions, there will be an increasing need for software applications to be migrated to hardware in the form of a hardware-software codesign system. In many cases, access to the high-level source code may not be available. It is thus desirable to have a technology to translate the software binaries intended for processors to hardware implementations. This paper provides details on the retargetable FREEDOM compiler. The compiler automatically translates DSP software binaries to register-transfer level (RTL) VHDL and Verilog for implementation on field-programmable gate arrays (FPGAs) as standalone or system-on-chip implementations. We describe the underlying optimizations and some novel algorithms for alias analysis, data dependency analysis, memory optimizations, procedure call recovery, and back-end code scheduling. Experimental results on resource usage and performance are shown for several program binaries intended for the Texas Instruments C 6211 DSP (VLIW) and the ARM 922 T reduced instruction set computer (RISC) processors. Implementation results for four kernels from the Simulink demo library and others from commonly used DSP applications, such as MPEG-4, Viterbi, and JPEG are also discussed. The compiler generated RTL code is mapped to Xilinx Virtex II and Altera Stratix FPGAs. We record overall performance gains of 1.5-26.9 for the hardware implementations of the kernels. Comparisons with the power aware compiler techniques (PACT) high-level synthesis compiler are used to show that software binaries can be used as intermediate representations from any high-level language and generate efficient hardware implementations.  相似文献   

17.
Energy-efficient design of battery-powered systems demands optimizations in both hardware and software. We present a modular approach for enhancing instruction level simulators with cycle-accurate simulation of energy dissipation in embedded systems. Our methodology has tightly coupled component models thus making our approach more accurate. Performance and energy computed by our simulator are within a 5% tolerance of hardware measurements on the SmartBadge. We show how the simulation methodology can be used for hardware design exploration aimed at enhancing the SmartBadge with real-time MPEG video feature. In addition, we present a profiler that relates energy consumption to the source code. Using the profiler we can quickly and easily redesign the MP3 audio decoder software to run in real time on the SmartBadge with low energy consumption. Performance increase of 92% and energy consumption decrease of 77% over the original executable specification have been achieved  相似文献   

18.
Switching activity and instruction cycles are two of the most important factors in power dissipation when the supply voltage is fixed. This paper studies the scheduling and assignment problems that minimize the total energy caused by both instruction processing and switching activities for applications with loops on multi-core, multi-Functional-Unit (multi-FU) architectures. An algorithm, EMPLS (Energy Minimization with Probability using Loop Scheduling), is proposed to minimize the total energy (E) while satisfying timing constraint (L) with guaranteed probability (P). We perform scheduling and assignment simultaneously. Our approach shows better performance than the approaches that consider scheduling and assignment in separate phases. Compared with previous work, our algorithm exhibits significant improvement in total energy reduction.  相似文献   

19.
Energy efficiency and capacity maximization are two of the most challenging issues to be addressed by current and future cellular networks. Significant research effort has been placed recently in reducing the total energy consumption while maintaining or improving capacity either by introducing more efficient hardware components or by developing innovative software techniques. In this paper we investigate a novel networking paradigm to address the aforementioned problems. By capitalizing on the inherent delay tolerance of Internet type services, we argue that significant energy savings can be achieved by postponing the communication of information for a later time instance with better networking conditions. We device decentralized algorithms for the proposed postponement schemes and show the superior performance of implementing such schemes over the traditional cellular operation.  相似文献   

20.
Scheduling and binding are two tasks found in high-level synthesis of hardware as well as in compiling software. These tasks are realized on graphs that are models of the hardware or of the software to be compiled to run on a specific processor. Scheduling focuses on determining the start execution time of each node in the graph. Binding is the task of assigning each node in the graph to a specific computational element. Realize binding before or after scheduling can exclude generating high-quality designs (hardware or binary code). The latter statement is true in particular in the era of design for low power. Do not combine scheduling and binding can lead to designs with high switching activities and hence to high power consumption. To the best of our knowledge, there is no approach at this moment that addresses the problem of unifying scheduling and binding with an exact algorithm to produce designs with reduced power consumption. Known approaches to that problem are heuristics. That problem is NP-hard in general, since it is the composition of two NP-hard problems. Also, it has not yet been formulated in the literature. The problem becomes more complex when one has to deal with cyclic graphs and/or there are constraints to be met such as timings. For cyclic graphs, one has to integrate retiming in the unification of scheduling and binding. We propose a mathematical formulation to that problem. We extend this formulation to solve the problem of combining modulo scheduling, binding, and retiming under timings and resources constraints while reducing power consumption due to switching activities. The proposed approach is tested using known benchmarks. Based on obtained numerical results, this approach is able to reduce power consumption by 33.24% on average, with an average of 33.83 s as a run time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号