首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Computation intensive DSP applications usually require parallel/pipelined processors in order to meet specific timing requirements. Data hazards are a major obstacle against the high performance of pipelined systems. This paper presents a novel efficient loop scheduling algorithm that reduces data hazards for such DSP applications. This algorithm has been embedded in a tool, called SHARP, which schedules a pipelined data flow graph to multiple pipelined units while hiding the underlying data hazards and minimizing the execution time. This paper reports significant improvement for some well-known benchmarks showing the efficiency of the scheduling algorithm and the flexibility of the simulation tool.  相似文献   

2.
Most embedded systems have limited amount of memory. In contrast, the memory requirements of the digital signal processing (DSP) and video processing codes (in nested loops, in particular) running on embedded systems is significant. This paper addresses the problem of estimating and reducing the amount of memory needed for transfers of data in embedded systems. First, the problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of nested loops is analyzed. For a fixed execution ordering, a quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for nonuniformly generated references are derived. Second, in addition to presenting an algorithm that computes the total memory required, this paper also discusses the effect of transformations (that change the execution ordering) on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. The term maximum window size is introduced, and quantitative expressions are derived to compute the maximum window size. A detailed analysis of the effect of unimodular transformations on data locality, including the calculation of the maximum window size, is presented.  相似文献   

3.
We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant performance increase. The transformation technique optimizes loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed using a novel interval analysis technique that can evaluate conditional expressions in the general polynomial form. Results from interval analysis combined with loop dependency information is used to partition the iteration space of the nested loop. In such cases, the loop nest is decomposed such as to eliminate the conditional test, thus substantially reducing the execution time. Our technique completely eliminates the conditional from the loops (unlike previous techniques) thus further facilitating the application of other optimizations and improving the overall speedup. Applying the proposed transformation technique on loop kernels taken from Mediabench, SPEC-2000, mpeg4, qsdpcm and gimp, on average we measured a 2.34X speedup when running on a UltraSPARC processor, a 2.92X speedup when running on an Intel Core Duo processor, a 2.44X speedup when running on a PowerPC G5 processor and a 2.04X speedup when running on an ARM9 processor. Performance improvement, taking the entire application into account, was also promising: for 3 selected applications (mpeg-enc, mpeg-dec and qsdpcm) we measured 15% speedup on best case (5% on average) for the whole application.  相似文献   

4.
An embedded system is called a multi-mode embedded system if it performs multiple applications by dynamically reconfiguring the system functionality. Further, the embedded system is called a multi-mode multi-task embedded system if it additionally supports multiple tasks to be executed in a mode. In this paper, we address an important HW/SW partitioning problem, that is, HW/SW partitioning of multi-mode multi-task embedded applications with timing constraints of tasks. The objective of the optimization problem is to find a minimal total system cost of allocation/mapping of processing resources to functional modules in tasks together with a schedule that satisfies the timing constraints. The key success of solving the problem is closely related to the degree of the amount of utilization of the potential parallelism among the executions of modules. However, due to an inherently excessively large search space of the parallelism, and to make the task of schedulability analysis easy, the prior HW/SW partitioning methods have not been able to fully exploit the potential parallel execution of modules. To overcome the limitation, we propose a set of comprehensive HW/SW partitioning techniques which solve the three subproblems of the partitioning problem simultaneously: (1) allocation of processing resources, (2) mapping the processing resources to the modules in tasks, and (3) determining an execution schedule of modules. Specifically, based on a precise measurement on the parallel execution and schedulability of modules, we develop a stepwise refinement partitioning technique for single-mode multi-task applications, which aims to solve the subproblems 1, 2 and 3 effectively in an integrated fashion. The proposed techniques is then extended to solve the HW/SW partitioning problem of multi-mode multi-task applications (i.e., to find a globally optimized allocation/mapping of processing resources with feasible execution schedule of modules). From experiments with a set of real-life applications, it is shown that the proposed techniques are able to reduce the implementation cost by 19.0 and 17.0% for single- and multi-mode multi-task applications over that by the conventional method, respectively.  相似文献   

5.
杨阳 《电子科技》2013,26(6):119-121,136
针对异构双核的嵌入式系统中MPU和DSP处理器间的频繁通信导致系统性能降低的问题,文中提出了一种适用于多媒体应用的智能任务控制器。该控制器可动态实现任务管理,并有效解决处理器间通信问题。文中采用ESL设计方法,构建双核处理器虚拟器平台,以像素为256×256的JPEG图像编码作为实际应用。结果表明,此智能控制器可减少15%~68%的任务管理时间。  相似文献   

6.
In this paper we introduce EVE (embedded vision/vector engine), with a FlexSIMD (flexible SIMD) architecture highly optimized for embedded vision. We show how EVE can be used to meet the growing requirements of embedded vision applications in a power- and area-efficient manner. EVE’s SIMD features allow it to accelerate low-level vision functions (such as image filtering, color-space conversion, pyramids, and gradients). With added flexibility of data accesses, EVE can also be used to accelerate many mid-level vision tasks (such as connected components, integral image, histogram, and Hough transform). Our experiments with a silicon implementation of EVE show that it performs many low- and mid-level vision functions with a 3–12x speed advantage over a C64x+DSP, while consuming less power and area. EVE also achieves code size savings of 4–6x over a C64x+DSP for regular loops. Thanks to its flexibility and programmability, we were able to implement two end-to-end vision applications on EVE and achieve more than a 5× application-level speedup over a C64x+. Having EVE as a coprocessor next to a DSP or a general purpose processor, algorithm developers have an option to accelerate the low- and mid-level vision functions on EVE. This gives them more room to innovate and use the DSP for new, more complex, high-level vision algorithms.  相似文献   

7.
RM算法的运行时开销研究与算法改进   总被引:2,自引:0,他引:2  
RM算法是经典的固定优先级实时调度算法.而在嵌入式实时系统中,系统的工作负荷往往是由很多频率快、执行时间较短的任务组成.因此,直接使用RM算法进行任务调度会由于实时操作系统中任务的上下文切换开销而导致嵌入式系统资源利用率的降低.分析了基于RM算法调度的任务之间的抢占关系,并建立了以任务属性为参数的上下文切换开销模型.在该模型的基础上,通过优化任务的释放时间来降低RM算法导致的系统运行时任务切换开销.最后的实验结果验证了该策略的有效性.  相似文献   

8.
Digital signal processing (DSP) applications involve processing long streams of input data. It is important to take into account this form of processing when implementing embedded software for DSP systems. Task-level vectorization, or block processing, is a useful dataflow graph transformation that can significantly improve execution performance by allowing subsequences of data items to be processed through individual task invocations. In this way, several benefits can be obtained, including reduced context switch overhead, increased memory locality, improved utilization of processor pipelines, and use of more efficient DSP oriented addressing modes. On the other hand, block processing generally results in increased memory requirements since it effectively increases the sizes of the input and output values associated with processing tasks. In this paper, we investigate the memory-performance trade-off associated with block processing. We develop novel block processing algorithms that carefully take into account memory constraints to achieve efficient block processing configurations within given memory space limitations. Our experimental results indicate that these methods derive optimal memory-constrained block processing solutions most of the time. We demonstrate the advantages of our block processing techniques on practical kernel functions and applications in the DSP domain.
Shuvra S. BhattacharyyaEmail:
  相似文献   

9.
Memory-processor integration offers new opportunities for reducing, the energy of a system. In the case of embedded systems, where memory access patterns can typically be profiled at design time, one solution consists of mapping the most frequently accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. In this work, we propose an algorithm for the automatic partitioning of on-chip SRAMs into multiple banks. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The algorithm computes an optimal solution to the problem under realistic assumptions on the power cost metrics, and with constraints on the number of memory banks. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34%  相似文献   

10.
配置时间过长是制约可重构系统整体性能提升的重要因素,而合理的任务调度技术可有效降低系统配置时间。该文针对粗粒度动态可重构系统(CGDRS)和具有数据依赖关系的流应用,提出了一种3维任务调度模型。首先基于该模型,设计了一种基于预配置策略的任务调度算法(CPSA);然后根据任务间的配置重用性,提出了间隔配置重用与连续配置重用策略,并据此对CPSA算法进行改进。实验结果证明,CPSA算法能够有效解决调度死锁问题、降低流应用执行时间并提高调度成功率。与其它调度算法相比,对流应用执行时间的平均优化比例达到6.13%~19.53%。  相似文献   

11.
Scheduling is one of the most often addressed optimization problems in DSP compilation, behavioral synthesis, and system-level synthesis research. With the rapid pace of changes in modern DSP applications requirements and implementation technologies, however, new types of scheduling challenges arise. This paper is concerned with the problem of scheduling blocks of computations in order to optimize the efficiency of their execution on programmable embedded systems under a realistic timing model of their processors. We describe an effective scheme for scheduling the blocks of any computation on a given system architecture and with a specified algorithm implementing each block. We also present algorithmic techniques for performing optimal block scheduling simultaneously with optimal architecture and algorithm selection. Our techniques address the block scheduling problem for both single- and multiple-processor system platforms and for a variety of optimization objectives including throughput, cost, and power dissipation. We demonstrate the practical effectiveness of our techniques on numerous designs and synthetic examples.  相似文献   

12.
为满足数据量大、算法复杂度高的应用需求,使用高性能DSP完成复杂图像算法处理,FPGA作为协处理器,完成图像采集、存储和显示等功能,构建了一种高性能的嵌入式图像处理系统。DSP和FPGA通过EMIF接口实现了高速无缝互联。采用三重缓冲读写机制解决了采集和显示的异步时钟域问题及算法处理时间不确定的问题。介绍了基于BIOS和NDK开发的C6455软件流程,展示了该系统图像处理算法运行周期的统计结果。该系统运行稳定可靠,具有较高的实用价值。  相似文献   

13.
A distributed mobile DSP system consists of a group of mobile devices with different computing powers. These devices are connected by a wireless network. Parallel processing in the distributed mobile DSP system can provide high computing performance. Due to the fact that most of the mobile devices are battery based, the lifetime of mobile DSP system depends on both the battery behavior and the energy consumption characteristics of tasks. In this paper, we present a systematic system model for task scheduling in mobile DSP system equipped with Dynamic Voltage Scaling (DVS) processors and energy harvesting techniques. We propose the three-phase algorithms to obtain task schedules with shorter total execution time while satisfying the system lifetime constraints. The simulations with randomly generated Directed Acyclic Graphs (DAG) show that our proposed algorithms generate the optimal schedules that can satisfy lifetime constraints.  相似文献   

14.
The co-synthesis of hardware–software systems for complex embedded applications has been studied extensively with focus on various qualitative system objectives such as high speed performance and low power dissipation. One of the main challenges in the construction of multiprocessor systems for complex real time applications is provide high levels of system availability that satisfies the users’ expectations. Even though the area of hardware software cosynthesis has been studied extensively in the recent past, the issues that specifically relate to design exploration for highly available architectures need to be addressed more systematically and in a manner that supports active user participation. In this paper, we propose a user-centric co-synthesis mechanism for generating gracefully degrading, heterogeneous multiprocessor architectures that fulfills the dual objectives of achieving real-time performance as well as ensuring high levels of system availability at acceptable cost. A flexible interface allows the user to specify rules that effectively capture the users’ perceived availability expectations under different working conditions. We propose an algorithm to map these user requirements to the importance attached to the subset of services provided during any functional state. The system availability is evaluated on the basis of these user-driven importance values and a CTMC model of the underlying fail-repair process. We employ a stochastic timing model in which all the relevant performance parameters such as task execution times, data arrival times and data communication times are taken to be random variables. A stochastic scheduling algorithm assigns start and completion time distributions to tasks. A hierarchical genetic algorithm optimizes the selections of resources, i.e. processors and busses, and the task allocations. We report the results of a number of experiments performed with representative task graphs. Analysis shows that the co-synthesis tool we have developed is effectively driven by the user’s availability requirements as well as by the topological characteristics of the task graph to yield high quality architectures. We experimentally demonstrate the edge provided by a stochastic timing model in terms of performance assessment, resource utilization, system-availability and cost. An erratum to this article is available at .  相似文献   

15.
Today’s embedded applications often consist of multiple concurrent tasks. These tasks are decomposed into sub-tasks which are in turn assigned and scheduled on multiple different processors to achieve the Pareto-optimal performance/energy combinations. Previous work introduced systematical approaches to make performance-energy trade-offs explorations for each individual task and used the exploration results at run-time to fulfill system-level constraints. However, they did not exploit the fact that the concurrent tasks can be executed in an overlapped fashion. In this paper, we propose a simple yet powerful on-line technique that performs task overlapping by run-time subtask re-scheduling. By doing so, a multiprocessor system with concurrent tasks can achieve better performance without extra energy consumption. We have applied our algorithm to a set of randomly-generated task graphs, obtaining encouraging improvements over non-overlapped task, and also having less overall energy consumption than a previous DVS method for real-time tasks. Then, we have demonstrated the algorithm on real-life video- and image-processing applications implemented on a dual-processor TI TMS320C6202 board: We have achieved a reduction of 22–29% in the application execution time, while the impact of run-time scheduling overhead proved to be negligible (1.55%).  相似文献   

16.
介绍了一个音频专用DSP核的硬件循环设计.该设计能实现单条指令和多条指令的硬件循环,并且由同一个电路实现,具有相同的指令格式,并支持最多4重嵌套的硬件循环.基于FPGA开发平台进行实验验证,结果表明,该设计能够正确高效地实现循环操作,为音频专用DSP核实现音频解码提供必要的支持.  相似文献   

17.
Dynamic Voltage Scaling (DVS) is one of the techniques used to obtain energy-saving in real-time DSP systems. In many DSP systems, some tasks contain conditional instructions that have different execution times for different inputs. Due to the uncertainties in execution time of these tasks, this paper models each varied execution time as a probabilistic random variable and solves the Voltage Assignment with Probability (VAP) Problem. VAP problem involves finding a voltage level to be used for each node of an date flow graph (DFG) in uniprocessor and multiprocessor DSP systems. This paper proposes two optimal algorithms, one for uniprocessor and one for multiprocessor DSP systems, to minimize the expected total energy consumption while satisfying the timing constraint with a guaranteed confidence probability. The experimental results show that our approach achieves significant energy saving than previous work. For example, our algorithm for multiprocessor achieves an average improvement of 56.1% on total energy-saving with 0.80 probability satisfying timing constraint.
Edwin H.-M. ShaEmail:
  相似文献   

18.
基于DM642的KLT跟踪算法的实现及优化   总被引:2,自引:0,他引:2  
Kanade-Lucas-Tomasi(KLT)算法是基于图像特征点的跟踪算法,由目标对象特征点提取,特征点跟踪两部分组成。本文首先阐述了KLT算法的基本原理,分析了影响算法执行速度的主要原因。分析表明KLT算法的操作主要集中在乘加运算和循环,图像卷积运算和循环占用的执行时间比较长。针对TMS320DM642 DSP的硬件平台特点,提出了算法优化的若干策略。通过配置编译环境,合理安排数据类型,消除存储器相关性,使用内联函数以及分解多层循环等方法,对算法的实现进行了优化。实验结果表明,优化后代码执行速度是优化前的3倍多。  相似文献   

19.
In this paper, the problem of providing a fully predictable execution environment for critical and hard real-time applications on embedded and DSP-based platforms is studied from the viewpoint of system architecture and operation.We introduce a set of homogenous models for time, signals and tasks, which will further serve as a basis for describing the architecture and operation of a particular hard real-time kernel – “HARETICK”. The kernel provides support for concurrent operation of hard real-time tasks (the HRT execution environment), using non-preemptive scheduling algorithms, along with soft real-time tasks (the SRT environment), using classical, preemptive, priority-based scheduling algorithms.A set of applications has been developed to test the correct operation of the HARETICK kernel according to the theoretical models and to evaluate its abilities to provide high predictability of execution for critical applications. Some of the main testing results are also discussed in the paper.  相似文献   

20.
Strict real-time processing and energy efficiency are required by high-performance Digital Signal Processing (DSP) applications. Scratch-Pad Memory (SPM), a software-controlled on-chip memory with small area and low energy consumption, has been widely used in many DSP systems. Various data placement algorithms are proposed to effectively manage data on SPMs. However, none of them can provide optimal solution of data placement problem for array data in loops. In this paper, we study the problem of how to optimally place array data in loops to multiple types of memory units such that the energy and time costs of memory accesses can be minimized. We design a dynamic programming algorithm, Iterational Optimal Data Placement (IODP), to solve data placement problem for loops for processor architectures with multiple types of memory units. According to the experimental results, the IODP algorithm reduced the energy consumption by 20.04 % and 8.98 % compared with a random memory placement method and a greedy algorithm, respectively. It also reduced the memory access time by 19.01 % and 8.62 % compared with a random memory placement method and a greedy approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号