共查询到20条相似文献,搜索用时 31 毫秒
1.
Sissades Tongsima Chantana Chantrapornchai Edwin H.-M. Sha Nelson L. Passos 《The Journal of VLSI Signal Processing》1998,18(2):111-123
Computation intensive DSP applications usually require parallel/pipelined processors in order to meet specific timing requirements. Data hazards are a major obstacle against the high performance of pipelined systems. This paper presents a novel efficient loop scheduling algorithm that reduces data hazards for such DSP applications. This algorithm has been embedded in a tool, called SHARP, which schedules a pipelined data flow graph to multiple pipelined units while hiding the underlying data hazards and minimizing the execution time. This paper reports significant improvement for some well-known benchmarks showing the efficiency of the scheduling algorithm and the flexibility of the simulation tool. 相似文献
2.
Ramanujam J. Jinpyo Hong Kandemir M. Narayan A. Agarwal A. 《Signal Processing, IEEE Transactions on》2006,54(1):286-294
Most embedded systems have limited amount of memory. In contrast, the memory requirements of the digital signal processing (DSP) and video processing codes (in nested loops, in particular) running on embedded systems is significant. This paper addresses the problem of estimating and reducing the amount of memory needed for transfers of data in embedded systems. First, the problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of nested loops is analyzed. For a fixed execution ordering, a quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for nonuniformly generated references are derived. Second, in addition to presenting an algorithm that computes the total memory required, this paper also discusses the effect of transformations (that change the execution ordering) on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. The term maximum window size is introduced, and quantitative expressions are derived to compute the maximum window size. A detailed analysis of the effect of unimodular transformations on data locality, including the calculation of the maximum window size, is presented. 相似文献
3.
Mohammad Ali Ghodrat Tony Givargis Alex Nicolau 《Design Automation for Embedded Systems》2009,13(3):193-221
We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase
in compilation time is acceptable in exchange for significant performance increase. The transformation technique optimizes
loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean
value of the conditional expression, determining the true/false paths, can be statically analyzed using a novel interval analysis
technique that can evaluate conditional expressions in the general polynomial form. Results from interval analysis combined
with loop dependency information is used to partition the iteration space of the nested loop. In such cases, the loop nest
is decomposed such as to eliminate the conditional test, thus substantially reducing the execution time. Our technique completely
eliminates the conditional from the loops (unlike previous techniques) thus further facilitating the application of other
optimizations and improving the overall speedup. Applying the proposed transformation technique on loop kernels taken from
Mediabench, SPEC-2000, mpeg4, qsdpcm and gimp, on average we measured a 2.34X speedup when running on a UltraSPARC processor, a 2.92X speedup when running on an Intel
Core Duo processor, a 2.44X speedup when running on a PowerPC G5 processor and a 2.04X speedup when running on an ARM9 processor.
Performance improvement, taking the entire application into account, was also promising: for 3 selected applications (mpeg-enc, mpeg-dec and qsdpcm) we measured 15% speedup on best case (5% on average) for the whole application. 相似文献
4.
An embedded system is called a multi-mode embedded system if it performs multiple applications by dynamically reconfiguring the system functionality. Further, the
embedded system is called a multi-mode multi-task embedded system if it additionally supports multiple tasks to be executed in a mode. In this paper, we address an important
HW/SW partitioning problem, that is, HW/SW partitioning of multi-mode multi-task embedded applications with timing constraints
of tasks. The objective of the optimization problem is to find a minimal total system cost of allocation/mapping of processing
resources to functional modules in tasks together with a schedule that satisfies the timing constraints. The key success of
solving the problem is closely related to the degree of the amount of utilization of the potential parallelism among the executions
of modules. However, due to an inherently excessively large search space of the parallelism, and to make the task of schedulability
analysis easy, the prior HW/SW partitioning methods have not been able to fully exploit the potential parallel execution of
modules. To overcome the limitation, we propose a set of comprehensive HW/SW partitioning techniques which solve the three
subproblems of the partitioning problem simultaneously: (1) allocation of processing resources, (2) mapping the processing resources to the modules in tasks, and (3) determining
an execution schedule of modules. Specifically, based on a precise measurement on the parallel execution and schedulability
of modules, we develop a stepwise refinement partitioning technique for single-mode multi-task applications, which aims to solve the subproblems 1, 2 and 3 effectively in an integrated fashion. The proposed
techniques is then extended to solve the HW/SW partitioning problem of multi-mode multi-task applications (i.e., to find a globally optimized allocation/mapping of processing resources with feasible execution schedule of modules). From experiments with
a set of real-life applications, it is shown that the proposed techniques are able to reduce the implementation cost by 19.0
and 17.0% for single- and multi-mode multi-task applications over that by the conventional method, respectively. 相似文献
5.
针对异构双核的嵌入式系统中MPU和DSP处理器间的频繁通信导致系统性能降低的问题,文中提出了一种适用于多媒体应用的智能任务控制器。该控制器可动态实现任务管理,并有效解决处理器间通信问题。文中采用ESL设计方法,构建双核处理器虚拟器平台,以像素为256×256的JPEG图像编码作为实际应用。结果表明,此智能控制器可减少15%~68%的任务管理时间。 相似文献
6.
Jagadeesh Sankaran Ching-Yu Hung Branislav Kisačanin 《Journal of Signal Processing Systems》2014,75(2):95-107
In this paper we introduce EVE (embedded vision/vector engine), with a FlexSIMD (flexible SIMD) architecture highly optimized for embedded vision. We show how EVE can be used to meet the growing requirements of embedded vision applications in a power- and area-efficient manner. EVE’s SIMD features allow it to accelerate low-level vision functions (such as image filtering, color-space conversion, pyramids, and gradients). With added flexibility of data accesses, EVE can also be used to accelerate many mid-level vision tasks (such as connected components, integral image, histogram, and Hough transform). Our experiments with a silicon implementation of EVE show that it performs many low- and mid-level vision functions with a 3–12x speed advantage over a C64x+DSP, while consuming less power and area. EVE also achieves code size savings of 4–6x over a C64x+DSP for regular loops. Thanks to its flexibility and programmability, we were able to implement two end-to-end vision applications on EVE and achieve more than a 5× application-level speedup over a C64x+. Having EVE as a coprocessor next to a DSP or a general purpose processor, algorithm developers have an option to accelerate the low- and mid-level vision functions on EVE. This gives them more room to innovate and use the DSP for new, more complex, high-level vision algorithms. 相似文献
7.
RM算法的运行时开销研究与算法改进 总被引:2,自引:0,他引:2
RM算法是经典的固定优先级实时调度算法.而在嵌入式实时系统中,系统的工作负荷往往是由很多频率快、执行时间较短的任务组成.因此,直接使用RM算法进行任务调度会由于实时操作系统中任务的上下文切换开销而导致嵌入式系统资源利用率的降低.分析了基于RM算法调度的任务之间的抢占关系,并建立了以任务属性为参数的上下文切换开销模型.在该模型的基础上,通过优化任务的释放时间来降低RM算法导致的系统运行时任务切换开销.最后的实验结果验证了该策略的有效性. 相似文献
8.
Ming-Yung Ko Chung-Ching Shen Shuvra S. Bhattacharyya 《Journal of Signal Processing Systems》2008,50(2):163-177
Digital signal processing (DSP) applications involve processing long streams of input data. It is important to take into account
this form of processing when implementing embedded software for DSP systems. Task-level vectorization, or block processing,
is a useful dataflow graph transformation that can significantly improve execution performance by allowing subsequences of
data items to be processed through individual task invocations. In this way, several benefits can be obtained, including reduced
context switch overhead, increased memory locality, improved utilization of processor pipelines, and use of more efficient
DSP oriented addressing modes. On the other hand, block processing generally results in increased memory requirements since
it effectively increases the sizes of the input and output values associated with processing tasks. In this paper, we investigate
the memory-performance trade-off associated with block processing. We develop novel block processing algorithms that carefully
take into account memory constraints to achieve efficient block processing configurations within given memory space limitations.
Our experimental results indicate that these methods derive optimal memory-constrained block processing solutions most of
the time. We demonstrate the advantages of our block processing techniques on practical kernel functions and applications
in the DSP domain.
相似文献
Shuvra S. BhattacharyyaEmail: |
9.
Benini L. Macchiarulo L. Macii A. Poncino M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(2):96-105
Memory-processor integration offers new opportunities for reducing, the energy of a system. In the case of embedded systems, where memory access patterns can typically be profiled at design time, one solution consists of mapping the most frequently accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. In this work, we propose an algorithm for the automatic partitioning of on-chip SRAMs into multiple banks. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The algorithm computes an optimal solution to the problem under realistic assumptions on the power cost metrics, and with constraints on the number of memory banks. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34% 相似文献
10.
配置时间过长是制约可重构系统整体性能提升的重要因素,而合理的任务调度技术可有效降低系统配置时间。该文针对粗粒度动态可重构系统(CGDRS)和具有数据依赖关系的流应用,提出了一种3维任务调度模型。首先基于该模型,设计了一种基于预配置策略的任务调度算法(CPSA);然后根据任务间的配置重用性,提出了间隔配置重用与连续配置重用策略,并据此对CPSA算法进行改进。实验结果证明,CPSA算法能够有效解决调度死锁问题、降低流应用执行时间并提高调度成功率。与其它调度算法相比,对流应用执行时间的平均优化比例达到6.13%~19.53%。 相似文献
11.
Inki Hong Miodrag Potkonjak Marios Papaefthymiou 《Design Automation for Embedded Systems》1999,4(4):311-327
Scheduling is one of the most often addressed optimization problems in DSP compilation, behavioral synthesis, and system-level synthesis research. With the rapid pace of changes in modern DSP applications requirements and implementation technologies, however, new types of scheduling challenges arise. This paper is concerned with the problem of scheduling blocks of computations in order to optimize the efficiency of their execution on programmable embedded systems under a realistic timing model of their processors. We describe an effective scheme for scheduling the blocks of any computation on a given system architecture and with a specified algorithm implementing each block. We also present algorithmic techniques for performing optimal block scheduling simultaneously with optimal architecture and algorithm selection. Our techniques address the block scheduling problem for both single- and multiple-processor system platforms and for a variety of optimization objectives including throughput, cost, and power dissipation. We demonstrate the practical effectiveness of our techniques on numerous designs and synthetic examples. 相似文献
12.
13.
Jiayin Li Meikang Qiu Jian-Wei Niu Yongxin Zhu Meiqin Liu Tianzhou Chen 《Journal of Signal Processing Systems》2012,67(3):239-253
A distributed mobile DSP system consists of a group of mobile devices with different computing powers. These devices are connected
by a wireless network. Parallel processing in the distributed mobile DSP system can provide high computing performance. Due
to the fact that most of the mobile devices are battery based, the lifetime of mobile DSP system depends on both the battery
behavior and the energy consumption characteristics of tasks. In this paper, we present a systematic system model for task
scheduling in mobile DSP system equipped with Dynamic Voltage Scaling (DVS) processors and energy harvesting techniques. We
propose the three-phase algorithms to obtain task schedules with shorter total execution time while satisfying the system
lifetime constraints. The simulations with randomly generated Directed Acyclic Graphs (DAG) show that our proposed algorithms
generate the optimal schedules that can satisfy lifetime constraints. 相似文献
14.
The co-synthesis of hardware–software systems for complex embedded applications has been studied extensively with focus on
various qualitative system objectives such as high speed performance and low power dissipation. One of the main challenges
in the construction of multiprocessor systems for complex real time applications is provide high levels of system availability
that satisfies the users’ expectations. Even though the area of hardware software cosynthesis has been studied extensively
in the recent past, the issues that specifically relate to design exploration for highly available architectures need to be
addressed more systematically and in a manner that supports active user participation.
In this paper, we propose a user-centric co-synthesis mechanism for generating gracefully degrading, heterogeneous multiprocessor
architectures that fulfills the dual objectives of achieving real-time performance as well as ensuring high levels of system
availability at acceptable cost. A flexible interface allows the user to specify rules that effectively capture the users’
perceived availability expectations under different working conditions. We propose an algorithm to map these user requirements
to the importance attached to the subset of services provided during any functional state. The system availability is evaluated
on the basis of these user-driven importance values and a CTMC model of the underlying fail-repair process. We employ a stochastic
timing model in which all the relevant performance parameters such as task execution times, data arrival times and data communication
times are taken to be random variables. A stochastic scheduling algorithm assigns start and completion time distributions
to tasks. A hierarchical genetic algorithm optimizes the selections of resources, i.e. processors and busses, and the task
allocations.
We report the results of a number of experiments performed with representative task graphs. Analysis shows that the co-synthesis
tool we have developed is effectively driven by the user’s availability requirements as well as by the topological characteristics
of the task graph to yield high quality architectures. We experimentally demonstrate the edge provided by a stochastic timing
model in terms of performance assessment, resource utilization, system-availability and cost.
An erratum to this article is available at . 相似文献
15.
Today’s embedded applications often consist of multiple concurrent tasks. These tasks are decomposed into sub-tasks which
are in turn assigned and scheduled on multiple different processors to achieve the Pareto-optimal performance/energy combinations.
Previous work introduced systematical approaches to make performance-energy trade-offs explorations for each individual task
and used the exploration results at run-time to fulfill system-level constraints. However, they did not exploit the fact that
the concurrent tasks can be executed in an overlapped fashion. In this paper, we propose a simple yet powerful on-line technique
that performs task overlapping by run-time subtask re-scheduling. By doing so, a multiprocessor system with concurrent tasks
can achieve better performance without extra energy consumption. We have applied our algorithm to a set of randomly-generated
task graphs, obtaining encouraging improvements over non-overlapped task, and also having less overall energy consumption
than a previous DVS method for real-time tasks. Then, we have demonstrated the algorithm on real-life video- and image-processing
applications implemented on a dual-processor TI TMS320C6202 board: We have achieved a reduction of 22–29% in the application
execution time, while the impact of run-time scheduling overhead proved to be negligible (1.55%). 相似文献
16.
17.
Voltage Assignment with Guaranteed Probability Satisfying Timing Constraint for Real-time Multiproceesor DSP 总被引:1,自引:0,他引:1
Meikang Qiu Zhiping Jia Chun Xue Zili Shao Edwin H.-M. Sha 《The Journal of VLSI Signal Processing》2007,46(1):55-73
Dynamic Voltage Scaling (DVS) is one of the techniques used to obtain energy-saving in real-time DSP systems. In many DSP
systems, some tasks contain conditional instructions that have different execution times for different inputs. Due to the
uncertainties in execution time of these tasks, this paper models each varied execution time as a probabilistic random variable
and solves the Voltage Assignment with Probability (VAP) Problem. VAP problem involves finding a voltage level to be used for each node of an date flow graph (DFG) in uniprocessor
and multiprocessor DSP systems. This paper proposes two optimal algorithms, one for uniprocessor and one for multiprocessor
DSP systems, to minimize the expected total energy consumption while satisfying the timing constraint with a guaranteed confidence
probability. The experimental results show that our approach achieves significant energy saving than previous work. For example,
our algorithm for multiprocessor achieves an average improvement of 56.1% on total energy-saving with 0.80 probability satisfying
timing constraint.
相似文献
Edwin H.-M. ShaEmail: |
18.
基于DM642的KLT跟踪算法的实现及优化 总被引:2,自引:0,他引:2
Kanade-Lucas-Tomasi(KLT)算法是基于图像特征点的跟踪算法,由目标对象特征点提取,特征点跟踪两部分组成。本文首先阐述了KLT算法的基本原理,分析了影响算法执行速度的主要原因。分析表明KLT算法的操作主要集中在乘加运算和循环,图像卷积运算和循环占用的执行时间比较长。针对TMS320DM642 DSP的硬件平台特点,提出了算法优化的若干策略。通过配置编译环境,合理安排数据类型,消除存储器相关性,使用内联函数以及分解多层循环等方法,对算法的实现进行了优化。实验结果表明,优化后代码执行速度是优化前的3倍多。 相似文献
19.
Mihai V. Micea Vladimir I. Cretu 《AEUE-International Journal of Electronics and Communications》2005,59(5):2968
In this paper, the problem of providing a fully predictable execution environment for critical and hard real-time applications on embedded and DSP-based platforms is studied from the viewpoint of system architecture and operation.We introduce a set of homogenous models for time, signals and tasks, which will further serve as a basis for describing the architecture and operation of a particular hard real-time kernel – “HARETICK”. The kernel provides support for concurrent operation of hard real-time tasks (the HRT execution environment), using non-preemptive scheduling algorithms, along with soft real-time tasks (the SRT environment), using classical, preemptive, priority-based scheduling algorithms.A set of applications has been developed to test the correct operation of the HARETICK kernel according to the theoretical models and to evaluate its abilities to provide high predictability of execution for critical applications. Some of the main testing results are also discussed in the paper. 相似文献
20.
Jun Zhang Tan Deng Qiuyan Gao Qingfeng Zhuge Edwin H.-M. Sha 《Journal of Signal Processing Systems》2013,72(3):151-164
Strict real-time processing and energy efficiency are required by high-performance Digital Signal Processing (DSP) applications. Scratch-Pad Memory (SPM), a software-controlled on-chip memory with small area and low energy consumption, has been widely used in many DSP systems. Various data placement algorithms are proposed to effectively manage data on SPMs. However, none of them can provide optimal solution of data placement problem for array data in loops. In this paper, we study the problem of how to optimally place array data in loops to multiple types of memory units such that the energy and time costs of memory accesses can be minimized. We design a dynamic programming algorithm, Iterational Optimal Data Placement (IODP), to solve data placement problem for loops for processor architectures with multiple types of memory units. According to the experimental results, the IODP algorithm reduced the energy consumption by 20.04 % and 8.98 % compared with a random memory placement method and a greedy algorithm, respectively. It also reduced the memory access time by 19.01 % and 8.62 % compared with a random memory placement method and a greedy approach. 相似文献