首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 793 毫秒
1.
Hardware/software partitioning is a key issue in the design of embedded systems when performance constraints have to be met and chip area and/or power dissipation are critical. For that reason, diverse approaches to automatic hardware/software partitioning have been proposed since the early 1990s. In all approaches so far, the granularity during partitioning is fixed, i.e., either small system parts (e.g., base blocks) or large system parts (e.g., whole functions/processes) can be swapped at once during partitioning in order to find the best hardware/software tradeoff. Since the deployment of a fixed granularity is likely to result in suboptimum solutions, we present the first approach that features a flexible granularity during hardware/software partitioning. Our approach is comprehensive in so far that the estimation techniques, our multigranularity performance estimation technique described here in detail, that control partitioning, are adapted to the flexible partitioning granularity. In addition, our multilevel objective function is described. It allows us to tradeoff various design constraints/goals (performance/hardware area) against each other. As a result, our approach is applicable to a wider range of applications than approaches with a fixed granularity. We also show that our approach is fast and that the obtained hardware/software partitions are much more efficient (in terms of hardware effort, for example) than in cases where a fixed granularity is deployed  相似文献   

2.
This paper presents two heuristics for automatic hardware/software partitioning of system level specifications. Partitioning is performed at the granularity of blocks, loops, subprograms, and processes with the objective of performance optimization with a limited hardware and software cost. We define the metric values for partitioning and develop a cost function that guides partitioning towards the desired objective. We consider minimization of communication cost and improvement of the overall parallelism as essential criteria during partitioning. Two heuristics for hardware/software partitioning, formulated as a graph partitioning problem, are presented: one based on simulated annealing and the other on tabu search. Results of extensive experiments, including real-life examples, show the clear superiority of the tabu search based algorithm.  相似文献   

3.
程序执行轨迹(Program executions trace,以下简称trace)是程序执行过程的指令流信息的记录,trace完整地记录了程序执行过程中所执行指令的内容和顺序。对于大多数程序,少数几个较短的热trace决定了系统的总体性能。本文提出了基于程序执行轨迹提取加速模块的软硬件划分方法。利用热trace提取算法划分系统中关键的trace到硬件,使用分支断言构造原子执行单位,以较小的硬件代价获得较高的加速比。在本文实验中,与采用模拟退火算法的指令级细粒度划分相比,获得的性能平均高9.6%,最终结果硬件面积小29%。  相似文献   

4.
5.
Exploiting instruction-level parallelism (ILP) is extremely important for achieving high performance in application specific instruction set processors (ASIPs) and embedded processors. Unlike conventional general purpose processors, ASIPs and embedded processors typically run a single application and hence must be optimized extensively for this in order to extract maximum performance. Further, low power and low cost requirements of ASIPs may demand reuse of pipeline stages causing pipelines with complex structural hazards. In such architectures, exploiting higher ILP is a major challenge to the designer.Existing techniques deal with either scheduling hardware pipelines to obtain higher throughput or software pipelining—an instruction scheduling technique for iterative computation—for exploiting greater ILP. We integrate these techniques to co-schedule hardware and software pipelines to achieve greater instruction throughput. In this paper, we develop the underlying theory of Co-Scheduling, called the Modulo-Scheduled Pipeline (or MS-Pipeline) theory. More specifically, we establish the necessary and sufficient condition for achieving the maximum throughput in a given pipeline operating under modulo scheduling. Further, we establish a sufficient condition to achieve a specified throughput, based on which we also develop a methodology for designing the hardware pipelines that achieve such a throughput. Further, we present initial experimental results which help to establish the usefulness of MS-pipeline theory in software pipelining. As the proposed theory helps to analyze and improve the throughput of Modulo-Scheduled Pipelines (MS-pipelines), it is especially useful in designing ASIPs and embedded processors.  相似文献   

6.
7.
A cell architecture for high performance digit-serial computation is presented. The design of this cell is based on the feedforward of the carry digit, which allows a high level of pipelining to increase the throughput rate. This will give designers greater flexibility in finding the best tradeoff between hardware cost and throughput rate. The effect of the number of pipelining levels on the throughput rate and hardware cost are presented.<>  相似文献   

8.
With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures.  相似文献   

9.
This paper presents an overview of global open Ethernet (GOE) architecture as a cost-effective Ethernet-based virtual private network (VPN) solution, and discusses a hardware and software implementation of a prototype system. Three main approaches have been proposed for a VPN solution on metro-area network: resilient packet ring, Ethernet over multiprotocol label switching (EoMPLS), virtual bridged local area network-tag stacking (Q-in-Q). None of these schemes can satisfy the following requirements at the same time: network topology flexibility, affordable network functionalities, low equipment cost, and low operational cost. The proposed GOE system is designed to solve VPN management problems of these approaches with MPLS VPN functionality at a low cost of Ethernet-based solution. The key components of GOE are: 1) a novel GOE tag for high-speed switching and 2) a novel routing and protection module via per-destination multiple rapid spanning tree protocol (PD-MRSTP). Via the analytical performance evaluation of EoMPLS, Q-in-Q, and GOE, we show that the memory cost and the network utilization of GOE is two-three times smaller and 22% higher than the other approaches, respectively. We also have developed a GOE prototype system and obtained the following remarkable hardware and software performance results. The GOE core switch delivered 100% of theoretical maximum throughput (10 G) with zero packet loss even with the field programmable gate array platform, and its 10-G port density is 1.5 times denser than the best currently available products. The GOE switch using PD-MRSTP also delivered a significantly fast protection switching time (1.975 ms), which was significantly faster than legacy Ethernet switches. These performance evaluation results prove that the proposed GOE system can be used as a cost-effective high-performance Ethernet-based VPN solution.  相似文献   

10.
In this paper, we introduce a new verification platform with ARM‐ and DSP‐based multiprocessor architecture. Its simple communication interface with a crossbar switch architecture is suitable for a heterogeneous multiprocessor platform. The platform is used to verify the function and performance of a DVB‐T baseband receiver using hardware and software partitioning techniques with a seamless hardware/software co‐verification tool. We present a dual‐processor platform with an ARM926 and a Teak DSP, but it cannot satisfy the standard specification of EN 300 744 of DVB‐T ETSI. Therefore, we propose a new multiprocessor strategy with an ARM926 and three Teak DSPs synchronized at 166 MHz to satisfy the required specification of DVB‐T.  相似文献   

11.
A design space exploration methodology of 1-D FFT processor is proposed to find the best hardware architecture in a quantitative way during early design. The methodology includes architecture candidate collection, coarse-grained architecture selection, and circuit level design optimizations. We show how to select a better architecture from candidates including different architectures (SDF, SDC, MDF, MDC and memory-based) with different degree of parallelism at different radices. The sub-level designs, including designs of rotator and data scaling module, are introduced for further optimizations. As a proof of concept, an FFT processor for 4G, WLAN and future 5G is designed supporting 16-4096 and 12-2400 point FFTs. Memory-based architecture with 16-datapath mixed-radix butterfly unit is selected to satisfy the demands for 1GS/s (4096) throughput. The synthesis result based on 65nm technology shows that the silicon cost and power consumption are 1.46mm2 and 68.64mW respectively. The proposed processor has better normalized throughput per area unit and normalized FFTs per energy unit than the state of the art available designs.  相似文献   

12.
基于负压波和流量平衡的原油管道检漏定位系统   总被引:1,自引:1,他引:0  
管道运输已成为现代社会不可缺少的部分,因此,研究管道泄漏检测系统具有极大的经济和社会价值。介绍了负压波和流量平衡的检测原理与定位算法,并设计了负压波和流量平衡相耦合的检漏定位系统。详细阐述了其硬件结构和软件设计思想。实际应用表明,该系统成本低且稳定可靠,能够迅速发现管道泄漏,并能精确定位。  相似文献   

13.
A new cell architecture for high performance digit-serial computation is presented. The design of this cell is based on the feed forward of the carry digit, which allows a high level of pipelining to increase the throughput rate with minimum latency. This will give designers greater flexibility in finding the best trade-off between hardware cost and throughput rate. A twin-pipe architecture to double the throughput rate of digit-serial/parallel multipliers is also presented. The effects of the number of pipelining levels and the twin architecture on the throughput rate and hardware cost are presented. A two's complement digit-serial/parallel multiplier which can operate on both negative and positive numbers is also presented.  相似文献   

14.
在由通用RISC处理器核和附加定点硬件加速器构成的定点SoC(System-on-Chip)芯片体系架构基础上,提出了一种新颖的基于统计分析的定点硬件加速器字长设计方法。该方法利用统计参数在数学层面上求解计算出满足不同信噪比要求下的最小字长,能有效地降低芯片面积、功耗和制作成本,从而在没有DSP协处理器的低成本RISC处理器核SoC芯片上运行高计算复杂度应用。  相似文献   

15.
Novel architectures for 1-D and 2-D discrete wavelet transform (DWT) by using lifting schemes are presented in this paper. An embedded decimation technique is exploited to optimize the architecture for 1-D DWT, which is designed to receive an input and generate an output with the low- and high-frequency components of original data being available alternately. Based on this 1-D DWT architecture, an efficient line-based architecture for 2-D DWT is further proposed by employing parallel and pipeline techniques, which is mainly composed of two horizontal filter modules and one vertical filter module, working in parallel and pipeline fashion with 100% hardware utilization. This 2-D architecture is called fast architecture (FA) that can perform J levels of decomposition for N * N image in approximately 2N2(1 - 4(-J))/3 internal clock cycles. Moreover, another efficient generic line-based 2-D architecture is proposed by exploiting the parallelism among four subband transforms in lifting-based 2-D DWT, which can perform J levels of decomposition for N * N image in approximately N2(1 - 4(-J))/3 internal clock cycles; hence, it is called high-speed architecture. The throughput rate of the latter is increased by two times when comparing with the former 2-D architecture, but only less additional hardware cost is added. Compared with the works reported in previous literature, the proposed architectures for 2-D DWT are efficient alternatives in tradeoff among hardware cost, throughput rate, output latency and control complexity, etc.  相似文献   

16.
Two-dimensional (2D) convolution is a basic operation in digital signal processing, especially in image and video applications. Although its computation is conceptually simple, a sum of products of constants by variables, its implementation is highly demanding in terms of computational power, especially when addressed to real-time embedded systems. This work brings an innovative approach oriented to dynamically reconfigurable hardware. A flexible 2D convolver is deployed on a SRAM-based FPGA split in two parts: a static region and a partially reconfigurable region (PRR). Just to provide a universal solution, all the configurable aspects of the convolver (kernel dimensions, operands resolution, constant coefficients, pipeline stages, etc.) fit allocated in the PRR. In this way, the computer can self-adapt its structure on the fly, according to the characteristics of the image to be processed each time. Although there are many research articles in the literature encompassing the design of 2D convolution computers, to the best of the authors’ knowledge, this is the first work that implements a 2D convolver based on run-time reconfigurable hardware, while other approaches synthesize it either directly in software or in hardware as fully static designs. This pioneer alternative - exploiting key implementation aspects like parallelism, pipeline, flexibility and functional density - overcomes both computational performance of software solutions and cost-effectiveness of static hardware designs, while delivering an outstanding level of adaptability. The balanced time-area trade-off achieved with this technology makes it appropriate for high-performance low-cost embedded systems.  相似文献   

17.
AdaBoost算法的人脸检测系统的SoC软硬件设计   总被引:1,自引:0,他引:1  
AdaBoost人脸检测算法计算量大,难以在嵌入式平台上用纯软件实时实现.文中对AdaBoost检测算法进行了性能分析,设计了合适的软硬件划分方案.算法的大部分计算都转移到硬件加速器中,大大提高了检测的速度.文中描述了整个系统的周期精确模型.仿真显示,SoC方案的速度是纯软件的11倍,在200MHz的主频下可以以28帧/秒的速度检测384*288的图像.  相似文献   

18.
Tecs is a test case development methodology for the functional validation of large electronic systems, typically consisting of several custom hardware and software components. The methodology determines a hierarchical top-down test case development process including test case specification, validation, partitioning and implementation. The test case development process addresses the functional validation of the system and its components such as ASICs, boards, HW and software modules; it does not facilitate timing or performance verification. The system functions are used to define test cases at the system level and to derive sub-functions for the system components. Test cases are specified, using a special purpose formalism, and validated before they are applied to the system under test. Furthermore, we propose a technique to partition test cases corresponding to the partitioning of the system into sub-systems and components. This technique can significantly reduce system simulation time because it allows the full validation of system functions by simulation at the sub-system and component level. The system model need only be simulated with a reduced set of stimuli to validate the interfaces between sub-systems. We present a test case specification language and tools that support the proposed methodology. The validation of a switching function illustrates methodology, language, and tools.  相似文献   

19.
In this paper, we propose a methodology for accelerating application segments by partitioning them between reconfigurable hardware blocks of different granularity. Critical parts are speeded-up on the coarse-grain reconfigurable hardware for meeting the timing requirements of application code mapped on the reconfigurable logic. The reconfigurable processing units are embedded in a generic hybrid system architecture which can model a large number of existing heterogeneous reconfigurable platforms. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by our developed high-performance data-path. The methodology mainly consists of three stages; the analysis, the mapping of the application parts onto fine and coarse-grain reconfigurable hardware, and the partitioning engine. A prototype software framework realizes the partitioning flow. In this work, the methodology is validated using five real-life applications. Analytical partitioning experiments show that the speedup relative to the all-FPGA mapping solution ranges from 1.5 to 4.0, while the specified timing constraints are satisfied for all the applications.  相似文献   

20.
In this paper, we present efficient VLSI architectures for full-search block-matching motion estimation (BMME) algorithm. Given a search range, we partition it into sub-search arrays called tiles. By fully exploiting data dependency within a tile, efficient VLSI architectures can be obtained. Using the proposed VLSI architectures, all the block-matchings in a tile can be processed in parallel. All the tiles within a search range can be processed serially or concurrently depending on various requirements. With the consideration of processing speed, hardware cost, and I/O bandwidth, the optimal tile size for a specific video application is analyzed. By partitioning a search range into tiles with appropriate size, flexible VLSI designs with different throughput can be obtained. In this way, cost effective VLSI designs for a wide range of video applications, from H.261 to HDTV, can be achieved.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号