期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A hardware-software codesign methodology for DSP applications 总被引：1，自引：0，他引：1

Kalavade A. Lee E.A. 《Design & Test of Computers, IEEE》1993,10(3):16-28

The authors describe a systematic, heterogeneous design methodology using the Ptolemy framework for simulation, prototyping, and software synthesis of systems containing a mixture of hardware and software components. They focus on signal-processing systems in which the hardware typically consists of custom data paths, finite-state machines (FSMs), glue logic and programmable processors. The software is one or more embedded programs running on the programmable components 相似文献

2.

BOAR: an advanced HW/SW coemulation environment for DSP system development

Jouni Isoaho Vesa Köppä Jarkko Oksala Pasi OjalaAuthor vitae 《Microprocessors and Microsystems》1997,20(10):2330-615

The BOAR emulation system is targeted to hardware/software (HW/SW) codevelopment of advanced embedded DSP and telecom systems. The challenge of the BOAR system is efficient customization of programmable hardware, and dedicated partitioning routine to target applications and structures, which allows quite high overall system performance. The system allows multiple configurations for communication between processors and field programmable gate arrays (FPGAs) making the BOAR system an efficient tool for real-time HW/SW coverification. The reprogrammable hardware of the emulation tool is based on four Xilinx 4000-series devices, two Texas TMS320C50 signal processors and one Motorola MC68302 microcontroller. With current devices the BOAR hardware provides approximately 40–70 kgates of logic capacity in DSP applications. The emulation capacity can be expanded by connecting several similar boards in chain. The system has also a versatile internal reprogrammable test environment for test bench development, performance evaluations and design debugging. The logic development environment is based on the Synopsys synthesis tools and an automatic design management software, which performs resource mapping and performance-driven design partitioning between FPGAs. The emulation hardware is currently connected to logic and software development environments via an RS-232C bus. The BOAR emulation system has been found a very efficient platform for real-life prototyping of different types of DSP algorithms and systems, and validating correct functionality of a VHDL macro library. 相似文献

3.

网络处理器的分析与研究 总被引：54，自引：0，他引：54

谭章熹林闯任丰源周文江《软件学报》2003,14(2):253-267

目前,网络在提高链路速率的同时出现了大量的新协议及新服务,而传统的网络设备一般采用专用硬件芯片或者基于纯粹的软件方案,很难兼顾性能与灵活性两方面的要求.为此,一种并行可编程的网络处理器被引入到路由器(交换机)的处理层面.它基于ASIP技术对网络程序处理进行了优化,同时还兼有硬件和软件两种方案的特点.网络处理器的出现将经典的"存储-转发"结构变为"存储-处理-转发",这为复杂的QoS控制和负载处理提供了可能.从网络处理器本身及其应用两个角度出发,介绍了相关的研究工作,分析了系统特点和面临的挑战,并展望其未来的发展方向. 相似文献

4.

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Xiaofang Wang 《Journal of Real-Time Image Processing》2014,9(1):187-204

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %. 相似文献

5.

Adapting FreeRTOS for multicores: an experience report 总被引：1，自引：0，他引：1

James Mistry Matthew Naylor Jim Woodcock 《Software》2014,44(9):1129-1154

Multicore processors are ubiquitous. Their use in embedded systems is growing rapidly, and given the constraints on uniprocessor clock speeds, their importance in meeting the demands of increasingly processor‐intensive embedded applications cannot be understated. To harness this potential, system designers need to have available to them embedded operating systems with built‐in multicore support for widely available embedded hardware. This paper documents our experience of adapting FreeRTOS, a popular embedded real‐time operating system, to support multiple processors. A working multicore version of FreeRTOS that is able to schedule tasks on multiple processors as well as provide full mutual‐exclusion support for use in concurrent applications is presented. Mutual exclusion is achieved in an almost completely platform‐agnostic manner, preserving one of FreeRTOS's most attractive features: portability. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

6.

ETA: experience with an Intel Xeon processor as a packet processing engine

Regnier G. Minturn D. McAlpine G. Saletore V.A. Foong A. 《Micro, IEEE》2004,24(1):24-31

Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications. 相似文献

7.

Acceleration of software algorithms using hardware/software co-design techniques

M.D. Edwards J. Forrest A.E. Whelan 《Journal of Systems Architecture》1997,42(9-10):697-707

Currently there is significant interest in the design and implementation of embedded systems where the hardware and software subsystems are developed concurrently in order to meet design constraints. We present a development environment for general-purpose systems, where the objective is to accelerate the performance of software-based applications, which are specified by C programs. Such programs may be partitioned into hardware and software subsystems — a speed-critical region of the software is implemented in an FPGA in order to provide the performance acceleration. We also discuss two versions of the underlying system hardware architecture. Practical examples are given to illustrate our approach. 相似文献

8.

Embedding of a real time image stabilization algorithm on a parameterizable SoPC architecture a chip multi-processor approach

Lionel Damez Loic Sieler Alexis Landrault Jean Pierre D��rutin 《Journal of Real-Time Image Processing》2011,6(1):47-58

Highly regular multi-processor architectures are suitable for inherently highly parallelizable applications such as most of the image processing domain. Systems embedded in a single programmable chip platform (SoPC) allow hardware designers to tailor every aspect of the architecture in order to match the specific application needs. These platforms are now large enough to embed an increasing number of cores, allowing implementation of a multi-processor architecture with an embedded communication network. In this paper we present the parallelization and the embedding of a real time image stabilization algorithm on a SoPC platform. Our overall hardware implementation method is based upon meeting algorithm processing power requirements and communication needs with refinement of a generic parallel architecture model. Actual implementation is done by the choice and parameterization of readily available reconfigurable hardware modules and customizable commercially available IPs (Intellectual Property). We present both software and hardware implementation with performance results on a Xilinx SoPC target. 相似文献

9.

一种基于监测的嵌入式系统设计技术 总被引：6，自引：0，他引：6

吴百锋彭澄廉孙晓光《计算机学报》2003,26(12):1728-1733

提出一种嵌入式系统软硬件协同设计方法，它以数据流图为系统模型对嵌入式系统的功能和性能需求进行描述，并通过一种特定的实现结构，使得设计者可以借助快速样机平台和事件驱动式监测技术来精确测定目标系统对系统模型的实现状况，从而使得软硬件协同设计过程特别是系统优化和性能验证能在精确、可靠的测试数据基础上进行．同目前通常使用的以软硬件部件性能估算为基础的软硬件协同设计方法相比，这种以测试为基础的设计技术更能确保设计结果的合理．相似文献

10.

Exploring embedded-systems architectures with Artemis

Pimentel A.D. Hertzbetger L.O. Lieverse P. van der Wolf P. Deprettere E.E. 《Computer》2001,34(11):57-63

Because embedded systems mostly target mass production and often run on batteries, they should be cheap to realize and power efficient. In addition, they require a high degree of programmability to provide real-time performance for multiple applications and standards. However, performance requirements as well as cost and power-consumption constraints demand that substantial parts of these systems be implemented in dedicated hardware blocks. As a result, their heterogeneous system architecture consists of components ranging from fully dedicated hardware components for time-critical application tasks. Increasingly, these designs yield heterogeneous embedded multiprocessor systems that reside together on a single chip. The heterogeneity of these highly programmable systems and the varying demands of their target applications greatly complicate system design. The increasing complexity of embedded-system architectures makes predicting performance behavior more difficult. Therefore, having the appropriate tools to explore different choices at an early design stage is increasingly important. The Artemis modeling and simulation environment aims to efficiently explore the design space of heterogeneous embedded-systems architectures at multiple abstraction levels and for a wide range of applications targeting these architectures. The authors describe their of this methodology in two studies that showed promising results, providing useful feedback on a wide range of design decisions involving the architectures for the two applications 相似文献

11.

FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi Shrinivas Pundlik Gang Luo 《Journal of Real-Time Image Processing》2016,11(4):751-767

Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. 相似文献

12.

FPGA-based module for SURF extraction 总被引：1，自引：0，他引：1

Tomáš Krajník Jan Šváb Sol Pedre Petr Čížek Libor Přeučil 《Machine Vision and Applications》2014,25(3):787-800

We present a complete hardware and software solution of an FPGA-based computer vision embedded module capable of carrying out SURF image features extraction algorithm. Aside from image analysis, the module embeds a Linux distribution that allows to run programs specifically tailored for particular applications. The module is based on a Virtex-5 FXT FPGA which features powerful configurable logic and an embedded PowerPC processor. We describe the module hardware as well as the custom FPGA image processing cores that implement the algorithm’s most computationally expensive process, the interest point detection. The module’s overall performance is evaluated and compared to CPU and GPU-based solutions. Results show that the embedded module achieves comparable distinctiveness to the SURF software implementation running in a standard CPU while being faster and consuming significantly less power and space. Thus, it allows to use the SURF algorithm in applications with power and spatial constraints, such as autonomous navigation of small mobile robots. 相似文献

13.

Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration

Clark Nathan Zhong Hongtao Tang Wilkin Mahlke Scott 《International journal of parallel programming》2003,31(6):429-449

General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. 相似文献

14.

A highly flexible,distributed multiprocessor architecture for network processing

《Computer Networks》2003,41(5):563-586

Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2. 相似文献

15.

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

Qianming Yang Mei Wen Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,63(2):508-537

Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability. 相似文献

16.

FPGA emulation methodology for fast and accurate power estimation of embedded processors

《Journal of Systems Architecture》2017

Early estimation of application-specific power consumption has become one of the major constraints of modern ASIC design. While in early stages of the design process precise power consumption can only be obtained from very time consuming gate-level (GTL) simulation, power estimation methodologies aim to reduce computational overhead by deriving models to approximate power consumption on higher levels. This work presents an FPGA accelerated power estimation methodology for programmable processors based on a hybrid functional level (FLPA) and instruction level power analysis (ILPA) that can be mapped onto an FPGA together with the functional emulation. It enables fast and accurate estimation of application-specific power consumption and energy per task which is crucial for power-aware design of embedded processor architectures. The approach allows both hardware and software designers to optimize their implementations not only for processing performance but also for power efficiency. The power emulation methodology and considerations for the FPGA implementation of the power estimation is described in detail. Model validation against GTL power simulation and results are given for a typical embedded RISC processor and a commercial-grade Application Specific Instruction Set Processor (ASIP). Power consumption models yield fast and accurate power estimation with a %MAE of less than 9% and NRMSE of less than 7% enabling co-optimization of both hardware and software with respect to power consumption in early design stages. 相似文献

17.

The application kernel approach—a novel approach for adding SMP support to uniprocessor operating systems

Simon Kgstrm Hkan Grahn Lars Lundberg 《Software》2006,36(14):1563-1583

The current trend of using multiprocessor computers for server applications requires operating system adaptations to take advantage of more powerful hardware. However, modifying large bodies of software is very costly and time consuming, and the cost of porting an operating system to a multiprocessor might not be motivated by the potential performance benefits. In this paper we present a novel method, the application kernel approach, for adaption of an existing uniprocessor kernel to multiprocessor hardware. Our approach considers the existing uniprocessor kernel as a ‘black box’, to which no or very small changes are made. Instead, the original kernel runs operating system services unmodified on one processor whereas the other processors execute applications on top of a small custom kernel. We have implemented the application kernel for the Linux operating system, which illustrates that the approach can be realized with fairly small resources. We also present an evaluation of the performance and complexity of our approach, where we show that it is possible to achieve good performance while at the same time keeping the implementation complexity low. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

18.

Reconfigurable media processing

《Parallel Computing》2002,28(7-8):1111-1139

Multimedia processing is becoming increasingly important with wide variety of applications ranging from multimedia cell phones to high definition interactive television. Media processing techniques typically involve the capture, storage, manipulation and transmission of multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D graphics, animation and full-motion video. A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified into two major categories, namely (i) general purpose processors with programmable media processing capabilities, and (ii) dedicated implementations (ASICs). We have performed a detailed complexity analysis of the recent multimedia standard (MPEG-4) which has shown the potential for reconfigurable computing, that adapts the underlying hardware dynamically in response to changes in the input data or processing environment. We therefore propose a methodology for designing a reconfigurable media processor. This involves hardware–software co-design implemented in the form of a parser, profiler, recurring pattern analyzer, spatial and temporal partitioner. The proposed methodology enables efficient partitioning of resources for complex and time critical multimedia applications. 相似文献

19.

The softening of hardware

Vahid F. 《Computer》2003,36(4):27-34

In the 1940s, when modern computing began, engineers tended to view computers and the programs running on them as unified entities. Now, after decades in which software and hardware developed along separate paths, we seem to have come full circle. The hardware on which our programs run is thanks to embedded systems. These systems force designers to work under incredibly tight constraints. To understand the technologies developed to satisfy these constraints, we must first distinguish the underlying embedded systems elements. 相似文献

20.

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Roman Lysecky 《International journal of parallel programming》2008,36(5):478-492

Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor. 相似文献