期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈双燕张铁军王东辉侯朝焕《微电子学与计算机》2006,23(6):42-44,48

基于Tensilca公司的可配置、可扩展、可集成处理器-Xtensa，文章实现了对美国国家标准语音电话加密解密算法——2．4Kbps MELP的改进。在选择一个合理的处理器配置的基础上，对算法进行指令集仿真。分析找出算法中使用频率较高的操作，添加新的指令集，进行硬件实现以提高性能。实现的结果证明，经过改进以后，在增加了一定的硬件逻辑的基础上，相对于未加修改前的处理器内核。算法实现需要的总周期数降低为原来的47％。相似文献

2.

Architecture and Compiler Optimizations for Data Bandwidth Improvement in Configurable Processors 总被引：1，自引：0，他引：1

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):986-997

Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup 相似文献

3.

嵌入式处理器音频解决方案综述 总被引：1，自引：0，他引：1

孔吉龙沪强《信息技术》2008,32(8)

介绍了当今主流嵌入式处理器音频解决方案,包括Tensilica Xtensa HiFi 2,ARM AudioDE以及ARC Sound Subsystem. 相似文献

4.

可扩展处理器的自定义指令自动识别综述

下载免费PDF全文

肖成龙王珊珊王心霖林军王晶玥《电子学报》2020,48(8):1655-1664

近年来,可扩展处理器越来越多地应用于嵌入式系统当中.在可扩展处理器周围使用自定义指令能够保证一定的灵活性,同时也能很好地满足嵌入式应用对高性能和低功耗的需求.自定义指令自动识别是可扩展处理器设计中的关键问题之一.针对可扩展处理器的应用领域和发展趋势,介绍近年来自定义指令自动识别的研究进展;在此基础上,对于自定义指令识别涉及的关键步骤：中间表示生成、自定义指令枚举、自定义指令选择和代码转换,分别进行总结和归纳,分析不同方法的优点和难点;按照不同应用领域,对可扩展处理器的应用进行了总结和分析;最后展望了自定义指令自动识别的未来发展趋势和研究方向. 相似文献

5.

Design and implementation of a high performance network security processor

Haixin Wang Guoqiang Bai Hongyi Chen 《International Journal of Electronics》2013,100(3):309-325

相似文献

6.

Application Specific Processor Design for H.264 Decoder with a Configurable Embedded Processor

Jin Ho Han Mi Young Lee Younghwan Bae Hanjin Cho 《ETRI Journal》2005,27(5):491-496

An application specific processor for an H.264 decoder with a configurable embedded processor is designed in this research. The motion compensation, inverse integer transform, inverse quantization, and entropy decoding algorithm of H.264 decoder software are optimized. We improved the performance of the processor with instruction‐level hardware optimization, which is tailored to configurable embedded processor architecture. The optimized instructions for video processing can be used in other video compression standards such as MPEG 1, 2, and 4. A significant performance improvement is achieved with high flexibility. Experimental results show that we could achieve 300% performance for the H.264 baseline profile level 2 decoder. 相似文献

7.

Architectural techniques for accelerating subword permutations with repetitions

McGregor J.P. Lee R.B. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):325-335

We propose two new instructions, swperm and sieve, that can be used to efficiently complete an arbitrary bit-level permutation of an n-bit word with or without repetitions. Permutations with repetitions are rearrangements of an ordered set in which elements may replace other elements in the set; such permutations are useful in cryptographic algorithms. On a four-way superscalar processor, we can complete an arbitrary 64-bit permutation with repetitions of 1-bit subwords in 11 instructions and only four cycles using the two proposed instructions. For subwords of size 4 bits or greater, we can perform an arbitrary permutation with repetitions of a 64-bit register in a single cycle using a single swperm instruction. This improves upon previous results by requiring fewer instructions to permute 4-bit or larger subwords packed in a 64-bit register and fewer execution cycles for 1-bit subwords on wide superscalar processors. We also demonstrate that we can accelerate the performance of the popular DES block cipher using the proposed instructions. We obtain a DES performance improvement of at least 55% in constrained embedded environments and an improvement of 71% on a four-way superscalar processor when applying DES as a cryptographic hash function. 相似文献

8.

Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study

V.A. Chouliaras Author Vitae V.M. Dwyer Author Vitae Author Vitae J.L. Nunez-Yanez Author Vitae Author Vitae K. Nakos Author Vitae Author Vitae 《Integration, the VLSI Journal》2008,41(1):135-152

This work presents a detailed case study in customizing a configurable, extensible, 32-bit RISC processor with vector/SIMD instruction extensions for the efficient execution of block-based video-coding algorithms utilizing a proprietary co-design environment. In addition to the default Full-Search motion estimation of the MPEG-2 Test Model 5, fourteen fast ME algorithms were implemented in both scalar and vector form. Results demonstrate a reduction of up to 68% in the dynamic instruction count of the full search-based encoder whereas the fast motion estimation algorithms achieved a reduction in instruction count of nearly 90%, both accelerated via three 128-bit vector/SIMD instructions when compared to the scalar, reference implementation of the standard. We address in detail the profiling, vectorization and the development of these vector instruction set extensions, discuss in depth the implementation of a parametric vector accelerator that implements these instructions and show the introduction of that accelerator into a 32-bit RISC processor pipeline, in a closely-coupled configuration. 相似文献

9.

Predictive system shutdown and other architectural techniques forenergy efficient programmable computation

Srivastava M.B. Chandrakasan A.P. Brodersen R.W. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1996,4(1):42-55

With the popularity of portable devices such as personal digital assistants and personal communicators, as well as with increasing awareness of the economic and environmental costs of power consumption by desktop computers, energy efficiency has emerged as an important issue in the design of electronic systems. While power efficient ASIC's with dedicated architectures have addressed the energy efficiency issue for niche applications such as DSP, much of the computation continues to be implemented as software running on programmable processors such as microprocessors, microcontrollers, and programmable DSP's. Not only is this true for general purpose computation on personal computers and workstations, but also for portable devices, application-specific systems etc. In fact, firmware and embedded software executing on RISC and DSP processor cores that are embedded in ASIC's has emerged as a leading implementation methodology for speech coding, modem functionality, video compression, communication protocol processing etc. This paper describes architectural techniques for energy efficient implementation of programmable computation, particularly focussing on the computation needed in portable devices where event-driven user interfaces, communication protocols, and signal processing play a dominant role. Two key approaches described here are predictive system shutdown and extended voltage scaling. Results indicate that a large reduction in power consumption can be achieved over current day solutions with little or no loss in system performance 相似文献

10.

TCP/IP协议的ASIC设计与实现 总被引：1，自引：0，他引：1

陈维良赵俊超魏少军《微电子学》2002,32(2):97-101

文章介绍了一种TCP/IP协议族传输、处理TCP数据和IP数据报过程的ASIC设计-TCP/IP协议处理器的硬件实现。简单介绍了TCP/IP协议，着重介绍了TCP/IP协议处理器系统结构以及各模块设计。硬件实现的TCP/IP协议处理器提高了IP数据报的处理速度，更重要的是，将Internet网络数据传输从传统的依赖电子计算机系统的模式中解放出来，实现了脱离计算机系统环境建立Internet网络连接。相似文献

11.

The Architecture and Development Flow of the S5 Software Configurable Processor

Jeffrey M. Arnold 《The Journal of VLSI Signal Processing》2007,47(1):3-14

A software configurable processor (SCP) is a hybrid device that couples a conventional processor datapath with programmable logic to allow application programs to dynamically customize the instruction set. SCP architectures can offer significant performance gains by exploiting data parallelism, operator specialization and deep pipelines. The S5000 is a family of high performance software configurable processors for embedded applications. The S5000 consists of a conventional 32-bit RISC processor coupled with a programmable Instruction Set Extension Fabric (ISEF). To develop an application for the S5 the programmer identifies critical sections to be accelerated, writes one or more extension instructions as functions in a variant of the C programming language, and accesses those functions from the application program. Performance gains of more than an order of magnitude over the unaccelerated processor can be achieved.

Jeffrey M. ArnoldEmail:

相似文献

12.

Streaming processors for next-generation mobile imaging applications

《Communications Magazine, IEEE》2005,43(12):81-89

Next-generation mobile devices will continue to demand high processing power for imaging applications. The expected performance is in the class of supercomputers, but delivered with limited energy and memory bandwidth for embedded systems. This article advocates a streaming computation model that leverages the deterministic access patterns in imaging applications to deliver the necessary processing throughput. A reconfigurable datapath connects a set of functional units, forming a computation pipeline to offer energy efficiency. The architecture and implementation of a stream processor are presented along with the memory subsystem to support stream data transfers. The results show speedup ranging from a factor of 2 to 28 for imaging applications, offering favorable comparison against scalar processors. 相似文献

13.

Multicore Flow Processor with Wire‐Speed Flow Admission Control

Kyeong‐Hwan Doo Bin‐Yeong Yoon Bhum‐Cheol Lee Soon‐Seok Lee Man Soo Han Whan‐Woo Kim 《ETRI Journal》2012,34(6):827-837

We propose a flow admission control (FAC) for setting up a wire‐speed connection for new flows based on their negotiated bandwidth. It also terminates a flow that does not have a packet transmitted within a certain period determined by the users. The FAC can be used to provide a reliable transmission of user datagram and transmission control protocol applications. If the period of flows can be set to a short time period, we can monitor active flows that carry a packet over networks during the flow period. Such powerful flow management can also be applied to security systems to detect a denial‐of‐service attack. We implement a network processor called a flow management network processor (FMNP), which is the second generation of the device that supports FAC. It has forty reduced instruction set computer core processors optimized for packet processing. It is fabricated in 65‐nm CMOS technology and has a 40‐Gbps process performance. We prove that a flow router equipped with an FMNP is better than legacy systems in terms of throughput and packet loss. 相似文献

14.

适用于异步徽处理器的16位自定时ALU

管超葛元庆吴瑞周润德《微电子学》2001,31(5):342-346

针对嵌入式微处理设计中提出的高性能,低功耗的要求,提出了一种面向异步微处理器的由动态电压级联逻辑电路（DCVS）构成的16位自定量ALU。在综合考虑面积、速度、功耗及指令的统计分布情况下,该ALU具有优异的性能。相似文献

15.

A VLIW processor with reconfigurable instruction set for embedded applications

Lodi A. Toma M. Campi F. Cappelli A. Canegallo R. Guerrieri R. 《Solid-State Circuits, IEEE Journal of》2003,38(11):1876-1886

This paper describes a new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath. The reconfigurable unit is tightly coupled with the processor, featuring an application-specific instruction-set extension. Mapping computation intensive algorithmic portions on the reconfigurable unit allows a more efficient elaboration, thus leading to an improvement in both timing performance and power consumption. A test chip has been implemented in a standard 0.18-/spl mu/m CMOS technology. The test of a signal processing algorithmic benchmark showed speedups ranging from 4.3/spl times/ to 13.5/spl times/ and energy consumption reduced up to 92%. 相似文献

16.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O

Borgatti M. Lertora F. Foret B. Cali L. 《Solid-State Circuits, IEEE Journal of》2003,38(3):521-529

A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design. It features an embedded reconfigurable processor built by joining a configurable and extensible processor core and an SRAM-based embedded field-programmable gate array (FPGA). Application-specific bus-mapped coprocessors and flexible input/output peripherals and interfaces can also be added and dynamically modified by reconfiguring the embedded FPGA. The architecture of the system is discussed as well as the design flows for pre- and post-silicon design and customization. The silicon area required by the system is 20 mm/sup 2/ in a 0.18-/spl mu/m CMOS technology. The embedded FPGA accounts for about 40% of the system area. 相似文献

17.

Top down structured parallelisation of embedded image processingapplications

Downton A.C. Tregidgo R.W.S. Cuhadar A. 《Vision, Image and Signal Processing, IEE Proceedings -》1994,141(6):431-437

The authors present a general system design method which is intended to support parallelisation of complete image processing applications using MIMD processors. The approach is based upon the utilisation of a generic system level parallel processor architecture, the `pipeline processor farm'(PPF), and is applicable to any embedded application with continuous input/output. The design method is illustrated using applications from the fields of computer vision and image coding. The design model accommodates several commonly exploited parallel processing paradigms, maps conveniently to the software structure of most image processing algorithms, provides incrementally scalable performance, and enables upper-bound speedups to be easily estimated from profiling data generated by the original sequential implementation of the application. It is believed that the approach has significant application in parallel embedded systems design, in the development environment, and in simulation work for computationally intensive image coding algorithms 相似文献

18.

A 1.2-W, 2.16-GOPS/720-MFLOPS embedded superscalar microprocessorfor multimedia applications

Kubosawa H. Takahashi H. Ando S. Asada Y. Asato A. Suga A. Kimura M. Higaki N. Miyake H. Sato T. Anbutsu H. Tsuda T. Yoshimura T. Amano I. Kai M. Mitarai S. 《Solid-State Circuits, IEEE Journal of》1998,33(11):1640-1648

We have designed a microprocessor that is based on a single instruction multiple data stream (SIMD) architecture. It features a two-way superscalar architecture for multimedia embedded systems that need to support especially MPEG2 video decoding/encoding and 3DCG image processing. This microprocessor meets all requirements of embedded systems, including (a) MPEG2 (MP@ML) decoding and graphic processing capabilities for three-dimensional images, (b) programming flexibility, and (c) low power consumption and low manufacturing cost. High performance was achieved by enhanced parallel processing capabilities while adopting a SIMD architecture and a two-way superscalar architecture. Programming flexibility was increased by providing 170 dedicated multimedia instructions. Low power consumption was achieved by utilizing advanced process technology and power-saving circuits. The processor supports a general-purpose RISC instruction set. This feature is important, as the processor will have to work as a controller of various target systems. The processor has been fabricated by 0.21-μm CMOS four-metal technology on a 9.84×10.12 mm die. It performs 2.16 GOPS/720 MFLOPS at an operating frequency of 180 MHz, with a power consumption of 1.2 W and a power supply of 1.8 V 相似文献

19.

Phase-Coupled Mapping of Data Flow Graphs to Irregular Data Paths

Steven Bashford Rainer Leupers 《Design Automation for Embedded Systems》1999,4(2-3):119-165

Many software compilers for embedded processors produce machine code of insufficient quality. Since for most applications software must meet tight code speed and size constraints, embedded software is still largely developed in assembly language. In order to eliminate this bottleneck and to enable the use of high-level language compilers also for embedded software, new code generation and optimization techniques are required. This paper describes a novel code generation technique for embedded processors with irregular data path architectures, such as typically found in fixed-point DSPs. The proposed code generation technique maps data flow graph representation of a program into highly efficient machine code for a target processor modeled by instruction set behavior. High code quality is ensured by tight coupling of different code generation phases. In contrast to earlier works, mainly based on heuristics, our approach is constraint-based. An initial set of constraints on code generation are prescribed by the given processor model. Further constraints arise during code generation based on decisions concerning code selection, register allocation, and scheduling. Whenever possible, decisions are postponed until sufficient information about a good decision has been collected. The constraints are active in the "background" and guarantee local satisfiability at any point of time during code generation. This mechanism permits to simultaneously cope with special-purpose registers and instruction level parallelism. We describe the detailed integration of code generation phases. The implementation is based on the constraint logic programming (CLP) language ECLiPSe. For a standard DSP, we show that the quality of generated code comes close to hand-written assembly code. Since the input processor model can be edited by the user, also retargetability of the code generation technique is achieved within a certain processor class. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

20.

FPGA prototyping of a RISC processor core for embedded applications

Gschwind M. Salapura V. Maurer D. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(2):241-250

Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base 相似文献