期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈海燕杨超刘胜刘仲《电子学报》2016,44(2):241-246

随着SIMD(Single Instruction Multiple Data stream)结构DSP(Digital Signal Processor)片上集成了越来越多的处理单元,并行访存的灵活性及带宽效率对实际运算性能的影响越来越大.本文详细分析了一般SIMD结构DSP中基2 FFT(Fast Fourier Transform)并行算法面临的访存问题,采用简单的部分地址异或逻辑完成SIMD并行访存地址转换,实现了FFT运算的无冲突SIMD并行访存;提出了几种带特殊混洗模式的向量访存指令,可完全消除SIMD结构下基2 FFT运算时需要的额外混洗指令操作.最后将其应用于某16路SIMD数字信号处理器YHFT-Matrix2中向量存储器VM的优化设计.测试结果表明,采用该SIMD并行存储结构优化的VM以增加18%的硬件开销实现了FFT运算全流水无冲突并行访存和100%并行访存带宽利用率;相比优化前的设计,不同点数FFT运算可获得1.32~2.66的加速比. 相似文献

2.

An Algorithm Adapted Autonomous Controlling Concept for a Parallel Single-Chip Digital Signal Processor

Johannes Kneip Mladen Berekovic Jens Peter Wittenburg Willm Hinrichs Peter Pirsch 《The Journal of VLSI Signal Processing》1997,16(1):31-40

Recent sub- semiconductor technology supports the monolithic integration of multiprocessor systems. High wiring density and short on-chip memory access cycles motivate novel architecture concepts, outperforming conventional parallel systems. An efficient controlling strategy is a key to gain high performance from limited silicon resources. In this paper, a controlling concept for a monolithic Autonomous Single-Instruction/Multiple Data (ASIMD) processor is presented, which combines the high parallelism of an SIMD approach with the flexibility of standard DSP architectures. To demonstrate the performance gains of the concept, a digital video signal processor, the HiPAR-DSP has been implemented. It consists of an array of 4 or 16 datapaths, local memories for each datapath, a shared memory with concurrent data access in shape of a matrix and a central RISC controller. A three stage execution autonomy has been implemented, consisting of conditional instructions, conditional skip of instructions by the data paths and global evaluation of local conditions by the central controller. This allows efficient execution of data dependent medium- and high-level algorithms with very low controlling overhead. A performance of up to two arithmetic gigaoperations per second is achieved for algorithms with irregular data flow or control flow for the 100 MHz clocked processor with 16 data paths. 相似文献

3.

Memory systems for highly parallel computers

Tanimoto S.L. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1991,79(4):403-415

Some recent developments in computer memories are discussed, and proposals are made to enhance the memory systems of highly parallel SIMD (single instruction multiple data) computers. The topics covered include increases in density and speed; adding new logic functions to memory chips; local autonomy in addressing; a parallel indexed RAM (PIRAM); impact of memory on numerical computation in SIMD; routing data among processing elements (PEs); bottlenecks; iconic and symbolic processing; a mesh-interfacing memory chip; bimodal memory system; and local address autonomy using PIRAMs 相似文献

4.

Instruction Set Extensions for Matrix Decompositions on Software Defined Radio Architectures

Murugappan Senthilvelan Mihai Sima Daniel Iancu Michael Schulte John Glossner 《Journal of Signal Processing Systems》2013,70(3):289-303

Emerging wireless applications consistently demand higher data rates. Unfortunately, it is challenging to achieve high data rates within the limited amount of available frequency spectrum. Hence, enhanced spectral efficiency and link reliability within the available frequency spectrum are of the utmost importance in current and next generation wireless protocols. To attain high spectral efficiency and link reliability, wireless protocols employ increasingly complex 2-dimensional techniques that involve computationally-intensive matrix operations. Multiple-Input Multiple-Output (MIMO) communication is an example of a promising technique employed by wireless protocols to deliver higher data rates at the cost of increased algorithmic complexity. Application Specific Integrated Circuits (ASICs) have traditionally been used to implement compute-intensive wireless protocols. The wireless industry has been gradually moving towards an alternative programmable platform called Software Defined Radio (SDR) due to its significant benefits, such as reduced development costs, and accelerated time-to-market. The computationally-intensive matrix operations used in current and next generation wireless protocols are extremely expensive to implement in SDR platforms with conventional Digital Signal Processor (DSP) instruction sets. Hence there is a need for novel instructions, hardware designs and algorithm enhancements to enable higher spectral efficiency on SDR platforms. In this paper, we propose Single Instruction Multiple Data (SIMD) CoOrdinate Rotation DIgital Computer (CORDIC) instruction set extensions with CORDIC hardware support to speedup computationally-intensive matrix decomposition algorithms. The CORDIC instruction set extensions have been implemented on the Sandbridge Sandblaster SB3000 SDR platform and evaluated on conventional algorithms used for decomposing a closed loop 4-by-4 Worldwide Interoperability for Microwave Access (WiMAX) MIMO channel into independent Single-Input Single-Output (SISO) channels. Our experimental results on the closed-loop MIMO channel decomposition using CORDIC instructions demonstrate more than 6x speedup over a Sandblaster baseline implementation that uses state-of-the-art SIMD DSP instructions. The CORDIC instructions also provide similar numerical accuracy when compared to the baseline implementation. The techniques we propose in this paper are also applicable to other SDR and embedded processor architectures. 相似文献

5.

Code Optimization Techniques in Embedded DSP Microprocessors

Stan Liao Srinivas Devadas Kurt Keutzer Steve Tjiang Albert Wang 《Design Automation for Embedded Systems》1998,3(1):59-73

We address the problem of code optimization for embedded DSP microprocessors. Such processors (e.g., those in the TMS320 series) have highly irregular datapaths, and conventional code generation methods typically result in inefficient code. In this paper we formulate and solve some optimization problems that arise in code generation for processors with irregular datapaths. In addition to instruction scheduling and register allocation, we also formulate the accumulator spilling and mode selection problems that arise in DSP microprocessors. We present optimal and heuristic algorithms that determine an instruction schedule simultaneously optimizing accumulator spilling and mode selection. Experimental results are presented. 相似文献

6.

基于TMS320DM642实现的一种低复杂度回声抵消器

高国坪陈水仙《电信科学》2009,25(2)

以TMS320DM642定点DSP为平台,实现了一种低复杂度并适应双端会话情况的回声抵消器.在算法上.为提高双端会话的鲁棒性,采用了双回声路径模型(two echo path moolel,TEPM)作为基本框架;为降低计算的复杂度,采用了多延时分块频域自适应滤波器(multi-delay block frequencydomain adaptive filter,MDF)估计回声路径的冲击响应.在实现上,针对DSP的并行执行能力和指令集的单指令多数据(SIMD)特性对主要运算进行了线性汇编级的优化.测试表明,这种回声抵消器可以在0.45%CPU占用率下处理128 ms路径延时的组合声源信号(composite source signal,CSS)和实际声源信号的回声. 相似文献

7.

A 32-b RISC/DSP microprocessor with reduced complexity 总被引：2，自引：0，他引：2

Dolle M. Jhand S. Lehner W. Muller O. Schlett M. 《Solid-State Circuits, IEEE Journal of》1997,32(7):1056-1066

This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous 相似文献

8.

ADSP􀀁BF535 存储器的分级管理机制及其性能评估

杨?? 波王跃科杨?? 俊邢克飞《电子器件》2003,26(4):387-392

存储器的管理机制及其性能直接决定DSP的性能。文章首先分析了ADSPBF535的存储器分级管理机制，对各个区域的存储器进行了详细讲解。其次，针对该DSP的L1、L2，进行了并行指令和FFT运算的性能评测。再次。进行了多种存储器之间的DMA数据传输测试，给出了具体的速度指标。评测数据证明了BF535具有优良的存储器性能，为Blackfin系列DSP的工程应用设计提供了重要的数据参考。相似文献

9.

一种用于移动通信的无线电多媒体DSP芯片的实现 总被引：2，自引：2，他引：0

杨合圣李清平《现代电子技术》2006,29(9):17-20

叙述了一种用于移动通信的无线电多媒体DSP芯片的实现。开发出的WM DSP芯片既支持用于Viterbi、时间同步等的通信指令,也支持多媒体指令。这个DSP能够处理可变长数据,并且在一个周期里可以执行4个MAC。提出的DSP采用了并行处理技术,如SIMD、矢量处理和DSP结构,并且采用了无线电应用的低功耗特性。整个DSP芯片包括测试电路和各种外围设备,如DMA、总线仲裁、定时器等等。除了存储器之外总共大约有170 000个门电路,并且时钟频率达到了100 MHz。相似文献

10.

A Novel Reconfigurable Processor Using Dynamically Partitioned SIMD for Multimedia Applications

Chun‐Gi Lyuh Jung‐Hee Suk Ik‐Jae Chun Tae Moon Roh 《ETRI Journal》2009,31(6):709-716

In this paper, we propose a novel reconfigurable processor using dynamically partitioned single‐instruction multiple‐data (DP‐SIMD) which is able to process multimedia data. The SIMD processor and parallel SIMD (P‐SIMD) processor, which is composed of a number of SIMD processors, are usually used these days. But these processors are inefficient because all processing units (PUs) should process the same operations all the time. Moreover, the PUs can process different operations only when every SIMD group operation is predefined. We propose a processor control method which can partition parallel processors into multiple SIMD‐based processors dynamically to enhance efficiency. For performance evaluation of the proposed method, we carried out the inverse transform, inverse quantization, and motion compensation operations of H.264 using processors based on SIMD, P‐SIMD, and DP‐SIMD. Experimental results show that the DP‐SIMD control method is more efficient than SIMD and P‐SIMD control methods by about 15% and 14%, respectively. 相似文献

11.

The first MAJC microprocessor: a dual CPU system-on-a-chip

Kowalczyk A. Adler V. Amir C. Chiu F. Choon Ping Chng De Lange W.J. Yuefei Ge Ghosh S. Tan Canh Hoang Baoqing Huang Kant S. Kao Y.S. Cong Khieu Kumar S. Lan Lee Liebermensch A. Xin Liu Malur N.G. Martin A.A. Ngo H. Sung-Hun Oh Orginos I. Shih L. Sur B. Tremblay M. Tzeng A. Vo D. Zambere S. Jin Zong 《Solid-State Circuits, IEEE Journal of》2001,36(11):1609-1616

The first implementation of MAJC architecture achieves high performance by using very long instruction word (VLIW), single instruction multiple data (SIMD), and chip multiprocessing. The chip integrates two processors, a memory controller, two high-speed parallel I/O interfaces, and a PCI controller. The chip, fabricated in a 0.22-μm CMOS process with six layers of copper interconnect, contains 13 million transistors and operates at 500 MHz. It is packaged in a 624-pin ceramic column grid array using flip-chip assembly technology 相似文献

12.

Exploiting Thread‐Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Jaegeun Oh Seok Joong Hwang Huong Giang Nguyen Areum Kim Seon Wook Kim Chulwoo Kim Jong‐Kook Kim 《ETRI Journal》2008,30(4):576-586

In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. 相似文献

13.

A scalable instruction buffer and align unit for xDSPcore

Panis C. Grunbacher H. Nurmi J. 《Solid-State Circuits, IEEE Journal of》2004,39(7):1094-1100

Increasing mask costs and decreasing feature sizes together with productivity demand have led to the trend of platform design. Software programmable embedded cores are used to provide the necessary flexibility in integrated systems. Facing increasing system complexity, single-issue digital signal processors (DSPs) have been replaced by cores providing the execution of several instructions in parallel. The most common programming model for multi-issue DSP core architectures is Very Long Instruction Word (VLIW) which is based on static scheduling, and enables minimization of the worst case execution time and reduces core complexity. The drawback of traditional VLIW is poor code density, which leads to high program memory requirements and, therefore, requires a large silicon area of the DSP subsystem. To overcome this problem without limiting the core performance, a scalable long instruction word (xLIW) is introduced. A special align unit is used for implementing the xLIW program memory interface. In this paper, the align unit and its main architectural feature, a scalable instruction buffer, is introduced in detail. xLIW is part of a project for a parameterized DSP core. 相似文献

14.

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Hoseok Chang Junho Cho Wonyong Sung 《Journal of Signal Processing Systems》2009,56(2-3):249-260

The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered. 相似文献

15.

A 2-/spl mu/m CMOS 8-MIPS digital processor with parallel processing capability

《Solid-State Circuits, IEEE Journal of》1986,21(5):750-765

A 2-/spl mu/m CMOS VLSI digital signal processor (DSP) family, the SP50, is described that is capable of eight million instructions per second and up to six concurrent operations in each instruction. Two DSPs, the PCB5010 and PCB5011, have been developed. Both are based on a common architecture which contains two 16-bit data buses, and a 16/spl times/16/spl rarr/40-bit multiplier accumulator and 16-bit ALU, both with multiprecision support in hardware. Also implemented are two static data RAMs (128/spl times/16 or 256/spl times/16), a data ROM (51/spl times/16), a 15-word three-port register file, three address computation units, and five serial and parallel I/O interfaces. The data path is controlled by an orthogonal instruction set, using 40-bit microcode words. The controller contains a five-level stack and an instruction repeat register, and can have either on-chip program memory (RAM: 32/spl times/40; ROM: 987/spl times/40) or off-chip program memory (up to 64K/spl times/40). Benchmarks show a two to sixfold improvement in overall performance over its predecessors. 相似文献

16.

A VLSI VAX chip set

《Solid-State Circuits, IEEE Journal of》1984,19(5):663-674

VLSI technology has been used to compress the full functionality and comparable performance of the WAX 11/780 super-minicomputer into a 1.2 M transistor microprocessor chip set. There was no subsetting of the 304 instruction set and the 17 data types, nor reduction in hardware support for the 4 Gbyte virtual memory management architecture. The chip set supports an integral 8-Kbyte memory cache, a 13.3-Mbyte/s system bus, and sophisticated multiprocessing. High performance is achieved through microcode optimizations afforded by the large control store, tightly coupled address and data caches, the use of internal and external 32-bit datapaths, the extensive application of both microlevel and macrolevel pipelining, and the use of specialized hardware assists. 相似文献

17.

Architecture Considerations for Multi-Format Programmable Video Processors

Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39

Many different video processor architectures exist. Its architecture gives a processor strength for a particular application. Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and peripheral support.

Jonah ProbellEmail:

相似文献

18.

The impact of Mpact 2

Purcell S. 《Signal Processing Magazine, IEEE》1998,15(2):102-107

Mpact media processors enable powerful, flexible and cost-effective multimedia in a PC. A single chip replaces today's multiboard, multichip solutions for graphics, video, audio, and communications. The architecture combines a high-bandwidth RAMBUS memory, VLIW/SIMD (single instruction, multiple data) processing, standard buses, and software programmability for the cost of a modern graphics chip. Mpact architecture uses a modified VLIW style with two RISC-like instructions per VLIW. The instructions are either executed sequentially or concurrently based on a tag in the VLIW. Classical VLIW suffers from low code density due to unused instruction fields, but the Mpact modified VLIW has the same code density as RISC instructions. Additionally, the SIMD instructions improve code density by increasing the work done by each instruction. An 8 byte word size was chosen to balance vector and scalar performance and also to balance data and instruction bandwidth. A 9 bit byte was chosen to represent color-component differences in one byte and to represent 18 bit color or 18 bit audio samples in two bytes. Hardware-dithered rounding of quantization noise allows most audio to be processed in two byte precision. The maximal multiplier precision of 24×24 was chosen for audio requirements. The article reviews the first-generation Mpact media processor and then describes the multimedia performance goals and architecture of Chromatic's second-generation media processor architecture. It then presents newer modules of the architecture in more detail 相似文献

19.

Canny算子在DSP上的实现及优化

徐克强梁光明刘任任王职军谢俊《现代电子技术》2014,(6):8-11

为了在DSP平台实现细胞图像快速分割,详细分析Canny算子原理,结合TI DSP TMS320C6678处理器特性,实现了算法移植。针对与外部存储器图像数据交互,改变以往对图像逐灰度值进行访问的方式,设计了矢量化数据打包方法处理高斯滤波来提高并行运算。且在梯度计算、阈值计算过程中,采用宽存储器访问方法提高读写外部存储器效率。结果表明设计的优化方法在不改变分割效果前提下改善了算子速度,可为工程人员在DSP平台进行算法移植与优化提供借鉴。相似文献

20.

The Circuits and Robust Design Methodology of the Massively Parallel Processor Based on the Matrix Architecture

Noda H. Tanizaki T. Gyohten T. Dosaka K. Nakajima M. Mizumoto K. Yoshida K. Iwao T. Nishijima T. Okuno Y. Arimoto K. 《Solid-State Circuits, IEEE Journal of》2007,42(4):804-812

Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs 相似文献