期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

81.6 GOPS Object Recognition Processor Based on a Memory-Centric NoC

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(3):370-383

For mobile intelligent robot applications, an 81.6 GOPS object recognition processor is implemented. Based on an analysis of the target application, the chip architecture and hardware features are decided. The proposed processor aims to support both task-level and data-level parallelism. Ten processing elements are integrated for the task-level parallelism and single instruction multiple data (SIMD) instruction is added to exploit the data-level parallelism. The Memory-Centric network-on-chip7 (NoC) is proposed to support efficient pipelined task execution using the ten processing elements. It also provides coherence and consistency schemes tailored for 1-to-N and M-to-1 data transactions in a task-level pipeline. For further performance gain, the visual image processing memory is also implemented. The chip is fabricated in a 0.18- $mu$m CMOS technology and computes the key-point localization stage of the SIFT object recognition twice faster than the 2.3 GHz Core 2 Duo processor. 相似文献

2.

An SIMD Multiprocessor Ring Architecture for the LMS Adaptive Algorithm

Miller T. Alexander S. Faber L. 《Communications, IEEE Transactions on》1986,34(1):89-92

A new architecture for a single instruction stream, multiple data stream (SIMD) implementation of the LMS adaptive algorithm is investigated. This is denoted as a ring architecture, due to its physical configuration, and it effectively solves the latency problem often associated with prediction error feedback in adaptive filters. The multiprocessor ring efficiently updates the filter input vector by operating as a pipeline structure, while behaving as a parallel structure in computing the filter output and applying the weight adaptation algorithm. Last, individual processor timing and capacity considerations are examined. 相似文献

3.

Scientific modeling with massively parallel SIMD computers 总被引：1，自引：0，他引：1

Wilding N.B. Trew A.S. Hawick K.A. Pawley G.S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1991,79(4):574-585

A number of scientific models are discussed that possess a high degree of inherent parallelism. For simulation purposes this is exploited by employing a massively parallel SIMD (single instruction multiple data) computer. The authors describe one such computer, the distributed array processor (DAP), and discuss the optimal mapping of a typical problem onto the computer architecture to best exploit the model parallelism. By focusing on specific models currently under study, they exemplify the types of problems which benefit most from a parallel implementation. The extent of this benefit is considered relative to implementation on a machine of conventional architecture 相似文献

4.

SpMV计算的ARM和FPGA异构加速器设计

朱明达薛济擎艾纯瑶《电讯技术》2024,64(2):302-309

针对稀疏矩阵向量乘(Sparse Matrix-Vector Multiplication, SpMV)在边缘端实施效率不高的问题,以稀疏矩阵的存储格式、SpMV的现场可编程门阵列(Field Programmable Gate Array, FPGA)加速为研究对象,提出了一种多端口改进的行压缩存储格式(Modified Compressed Sparse Row Format, MCSR)与ARM+FPGA架构任务级数据级硬件优化相结合的加速方法。使用多个端口并行存取数据来提高计算并行度;使用数据流、循环流水实现循环间、循环内的并行加速;使用数组分割、流传输实现数据的细粒度并行缓存与计算;使用ARM+FPGA架构,ARM完成对系统的控制,将计算卸载到FPGA并行加速。实验结果表明,并行加速优化后的ARM+FPGA方案相较于单ARM方案最高可达10倍的加速效果,而且增加的资源消耗在可接受范围内,矩阵规模越大非零值越多加速效果越明显。研究成果在边缘端实施SpMV计算方面有一定实用价值。相似文献

5.

基于SIMD PE阵列的DCT数据并行实现方法研究

下载免费PDF全文

钟升《电子学报》2009,37(7):1546-1553

本文为满足G级像素帧的实时性处理需求,针对DCT变换计算量大和常规处理中并行度不足的问题,提出一种基于SIMD PE阵列的DCT数据并行实现方法.该方法因PE阵列本身所具有的可裁减特性,可应用于不同并行度需求的嵌入式系统中.文中提出一种基于PE标识的数据并行操作方式, 不但解决了局部计算中的"PE自治"问题,又省去了数据寻址时间开销.该操作方式规则、简洁,满足SIMD操作规则性强的要求,符合并行处理技术的发展方向. 相似文献

6.

Exploiting Thread‐Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Jaegeun Oh Seok Joong Hwang Huong Giang Nguyen Areum Kim Seon Wook Kim Chulwoo Kim Jong‐Kook Kim 《ETRI Journal》2008,30(4):576-586

In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. 相似文献

7.

Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection

《Journal of Visual Communication and Image Representation》2008,19(8):558-572

In order to achieve high computational performance and low power consumption, many modern microprocessors are equipped with special multimedia instructions and multi-core processing capabilities. The number of cores on a single chip increases double every three years. Therefore, besides complexity reduction by smart algorithms such as fast macroblock mode selection, an effective algorithm for parallelizing H.264/AVC is also very crucial in implementing a real-time encoder on a multi-core system. This algorithm serves to uniformly distribute workloads for H.264/AVC encoding over several slower and simpler processor cores on a single chip. In this paper, we propose a new adaptive slice-size selection technique for efficient slice-level parallelism of H.264/AVC encoding on a multi-core processor using fast macroblock mode selection as a pre-processing step. For this we propose an estimation method for the computational complexity of each macroblock using pre macroblock mode selection. Simulation results, with a number of test video sequences, show that, without any noticeable degradation, the proposed fast macroblock mode selection reduces the total encoding time by about 57.30%. The proposed adaptive slice-level parallelism has good parallel performance compared to conventional fixed slice-size parallelism. The proposed method can be applied to many multi-core systems for real-time H.264 video encoding. 相似文献

8.

A parallel binary structured LMS algorithm for transversal adaptive filters

Mohammad Eshghi Joanne DeGroat 《The Journal of VLSI Signal Processing》1995,10(2):127-140

The adaptation process in digital filters requires extensive calculation. This computation makes adaptation a slow and time consuming process. Simple, but versatile, parallel algorithms for adaptive filters, suitable for VLSI implementation, are in demand. In this paper a regular and modular parallel algorithm for an adaptive filter is presented. This parallel structure is based on the Gradient Vector Estimation Algorithm, which minimizes the Mean Square Error. In the parallel method, the tap weights of the adaptive filter are updated everys steps, whereas in the recursive algorithms, the tap weights are updated at each step. Fors step update, bit strings of lengths are used to derive the terms with which the tap weights of the adaptive filter are calculated. The algorithm presented computes the tap weights at timen+s as a function of the tap weights at timen, the inputs from timen+1 ton+s−1, and the desired output from timen+1 ton+s−1. The algorithm also can be mapped to a VLSI architecture that is both regular and modular and allows either expansion of the order of the filter or the degree of parallelism obtainable. A comparison between the performance of the sequential LMS algorithm, Fast Exact LMS algorithm, and the parallel binary structured LMS algorithm is presented. 相似文献

9.

Sub-word parallelism in digital signal processing

Fridman J. 《Signal Processing Magazine, IEEE》2000,17(2):27-35

We deal with parallelism at the data level. We describe an implementation of the architectural technique called sub-word parallelism (SWP), which increases parallelism at the data-element-level by means of partitioning a processor's data path. The specific implementation we focus on is based on the TigerSHARC DSP architecture, developed at Analog Devices, Inc. As a result of SWP, the same data path and computation units perform more than one computation on an N-element composite word. This composite word consists of more than one adjacent sub-words. SWP is quite common and exists in production versions of most major general-purpose microprocessors. We also present an implementation of an FIR filter in the TigerSHARC using data-level SWP as an example 相似文献

10.

Efficient computation of time-varying and adaptive filters

Jones D.L. 《Signal Processing, IEEE Transactions on》1993,41(3):1077-1086

Two techniques for efficient computation of filters that support time-varying coefficients are developed. These methods are forms of distributed arithmetic that encode the data, rather than the filter coefficients. The first approach efficiently computes scalar-vector products, with which a digital filter is easily implemented in a transpose-form structure. This method, based on digital coding, supports time-varying coefficients with no additional overhead. Alternatively, distributed-arithmetic schemes that encode the data stream in sliding blocks support efficient direct-form filter computation with time-varying coefficients. A combination of both of these techniques greatly reduces the computation required to implement LMS adaptive filters 相似文献

11.

面向VLIW结构的高性能代码生成技术 总被引：1，自引：1，他引：0

王红梅王敏张铁军单睿侯朝焕《微电子学与计算机》2010,27(2)

DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能. 相似文献

12.

Quantifying the benefit of thread and data parallelism for fast motion estimation in MPEG-2

Chouliaras V.A. Agha S. Jacobs T.R. Dwyer V.M. 《Electronics letters》2006,42(13):747-748

Preliminary results of a concise investigation of the performance benefits obtained by exploiting thread and data parallelism in fast motion estimation algorithms in MPEG-2 are presented. Thirteen such fast ME algorithms were implemented using both thread-parallel and data-parallel schemes to determine their computational requirements in an embedded environment. The results are then compared to both the default (non-parallelised, full-search) as well as their respective (non-parallelised, fast) versions. Results conclusively demonstrate that both thread and data level parallelism should be exploited for cases where full-search motion estimation is a requirement. By contrast, all fast methods demonstrate that wide data-parallel hardware provides little performance improvement over a conservative, 4-byte single-instruction, multiple-data (SIMD) sum-of-absolute-differences (SAD) coprocessor. In the context of portable, consumer applications, both sets of results strongly suggest a multi-core approach with moderate data-parallel infrastructure. 相似文献

13.

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Yi Wang Linfeng Pan Zili Shao Yong Guan Minyi Guo 《Journal of Signal Processing Systems》2014,74(2):137-150

Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core. 相似文献

14.

Exploration of Soft-Output MIMO Detector Implementations on Massive Parallel Processors

Robert Fasthuber Min Li David Novo Praveen Raghavan Liesbet Van Der Perre Francky Catthoor 《Journal of Signal Processing Systems》2011,64(1):75-92

Emerging Software Defined Radio (SDR) baseband platforms are based on multiple processors with massive parallelism. Although the computational power of these platforms would theoretically enable SDR solutions with advanced wireless signal processing, existing work implements still rather basic algorithms. For instance, current Multiple-Input Multiple-Output (MIMO) detector implementations are typically based on simple linear hard-output and not on advanced near-Maximum Likelihood (ML) soft-output detection. However, only the latter enables to exploit the full potential of MIMO technology. In this work, we explore the feasibility of advanced soft-output near-ML MIMO detectors on massive parallel processors. Although such detectors are considered to be very challenging due to their high computational complexity, we combine architecture-friendly algorithm design, application specific instructions and instruction-level/data-level parallelism explorations to make SDR solutions feasible. We show that, by applying the proposed combination of techniques, it is possible to obtain SDR implementations which can deliver data rates that are sufficient for future wireless systems. For example, a 2 × 4 Coarse Grain Array (CGA) processor with 16-way Single Instruction Multiple Data (SIMD) can deliver 192/368 Mbps throughput for 2 × 2 64/16-QAM transmissions. Finally, we estimate the area and power consumption of the programmable solution and compare it against a traditional Application Specific Integrated Circuit (ASIC) approach. This enables us to draw conclusions from the cost perspective. 相似文献

15.

超高清实时H.264/AVC编码系统设计

邓刚《电视技术》2014,38(15)

基于多核处理器的并行计算能力,设计并实现实时超高清分辨率(3 840×2 160)的H.264/AVC视频编码系统。该系统在原始像素输入端实现高效的内存管理,超高清编码器采用帧级、条带级、指令级的并行方案,码流输出端则采用FIFO缓冲器对RTP包的传输速度进行控制。实验结果表明,编码系统能实时对超高清视频源进行并行编码,通过RTP封装格式传输至IP网络,用户可使用视频播放器接收并回放。相似文献

16.

A Flexible Architecture for Modular Arithmetic Hardware Accelerators based on RNS

Samuel Antão Leonel Sousa 《Journal of Signal Processing Systems》2014,76(3):249-259

Modular arithmetic is a building block for a variety of applications potentially supported on embedded systems. An approach to turn modular arithmetic more efficient is to identify algorithmic modifications that would enhance the parallelization of the target arithmetic in order to exploit the properties of parallel devices and platforms. The Residue Number System (RNS) introduces data-level parallelism, enabling the parallelization even for algorithms based on modular arithmetic with several data dependencies. However, the mapping of generic algorithms to full RNS-based implementations can be complex and the utilization of suitable hardware architectures that are scalable and adaptable to different demands is required. This paper proposes and discusses an architecture with scalability features for the parallel implementation of algorithms relying on modular arithmetic fully supported by the Residue Number System (RNS). The systematic mapping of a generic modular arithmetic algorithm to the architecture is presented. It can be applied as a high level synthesis step for an Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) design flow targeting modular arithmetic algorithms. An implementation with the Xilinx Virtex 4 and Altera Stratix II Field Programmable Gate Array (FPGA) technologies of the modular exponentiation and Elliptic Curve (EC) point multiplication, used in the Rivest-Shamir-Adleman (RSA) and (EC) cryptographic algorithms, suggests latency results in the same order of magnitude of the fastest hardware implementations of these operations known to date. 相似文献

17.

基于Tilera众核处理器的AVS并行解码器的设计与实现

吴杰张文军高志勇张小云《电视技术》2014,38(15)

众核处理器的并行计算为AVS并行解码器的实现提供了基础,本文提出了一种功能并行和数据并行混合的并行设计方案,该方案采用了帧间和宏块行的两级并行。本文使用的是Tilera推出的Tile-Gx36众核处理器,同时利用该处理器提供的SIMD指令集进行了反量化、反变换、插值等模块的优化。实验结果表明该设计具有良好的并行加速比,可以在6个核的条件下完成1路AVS高清实时解码。相似文献

18.

一种基于OpenCL的Lukas-Kanade光流并行加速算法

吴进李乔深闵育马思敏《电讯技术》2018,58(8)

LK(Lukas-Kanade)光流法在运动目标检测和跟踪领域具有广泛应用,但其计算复杂、速度慢,难以适应异构硬件平台。为实现LK光流法在不同平台上的高效运行,设计了一种基于开放式计算语言（OpenCL）的LK光流法并行算法。该算法通过将二维图像上像素点上的稠密计算映射到多线程上实现数据并行,并基于OpenCL平台的共享内存等优化方法减小了主机内存与设备内存数据传输。实验测试表明,该算法相比于多核CPU下的基础OpenCV函数库中的LK算法获得了最高31倍的加速比,同时在速度上与统一计算设备体系结构（CUDA）加速的LK光流法相近。此外,还在多种不同设备下验证了加速算法的平台通用性。相似文献

19.

A Novel Macro-Block Group Based AVS Coding Scheme for Many-Core Processor 总被引：1，自引：0，他引：1

Zhenyu Wang Luhong Liang Guolei Yang Xianguo Zhang Jun Sun Debin Zhao Wen Gao 《Journal of Signal Processing Systems》2011,65(1):129-145

Implementation of video coding systems such as H.264/AVC and AVS on multi-core and many-core platforms is attracting much attention. The slice-level parallelism is popular in parallel video coding for its simplicity and flexibility, however, the video quality loses greatly since the partitioning of slices breaks the dependency between macro-blocks, especially on multi-core or many-core platforms. To address this problem, we propose a Macro-Block Group (MBG) parallel scheme for parallel AVS coding. In the proposed scheme, video frames are equally divided into rectangular MBG regions; each MBG consists of more rows and less columns of macro-blocks than the slice-level scheme. Given that MBG is not syntactically supported by AVS, a vertical partitioning scheme is introduced. Additionally, we use mode confining and motion vector difference adjusting techniques to keep consistent with the standard. Two MBG parallel schemes (5 × 9 MBG partition and 8 × 7 MBG partition) are developed on a TILE64 many-core platform, where P/B frames use the MBG parallel scheme and I frames use the macro-block-level parallelism. Experimental results show that the proposed scheme of 5 × 9 MBG partition can achieve a reduction of 52% (IPPP) and 41% (IBBP) quality loss while keeping the same speed-up compared with the slice-level parallelism. With more cores employed, the scheme of 8 × 7 MBG partition gains 23.9 times of speed-up compared with the single-core implementation and achieves similar coding performance as the 5 × 9 scheme. 相似文献

20.

A two-stage angle-rotation architecture and its error analysis for efficient digital mixer implementation

Dengwei Fu Willson A.N. Jr. 《IEEE transactions on circuits and systems. I, Regular papers》2006,53(3):604-614

An efficient angle-rotation architecture, suitable for use in a digital mixer, is presented. The architecture employs a decomposition of the high-precision rotation-angle into a coarse angle and a fine angle, and employs two processing stages, coarse and fine. Only the coarse stage employs a ROM, and that ROM is extremely small. Small multipliers are shown to suffice while providing full accuracy at the system output. A systematic development of the architecture and its supporting theory and assumptions is given, along with the rationale for determining its structure. A rigorous analysis of its error sources is presented, as well as the computation of bounds for the various errors they induce at the system output. This error analysis is shown to lead to a straightforward means of designing an efficient two-stage angle-rotation unit for a digital mixer, given input/output bitwidths and performance specifications. 相似文献