期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Configurable parallel memory architecture for multimedia computers

Kimmo Jarno Timo Jarkko 《Journal of Systems Architecture》2002,47(14-15)

This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses. 相似文献

2.

Compiler-Directed Code Restructuring for Improving Performance of MPSoCs

Guilin Chen Kandemir M. 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(9):1201-1214

One of the critical goals in code optimization for multi-processor-system-on-a-chip (MPSoC) architectures is to minimize the number of off-chip memory accesses. This is because such accesses can be extremely costly from both performance and power angles. While conventional data locality optimization techniques can be used for improving data access pattern of each processor independently, such techniques usually do not consider locality for shared data. This paper proposes a strategy that reduces the number of off-chip references due to shared data. It achieves this goal by restructuring a parallelized application code in such a fashion that a given data block is accessed by parallel processors within the same time frame, so that its reuse is maximized while it is in the on-chip memory space. This tends to minimize the number of off-chip references since the accesses to a given data block are clustered within a short period of time during execution. Our approach employs a polyhedral tool that helps us isolate computations that manipulate a given data block. In order to test the effectiveness of our approach, we implemented it using a publicly-available compiler infrastructure and conducted experiments with twelve data-intensive embedded applications. Our results show that optimizing data locality for shared data elements is very useful in practice. 相似文献

3.

一种精简的二维DWT结构设计

曹志研季振洲胡铭曾《计算机工程》2007,33(23):228-229

设计了一种低功耗的二维离散小波变换(DWT)结构，用于无线传感器网络中的图像压缩。该结构实现了精简复杂性的(5,3)整数离散小波变换，采用流水线和延迟线技术，在获得高运算吞吐率的同时，使数据尽可能被处理单元高效利用，以减少对片内存储器和片外存储器的访问次数。多级二维DWT采用展开方法实现，这种方法可尽早开始下一级变换，不需要大的片内存储器和片内存取操作。模拟试验和FPGA实现验证了系统在满足需要性能的前提下具有低复杂性、低功耗、片内存储器小等优点。相似文献

4.

Random access schemes for efficient FPGA SpMV acceleration

《Microprocessors and Microsystems》2016

Utilizing hardware resources efficiently is vital to building the future generation of high-performance computing systems. The sparse matrix – dense vector multiplication (SpMV) kernel, which is notorious for its poor efficiency on conventional processors, is a key component in many scientific computing applications and increasing SpMV efficiency can contribute significantly to improving overall system efficiency. The major challenge in implementing SpMV efficiently is handling the input-dependent memory access patterns, and reconfigurable logic is a strong candidate for tackling this problem via memory system customization. In this work, we consider three schemes (all off-chip, all on-chip, caching) for servicing the irregular-access component of SpMV and investigate their effects on accelerator efficiency. To combine the strengths of on-chip and off-chip random accesses, we propose a hardware-software caching scheme named NCVCS that combines software preprocessing with a nonblocking cache to enable highly efficient SpMV accelerators with modest on-chip memory requirements. Our results from the comparison of the three schemes implemented as part of an FPGA SpMV accelerator show that our scheme effectively combines the high efficiency from on-chip accesses with the capability of working with large matrices from off-chip accesses. 相似文献

5.

Buffer structure optimized VLSI architecture for efficient hierarchical integer pixel motion estimation implementation

Haibing Yin Dong Sun Park Xiao Yun Zhang 《Journal of Real-Time Image Processing》2016,11(3):507-525

Integer pixel motion estimation (IME) is one crucial module with high complexity in high-definition video encoder. Efficient algorithm and architecture joint design is supposed to tradeoff multiple target parameters including throughput capacity, logic gate, on-chip SRAM size, memory bandwidth, and rate distortion performance. Data organization and on-chip buffer structure are crucial factors for IME architecture design, accounting for multiple target performance tradeoff. In this work, we combine global hierarchical search and local full search to propose hardware efficient IME algorithm, and then propose hardware VLSI architecture with optimized on-chip buffer structure. The major contribution of this work is characterized by: (1) improved hierarchical IME algorithm with presearch and deliberate data organization, (2) multistage on-chip reference pixel buffer structure with high data reuse between integer and fraction pixel motion estimations, (3) highly reused and reconfigurable processing element structure. The optimized data organization and buffer structure achieves nearly 70 % buffer saving with less than average 0.08, 0.12 dB the worst case, PSNR degradation compared with full search based architecture. At the hardware cost of 336 and 382 K logic gate and 20 kB SRAM, the proposed architecture achieves the throughput of 384 and 272 cycles per macroblock, at system frequency of 95 and 264 MHz for 1080p and QFHD @30fps format video coding. 相似文献

6.

Fast and deterministic hash table lookup using discriminative bloom filters

Kun Huang Gaogang Xie Rui Li Shuai Xiong 《Journal of Network and Computer Applications》2013,36(2):657-666

Hash tables are widely used in network applications, as they can achieve O(1) query, insert, and delete operations at moderate loads. However, at high loads, collisions are prevalent in the table, which increases the access time and induces non-deterministic performance. Slow rates and non-determinism can considerably hurt the performance and scalability of hash tables in the multi-threaded parallel systems such as ASIC/FPGA and multi-core. So it is critical to keep the hash operations faster and more deterministic.This paper presents a novel fast collision-free hashing scheme using Discriminative Bloom Filters (DBFs) to achieve fast and deterministic hash table lookup. DBF is a compact summary stored in on-chip memory. It is composed of an array of parallel Bloom filters organized by the discriminator. Each element lookup performs parallel membership checks on the on-chip DBF to produce a possible discriminator value. Then, the element plus the discriminator value is hashed to a possible bucket in an off-chip hash table for validating the match. This DBF-based scheme requires one off-chip memory access per lookup as well as less off-chip memory usage. Experiments show that our scheme achieves up to 8.5-fold reduction in the number of off-chip memory accesses per lookup than previous schemes. 相似文献

7.

AVS分数像素插值算法的VLSI实现

王方晴王祖强《电子技术应用》2010,36(4)

基于AVS运动补偿分数像素插值算法,提出了一种新的VLSI结构,满足了AVS基准档次6.2级别(1920×1080,4:2:2,30f/s)高清视频实时解码的要求。介绍了AVS分数像素插值算法,采用一种新的基于移位寄存器的寄存器文件作为内部像素存储器,提高了并行处理效率,并将脉动阵列应用到AVS插值滤波器中,有效提高了运动补偿插值运算的速度。相似文献

8.

Access pattern restructuring for memory energy

De La Luz V. Kadayif I. Kandemir M. Sezer U. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(4):289-303

Improving memory energy consumption of programs that manipulate arrays is an important problem as these codes spend large amounts of energy in accessing off-chip memory. We propose a data-driven strategy to optimize the memory energy consumption in a banked memory system. Our compiler-based strategy modifies the original execution order of loop iterations in array-dominated applications to increase the length of the time period(s) in which memory banks are idle (i.e., not accessed by any loop iteration). To achieve this, it first classifies loop iterations according to their bank accesses patterns and then, with the help of a polyhedral tool, tries to bring the iterations with similar bank access patterns close together. Increasing the idle periods of memory banks brings two major benefits: first, it allows us to place more memory banks into low-power operating modes and, second, it enables us to use a more aggressive (i.e., more energy saving) operating mode (hence, saving more energy) for a given bank (instead of a less aggressive mode). The proposed strategy can reduce memory energy consumption in both sequential and parallel applications. Our strategy has been implemented in an experimental compiler using a polyhedral tool and evaluated using nine array-dominated applications on both a cacheless system and a system with cache memory. Our experimental results indicate that the proposed strategy is very successful in reducing the memory system energy and improves the memory energy by as much as 36.8 percent over a strategy that uses low-power modes without optimizing data access pattern. Our results also show that optimizations that target reducing off-chip memory energy can generate very different results from those that target at improving only cache locality. 相似文献

9.

Scalable high-throughput variable block size motion estimation architecture

Stephen Warrington Wai-Yip Chan Subramania Sudharsanan 《Microprocessors and Microsystems》2009,33(4):319-325

Variable block size (VBS) motion compensated prediction (MCP) provides substantial rate-distortion performance gain over conventional fixed-block-size MCP and is a key feature of the H.264/AVC video coding standard. VBS–MCP requires the encoder to perform VBS motion estimation (VBSME), a computationally complex operation. In this paper, we propose a high motion vector throughput full-search VBSME architecture. High performance is achieved by performing parallel computations for multiple pixels within a macroblock, as well as computing several candidate motion vector (MV) positions in parallel. Two implementations of the architecture are examined, a four pixel-parallel implementation, and a higher performance 16 pixel-parallel implementation. A high degree of scalability is achieved by allowing for a variable length processing element array, where more processing elements yields a higher degree of candidate MV parallelism. The proposed architecture achieves a throughput exceeding current full-search VBSME architectures. 相似文献

10.

A reduced memory bandwidth and high throughput HDTV motion compensation decoder for H.264/AVC High 4:2:2 profile

Bruno Zatt Leandro M. de L. Silva Arnaldo Azevedo Luciano Agostini Altamiro Susin Sergio Bampi 《Journal of Real-Time Image Processing》2013,8(1):127-140

This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm² area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile. 相似文献

11.

DSPs实时视频处理中的Cache优化算法研究

唐文佳朱光喜王曜刘瑜《小型微型计算机系统》2005,26(4):680-683

在采用并行超长指令字结构的DSP芯片中，CPU处理速度与片外数据存取速度不匹配的问题，导致了CPU处理延时，限制了DSP系统性能的提升，针对这一问题，根据Cache的结构提出一种适宜于在DSPCPU上进行视频数据处理的数据排列新算法，并且将其成功地应用到基于Trimedia PNXl301的MPEG-4程序优化工作中，系统编码结果表明，该方法有效地减少了Cachemiss及片外数据存取的时间开销，在同等条件下，采用本算法后系统编码性能提高2帧／秒(CIF格式)左右。相似文献

12.

DataScalar: A memory-centric approach to computing

《Journal of Systems Architecture》1999,45(12-13):1001-1022

Commodity microprocessors contain more on-chip memory with each successive generation, and will contain tens of megabytes within the decade. We describe a novel architecture that runs an unmodified uniprocessor program across multiple nodes, each of which contains a processor tightly integrated with a sizable memory. The execution of instructions is replicated, while the access of operands is distributed across the nodes. Each node accesses operands in its fast local memory and broadcasts them to the other nodes. This architecture exploits out-of-order execution and the fact that each chip has integrated processor and memory, to run memory-intensive, hard-to-parallelize programs more efficiently. In this paper, we describe an implementation with specific solutions to the unique problems that this architecture poses. Finally, we conclude by comparing simulation results of our implementation to more traditional equivalent systems. In our simulated implementation, five unmodified SPEC95 binaries ran – in most cases – considerably faster than in systems with more traditional memory systems. 相似文献

13.

面向应用的流存储系统评测与改进

汪芳安虹徐光许牧姚平《小型微型计算机系统》2010,31(5)

有限的片外存储带宽是制约流处理器性能提升的瓶颈之一,流存储系统已经采用了多种方式来缓解这个问题,但当前的设计并没有充分考虑应用具体的访存模式对有效带宽利用率的影响.通过分析和实验,评估流存储系统主要设计参数对不同访存模式的优化效果;在此基础上针对不同的流访问并行度提出了相应的结构改进,加入宽发射和短作业优先调度支持,充分挖掘存储访问的局部性和并行性,改善了负载平衡,从而有效地提高了片外带宽的使用效率和流程序的整体性能. 相似文献

14.

General memory efficient packet matching FPGA architecture for future high-speed networks

《Microprocessors and Microsystems》2020

Packet classification (matching) is one of the critical operations in networking widely used in many different devices and tasks ranging from switching or routing to a variety of monitoring and security applications like firewall or IDS. To satisfy the ever-growing performance demands of current and future high-speed networks, specially designed hardware accelerated architectures implementing packet classification are necessary. These demands are now growing to such an extent, that in order to keep up with the rising throughputs of network links, the FPGA accelerated architectures are required to perform matching of multiple packets in every single clock cycle. To meet this requirement a simple replication approach can be utilized – instantiate multiple copies of a processing pipeline matching incoming packets in parallel. However, simple replication of pipelines inseparably brings a significant increase in utilization of FPGA resources of all types, which is especially costly for rather scarce on-chip memories used in matching tables.We propose and examine a unique parallel hardware architecture for hash-based exact match classification of multiple packets in each clock cycle that offers a reduction of memory replication requirements. The core idea of the proposed architecture is to exploit the basic memory organization structure present in all modern FPGAs, where hundreds of individual block or distributed memory tiles are available and can be accessed (addressed) independently. This way, we are able to maintain a rather high throughput of matching multiple packets per clock cycle even without fully replicated memory resources in matching tables. Our results show that the designed approach can use on-chip memory resources very efficiently and even scales exceptionally well with increased capacities of match tables. For example, the proposed architecture is able to achieve a throughput of more than 2 Tbps (over 3 000 Mpps) with an effective capacity of more than 40 000 IPv4 flow records at the cost of only a few hundred block memory tiles (366 BlockRAM for Xilinx or 672 M20K for Intel FPGAs) utilizing only a small fraction of available logic resources (around 68 000 LUTs for Xilinx or 95 000 ALMs for Intel). 相似文献

15.

Motion Analysis on the Micro Grained Array Processor

《Real》1997,3(2):101-110

Motion analysis plays a key role in video coding (e.g., video telephone, MPEG, HDTV) and computer vision systems (e.g., image segmentation, structure from motion). Motion estimation methods can be classified into three groups – matching-based, gradient-based, and frequency-based methods. The block matching algorithm (BMA) has been widely used for region matching in image coding, for example in MPEG (Motion Picture Expert's Group). Optical flow computation based on the spatio-temporal constraint equation has been broadly used in image segmentation to compute each pixel's velocity on a moving object. For both of these tasks, dedicated ASIC systems have been developed and widely used. Unfortunately, such systems have the disadvantage of restricted adaptability. The Micro Grained Array Processor (MGAP), which is a fine-grained, mesh-connected, SIMD array processor being developed at Penn State University, can provide a more regular, flexible, and efficient approach for solving, in real time, these two important computations.In this paper, we propose a new data flow scheme for an efficient, systolic, full-search BMA on programmable array processors so that we can process as many adjacent template blocks as possible in unison in order to reduce the data memory accesses. In particular we present an efficient implementation of the BMA on the MGAP. As a result, the BMA for the MPEG SIF video format (352 × 240 pixels) with a block size of 16 × 16 pixels, a displacement range of 16 pixels, and frame rate of 30 frames/sec can be computed at real-time processing rates on the MGAP. We also show a real-time mapping to the MGAP of the optical flow computation for images of size 256 × 256 pixels. 相似文献

16.

存算解耦合的粗粒度可重构阵列访存结构设计

洪途景乃锋《计算机工程》2021,47(2):239-245

粗粒度可重构阵列架构兼具灵活性和高效性,但高计算吞吐量的特性也会给访存带来压力。在片下动态存储器带宽相对固定的情况下,设计一种存算解耦合的访存结构。将控制逻辑集成在轻量级的存储空间中,通过可配置的存储空间隔离访存和计算的循环迭代,从而掩盖内存延时,同时利用该结构进行串联和对齐操作,以适配不同的计算访存频率比并优化间接访问过程。实验结果表明,该访存结构在目标架构中能够获得1.84倍的性能优化,其中乱序操作可使间接访问得到平均22%的性能提升。相似文献

17.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献

18.

基于中间点划分无冲突哈希的高速包处理

张墨华李戈《计算机应用》2012,32(4):999-1002

通过在高速片上存储器上存储所有的攻击特征,实现对数据包的高速检测。针对有限的片上存储器空间,提出一种新的基于中间点划分无冲突哈希函数的trie树结构,将攻击特征串平均分配到trie树每层的多个组中,实现对片上存储器有效的控制。通过在同一个芯片中采用流水并行方式执行查询操作,获得更高的吞吐量。存储中间点的空间复杂度为O(n),哈希表的构建时间随攻击特征数量线性增长。实验结果表明：该方法降低了片上存储空间需求,在片上存储器只需执行一次即可完成特征匹配操作。相似文献

19.

一种高性能图形图像系统的帧缓存体系结构设计

黄小虎郑南宁《小型微型计算机系统》1996,17(3):1-5

本文介绍了一种支持高性能的图形图像系统的帧缓存的体系结构。在此系统中利用ＳＩＭＤ，存储器交叉和流水线三种并行技术，以及ＣＡＣＥ来提高更新帧缓存中像素数据的速率。在此系统中图形处理器可以按行、列或一个任意的矩形块同时存取Ｎ／２个像素（Ｎ为帧缓存的模块数）。系统中的Ｚ—ＢＵＦＦＥＲ可以提高三维消隐面算法的效率。相似文献

20.

Design and implementation of motion compensator in memory reduced HDTV decoder with embedded compression engine

Hongli Gao Fei Qiao Huazhong Yang 《Multimedia Tools and Applications》2012,56(3):597-614

In this paper, a low-cost compatible motion compensator is implemented and integrated into a macroblock-level three-stage-pipelined HDTV decoder, in which an embedded compression (EC) engine is realized as well. The decoder with EC engine is designed to reduce the power consumption and memory bandwidth requirement since memory accesses are reduced. In the motion compensator, a boundary judgment scheme for reference pixel fetching is proposed to provide seamless integration in HDTV video decoder for the block-based EC engines. Furthermore, a buffer sharing mechanism is adopted to reduce extra memory requirement involved by EC. The reference pixel fetching unit costs only 17.3 K logic gates when the working frequency is set to 166.7 MHz. On average, when decoding HD1080 video sequence, 30% memory access reduction and 24% memory power consumption saving are achieved when a near lossless EC algorithm is integrated in the video decoder. In other words, the proposed motion compensator makes the EC engine an integral part of a memory reduced decoder without extra cost. Additionally, since the work in this paper is based on EC schemes, the EC design criterion are discussed, and several useful rules on the selection of EC algorithm are addressed for the video decoder of corresponding VLSI architecture. 相似文献