期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

黄君凯周桦常周林《电子技术应用》2006,32(5):115-116

基于H.264/AVC视频编码标准,完成了编码模块中的4×4整数变换量化核的分析和硬件实现的优化设计。通过三种优化设计处理后,在硬件开销改变不大的情况下,使4×4整数变换量化核的最高工作频率相比优化前的30.7MHz提高了82%,达到55.8MHz,为H.264/AVC视频编码标准的硬件实现提供了参考。相似文献

2.

DCLSA：一种适用于H.264/AVC的DCT系数分层置乱算法

下载免费PDF全文

包先雨蒋建国李援《中国图象图形学报》2008,13(4):618-623

基于目前的DCT系数加密算法在安全性、压缩比和信噪比上都各自存在缺点,提出了一种新的适用于H264/AVC的DCT系数分层置乱算法(DCLSA)。该算法针对H264/AVC中4×4 DCT变换的特点,首先将同一宏块中每个4×4块DCT系数进行块间分层,构建系数分层模型,然后可根据安全性要求对不同层系数进行随机置乱,实现加密编码。通过性能比较和具体实验效果分析,此算法具有更高的安全性、更优的压缩比和较好的信噪比,适合于H.264/AVC的网络安全应用。相似文献

3.

H.264/AVC帧内预测模式选择算法研究 总被引：11，自引：0，他引：11

裴世保李厚强俞能海《计算机应用》2005,25(8):1808-1810

H.264/AVC采用空间域上的帧内预测技术,进一步提高了编码效率,但由于H.264/AVC支持的帧内预测模式数较多,使预测的复杂度大幅度增加。详细分析了帧内预测模式的选择过程,提出一种快速的率失真优化(RDO)模式下的快速Intra_4×4模式选择算法,该算法根据SATD(SumofAbsoluteTransformedDifference)以及相邻块的预测模式之间的相关性等特征,预先排除了超过65%可能性小的Intra_4×4模式,避免了不必要的计算,从而大幅度降低帧内预测的复杂度,同时基本保持了H.264/AVC的编码性能。相似文献

4.

H.264/AVC帧间多种块模式的编码性能分析与研究 总被引：6，自引：0，他引：6

成运戴葵王志英《计算机工程与应用》2005,41(5):33-36,156

H.264/AVC是最新的国际视频标准,与其它视频编码标准相比,其在编码效率方面有强大的优势:在相同的重建图像质量下,H.264/AVC比MPEG-2H.263++和MPEG-4的第2部分分别节约64.46%、、48.80%和38.62%的码率。但H.264/AVC中编码效率的提高是以增加巨大的运算量为前提的。该文在介绍H.264/AVC中的宏块分块模式、运动估计及模式选择算法的基础上,重点对各种帧间块模式下的运动估计及4×4亮度变换与量化操作进行了分析,然后对各种帧间块模式的组合进行了实验研究,实验结果表明,当帧间只用块模式1～4时,在相同的比特率下客观图像质量亮度分量的PSNR比帧间使用全部块模式时平均降低0.13dB,而编码的时间平均能减少40%左右。相似文献

5.

帧间编码模式选择及其择优早期终止的H.264/AVC快速算法 总被引：1，自引：1，他引：0

下载免费PDF全文

章国宝李亮《中国图象图形学报》2009,14(1):59-64

通过研究H.264/AVC帧间编码中最佳模式分布,提出了3种模式选择早期终止策略,分别对SKIP,P16×16,P16×8和P8×16这3类模式成为最佳编码模式进行判断,适当地结束编码模式选择过程,避免了对后面不可能模式率失真代价的计算,从而减少了帧间编码的计算复杂度。实验结果表明,该算法可以减少H.264/AVC编码器50％～70％的编码时间,同时保持与全搜索算法基本一致的编码性能,峰值信噪比PSNR平均下降0.17 dB左右,编码比特率平均增加2.42％。相似文献

6.

基于运动搜索预处理的H.264/AVC帧间块划分模式快速选择算法

王正宁诸昌钤赵晋华《计算机应用》2005,25(6):1313-1315

为了进一步减小宏块的帧间预测误差,新的视频编码标准H. 264 /AVC采用了灵活多样的块划分类型。不同于以往视频编码标准中规定的16×16, 8×8固定块尺寸,H. 264 /AVC中采用的块尺寸包含16×16到4×4七种类型。然而在提高编码效率的同时却带来了巨大的计算复杂度,尤其是采用率失真全搜索算法时。提出了一种基于运动搜索预处理及残差纹理分析的快速帧间块划分模式预测算法,并通过实验证明了本算法在几乎不牺牲图像质量和压缩效率的基础上,有效地减少了计算复杂度。相似文献

7.

H.264中整数变换与量化的FPGA实现

下载免费PDF全文

罗军黄启俊常胜李昌盛《中国图象图形学报》2011,16(5):740-745

H.264以其优异的压缩比率和高图像质量在实时网络视频通信、数字广播电视及高清视频存储播放等方面获得广泛应用。变换量化作为H.264编码框架中的一个基础模块,是熵编码前的一个重要处理过程,它的主要作用是使输入系数间数据相关性降低。鉴于之前大部分的变换量化是基于软件或协处理器来实现以及此种实现方式在速度及吞吐量上的局限,而硬件实现在速度和吞吐量上则具有很大的优势, 因此研究H.264变换量化的硬件实现具有实用价值。采用高速并行处理的架构,基于寄存器传输级（RTL）用硬件描述语言完成了H.264中的整数离散余弦变换（IDCT）及量化算法的实现,并用Altera公司的Cyclone Ⅱ系列可编程逻辑器件实现了硬件验证测试。设计方案消耗了10489个逻辑单元,最高工作时钟频率为184.88MHz,数据处理能力达到2958Mpixels/s,可在一个时钟周期之内完成对一个4×4矩阵数据的变换量化处理,可满足高速高吞吐量数据流处理的要求。相似文献

8.

H.264/AVC中整数DCT变换量化模块的Verilog设计

沈劲桐张卫《计算机与现代化》2013,(2):108-112,116

H.264/AVC视频压缩标准采用了4×4整数DCT变换和量化方法,避免了数据失配并提高了精度,具有较高的编码效率。本文分析H.264整数DCT变换和量化算法,将DCT变换转换为两次快速蝶形运算,减少了计算量,并用Verilog硬件描述语言编程实现整数DCT变换和量化功能,利用QuartusII进行综合和仿真,得到正确的结果。本设计具有54.54MHz的时钟频率、较低的资源消耗和功耗。相似文献

9.

面向移动设备的MPEG-2到H.264降空间分辨率转码*

夏定元袁卫军《计算机应用研究》2010,27(5):1991-1993

针对无线网络中移动设备低码率、低分辨率的视频流要求,提出一种MPEG-2到H.264降空间分辨率转码的快速编码模式选择算法。在视频序列2∶1下采样后,利用MPEG-2解码端运动矢量的距离和方向信息决定是否将相邻8×8块合并成8×16、16×8、16×16块,若不能合并,则用解码端AC系数的组合判断是否将8×8块继续分割为8×4、4×8、4×4块,既考虑到H.264编码中的所有块大小,又避免了传统率失真选择算法的庞大计算量。实验结果显示,该方法在保证相当的视频质量情况下,平均节约87.02%的编码模式选择时间。相似文献

10.

DCLSA:一种适用于H.264/AVC的DCT系数分层置乱算法

下载免费PDF全文

BAO Xian-yu JIANG Jian-guo LI Yuan 《中国图象图形学报》2008,(4)

基于目前的DCT系数加密算法在安全性、压缩比和信噪比上都各自存在缺点,提出了一种新的适用于H.264/AVC的DCT系数分层置乱算法(DCLSA)。该算法针对H.264/AVC中4×4DCT变换的特点,首先将同一宏块中每个4×4块DCT系数进行块间分层,构建系数分层模型,然后可根据安全性要求对不同层系数进行随机置乱,实现加密编码。通过性能比较和具体实验效果分析,此算法具有更高的安全性、更优的压缩比和较好的信噪比,适合于H.264/AVC的网络安全应用。相似文献

11.

A multi-streaming SIMD multimedia computing engine

Jih-Ching Chiu Yu-Liang Chou 《Microprocessors and Microsystems》2010,34(7-8):247-258

Current multimedia extensions provide a mechanism for general-purpose processors to meet the growing performance demand of multimedia applications. However, the computing performance of these extensions is often limited for the design conceptions of the single data stream. This paper presents an architecture called “multi-streaming SIMD” that enables current multimedia extensions to simultaneously manipulate multiple data streams. To efficiently and flexibly realize the proposed architecture, an operation cell is designed by fusing the logic gates and the storage cells together. Multiple operation cells then are connected to compose a register file with the ability of performing SIMD operations called “Multimedia Operation Storage Unit (MOSU)”. Further, many MOSUs are used to compose a multi-streaming SIMD computing engine that can simultaneously manipulate multiple data streams and exploit the subword parallelisms of the elements in each data stream. This paper also designs three instruction modes (global, coupling, and isolated modes) for programmers to dynamically configure the multi-streaming SIMD computing engine at the instruction level to manipulate different amounts of data streams. Simulation results show that when the multi-streaming SIMD architecture has four 4-register MOSUs, it provides a factor of 3.3×–5.5× performance enhancement for traditional MMX extensions on 12 multimedia kernels. 相似文献

12.

面向多簇架构DSP的树匹配向量化算法

郭连伟郑启龙黄胜兵徐华叶《计算机系统应用》2015,24(10):142-147

BWDSP是针对高性能计算设计的一款新型的处理器, 采用多簇超长指令字体系结构和SIMD架构, 有丰富的指令集. 为充分利用BWDSP提供的向量化资源, 迫切需要提出一种向量化算法. 本文在open64基础上研究并实现了面向多簇超长指令字(VLIW)DSP的SIMD编译优化算法. 算法基于OPEN64的中间语言WHIRL, 能够充分地利用BWDSP丰富的硬件资源和向量化指令. 最终实验结果表明, 对于能够合成双字和单字的循环程序, 该优化算法能够平均取得6倍和4倍的加速比. 相似文献

13.

Canny边缘检测算法在飞腾平台上的实现与优化

郭恒亮柴晓楠韩林赫晓慧商建东《计算机工程》2021,47(7):37-43

为实现国产飞腾DSP平台对底层图像库的支持,针对原始Canny边缘检测算法计算时间过长的问题,设计一种面向FT-M7002平台的Canny梯度计算并行算法。基于FT-M7002高性能处理架构,采用单指令流多数据流向量化方式增强DSP内核指令的并行处理能力,根据FT-M7002平台向量存储器的层次结构特征,分析Canny梯度计算并行算法的访存模式,通过首地址偏移取址解决不连续访存问题,并结合双缓冲方式完成数据传输与数据计算。实验结果表明,在与原始Canny算法具有相同检测精度的情况下,该算法在卷积核大小为3×3、5×5、7×7时整体运行速度提升了1.490~2.112倍,缩小了与主流加速器件在数字图像处理领域的性能差距。相似文献

14.

基于GCC的高性能DSP Matrix向量指令集扩展

辛乃军陈旭灿孙海燕阳柳罗杰淡孝强王霁《计算机工程与科学》2012,34(1):58-63

自动向量化技术是编译器提高程序并行性的优化方法。随着支持SIMD结构处理器的计算平台的广泛应用,自动向量化技术也成为编译器技术研究的热点。GCC编译器是一种开源、跨平台的编译器。本文基于GCC内部自动向量化算法,结合Matrix芯片的体系结构和指令集特点,完成了Matrix向量指令集在GCC后端扩展,实现了基本的自动向量化支持。测试结果表明,扩展后的编译器能够支持Matrix向量指令集,进行基本的自动向量化,同时支持以内建函数方式开发基于Matrix的并行程序。相似文献

15.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

16.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

17.

Fast implementation of dense stereo vision algorithms on a highly parallel SIMD architecture

Fouzhan Hosseini Amir Fijany Saeed Safari Jean-Guy Fontaine 《Journal of Real-Time Image Processing》2013,8(4):421-435

In this paper, we present faster than real-time implementation of a class of dense stereo vision algorithms on a low-power massively parallel SIMD architecture, the CSX700. With two cores, each with 96 Processing Elements, this SIMD architecture provides a peak computation power of 96 GFLOPS while consuming only 9 Watts, making it an excellent candidate for embedded computing applications. Exploiting full features of this architecture, we have developed schemes for an efficient parallel implementation with minimum of overhead. For the sum of squared differences (SSD) algorithm and for VGA (640 × 480) images with disparity ranges of 16 and 32, we achieve a performance of 179 and 94 frames per second (fps), respectively. For the HDTV (1,280 × 720) images with disparity ranges of 16 and 32, we achieve a performance of 67 and 35 fps, respectively. We have also implemented more accurate, and hence more computationally expensive variants of the SSD, and for most cases, particularly for VGA images, we have achieved faster than real-time performance. Our results clearly demonstrate that, by developing careful parallelization schemes, the CSX architecture can provide excellent performance and flexibility for various embedded vision applications. 相似文献

18.

AVS标准中整数DCT变换快速算法的硬件设计

蔡海涛刘荣科《电子技术应用》2007,33(12):68-71

介绍了AVS标准中整数DCT变换矩阵的化简方法,该方法提高了一维整数DCT变换硬件实现的速度。基于此一维整数DCT变换,采用模块复用和流水线设计,实现了二维整数DCT直接变换在一个时钟周期内完成,工作频率可达160MHz。仿真结果证实了该算法的有效性。相似文献

19.

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化

龚鸣清叶煌张鉴卢兴敬陈伟《计算机应用》2019,39(6):1557-1562

针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题，提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法（SGEMM）算法优化方案。首先，确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率；其次，针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术；最后，根据语音方向的神经网络中常见的三种矩阵模式设计测试实验，实验中使用RK3399硬件平台运行程序。实验结果表示：方阵模式下单核计算速度为10.23 GFLOPS，达到实测浮点峰值的78.2%；在细长矩阵模式下单核计算速度为6.35 GFLOPS，达到实测浮点峰值的48.1%；在连续小矩阵模式下单核计算速度为2.53 GFLOPS，达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中，程序的实际语音识别速度取得了显著提高。相似文献

20.

二维SIMD结构的低功耗调度

下载免费PDF全文

张倩《计算机工程》2009,35(10):273-275

针对二维SIMD结构,提出一种可以动态关闭空转部件且结合编译器、指令集和体系结构支持的低功耗调度算法,其中包括编译器优化二维SIMD指令,功耗指令发出部件开关信号,系统接收信号并执行。采用对不同功能单元分别调度的方式和部件局部化的方法。在模拟器上的实验结果表明该方法可以节省整个系统约15％的能量消耗。相似文献