期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《微型机与应用》2014,(16):31-33

针对光电经纬仪高帧频和高分辨率图像实时压缩难以实现的问题,在TI公司提供的H.264单核编码开源工程和多核并行算法的基础上,提出了基于TMS320C6678多核处理器的H.264并行算法。在单核编码开源工程实现多核并行视频编码,将每帧图像平均划分成多个Slice,每个DSP核处理一个Slice。实验结果表明,与单核视频编码相比,多核并行视频编码的加速与比核数的增加呈线性增长,在TMS320C6678多核处理器上实现光电经纬仪的实时图像压缩具备较强的工程实践性。相似文献

2.

基于H.264实时编码的多核并行算法 总被引：1，自引：0，他引：1

下载免费PDF全文

冯飞龙陈耀武《计算机工程》2010,36(24):226-227

针对H.264多核实时编码架构,根据编码模块的数据依赖关系,提出基于相邻宏块的并行算法,融合Slice级、宏块行级和相邻宏块级并行算法,实现多粒度并行编码算法,加大了数据并行深度。实验结果表明,该并行编码算法在图像质量几乎不变的情况下能有效提高并行加速比。相似文献

3.

基于同构多核处理器的H.264多粒度并行编码器 总被引：2，自引：0，他引：2

于俊清李江魏海涛《计算机学报》2009,32(6)

H.264码率低和视频质量高的优越性能以增加编码计算的复杂度为代价,如何开发适用于多核处理器平台的并行编码算法是提高其编码速度的重要研究内容,对于满足高清视频实时传输和大规模共享具有十分重要的意义.利用H.264开源编码器项目X264,在片级和数据级并行编码算法的基础上,通过分析图像帧之间的参考关系,提出并实现了B帧个数可变的帧级并行算法;根据宏块之间的参考关系,设计了一种类似流水线的宏块级并行方法;基于Intel同构多核平台,提出融合帧级、片级、宏块级和数据级4种不同粒度的并行编码方案,开发了H.264多粒度并行编码器.实验结果表明,在码率增加不大的情况下,H.264多粒度并行编码器可以很好地提升编码加速比,视频编码质量符合高质量的要求. 相似文献

4.

一种基于帧间和帧内宏块级的X264并行编码算法

魏妃妃梁久桢韩军《计算机工程与科学》2011,33(7):106

结合H.264编码标准对X264编码器进行了分析与研究,目的在于提高编码速度,增强X264的实时性。在重点分析了宏块间数据依赖关系的情况下,针对帧间宏块级多线程并行编码的特点,本文提出了一种基于帧间和帧内宏块级的多线程并行编码算法。该算法在原有的帧间宏块级多线程并行编码的基础上,遵循宏块之间的空间相关性,为I帧内每行宏块创建单独的线程,实现了帧间和帧内宏块级并行编码,达到了多粒度并行的效果。实验结果表明,该算法在视频序列能够有效地编码和保持峰值信噪比变化不大的情况下,提高了编码的加速比,从而加强了视频编码的实时性。相似文献

5.

基于高清编码的自适应Slice划分算法

下载免费PDF全文

冯飞龙陈耀武《计算机工程》2010,36(23):226-228,233

对于多核高清视频实时编码系统,提出一种自适应Slice划分算法。该算法基于码率控制和熵编码复杂度模型,通过Intra预测得到当前编码图像的纹理复杂度分布,预测编码图像的计算复杂度分布,通过自适应Slice划分实现多核间计算复杂度均匀分配,从而提高多核并行编码效率。实验结果表明,与固定宏块数的Slice划分算法相比,该算法能更有效地提高并行加速比。相似文献

6.

基于异构多核处理器的H.264并行编码算法

下载免费PDF全文

吕明洲陈耀武《计算机工程》2012,38(16):35-39

H.264视频编码标准计算复杂度较高,难以完成高清视频的实时编码。为此,提出异构多核DM6467平台的H.264并行编码算法。综合DM6467内部各个硬件加速引擎的依赖关系和存储器特点,设计宏块级并行编码算法,通过分析多slice模式流水线的特点,以及数字信号处理器和ARM双核任务分配,提出合并流水线、核间负载均衡的优化方案。实验结果表明,优化后的编码器效率提高18%,能实现在DM6467平台上1080P的实时编码。相似文献

7.

HEVC帧内预测算法加速设计与实现

王飞龙刘新闯刘鹏辛晓斐石鹏飞《计算机应用与软件》2020,37(1):151-156

新一代视频编码标准获得了较高的编码效率,但同时也增加了计算量。HEVC(High Efficiency Video Coding)并行算法能够提高编码速度,开发适用于多核处理器的并行编码算法对于满足高清视频实时传输和大规模实时共享具有十分重要的意义。分析帧内预测算法在处理像素过程中数据之间的依赖关系,进行基于预测模式的细粒度并行性的设计。块与块之间采用流水线处理,减少帧内预测算法的执行时间。利用动态可编程可重构视频阵列处理器,对帧内预测算法进行验证。实验结果表明,相比于HM16.0官方测试标准,信噪比提高了10%,算法的执行时间减少了大约70%。相似文献

8.

基于多核处理器的SVC高清实时编码

下载免费PDF全文

黄亮《计算机工程与应用》2013,49(13):170-174

针对SVC（Scalable Video Coding）视频编码算法的高复杂度,提出了一种面向TileraGx36多核平台的针对高清视频的SVC并行编码算法。在层间,提出基于时间层对齐的空间层级并行编码;在层内,针对图像变化的多样性,为实现Slice间编码性能的动态均衡,提出了直接根据统计时间的Slice级动态分割方法,并针对依赖性较强的去方块滤波模块实现了多核并行滤波方案。结合平台特点,实现了多核处理器核数的动态分配方案。实验结果表明,整个方案并行加速比超过19,实现了最大分辨率720P视频序列的实时编码。相似文献

9.

基于时间相关的H.264帧内预测算法* 总被引：3，自引：0，他引：3

胡少华叶水生周登峰《计算机应用研究》2010,27(7):2770-2772

为了降低H.264编码器的计算复杂度,提出了一种快速帧内预测模式选择算法,通过分析帧内预测模式在时间和空间上的相关性,将已编码帧中宏块的帧内预测模式作为当前宏块的候选预测模式,减少了计算的开销。实验结果表明,在图像性能和码率变化不大的情况下,编码时间减少约10%,有利于实时编码的实现。相似文献

10.

H.264中基于上下文的参考帧选择算法

下载免费PDF全文

王英坤徐伯庆杨华《计算机工程》2008,34(8):249-251

多参考帧选择技术提高了有暴露背景或新增景物区域的编码效率,但对视频编码性能的提升不大。该文提出基于上下文的多参考帧选择算法,利用宏块的运动特性决定当前宏块搜索的最大参考帧数目,通过分析当前参考帧预测的结果,动态地判断是否需要进行下一参考帧预测。实验结果表明,在编码质量接近全搜索算法的基础上,该算法搜索速度为全搜索算法的2.10倍~2.65倍。相似文献

11.

Parallel multigrid algorithms based on generic approximate sparse inverses: an SMP approach

Christos K. Filelis-Papadopoulos George A. Gravvanis 《The Journal of supercomputing》2014,67(2):384-407

New parallel computational techniques are introduced for the parallelization of Generic Approximate Sparse Inverse multigrid methods, based on Portable Operating System Interface for UniX (POSIX) threads, for multicore systems. Parallelization of the Generic Approximate Sparse Inverse Matrix (GenAspI) algorithm is achieved based on a new computational approach, namely “strip,” which utilizes the data independence of the rows assigned in each available processor. Additionally, new parallel computational techniques are proposed for the parallelization of a modified multigrid V-Cycle method, based on POSIX Threads, for multicore systems. The modified V-Cycle utilized a Parallel PGenAspI Preconditioned Bi-Conjugate Gradient STABilized (BiCGSTAB) as a coarse solver to ensure better parallel performance of the multigrid method. For parallelization purposes, a replication of the multigrid method function is executed on each processor with different index bands and with proper synchronization points to ensure less thread-creation overhead and to maximize parallel performance. Theoretical estimates on speedups and efficiency are also presented. Finally, numerical results for the performance of the PGenAspI algorithm and the PGenAspI–MGV method for solving classical two-dimensional boundary value problems on multicore computer systems are presented. The implementation issues of the proposed method are also discussed using POSIX threads on multicore systems. 相似文献

12.

基于多条带HEVC并行编码器的负载均衡算法

下载免费PDF全文

刘欢房胜李哲赵晴《计算机工程与应用》2019,55(18):180-188

针对在均匀条带划分的HEVC并行视频编码器中出现的负载失衡问题,提出了一种基于多条带HEVC并行编码器的负载均衡算法。从编码参数入手,通过分析量化参数、参考帧数目和图像组等因素与编码耗时之间的关系,提出了一种基于编码参数的编码时间预测模型。以位置上和时间层上相邻已编码帧的编码信息为基础,以实际编码参数为依据,根据编码时间预测模型进行当前帧编码时间的预测,从而以当前帧的预测时间为依据,进行多条带HEVC并行编码器的负载均衡操作。实验结果表明,与现有均匀条带划分方法相比,提出的方法能够提升加速比9.23%左右,而编码的性能损失几乎可以忽略不计。相似文献

13.

Using graphics processors to accelerate the computation of the matrix inverse 总被引：1，自引：1，他引：0

P. Ezzatti E. S. Quintana-Ortí A. Remón 《The Journal of supercomputing》2011,58(3):429-437

We study the use of massively parallel architectures for computing a matrix inverse. Two different algorithms are reviewed, the traditional approach based on Gaussian elimination and the Gauss–Jordan elimination alternative, and several high performance implementations are presented and evaluated. The target architecture is a current general-purpose multicore processor (CPU) connected to a graphics processor (GPU). Numerical experiments show the efficiency attained by the proposed implementations and how the computation of large-scale inverses, which only a few years ago would have required a distributed-memory cluster, take only a few minutes on a hybrid architecture formed by a multicore CPU and a GPU. 相似文献

14.

Performance‐based parallel loop self‐scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters

Chao‐Tung Yang Chao‐Chin Wu Jen‐Hsiang Chang 《Concurrency and Computation》2011,23(8):721-744

Parallel loop self‐scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self‐scheduling schemes have been proposed as applicable to heterogeneous cluster computing environments. In recent years, multicore computers have been widely included in cluster systems. However, previous researches into parallel loop self‐scheduling did not consider certain aspects of multicore computers; for example, it is more appropriate for shared‐memory multiprocessors to adopt Open Multi‐Processing (OpenMP) for parallel programming. In this paper, we propose a performance‐based approach using hybrid OpenMP and MPI parallel programming, which partition loop iterations according to the performance weighting of multicore nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

15.

Using hybrid MPI and OpenMP programming to?optimize communications in parallel loop self-scheduling schemes for multicore PC clusters

Chao-Chin Wu Lien-Fu Lai Chao-Tung Yang Po-Hsun Chiu 《The Journal of supercomputing》2012,60(1):31-61

Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and OpenMP programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an OpenMP thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works. 相似文献

16.

Toward fast Wyner-Ziv video decoding on multicore processors

Alberto Corrales-García José Luis Martínez Gerardo Fernández-Escribano Francisco J. Quiles 《Multimedia Tools and Applications》2014,68(3):717-745

The Wyner-Ziv video coding paradigm provides a framework where most of the complexity is moved from the encoder to the decoder. In this way, Wyner-Ziv coding efficiently supports multimedia services for mobile devices which have to capture, encode and send video. However, the complexity of the decoder is quite high and it should be reduced. This work presents several parallel Wyner-Ziv decoding algorithms aimed at reducing this high complexity. Considering the fact that technological advances provide us new hardware which supports parallel data processing, these algorithms efficiently distribute the burden of the complexity over the number of cores which are available in the architecture. Particularly four parallel approaches have been proposed and analyzed. In the first parallel approach, the each bitplane of a frame could be decoded in a parallel way by a different core, achieving a time reduction of 33.21 % in average, although it depends on the number of bitplanes used. The second approach proposes a spatial distribution of each frame, avoiding dependences between bitplanes and then obtaining a time reduction of 67 % in average. The third approach executes each GOP in a parallel way, avoiding all synchronization dependences and achieving 71 % of time reduction in average, although the maximum performance is reached when the key frame buffer is full. Finally, the last approach distributes the burden of complexity over two levels, namely GOP and frame, in order to obtain the advantages of both: a negligible rate distortion penalty based on the GOP approach, and a low delay introduced by the spatial distribution approach. By using this parallel approach, the decoding time is reduced up to 76 %. In addition, by using parallel decoding, 60 % of the energy consumption is saved. The proposed methods are scalable for any multicore processor architecture and adaptable for different Wyner-Ziv decoding schemes. 相似文献

17.

基于多核DSP处理器DM8168的视频处理方法

胡志权杨斌《单片机与嵌入式系统应用》2014,(8):39-41

随着1080P高清视频以及4K超高清晰视频的普及和应用,基于传统单核DSP处理器的视频信息处理已有些力不从心。为此TI公司推出了一款专门用于高清视频处理的多核DSP处理器,它拥有4个不同类型的处理器,使得视频处理达到了一个更高水平。本文分析研究了该处理器的多核DSP结构及应用开发方法,并对多核间的协调工作及负载情况进行了测试分析。相似文献

18.

ParaStream: A parallel streaming Delaunay triangulation algorithm for LiDAR points on multicore architectures 总被引：1，自引：0，他引：1

Huayi Wu Xuefeng Guan Jianya Gong 《Computers & Geosciences》2011,37(9):1355-1363

This paper presents a robust parallel Delaunay triangulation algorithm called ParaStream for processing billions of points from nonoverlapped block LiDAR files. The algorithm targets ubiquitous multicore architectures. ParaStream integrates streaming computation with a traditional divide-and-conquer scheme, in which additional erase steps are implemented to reduce the runtime memory footprint. Furthermore, a kd-tree-based dynamic schedule strategy is also proposed to distribute triangulation and merging work onto the processor cores for improved load balance. ParaStream exploits most of the computing power of multicore platforms through parallel computing, demonstrating qualities of high data throughput as well as a low memory footprint. Experiments on a 2-Way-Quad-Core Intel Xeon platform show that ParaStream can triangulate approximately one billion LiDAR points (16.4 GB) in about 16 min with only 600 MB physical memory. The total speedup (including I/O time) is about 6.62 with 8 concurrent threads. 相似文献

19.

面向众核系统的层次化栅栏同步机制

臧照虎李晨王耀华陈小文郭阳《计算机工程与科学》2022,44(11):1901-1908

同步操作在保证多核处理器线程的数据一致性和正确性等方面起着重要作用。随着处理器内核数量的不断增加,同步操作的开销也越来越大。栅栏同步是并行应用中多核同步的重要方法之一。软件同步方法通常需要数千个周期才能完成多个内核之间的同步,这种高延迟和串行化同步会导致多核程序性能的显著下降。相比于软件栅栏同步方法,硬件栅栏能够实现较低的同步延迟,然而传统集中式硬件栅栏的可扩展性有限,难以适应众核处理器系统的同步需求。面向众核处理器提出了一种层次化硬件栅栏机制——HSync,它由本地栅栏单元和全局栅栏单元组成,二者协调配合,以实现低硬件开销的快速同步。实验结果表明,与传统的集中式硬件栅栏相比,层次化硬件栅栏机制将众核处理器系统性能提高了1.13倍,同时网络流量减少了74%。相似文献