共查询到20条相似文献,搜索用时 625 毫秒
1.
Tero Rintaluoma 《电子元器件资讯》2010,(2)
本文将展示如何通过ARM NEON技术提高和优化基于软件的H.264视频解码器的性能.对RealView中的ARMProfiler以及真实硬件进行了数种测量,并给出了H.264和MPEG-4解码器及MPEG-4编码器的对应数据.与编译至Cortex-A8处理器架构的原始ARM优化C代码相比,Profiler上H.264解码器的总体性能提高了54%. 相似文献
2.
3.
基于OMAP平台的H.264解码器实现 总被引:1,自引:0,他引:1
给出了一种在OMAP5910平台上进行H.264解码器设计的实现方案。由于OMAP5910是双核处理器,本方案遵循它的编程模式,并结合具体结构进行了优化,最终通过ARM端客户程序负责控制DSP进行解码,并采用DSP端应用程序进行具体的解码处理,同时利用该解码器对图像进行了测试。实验结果表明,该解码器可以符合手持设备的应用需求。 相似文献
4.
H.265继续沿用H.264编码架构,去方块滤波器也是H.265视频编码标准的一个重要选项,去除混合编码带来的块效应极大改善了视频的质量,但由于H.265超级宏块的存在,去方块效应滤波相关参数层层嵌入在每个小的处理单元中,这种结构不利于实现基于宏块行间的并行化,同时也很难高效地利用Cortex-A9架构SIMD优化性能.首先详细分析H.265标准去块滤波器的处理过程以及并行处理的困难,进而提出一种便于实现基于宏块行间的并行去块滤波结构,然后进行Cortex-A9汇编优化.基于HM14.0实验,改进去方块效应滤波器计算复杂度从占整个解码器25%降至14%,大大提升了解码器性能,为移动设备上实现H.265大分辨率视频实时播放奠定基础. 相似文献
5.
6.
7.
通过分析H.264软件解码器的结构和复杂度,确定了解码器在优化过程中的重点和难点,并结合TMS320DM642DSP性能特点,详细讨论了在TMS320DM642DSP平台上H.264解码器所采用的优化方法。这些方法主要涉及提高程序代码的并行性和增强存储器访问的效率,重点是运动补偿、IDCT等关键模块的优化。通过实验结果表明,本解码器可以实现CIF格式视频流的实时解码。 相似文献
8.
9.
该文讨论H.264解码器在TI公司的TMS320C64x系列DSP芯片上的实现方法,给出了在闻亭公司的DAM6416P处理平台上优化C语言代码的基本方法和在DAM6416P处理平台上对H.264解码器的C代码进行优化的具体措施.实验结果表明了该优化方法的合理性. 相似文献
10.
11.
An application specific processor for an H.264 decoder with a configurable embedded processor is designed in this research. The motion compensation, inverse integer transform, inverse quantization, and entropy decoding algorithm of H.264 decoder software are optimized. We improved the performance of the processor with instruction‐level hardware optimization, which is tailored to configurable embedded processor architecture. The optimized instructions for video processing can be used in other video compression standards such as MPEG 1, 2, and 4. A significant performance improvement is achieved with high flexibility. Experimental results show that we could achieve 300% performance for the H.264 baseline profile level 2 decoder. 相似文献
12.
Viterbi decoding is widely used in many radio systems. Because of the large computation complexity, it is usually implemented with ASIC chips, FPGA chips, or optimized hardware accelerators. With the rapid development of the multicore technology, multicore platforms become a reasonable choice for software radio (SR) systems. The Cell Broadband Engine processor is a state-of-art multi-core processor designed by Sony, Toshiba, and IBM. In this paper, we present a 64-state soft input Viterbi decoder for WiMAX SR Baseband system based on the Cell processor. With one Synergistic Processor Element (SPE) of a Cell Processor running at 3.2GHz, our Viterbi decoder can achieve the throughput up to 30Mb/s to decode the tail-biting convolutional code. The performance demonstrates that the proposed Viterbi decoding implementation is very efficient. Moreover, the Viterbi decoder can be easily integrated to the SR system and can provide a highly integrated SR solution. The optimization methodology in this module design can be extended to other modules on Cell platform. 相似文献
13.
介绍了基于ARM+DSP架构的嵌入式机器视觉系统的特性,分析了制约嵌入式机器视觉系统性能的因素。从操作系统和应用程序方面,讨论了嵌入式机器视觉系统的优化方案。通过对嵌入式Linux内核和文件系统进行裁剪,对应用程序代码进行大量的优化,并充分利用Cotex-A处理器独有的NEON加速技术,使系统开机启动时间缩短25 s,应用程序运行速度提高2.5倍。 相似文献
14.
Shahriar Shahabuddin Janne Janhunen Markku Juntti Amanullah Ghazi Olli Silvén 《Analog Integrated Circuits and Signal Processing》2014,78(3):611-622
In order to meet the requirement of high data rates for next generation wireless systems, efficient implementations of receiver algorithms are essential. On the other hand, faster time-to-market motivates the investigation of programmable implementations. This paper presents a novel design of a programmable turbo decoder as an application-specific instruction-set processor (ASIP) using transport triggered architecture (TTA). The processor architecture is designed in such a manner that it can be programmed with high level language to support different suboptimal maximum a posteriori (MAP) algorithms in a single TTA processor. The design enables the designer to change the algorithms according to the frame error rate performance requirement. A quadratic polynomial permutation interleaver is used for contention-free memory access and to make the processor 3GPP LTE compliant. Several optimization techniques to enable real time processing on programmable platforms are introduced. The essential parts of the turbo decoding algorithm are designed with vector function units. Unlike most other turbo decoder ASIPs, high level language is used to program the processor to meet the time-to-market requirements. With a single iteration, 68.35 Mbps decoding speed is achieved for the max-log-MAP algorithm at a clock frequency of 210 MHz on 90 nm technology. 相似文献
15.
Paulin P.G. Pilkington C. Langevin M. Bensoudane E. Lyonnard D. Benny O. Lavigueur B. Lo D. Beltrame G. Gagne V. Nicolescu G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(7):667-680
The MultiFlex system is an application-to-platform mapping tool that integrates heterogeneous parallel components-H/W or S/W- into a homogeneous platform programming environment. This leads to higher quality designs through encapsulation and abstraction. Two high-level parallel programming models are supported by the following MultiFlex platform mapping tools: a distributed system object component (DSOC) object-oriented message passing model and a symmetrical multiprocessing (SMP) model using shared memory. We demonstrate the combined use of the MultiFlex multiprocessor mapping tools, supported by high-speed hardware-assisted messaging, context-switching, and dynamic scheduling using the StepNP demonstrator multiprocessor system-on-chip platform, for two representative applications: 1) an Internet traffic management application running at 2.5 Gb/s and 2) an MPEG4 video encoder (VGA resolution, at 30 frames/s). For these applications, a combination of the DSOC and SMP programming models were used in interoperable fashion. After optimization and mapping, processor utilization rates of 85%-91% were demonstrated for the traffic manager. For the MPEG4 decoder, the average processor utilization was 88%. 相似文献
16.
17.
18.
19.
针对当前异构信号处理平台中信号处理应用的调度算法优化目标单一且调度结果中处理器负载不均衡的问题,提出了一种基于蚁群优化算法的负载均衡算法。该算法结合蚁群优化算法的快速搜索能力和组合优化能力,以信号处理应用的调度长度和处理器负载均衡为优化目标,对初始信息素矩阵和蚂蚁的遍历顺序进行改进,提出调度长度启发因子和负载均衡启发因子对处理器选择公式进行改进,利用轮盘赌策略确定信号处理应用各子任务分配的处理器,完成信号处理应用的调度。仿真结果表明,该算法得到调度结果在调度长度和负载均衡方面均有改进,可以充分发挥各处理器性能,提高异构信号处理平台的整体效率。 相似文献
20.
随着市场智能手机平台和平板电脑对芯片性能和上市时间要求的不断提升,后端工程师面临的设计压力会越来越大。传统的数字实现流程在满足当今SoC设计的功耗、频率与面积要求方面正在达到极限。那如何在很短的时间内迅速实现芯片功耗、频率与面积的提升变的尤为重要。本文基于SMIC 40nm低功耗工艺的ARM Cortex A9物理设计的实际情况,详细阐述了如何使用cadence最新的时钟同步优化技术,又称为CCopt技术来实现统一的时钟树综合和物理优化。根据实现的结果来看,CCopt引擎很好的实现了目标。实现8%的设计频率提升,并实现了时钟树功率与面积降低。Cadence最新的CCopt引擎对实现复杂芯片物理设计、缩短设计周期、提升芯片性能带来了很大的优势。 相似文献