期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

闫迪帅玮祎孙克李晓宇《电讯技术》2018,58(6)

在通用处理器上进行信号处理是软件无线电发展的方向之一,现有的共享存储并行编程（OpenMP）和直接线程并行法难以对信号处理进行并行加速。针对串行算法的并行化问题,引入多核流水线方法,对传统串行方法和多核流水线的实时性进行了分析对比。针对多核流水线的同步问题,研究了一种分布式的自适应线程同步方法。结合信号处理实例,对串行方法和多核流水线的实时性进行测试,结果表明多核流水线的吞吐率是串行方法的2.1倍,处理能力大大提高。相似文献

2.

Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors

Wenlong Li Xiaofeng Tong Tao Wang Yimin Zhang Yen-Kuang Chen 《Journal of Signal Processing Systems》2009,57(2):213-228

This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism, and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore, the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1) extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also useful for everyone to exploit the thread-level parallelism in their applications.

Wenlong LiEmail:

相似文献

3.

Scalable Programming Models for Massively Multicore Processors

McCool M.D. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2008,96(5):816-831

Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs. 相似文献

4.

Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection

《Journal of Visual Communication and Image Representation》2008,19(8):558-572

In order to achieve high computational performance and low power consumption, many modern microprocessors are equipped with special multimedia instructions and multi-core processing capabilities. The number of cores on a single chip increases double every three years. Therefore, besides complexity reduction by smart algorithms such as fast macroblock mode selection, an effective algorithm for parallelizing H.264/AVC is also very crucial in implementing a real-time encoder on a multi-core system. This algorithm serves to uniformly distribute workloads for H.264/AVC encoding over several slower and simpler processor cores on a single chip. In this paper, we propose a new adaptive slice-size selection technique for efficient slice-level parallelism of H.264/AVC encoding on a multi-core processor using fast macroblock mode selection as a pre-processing step. For this we propose an estimation method for the computational complexity of each macroblock using pre macroblock mode selection. Simulation results, with a number of test video sequences, show that, without any noticeable degradation, the proposed fast macroblock mode selection reduces the total encoding time by about 57.30%. The proposed adaptive slice-level parallelism has good parallel performance compared to conventional fixed slice-size parallelism. The proposed method can be applied to many multi-core systems for real-time H.264 video encoding. 相似文献

5.

高密度集成与单芯片多核系统及其研究进展

李东生高明伦《半导体技术》2012,37(2):89-95

在体积、重量和功耗有严格约束的情况下,系统小型化遇到多种技术挑战,为了满足高密度计算和小型化的要求,高密度系统集成和单芯片多核处理器至关重要。讨论了高密度集成与单芯片多核处理器技术及其研究进展,其中包括单芯片多核处理器(CMP)、片上网络(NoC)、3D集成电路、高密度封装。提出了CMP的两个发展特征,即小核大数量和层次型簇结构。指出高密度集成设计与高密度封装设计逐渐融合,并为单芯片多核系统的物理实现提供了技术保证,为最终实现高密度计算和小型化系统提供了硬件解决方案。相似文献

6.

Two Step Timing Synchronization Scheme for OFDM Signal in General Purpose Processor Based Software Defined Radio Receiver

Yuki Tanaka Mamiko Inamori Yukitoshi Sanada 《Wireless Personal Communications》2014,79(1):363-374

Software defined radio (SDR) is a technology that allows a single terminal to support various kinds of wireless systems by changing its software to reconfigure itself. A general purpose processor (GPP) based SDR receiver platform named Sora has been recently developed by Microsoft. In the GPP based SDR receiver, timing synchronization of an OFDM signal consumes a significant amount of computational resources in the GPP. In this paper, a timing synchronization scheme which uses delayed correlation and matched filtering for the GPP based SDR platform is evaluated. The two stage timing synchronization scheme reduces the computational complexity by limiting the timing range of matched filtering. The proposed scheme reduces the amount of data transmission between the memory and the GPP of the SDR platform. It is shown through an experiment that the proposed scheme reduces the number of cycles for timing synchronization by up to 30 %. 相似文献

7.

Modeling and Evaluating Non-shared Memory CELL/BE Type Multi-core Architectures for Local Image and Video Processing

Svetislav Momcilovic Leonel Sousa 《Journal of Signal Processing Systems》2011,62(3):301-318

Local processing, which is a dominant type of processing in image and video applications, requires a huge computational power to be performed in real-time. However, processing locality, in space and/or in time, allows to exploit data parallelism and data reusing. Although it is possible to exploit these properties to achieve high performance image and video processing in multi-core processors, it is necessary to develop suitable models and parallel algorithms, in particular for non-shared memory architectures. This paper proposes an efficient and simple model for local image and video processing on non-shared memory multi-core architectures. This model adopts a single program multiple data approach, where data is distributed, processed and reused in an optimal way, regarding the data size, the number of cores and the local memory capacity. The model was experimentally evaluated by developing video local processing algorithms and programming the Cell Broadband Engine multi-core processor, namely for advanced video motion estimation and in-loop deblocking filtering. Furthermore, based on these experiences it is also addressed the main challenges of vectorization, and the reduction of branch mispredictions and computational load imbalances. The limits and advantages of the regular and adaptive algorithms are also discussed. Experimental results show the adequacy of the proposed model to perform local video processing, and that real-time is achieved even to process the most demanding parts of advanced video coding. Full-pixel motion estimation is performed over high resolution video (720×576 pixels) at a rate of 30 frames per second, by considering large search areas and five reference frames. 相似文献

8.

Implementation of a High Throughput 3GPP Turbo Decoder on GPU

Michael Wu Yang Sun Guohui Wang Joseph R. Cavallaro 《Journal of Signal Processing Systems》2011,65(2):171-183

Turbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput. 相似文献

9.

基于多核处理器的任务记录数据并行压缩算法

许晋胡泽林杨智王颖《电子科技》2014,27(8):164-166,169

数据记录与回放是任务电子系统的重要部件之一。在任务执行过程中,数据记录与回放对实时数据进行记录存储,供事后回放分析。随着传感器性能的逐步提升,记录的数据量也随之增长。为减少记录数据的存储大小,采用zlib函式库对数据进行实时压缩,为缓解压缩所带来的数据记录速率慢的性能瓶颈,提出了一种基于多核处理器的并行压缩算法,该算法充分利用多核处理器的计算能力实现多线程并行压缩。通过实验表明,该算法取得了加速性能与压缩性能的大幅提升。相似文献

10.

Implementation of media processors

Pirsch P. Stolberg H.-J. Yan-Kuang Chen Kung S.Y. 《Signal Processing Magazine, IEEE》1997,14(4):48-51

Conventional standard processors do not correspond well to the characteristics of multimedia signal processor algorithms. Therefore, special architectural approaches are necessary for multimedia processors to deliver the required high processing power with efficient use of hardware resources. Programmable approaches offer a high degree of flexibility. In order to attain multimedia signal processor performance, architectural strategies for programmable processors are based on parallelization and adaptation principles. The future multimedia signal processor implementation hinges upon an optimal trade-off between the two design spaces, which can be effectively addressed by a codesign approach 相似文献

11.

Regularity-constrained floorplanning for multi-core processors

Xi Chen Jiang Hu Ning Xu 《Integration, the VLSI Journal》2014

Multi-core technology becomes a new engine that drives performance growth for both microprocessors and embedded computing. This trend requires chip floorplanners to consider regularity constraint since identical processing/memory cores are preferred to form an array in layout. In general, regularity facilitates modularity and therefore makes chip design planning easier. As chip core count keeps growing, pure manual floorplanning will be inefficient on the solution space exploration while conventional floorplanning algorithms do not address the regularity constraint for multi-core processors. In this work, we investigate how to enforce regularity constraint in a simulated annealing based floorplanner. We propose a simple and effective technique for encoding the regularity constraint in sequence-pairs. To the best of our knowledge, this is the first work on regularity-constrained floorplanning in the context of multi-core processor designs. Experimental comparisons with a semi-automatic method show that our approach yields an average of 12% less wirelength and mostly smaller area. 相似文献

12.

Programmable processor implementations of -best list sphere detector for MIMO receiver

Janne Janhunen Olli Silvn Markku Juntti 《Signal processing》2010,90(1):313-323

An increasing number of standards in wireless communications have encouraged to study programmable processors as platforms for flexible receivers. A multiple-input multiple-output (MIMO) antenna system combined with orthogonal frequency division multiplexing (OFDM) technique has been introduced in many wireless communications standards, such as in the third generation long term evolution (3G LTE). The MIMO-OFDM system requires an efficient detector and a platform support for parallel processing of multiple subcarriers. A K-best list sphere detector (LSD) provides for near optimal decoding performance and a fixed throughput making it an interesting algorithm from the point of view of practical implementations.In this paper, we compare the implementations of the K-best LSD on four processor platforms: a digital signal processor (DSP), software defined radio (SDR), application-specific processor (ASP) and application-specific instruction-set processor (ASIP). The DSP is a popular very long instruction word (VLIW) device (TMS320C6455), the SDR processor employs multithreading and multiple cores (SB3500 core processor), the ASP is based on transport triggered architecture (TTA), while the ASIP is the SDR processor enhanced with a special instruction-set extension for sorting.A 2×2 MIMO antenna system with 64-quadrature amplitude modulation (64-QAM) is assumed. The chosen list sizes K=8 and 16 are based on simulation results carried out in MATLAB environment with the third generation long term evolution (3G LTE) parameters. The proposed ASIP achieved a promising throughput of 32.0 Mbps, where the software defined radio (SDR) implementation on the SB3500 core processor suffers from an inefficient software sorter. The ASP, in which the minimized hardware complexity has been the goal, achieves a throughput of 7.6 Mbps. However, more essential examination is related to the symbol time, which sets strict parallel processing requirements to the programmable processors. 相似文献

13.

基于并行算法的快速人脸识别系统设计与实现

许嘉诚《无线互联科技》2020,(6):63-65

为了提高人脸识别实时处理速度,并充分利用当前多核处理器资源,文章实现了一种在Python环境下基于并行算法的快速人脸识别系统。人脸特征与数据库的储存实时交互,利用各子程序安全通信实现同步处理,并行处理实时人脸检测和人脸特征编码提取,同时进行最相似人脸匹配。实验结果表明,程序并行后有着速度快、准确度高、实时性强等优点。为大数据人脸识别处理提供了方法,也有利于实际的使用。相似文献

14.

基于GPP的LTE上行链路并行策略研究

王海玲彭涛钱荣荣《现代电信科技》2012,(7):22-26,31

3GPP LTE系统需要支持很高的空口速率,这就对基带部分的信号处理时间提出了很高的要求。单线程的串行处理已经很难达到系统要求,需要引入并行化方法来实现高速的数据处理。在英特尔多核处理器上,针对LTE上行物理信道的特点,使用OpenMP并行化方法对上行数据进行了并行化处理,并通过测试比较,验证这种方法取得了很好的效果。相似文献

15.

An Algorithm-Hardware-System Approach to VLIW Multimedia Processors 总被引：2，自引：0，他引：2

Mladen Berekovic Peter Pirsch Johannes Kneip 《The Journal of VLSI Signal Processing》1998,20(1-2):163-180

Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads. 相似文献

16.

MIPS指令集多核处理器信令处理能力评估

万志涛《电信科学》2011,(Z1)

通用高性能处理器在信令处理上有着广泛的应用,但有功耗较高的缺点。基于MIPS指令集的低功耗多核处理器的能效比较高,但信令处理能力不明确。本文采用密集内存访问的方法对处理器的信令处理能力进行评价。通过对MIPS指令集多核处理器和X86处理器的比较,得出MIPS多核处理器在信令处理能力和功效比上均有优势。以GTP为例在MIPS架构多核处理器和X86架构处理器上分别实现并进行性能测试。测试结果表明本文所述性能评价方式比较合理,同时也证明MIPS多核处理器可以用作信令处理,能效比显著高于通用高性能处理器。相似文献

17.

Optimizing energy-efficiency for program partitioning and mapping onto multi-core packet processing systems

Jing Huang Olga Ormond Di Ma Xiaojun Wang 《中国邮电高校学报(英文版)》2012

The sharp increase in bandwidth requirements and versatility of network applications has prompted packet processing systems to widely adopt a multi-core multi-threaded architectural design. A challenging issue when programming such a system is how to fully utilize the processing power in a pipeline-parallel topology. As the power consumption increases, maintaining the energy-efficiency of the whole system also becomes delicate.In this paper, we proposed a strategy based on graph bi-partitioning (Bi-Par) to automatically map the programming code onto the multiple processing cores. The algorithm searches for an optimal configuration of the pipeline depth and the width of each pipeline stage. Steps taken to optimize the performance include iterations over the sub-tasks at the pipeline edges, and performing migration of tasks between cores to improve energy-efficiency. We designed a compiler framework to implement the algorithm, and use an experimental model to validate it. The simulation results show that our approach improves the energy-efficiency in all three benchmarks by between 8.04% and 34%, with a marginal loss in throughput. 相似文献

18.

An attention controlled multi-core architecture for energy efficient object recognition

Joo-Young Kim Sejong Oh Seungjin Lee Minsu Kim Jinwook Oh Hoi-Jun Yoo 《Signal Processing: Image Communication》2010,25(5):363-376

In this paper, an attention controlled multi-core architecture is proposed for energy efficient object recognition. The proposed architecture employs two IP layers having different roles for energy efficient recognition processing: the attention/control IPs compute regions-of-interest (ROIs) of the entire image and control the multiple processing cores to perform local object recognition processing on selected area. To this end, a task manager is proposed to perform dynamic scheduling of various ROI tasks from the attention IP to multiple cores in a unit of small-sized grid-tile. Thanks to a number of grid-tile threads generated by the task manager, the utilization of the multiple cores amounts to 92% on average. As a result, the proposed architecture achieves 2.1× energy reduction in multi-core recognition system by indicating processing cores to focus on critical area of the image with a 0.87 mJ attention processing. Finally, the proposed architecture is implemented in 0.13 μm CMOS technology and the fabricated chip verifies 3.2× lower energy dissipation per frame than the state-of-the-art object recognition processor. 相似文献

19.

多核处理器功耗优化与评估技术发展综述

邢立冬《电子设计工程》2014,(12):100-103

多核处理器已经成为当前处理器设计的主流,其并行处理能力显著提高了处理器的性能,同时,多核处理器本身的高度集成度也使其功耗显著上升,从而在一定程度上限制了多核处理器的发展。本文描述了低功耗设计的基本理论、常用的低功耗设计技术和多核处理器中的功耗评估技术,并分析和总结了低功耗多核处理器研究的最新进展,可为多核处理器的设计提供有益的参考。相似文献

20.

Energy Aware Signal Processing for Software Defined Radio Baseband Implementation

Min Li David Novo Bruno Bougard Claude Desset Antoine Dejonghe Liesbet Van Der Perre Francky Catthoor 《Journal of Signal Processing Systems》2011,63(1):13-25

The fast pacing diversity and evolution of wireless communications require a wide variety of baseband implementations within a short time-to-market. Besides, the exponentially increased design complexity and design cost of deep sub-micron silicon highly desire the designs to be reused as much as possible. This yields an increasing demand for reconfigurable/ programmable baseband solutions. Implementing all baseband functionalities on programmable architectures, as foreseen in the tier-2 SDR, will become necessary in the future. However, the energy efficiency of SDR baseband platforms is a major concern. This brings a challenging gap that is continuously broadened by the exploding baseband complexity. We advocate a system level approach to bridge the gap. Specifically, we fully leverage the advantages (programmability) of SDR platforms to compensate its disadvantages (energy efficiency). Highly flexible and dynamic baseband signal processing algorithms are designed and implemented to exploit the abundant dynamics in the environment and the user requirement. Instead of always performing the best effort, the baseband can dynamically and autonomously adjust its work load to optimize the average energy consumption. In this paper, we will introduce such baseband signal processing techniques optimized for SDR implementations. The methodology and design steps will be presented together with 3 representative case studies in HSDPA, WiMAX and 3GPP LTE. 相似文献