首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
距离多普勒算法是合成孔径雷达成像的一种经典方法。随着信号处理性能需求的逐步提高,多核并行处理器已经逐步发展起来,如TI的C6678处理器为8核DSP。在多核DSP实现成像算法时,多核多线程设计、多核任务分配、计算传输平衡等问题是影响性能的关键问题。采用数据并行的方式实现并行设计框架。针对距离多普勒算法的特点,设计收数同时脉压、8核协同处理大点数脉压,以及每个核独立处理小点数任务等多种并行方式。通过基于多核DSP的并行设计,大大提高了距离多普勒算法的处理性能。  相似文献   

2.
基于实现目标探测识别以及高精度目标信息测量等复杂处理算法的目的,采用单片多核DSP TMS320C6678构成弹载高速多任务实时嵌入式处理平台,通过数据流处理模式的并行软件设计方法,将系统处理任务均衡分配到各处理器内核,以实现实时并行处理,提升弹载信息处理系统的功能和性能。开展基于多核处理器的并行软件研制、充分发挥多核处理能力将成为弹载嵌入式系统软件设计的新课题。  相似文献   

3.
并行计算是实现高性能计算的一个重要发展方向。随着信号处理、通信等领域对处理能力需求的不断提升,DSP的并行开发技术也得到了较快发展。多器件并行和片上多核的方法可以有效提高处理性能。多核并行处理相对于传统单核DSP要进行多任务并行设计,使系统设计更加复杂。文中在探讨了利用8核处理器进行信号处理开发的关键技术的基础上,采用Round—Robin方式设计了一种多核并行信号处理模式,并对多核的同步、Cache一致性、任务并行分配等进行了论述。  相似文献   

4.
FFT和IFFT是信号处理最常用的算法。随着技术发展需求的不断提高,FFT、IFFT点数越来越大。信号处理器逐步由单核向多处理器并行、多核并行方向发展。文中研究了大点数FFT、IFFT并行设计方法,把IFFT转换成FFT计算并将大点数FFT拆分成小点数运算。在TI C66788核处理器上实现了有缓冲和无缓冲的大点数FFT、IFFT设计。通过并行设计,实现大点数FFT、IFFT在8核处理器上并行计算。通过计算和传输并行、多核并行设计,提高了处理性能。  相似文献   

5.
基于CostarⅡ的异构多核DSP设计与实现   总被引:1,自引:1,他引:0  
基于CostarⅡ DSP内核设计并实现了一种高性能的嵌入式异构多核DSP.该设计集成了四个DSP内核和一个RISC处理器内核;每个内核均拥有自己的私有存储器;所有内核共享具有多体并行存储结构的数据存储器;四个DSP内核使用可配置的共享程序存储器;各内核之间拥有邮箱、信号量及中断等多种同步与通信机制.为了验证该设计,在该系统上测试了JPEG解码算法,并通过了FPGA验证.测试结果表明,该设计具有编程模式简洁,易于提高任务执行的并行度的优点.  相似文献   

6.
《信息技术》2015,(7):5-8
数字信号处理器(DSP)是对数字信号进行高速处理的专用芯片。单核结构的DSP已经不能满足信息行业发展的需求,多核结构DSP应运而生,同时,也为如何准确高效地利用多核处理器平台设计并行软件带来了挑战。文中主要基于德州仪器TMS320C6678平台,对多核DSP应用开发进行研究,并简单介绍了基于此DSP的AVS视频双核并行解码的实现。  相似文献   

7.
为了在DSP平台实现细胞图像快速分割,详细分析Canny算子原理,结合TI DSP TMS320C6678处理器特性,实现了算法移植。针对与外部存储器图像数据交互,改变以往对图像逐灰度值进行访问的方式,设计了矢量化数据打包方法处理高斯滤波来提高并行运算。且在梯度计算、阈值计算过程中,采用宽存储器访问方法提高读写外部存储器效率。结果表明设计的优化方法在不改变分割效果前提下改善了算子速度,可为工程人员在DSP平台进行算法移植与优化提供借鉴。  相似文献   

8.
俞健  周维超  刘坤 《半导体光电》2012,33(6):902-905
在DSP+FPGA的高速图像处理系统中,针对系统数据量大、运算复杂的特点,提出了一种基于SRIO协议的DSP与FPGA处理器互连,并进一步使用FPGA中的MPMC控制器连接DDR2SDRAM,实现了图像处理系统内部处理器的共享存储。该方法通过在DSP和FPGA上编程,实现了SRIO协议中的存储器映射I/O事务(LSU)方式的传输,处理器之间通过SRIO接口传输的数据速率达到3.125Gb/s。实验结果表明,该方法有效地实现了处理器之间数据稳定可靠的传输,使系统内的数据交换灵活快捷,提高了DSP的协处理能力,很好地满足了处理系统实时性的需求。  相似文献   

9.
提出了一种DSP和通用CPU一体化的处理器架构,并完成了一款基于该架构的同构4核处理器设计和流片验证.该处理器基于VLIW结构,支持自主定义的DSP指令系统,兼容现有通用的MIPS 4KC处理器指令集,支持最大8个指令通道的并行发射.处理器在不改变CPU的指令编码以及执行顺序的前提下,实现了芯片结构上的DSP和CPU执行处理的一体化,适合在统一的平台上同时完成宽带通信和多媒体的信号和协议处理的嵌入式应用开发.处理器内核通过自主定义的DSP指令字中前后并行标识位和一条专用的前导paralink指令实现了DSP与CPU指令的并行发射.在4核处理器的同构架构上,采用了全局读局部写的多核间片上数据存储策略,在控制硬件开销的基础上实现片上数据的共享.仿真和流片验证结果表明,所提出的DSP和CPU一体化处理器架构可行,在宽带通信和多媒体等嵌入式应用上具有优势.  相似文献   

10.
张宇帆  陈颖  方科  费霞 《电讯技术》2023,(4):536-543
以多核数字信号处理器(Digital Signal Processor, DSP)作为计算节点的多核DSP集群系统成为一大发展趋势。当前阶段,由于多核DSP内核硬件资源利用不充分与访存带宽限制,峰值性能与实际性能间存在鸿沟。基于C66x内核丰富的指令集架构以及运算指令编排原则,结合编译器提供的汇编信息,设计并优化了QR分解算法,在充分挖掘DSP单核性能极致的同时减少了矩阵分解的计算时间。根据掌握的优化技术,设计并实现基于多核DSP集群系统的大规模并行QR分解模型,并在分布式计算框架上完成了分解任务。分析结果表明,优化后的QR分解计算效率以及C66x单核硬件资源使用率均提升了二十余倍,随着待分解矩阵规模的成倍增加,多核DSP集群相比于单核的计算性能提升也愈加明显。  相似文献   

11.
General-purpose multicore processors are being accepted in all segments of the industry, including signal processing and embedded space, as the need for more performance and general-purpose programmability has grown. Parallel processing increases performance by adding more parallel resources while maintaining manageable power characteristics. The implementations of multicore processors are numerous and diverse. Designs range from conventional multiprocessor machines to designs that consist of a "sea" of programmable arithmetic logic units (ALUs). In this article, we cover some of the attributes common to all multicore processor implementations and illustrate these attributes with current and future commercial multicore designs. The characteristics we focus on are application domain, power/performance, processing elements, memory system, and accelerators/integrated peripherals.  相似文献   

12.
Parallelization of Digital Signal Processing (DSP) software is an important trend in Multiprocessor System-on-Chip (MPSoC) implementation. The performance of DSP systems composed of parallelized computations depends on the scheduling technique, which must in general allocate computation and communication resources for competing tasks, and ensure that data dependencies are satisfied. In this paper, we formulate a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for MPSoC mapping of DSP systems that are represented as Synchronous Dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We first address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized are parallel actors (i.e., they can be parallelized to exploit multiple cores). For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows that involve particle swarm optimization (PSO) — PSO with a mixed integer programming formulation, PSO with simulated annealing, and PSO with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support the general PAS problem, which considers both parallel actors and sequential actors (actors that cannot be parallelized) in an integrated manner. We demonstrate that our PAS-targeted scheduling framework provides a useful range of trade-offs between synthesis time requirements and the quality of the derived solutions. We also demonstrate the performance of our scheduling framework from two aspects: simulations on a diverse set of randomly generated SDF graphs, and implementations of an image processing application and a software defined radio benchmark on a state-of-the-art multicore DSP platform.  相似文献   

13.
Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs.  相似文献   

14.
Lapsley  P. Blalock  G. 《Spectrum, IEEE》1996,33(7):74-78
The market for products based on digital signal processing (DSP) technology-wireless communication devices and PC multimedia peripherals, for example-is growing rapidly. Semiconductor manufacturers have responded to this demand with a bewildering array of DSP processors. Selecting the best one for a given application presents a difficult and time-consuming challenge for DSP system designers. Simple, familiar performance measures like MIPS and MOPS are misleading and neglect factors like memory usage, power consumption and application execution time. Complex alternatives-such as application benchmarks-suffer from limitations that virtually preclude fair comparisons. Fortunately, a compromise methodology that combines algorithm kernel benchmarking with application profiling yields good estimates of processor performance weighted to the target application  相似文献   

15.
岳梦云  白冰 《电子学报》2000,48(10):2041-2046
本文设计了一种适用于电机矢量控制算法的数字信号处理系统的微架构定义,包括其指令集定义、存储器模型以及与主CPU的交互模式.该设计具有通过固定部分多操作数有效缩减指令编码长度提高代码密度以及后台执行多周期指令提高ALU并行效率的显著优点.文中给出了典型的FOC控制算法在DSP (Digital Signal Processor)指令集上实现的指令周期数,也给出了对应架构的电路实现情况,最终以ARM CORTEX-M0及几款主流DSP作为比较基线,通过实测实验数据证明了体系结构的高能效比,以较为有限的电路面积代价,极大提高了集成DSP的嵌入式系统的运行效率.  相似文献   

16.
A 32-b RISC/DSP microprocessor with reduced complexity   总被引:2,自引:0,他引:2  
This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous  相似文献   

17.
This paper examined how recent innovations in processor technology are pushing the limits for ATE applications. Various multicore programming techniques were discussed including task parallelism, data parallelism, and pipelining. In addition, an example of optimizing complex analysis was covered. The benefits of adopting multicore technology and parallel software architectures include a reduction in overall test time, more sophisticated simulation approaches, and the ability to analyze complex systems.  相似文献   

18.
张然  刘佩林 《信息技术》2011,35(4):14-18,84
HD视频的编码技术具有广阔的应用前景,为解决其大数据量实时处理的瓶颈问题,提出一种基于CUDA平台的并行编码系统架构。根据CUDA平台软硬件结构特性,采用三级并行机制;并提出一种高并行化快速ME搜索算法;同时合理分配内存空间,实现大数据量的实时运算与存取。实验结果表明,方案具有高并行度,高编码速率的特点,对HD视频可达到实时编码要求。  相似文献   

19.
Parallel computing is rapidly entering mainstream computing, and multicore processors can now be found in the heart of supercomputers, desktop computers, and laptops. Consequently, applications will increasingly need to be parallelized to fully exploit the multicore processor throughput gains that are becoming available. Unfortunately, writing parallel code is more complex than writing serial code. An introductory parallel computing course aims to introduce students to this technology shift and to explain that parallelism calls for a different way of thinking and new programming skills. The course covers theoretical topics and offers practical experience in writing parallel algorithms on state-of-the-art parallel computers, parallel programming environments, and tools.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号