首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
分块内存和多地址生成器(AGU)是DSP普遍采用的体系结构.传统的C语言编译器没有针对分块内存和多AGU结构进行代码优化,导致生成代码无法满足性能需求,影响了C语言编译器在数字信号处理领域的应用.为了解决这个问题,提出基于编译指示,与分块内存和多AGU结构相关的编译优化算法.该算法利用定义引用链和引用定义链中的数据流信息,为地址计算指令和访存指令分配AGU,从而提高生成代码的指令级并行度.实验结果表明此算法能够达到较好的优化效果.  相似文献   

2.
在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。  相似文献   

3.
为提高嵌入式处理器对计算密集型应用的处理能力,并增强定制指令的适应性,提出一种面向多任务的定制指令模式自动化提取方法.该方法以目标应用的热点代码数据流图集合作为分析对象,通过流图频率加权的方式实现目标任务的优先级调整,并挖掘隐藏于任务程序热点中的频繁计算模式作为定制指令模式.安全加密和媒体处理优化实例结果表明,该方法可提高定制指令的适应性和利用率,其优化效果优于传统独立设计方式.  相似文献   

4.
异构环境资源的不均衡性使得移动嵌入式计算平台在与桌面系统进行通信时,面临计算速度慢、存储空间有限、屏幕和网络带宽受限等问题,这种不平衡性给协同工作带来了新的挑战。文章利用映射机制来解决异构平台之间的差异性,把协同共享工作空间中的操作对象经过映射,简化图属性或图拓扑结构,使其适应资源受限平台。映射包括拓扑映射和属性映射。拓扑映射是指通过改变图的拓扑结构来简化原始图,也就是简化服务端的编辑图案,节省客户端的资源,其中又包括子图映射、路径合并和顶点压缩三种方法;属性映射,是对拓扑映射的一种补充,通过改变图上某些属性值达到转换或简化的目的。  相似文献   

5.
卷积神经网络(convolutional neural network,CNN)在图像处理、语音识别、自然语言处理等领域实现了很好的性能.大规模的神经网络模型通常遭遇计算、存储等资源限制,稀疏神经网络的出现有效地缓解了对计算和存储的需求.尽管现有的领域专用加速器能够有效处理稀疏网络,它们通过算法和结构的紧耦合实现高能效,却丧失了结构的灵活性.粗粒度数据流架构通过灵活的指令调度可以实现不同的神经网络应用.基于该架构,密集卷积规则的计算特性使不同通道共享相同的一套指令执行,然而稀疏网络中存在权值稀疏,使得这些指令中存在0值相关的无效指令,而现有的指令执行方式无法自动跳过它们从而产生无效计算.同时在执行不规则的稀疏网络时,现有的指令映射方法造成了计算阵列的负载不均衡.这些问题阻碍了稀疏网络性能的提升.基于不同通道共享一套指令的前提下,根据稀疏网络的数据和指令特征增加指令控制单元实现权值数据中0值相关指令的检测和跳过,同时使用负载均衡的指令映射算法解决稀疏网络中指令执行不均衡问题.实验表明:与密集网络相比稀疏网络实现了平均1.55倍的性能提升和63.77%的能耗减少.同时比GPU(cuSparse)和Cambricon-X实现的稀疏网络分别快2.39倍(Alexnet)、2.28倍(VGG16)和 1.14倍(Alexnet)、1.23倍(VGG16).  相似文献   

6.
数据流编程语言是一种面向领域的编程语言,它能够将计算与通信分离,暴露应用程序的并行性.多核集群中计算、存储和通信等底层资源的复杂性对数据流程序的性能提出了新的挑战.针对数据流程序在多核集群上执行存在资源利用低和扩展性差等问题,利用同步数据流图作为中间表示,文中提出并实现了面向多核集群的层次性流水线并行优化方法.方法包含任务划分与调度、层次流水线调度和数据局部性优化,经过编译优化后生成基于MPI的可并行执行的目标代码.其中任务划分与调度是利用程序中数据和任务并行性将任务映射到计算核上,实现负载均衡和低通信同步开销;层次性流水线调度是利用程序中的并行性构造低延迟流水线调度;数据局部性优化是针对数据访问存在的Cache伪共享做面向存储的优化.实验以X86架构多核处理器组成的集群为平台,选取媒体处理领域的典型应用算法作为测试程序,对层次流水线优化进行实验分析.实验结果表明了优化方法的有效性.  相似文献   

7.
本文提出静态数据流计算机和数据流图的模型,再由此建立计算数据流计算机中指令操作开销与并行度相互关系的模型。根据这一模型对具有各种并行度的程序进行计算,求出程序运行时实际并行度与指令操作开销的关系。由此得出结论,当程序一定时,它在一个系统上运行的实际并行度是由系统中指令操作开销唯一决定的,即为MP/(OH+1)(MP是程序的平均并行度,OH为指令平均操作开销)。因此,在数据流计算机中,操作开销对系统性能有着严重的影响。  相似文献   

8.
基于图形处理器的通用计算模式*   总被引:4,自引:4,他引:0  
针对GPU图形处理的特点,分析其应用于通用计算的并行处理机制和数据映射,提出了一种GPU通用计算模式的映射机制和一般性设计方法,并针对GPU的吞吐量、数据流处理能力和基本数学运算能力等进行性能测试,为GPU通用计算的算法设计、实现和性能优化提供参考依据。  相似文献   

9.
图计算已成为大数据处理领域的主流应用,采用特定硬件加速可以显著提高图计算的性能和能效.众所周知,硬件代码的编写和验证十分耗时,尽管通用高层次综合(high level synthesis,HLS)系统允许用户使用高级语言(如C语言)特性自动生成硬件结构,但是对于图计算这种不规则算法,其仍缺乏有效的并行性和访存技术支撑,存在综合效果不理想、效率不高等突出问题.提出一种面向图计算的高效HLS方法,结合图算法嵌套循环、随机访存、数据冲突以及幂律分布等特性,采用数据流架构实现高效的并行流水线,保证处理单元的负载均衡.通过提供的编程原语,提出的方法可将通用图算法转化为模块化的数据流中间表示形式,进而映射到参数化的硬件模板.在Xilinx Virtex UltraScale+XCVU9P的实现验证了方法的正确性,不同类型的图算法在多个数据集上的实验结果表明,相比国际上通用的Spatial HLS系统,提出的方法可达到7.9~30.6倍的性能提升.  相似文献   

10.
代码选择在编译器的代码产生阶段是一个十分重要的任务,它的目标就是在与机器无关的中间表示代码和与处理器相关的机器指令之间寻找一种高效的映射方法。为了支持DSP处理器的SIMD指令,在传统的基于数据流树中间表示的代码选择算法的基础上,提出一种基于数据流图(DFG)的代码选择技术,它能在最大限度地挖掘和利用SIMD指令的基础上寻求对整个DFG的最优覆盖。  相似文献   

11.
Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.  相似文献   

12.
提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。  相似文献   

13.
An experimental approach is chosen to investigate the performance of a fine-grained dataflow architecture for numerically intensive digital signal processing (DSP) applications. The focus is on the behavior of pipelined data-parallel algorithms. However, the granularity of the high-level language programming blocks is not explicitly optimized to balance computation and communication; a natural and logical fine-grained decomposition of problems is used instead. The authors interpret their empirical data by means of parameters such as a number of instructions per generic unit of computation, a density of precedence relations, and a serial fraction. The performance and limitations of fine-grained general-purpose dataflow computing are discussed  相似文献   

14.
Synchronous dataflow architecture for network processors   总被引:1,自引:0,他引:1  
Carlstrom  J. Boden  T. 《Micro, IEEE》2004,24(5):10-18
Network processors are programmable, highly integrated communications circuits optimized to provide processing at high data and packet rates. The packet instruction set computer (PISC) architecture is a synchronous dataflow architecture developed for network processors. It uses a deep pipeline that contains two types of processing elements: PISC processors, which perform programmable data manipulation, and I/O processors, which provide access to shared resources such as look-up table memory, hardware accelerators, or coprocessors.  相似文献   

15.
The dataflow program graph execution model, or dataflow for short, is an alternative to the stored-program (von Neumann) execution model. Because it relies on a graph representation of programs, the strengths of the dataflow model are very much the complements of those of the stored-program one. In the last thirty or so years since it was proposed, the dataflow model of computation has been used and developed in very many areas of computing research: from programming languages to processor design, and from signal processing to reconfigurable computing. This paper is a review of the current state-of-the-art in the applications of the dataflow model of computation. It focuses on three areas: multithreaded computing, signal processing and reconfigurable computing.  相似文献   

16.
17.
We propose a new framework design for exploiting multi‐core architectures in the context of visualization dataflow systems. Recent hardware advancements have greatly increased the levels of parallelism available with all indications showing this trend will continue in the future. Existing visualization dataflow systems have attempted to take advantage of these new resources, though they still have a number of limitations when deployed on shared memory multi‐core architectures. Ideally, visualization systems should be built on top of a parallel dataflow scheme that can optimally utilize CPUs and assign resources adaptively to pipeline elements. We propose the design of a flexible dataflow architecture aimed at addressing many of the shortcomings of existing systems including a unified execution model for both demand‐driven and event‐driven models; a resource scheduler that can automatically make decisions on how to allocate computing resources; and support for more general streaming data structures which include unstructured elements. We have implemented our system on top of VTK with backward compatibility. In this paper, we provide evidence of performance improvements on a number of applications.  相似文献   

18.
It is widely believed that superscalar and superpipelined extensions of RISC style architecture will dominate future processor design, and that needs of parallel computing will have little effect on processor architecture. This belief ignores the issues of memory latency and synchronization, and fails to recognize the opportunity to support a general semantic model for parallel computing. Efforts to extend the shared-memory model using standard microprocessors have led to systems that implement no satisfactory model of computing, and present the programmer with a difficult interface on which to build parallel computing applications. A more satisfactory model for parallel computing may be obtained on the basis of functional programming concepts and the principles of modular software construction. We recommend that designs for computers be built on such a general semantic model of parallel computation. Multithreading concepts and dataflow principles can frame the architecture of these new machines.  相似文献   

19.
The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed. An approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array. The processors in this array execute the operations included in the corresponding nodes (or subsets of nodes) of the dataflow graph, while regular interconnections of these elements serve as edges of the graph. To speed up the execution, the proposed array allows the generation of computation fronts and their cancellation at a later time, depending on the arriving data operands; thus it is called a data-driven array. The structure of the basic cell and its programming are examined. Some design details are presented for two selected blocks, the instruction memory and the flag array. A scheme for mapping a dataflow graph (program) onto a hexagonally connected array is described and analyzed. Two distinct performance measures-mapping efficiency and array utilization-and some performance results are discussed  相似文献   

20.
This paper presents an extended architecture and a scheduling algorithm for a dataflow computer aimed at real-time processing. From the real-time processing point of view, current dataflow computers have several problems which stem from their hardware mechanisms for scheduling instructions based on data synchronization. This mechanism extracts as many eligible instructions as possible for execution of a program, then executes them in parallel. Hence, the computation in a dataflow computer is generally difficult to interrupt and schedule using software. To realize a controllable dataflow computation, two basic mechanisms are introduced for serializing concurrent processes and interrupting the execution of a process. A parallel and distributed algorithm for the scheduler is presented, with these two mechanisms, which controls and decides state transitions and execution order of the processes based on priority and execution depth, while still maintaining the number of the running state processes at a preferred value. To gear the scheduler algorithm to meet one of the requirements for real-time processing, such as time-constrained computing, a data-parallel algorithm for selection of the user-process with the current highest priority in O (x log x n) time is proposed, where n is the number of priority levels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号