首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
数据密集型应用中的核心循环消耗了程序的大量执行时间.如何实现核心循环在粗粒度可重构体系结构(CGRA)上的有效映射仍是当前研究领域的难点.为了在CGRA上最大程度开发应用并行性,降低循环访存开销,提高硬件资源利用率,文中提出一种新颖的面向CGRA循环流水映射的数据并行优化方法.通过定义一种新的可重构计算模型TMGC2以实现对循环的多条数据流水线并行加速.为避免并行化执行带来的额外存储体冲突问题影响CGRA执行性能,为后续循环映射创造良好的数据条件,引入存储体消除策略对数据进行重组,并结合数据重用图实现数据并行优化.实验表明,采用文中方法对已有CGRA循环流水映射方法进行优化,可以提高37.2%的数据吞吐量及41.3%的资源利用率.  相似文献   

2.
循环流水技术运用于粗粒度可重构体系结构可带来显著性能提升.循环控制、流水线同步和存储器有效利用是其中的关键问题.文中介绍了在粗粒度可重构体系结构LEAP上循环自主流水化的硬件实现.该方法基于支持循环迭代自动调度的控制部件、数据驱动ALU和可配置静态交换路由.利用动态调度循环中操作的优势,LEAP可发掘更高的程序并行度;分布式存储访问和高效数据重用则提高了带宽利用率.实验结果表明,相对于通用处理器,LEAP有13.08~535.65倍的性能提升.  相似文献   

3.
开发粗粒度可重构阵列之上的映射工具是把应用算法正确有效地映射到可重构硬件上,并使算法在可重构硬件上正确高效运行的关键之所在。因此,我们设计并实现了映射工具。本文介绍了映射工具的设计和实现过程,并给出了实现中的关键技术--布局。最后,本文还就几个测试程序给出了映射工具的映射结果。测试结果证明,布局算法的结
结果正确且优化,映射工具的设计合理,功能无误。  相似文献   

4.
为了更有效地优化粗粒度可重构单元阵列映射加速性能,提出了一种行节点无依赖约束的空域映射调度方法,基于相同条件下,采用时延Petri网对若干个按约束已经被划分映射到可重构单元阵列的数据流子图的运行情况进行了分析,通过一个实例比较了行节点有依赖和无依赖的运行结果,结果表明该种空域映射方法具有可行性。  相似文献   

5.
面向应用的可重构编译器ASCRA(英文)   总被引:1,自引:0,他引:1       下载免费PDF全文
在很多应用领域已经开展了可重构计算的研究,但是由于缺乏高层设计工具,设计者需要较深的软件和硬件专业知识才能开发GPP/RAU架构的程序,阻碍了其大规模应用。提出了一种面向应用的可重构编译器——ASCRA的初始架构,它可以自动将C语言映射为VHDL语言,从而解决可重构计算中自动编译工具的瓶颈。ASCRA编译器主要研究软硬件划分技术和面向硬件的优化技术,如脉动阵列、循环流水技术。在ML505开发平台上,设计实现了ASCRA编译器的验证平台,并通过实验给出了核心程序段生成VHDL代码的综合信息。  相似文献   

6.
通过定义算法关键循环到可重构阵列映射的建立时间、保持时间等核心时序参数,分析存储器带宽有限、算法数据流图拓扑不规则等实际问题,给出配置时序模型的优化算法,提出路径特征等参数的描述形式,为可重构自动编译提供新的处理方式。验证结果表明,在视频算法H.264关键循环deblocking的映射过程中,该优化映射方法使得性能在原有基础上提升43%。  相似文献   

7.
可重构计算技术在上世纪末到本世纪初成为计算机体系结构中的一个热门的研究领域.研制可重构微处理器体系结构也是今后微处理器体系结构研究的重要发展方向之一.本文首先对可重构技术进行了简要的介绍.在此基础上,从可重构粒度、范围和时机三个方面对可重构微处理器的设计方法做了分析.最后,文章提出了一种基于"标准基本单元-可配置通路"的设计思路.  相似文献   

8.
针对传统的面向应用领域的多核SoC体系结构设计方法存在系统结构探索空间大、设计复杂度高等问题,提出了一种基于体系结构模板的粗粒度可重构SoC系统架构设计方法。该设计方法以体系结构设计为中心,体系结构模板可重用、参数可配置,从而缩小了体系结构设计探索空间,提高了体系结构设计效率,降低了应用程序编译器开发复杂性。最后,以密码处理领域为例,将模板参数实例化,构建了一个面向密码处理领域的多核可重构指令集处理器SoC系统(Multi-RISP SoC)。实验结果表明,MultiRISP SoC系统与几个典型可重构平台在性能上相当,但系统构建更为快速高效。  相似文献   

9.
提出了一种采样类可重构测试系统的设计方法,研究了通过重构控制核心,添加功能模块实现系统功能多样化的技术。文中详细论述了采样类可重构测试系统的体系结构,包括控制核心主板、数字信号处理模块、FPI/O模块、信号采集模块、同步与触发模块、数据存储模块的相关技术原理与实现方法。  相似文献   

10.
随着面向特定应用领域的计算需求不断增大,可重构体系结构由于具有结构的灵活性和计算的高效性而得到广泛的关注,成为当前计算机体系结构研究的一大热点。能否准确地分析出应用领域程序的计算特征特别是循环特征对于能否设计出高效支持该领域应用计算特征的可重构系统有着至关重要的作用。提出一种面向可重构的循环特征分析方法,并且给出该方法的实现并根据该方法进行DES算法程序循环特征的分析。该方法对可重构系统的设计、编译优化、任务划分等问题都提供了简明直观的辅助。  相似文献   

11.
In this paper, we present techniques for providing on-demand structural redundancy for Coarse-Grained Reconfigurable Array (CGRAs) and a calculus for determining the gains of reliability when applying these replication techniques from the perspective of safety-critical parallel loop program applications. Here, for protecting massively parallel loop computations against errors like soft errors, well-known replication schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) must be applied to each single Processor Element (PE) rather than one based on application requirements for reliability and Soft Error Rates (SERs). Moreover, different voting options and signal replication schemes are investigated. It will be shown that hardware voting may be accomplished at negligible hardware cost, i. e. less than two percent area overhead per PE, for a class of reconfigurable processor arrays called Tightly Coupled Processor Arrays (TCPAs). As a major contribution of this paper, a formal analysis of the reliability achievable by each combination of replication and voting scheme for parallel loop executions on CGRAs in dependence of a given SER and application timing characteristics (schedule) is elaborated. Using this analysis, error detection latencies may be computed and proper decisions which replication scheme to choose at runtime to guarantee a maximal probability of failure on-demand can be derived. Finally, fault-simulation results are provided and compared with the formal analysis of reliability.  相似文献   

12.
Partitioning Methodology for Heterogeneous Reconfigurable Functional Units   总被引:1,自引:0,他引:1  
A partitioning methodology between the reconfigurable hardware blocks of different granularity, which are embedded in a generic heterogeneous architecture, is presented. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by a 2-Dimensional Array of Processing Elements. Critical parts, called kernels, are mapped on the coarse-grain reconfigurable logic for improving performance. The partitioning method is mainly composed by three steps: the analysis of the input code, the mapping onto the Coarse-Grain Reconfigurable Array and the mapping onto the FPGA. The partitioning flow is implemented by a prototype software framework. Analytical partitioning experiments, using five real-world applications, show that the execution time speedup relative to an all-FPGA solution ranges from 1.4 to 5.0.  相似文献   

13.
14.
Coarse grain reconfigurable array architectures have become increasingly popular due to their flexibility, scalability and performance. However, the mapping of programs on these architectures is characterized by huge complexity. This work presents a new mapping methodology for effectively mapping applications on coarse grained reconfigurable arrays. The core of this methodology comprises of the scheduling and register allocation phases performed, for the first time in the case of CGRAs, in a single step. Additionally, modulo scheduling with backtracking capability is incorporated in this scheme. The main contribution of this work includes a novel technique for minimizing the memory bandwidth bottleneck, a new priority scheme and a new set of heuristics which target on the maximization of the Instruction Level Parallelism by efficiently managing the architecture’s resources. The overall approach is retargetable with respect to a parametric architecture template modelling a large number of architecture alternatives and it has been automated with a prototype tool which permits experimental exploration. The experimental results showed that the achieved performance figures are very close to the most effective ones derived from the theoretical study on the architecture’s resources and the applications requirements. Moreover, the application of the bandwidth optimization technique lead to a 20–130% increase on operation parallelism. Finally, the experiments quantified the benefit from applying the new priority scheme and heuristics.  相似文献   

15.
Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches.  相似文献   

16.
Super-resolution of images based on local correlations   总被引:6,自引:0,他引:6  
An adaptive two-step paradigm for the super-resolution of optical images is developed in this paper. The procedure locally projects image samples onto a family of kernels that are learned from image data. First, an unsupervised feature extraction is performed on local neighborhood information from a training image. These features are then used to cluster the neighborhoods into disjoint sets for which an optimal mapping relating homologous neighborhoods across scales can be learned in a supervised manner. A super-resolved image is obtained through the convolution of a low-resolution test image with the established family of kernels. Results demonstrate the effectiveness of the approach.  相似文献   

17.
New standards in signal, multimedia, and network processing for embedded electronics are characterized by computationally intensive algorithms, high flexibility due to the swift change in specifications. In order to meet demanding challenges of increasing computational requirements and stringent constraints on area and power consumption in fields of embedded engineering, there is a gradual trend towards coarse-grained parallel embedded processors. Furthermore, such processors are enabled with dynamic reconfiguration features for supporting time- and space-multiplexed execution of the algorithms. However, the formidable problem in efficient mapping of applications (mostly loop algorithms) onto such architectures has been a hindrance in their mass acceptance. In this paper we present (a) a highly parameterizable, tightly coupled, and reconfigurable parallel processor architecture together with the corresponding power breakdown and reconfiguration time analysis of a case study application, (b) a retargetable methodology for mapping of loop algorithms, (c) a co-design framework for modeling, simulation, and programming of such architectures, and (d) loosely coupled communication with host processor.  相似文献   

18.
提出了一种面向SIMD机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和  相似文献   

19.
In this paper we present a boundary integral equation method for the numerical conformal mapping of a bounded multiply connected region onto a radial slit region. The method is based on some uniquely solvable boundary integral equations with adjoint classical, adjoint generalized and modified Neumann kernels. These boundary integral equations are constructed from a boundary relationship satisfied by a function analytic on a multiply connected region. Some numerical examples are presented to illustrate the efficiency of the presented method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号