期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

杨超峰胡铭曾《计算机学报》1999,22(10):1119-1121

自８０年代末,处理器阵列研究的一个新方向是设计线性阵列．但是,目前文献中还没有提出一个实用的线性阵列设计方法．线性阵列可以分为有局部存储器线性阵列和无局部存储器线性阵列两类．在“有局部存储器线性阵列的综合”一文中,提出了一个运算时间最优的有局部存储器线性阵列的设计方法．作为续篇,文中将提出一个运算时间最优的无局部存储器线性阵列的设计方法相似文献

2.

有局部存储器线性阵列的综合 总被引：2，自引：2，他引：0

杨超峰胡铭曾《计算机学报》1999,22(8):828-839

自８０年代末,处理器阵列研究的一个新方向是设计线性阵列。在这方面,Ｌｅｅ和Ｋｅｄｅｍ作出了开创性工作,他们提出了一个线性阵列模型及一组正确设计应满足的条件,但是,至今文献中还没有提出一个实用的线性阵列的设计方法,按照Ｌｅｅ－Ｋｅｄｅｍ线性阵列模型,根据处理单元是否有局部存储器,线性阵列可以分为两类,对此,作者通过引入一个刻画算法的新参数－路径函数,提出上一个实用的,运算时间最优的有局部存储器线性阵相似文献

3.

无局部存在器线性阵列的综合

杨超峰胡铭曾《计算机学报》1999,22(10):1119-1120,F003

自８０年代末，处理阵列研究的一个新方向是设计线性阵列，但是，目前文献还没有提出一个实用的线性阵列设计方法，线性阵列可以分为有局部存储器线性阵列和无局部存储器线性阵列两类。相似文献

4.

面向数据驱动处理器阵列的自动综合 总被引：1，自引：0，他引：1

邬贵明窦勇王淼《计算机工程与科学》2009,31(Z1)

本文提出了一种数据驱动处理器阵列结构,该结构能有效平衡存储和计算,适合用于在FPGA上实现高性能的算法加速,同时提出了一个面向该结构的自动综合框架,通过该框架可以将常规循环有效地映射到数据驱动处理器阵列上。实验结果表明了该自动综合框架的有效性,且生成的设计性能优于通用处理器。相似文献

5.

灵活列选路模式下构造紧致逻辑阵列的高效算法

贺佩兰姜桂圆《小型微型计算机系统》2015,(2):360-364

当前成百上千的处理器可以集成到同一个芯片上,而高密度处理器阵列在高速并行处理的时候经常发生故障.一种有效的解决方法是构造一个不包含故障单元的逻辑阵列,使得原始任务能够继续执行.我们研究在灵活列选路模式下构造逻辑阵列的高效算法,使得所构造的逻辑阵列不仅规模最大而且互连网络长度尽可能短.我们提出的算法TCA首先使用现存算法构造一个最大逻辑阵列,之后优化各个逻辑列来减少阵列的互连网络长度,我们把优化每个逻辑列的问题转化为带权图上的最短路径问题求解.实验结果表明我们的方法显著减少了逻辑阵列互连网络长度. 相似文献

6.

Systolic阵列及其全局视图的代数描述

下载免费PDF全文

郑飞谢康林《计算机工程与科学》1992,14(3):25-32

在简要介绍Systolic阵列及其设计方法的基础上,讨论了一个基于生成函数的Systolic阵列全局数据交互作用活动的形式代数描述。它能够刻画Systolic阵列中数据流在任何时刻的速率和运动路径,并有功于Systolic阵列的说明、设计和验证。相似文献

7.

只用一个可组线性并行阵列实现一个典型信号处理过程

车文刚李衍达《计算机学报》1992,(6):475-478

本文在CFM(Configurable Function Method)模型下,给出只用一个VLSI并行阵列实现一个典型的信号处理过程的方法。这种CFM阵列的独特结果在于只用以矩阵算法为基础的一个线性VLSI阵列就可以完成一个完整的信号处理过程。本文试图用傅里叶变换,数字频域滤波,反傅里叶交换这一过程来说明CFM方法在实现不同算法时的可组性。相似文献

8.

容错处理器阵列的并行重构及VHDL实现

周美婷武继刚姜桂圆《小型微型计算机系统》2015,(2):375-380

网格连接的处理器阵列是一种应用广泛的高性能体系结构,而容错处理器阵列的重构技术是近年来的研究热点之一.现有的研究多数集中在串行重构算法上,忽视了该结构重构时内在的可并行性.本文根据阵列结构的特点设计了一种基于VHDL语言的重构算法,该算法从第一行的各个无故障处理器单元同时向下选路,具有潜在的并行性,.实验结果表明,与现有的串行算法相比,本文提出的并行算法同样能够生成最大规模的目标阵列并且当物理阵列大小为48×48,本文提出的并行算法加速重构将近20倍. 相似文献

9.

大矩阵QR分解的FPGA设计与实现

下载免费PDF全文

周杰陈啸洋赵建勋窦勇《计算机工程与科学》2010,32(10):34-37

大规模QR分解在信号处理、图像处理、计算结构力学等领域有着广泛的应用。大规模矩阵QR分解主要在高性能并行机上进行运算,目前还没有基于FPGA平台的加速实现。本文在分析快速Givens Rotation QR分解算法特征的基础上,提出并实现了一种细粒度并行QR分解算法,并在Altera StratixⅡ FPGA平台上实现可扩展QR分解线性阵列处理器。相对于单处理单元,该阵列处理器可取得近似线性加速比,显示了良好的可扩展性。在100 MHz频率下的性能测试结果表明,相对于2.0GHz的Pentium双核通用微处理器,该阵列处理器可取得19倍的加速比。相似文献

10.

容错处理器阵列的多逻辑列并行重构算法

章子凯武继刚姜文超刘竹松《计算机工程与科学》2018,40(1):24-33

处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。相似文献

11.

基于Pipeline的一类动态规划并行算法

何奇《计算机学报》1994,17(7):527-535

动态规划是解决组合优化问题的有效方法之一，本文基于Ｐｉｐｅｌｉｎｅ结构，提出并分析了三个相似的动态规划并行算法（求简单最短路径，求最长公共子串和解背包问题），获得了较理想的加速比、并行效率等指标，进而提出并讨论了这一类问题之动态规划并行处理的一般化思想及方法。相似文献

12.

Optimal synthesis of algorithm-specific lower-dimensional processorarrays

Ganapathy K.N. Wah B.W. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(3):274-287

相似文献

13.

On high-speed computing with a programmable linear array

Peizong Lee Zvi M. Kedem 《The Journal of supercomputing》1990,4(3):223-249

It has been observed by many researchers that systolic arrays are very suitable for certain high-speed computations. Using a formal methodology, we present a design for a single simple programmable linear systolic array capable of solving large numbers of problems drawn from a variety of applications. The methodology is applicable to problems solvable by sequential algorithms that can be specified as nested for-loops of arbitrary depth. The algorithms of this form that can be computed on the array presented in this paper include 25 algorithms dealing with signal and image processing, algebraic computations, matrix arithmetic, pattern matching, database operations, sorting, and transitive closure. Assuming bounded I/O, for 18 of those algorithms the time and storage complexities are optimal, and therefore no improvement can be expected by using dedicated special-purpose linear systolic arrays designed for individual algorithms. We also describe another design which, using a sufficient large local memory and allowing data to be preloaded and unloaded, has an optimal processor/time product.An earlier version of this paper was presented at Supercomputing '88.This work was partially supported by ONR under the contract N00014-85-K-0046 and by NSF under Grant Number CCR-8906949. 相似文献

14.

APPLICATION-SPECIFIC ARRAY PROCESSORS FOR THE LONGEST COMMON SUBSEQUENCE PROBLEM OF THREE SEQUENCES ∗ †

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):27-52

Abstract

We design linear time systolic-based parallel algorithms that run on two-dimensional arrays for both computing the length and recovering a longest common subsequence of three given sequences that are appropriate for very large-scale integration (VLSI) implementation. These problems have been qualified to be difficult to be solved in linear time in [14], and our approach, which generalizes the methods used for determining a longest common subsequence of two strings [28,38] to the case of three strings, enables to solve both problems in linear time. Given the three sequences A, B and C of length n, m and l (n ≤ m ≤ l;), we provide an algorithm that computes the length p of their longest common subsequence on a modular I/O bounded one-way two-dimensional array of nm processors in n + 2m + l? 1 time-steps. To compute a longest common subsequence of the three given strings, we show that each processor of the above array requires an O(min{n,p\) local storage to solve the problem in In + 1m + 1 + p — 2 time-steps. 相似文献

15.

Design of Processor Arrays for Reconfigurable Architectures

Fimmel Dirk Merker Renate 《The Journal of supercomputing》2001,19(1):41-56

This paper deals with the design of processor arrays for regular algorithms. The design is constrained by limited implementation cost characterizing a reconfigurable architecture. The objective of the design is to minimize the latency of the processor array. The presented approach to determine a scheduling function leading to the minimal latency of the processor array is formulated as a linear program that incorporates 1) the selection of modules to be implemented in processors to execute operations of the algorithm, 2) the binding of operations to modules, 3) the computation of the number of registers, 4) the limitation of implementation cost for modules and registers, 5) the determination of the size of partitions that allows to match the limited implementation cost. 相似文献

16.

On loop transformations for generalized cycle shrinking

Weijia Shang O'Keefe M.T. Fortes J.A.B. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):193-204

This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes 相似文献

17.

An Integrated Approach to the Design of Processor Arrays with Systolic Organization of Computations

L. D. Elfimova Yu. V. Kapitonova 《Cybernetics and Systems Analysis》2002,38(6):797-807

An integrated approach is proposed to the design of economically efficient and high-performance processor arrays with systolic organization of computations. The approach includes the construction of VLSI-oriented versions of locally recursive algorithms and synthesis of new architectures of processor arrays for transforming algorithms that maximally take into account fundamental restrictions of VLSI technology. Within the framework of this approach, strategies are developed for obtaining the above-mentioned algorithms and architectures. 相似文献

18.

Efficient control generation for mapping nested loop programs onto processor arrays

《Journal of Systems Architecture》2007,53(5-6):300-309

Processor array architectures are optimal platforms for computationally intensive applications. Such architectures are characterized by hierarchies of parallelism and memory structures, i.e. processor arrays apart from different levels of cache have a large number of processing elements (PE) where each PE can further contain sub-word parallelism. In order to handle large scale problems, balance local memory requirements with I/O-bandwidth, and use different hierarchies of parallelism and memory, one needs a sophisticated transformation called hierarchical partitioning. Innately the applications are data flow dominant and have almost no control flow, but the application of hierarchical partitioning techniques has the disadvantage of a more complex control flow. In a previous paper, the authors presented first time a methodology for the automated control path synthesis for the mapping of partitioned algorithms onto processor arrays. However, the control path contained complex multiplication and division operators. In this paper, we propose a significant extension to the methodology which reduces the hardware cost of the global controller and memory address generators by avoiding these costly operations. 相似文献