期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The instruction systolic array and its relation to other models of parallel computers

《Parallel Computing》1988,7(1):25-39

In this paper we investigate the relationships between three different models of parallel computers based on mesh-connected arrays: the processor array (PA), which is an MIMD-array of independent processors, the instruction broadcasting array (IBA), where the instructions are broadcast to all the processors of a column and executed according to selector information which is broadcast to all the processors of a row, and the instruction systolic array (ISA), where the instructions are pumped through the array row by row and combined with selector information which is pumped through the array column by column. For every two of these models we determine tight bounds on the worst-case delay introduced by a transformation of a program on one model into an equivalent program on the other. The results show that the ISA concept combines the advantages of standard systolic arrays with those of the MIMD concept. Since in addition the ISA architecture has smaller area requirements than a corresponding systolic array or MIMD machine it is strong practical relevance. 相似文献

2.

Parallel sorting with cooperating heaps in a linear array of processors

Yen-Chun Lin Ferng-Ching Lin 《Parallel Computing》1990,16(2-3):273-278

A parallel sorting algorithm using cooperating heaps in a linear array of processors is presented. It can sort a sequence whose length is much larger than the number of processors. Because the output begins one step after all the items have been input, sorting n items requires 2n + 1 steps. Two independent modifications of the algorithm are possible; one tries to reduce the number of processors used, and the other can sort more items on the same array. 相似文献

3.

Optimal task assignment in homogeneous networks

Cheol-Hoon Lee Shin K.G. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):119-129

This paper considers the problem of assigning the tasks of a distributed application to the processors of a distributed system such that the sum of execution and communication costs is minimized. Previous work has shown this problem to be tractable for a system of two processors or a linear array of N processors, and for distributed programs of serial parallel structures. Here we focus on the assignment problem on a homogeneous network, which is composed of N functionally-identical processors, each with its own memory. Some processors in the network may have unique resources, such as data files or certain peripheral devices. Certain tasks may have to use these unique resources; they are called attached tasks. The tasks of a distributed program should therefore be assigned so as to make use of specific resources located at certain processors in the network while minimizing the amount of interprocessor communication. The assignment problem in such a homogeneous network is known to be NP-hard even for N=3, thus making it intractable for a network with a medium to large number of processors. We therefore focus on task assignment in general array networks, such as linear arrays, meshes, hypercubes, and trees. We first develop a modeling technique that transforms the assignment problem in an array or tree into a minimum-cut maximum-flow problem. The assignment problem is then solved for a general array or tree network in polynomial time 相似文献

4.

Large window median filtering on CLIP7

K.N. Mathews 《Pattern recognition letters》1983,1(5-6):341-346

An algorithm for finding the median, designed for use on byte wide architecture parallel array processors, is presented together with its implementation on CLIP7, a scanning array of such processors. 相似文献

5.

Minimum-diameter cyclic arrangements in mapping data-flow graphs onto VLSI arrays

P. Erdös I. Koren S. Moran G. M. Silberman S. Zaks 《Theory of Computing Systems》1988,21(1):85-98

Regular arrays of processing elements in VLSI have proved to be suitable for high-speed execution of many matrix operations. To execute an arbitrary computational algorithm on such processing arrays, it has been suggested mapping the given algorithm directly onto a regular array. The computational algorithm is represented by a data-flow graph whose nodes are to be mapped onto processors in the VLSI array. This study examines the complexity of mapping data-flow graphs onto square and hexagonal arrays of processors. We specifically consider the problem of routing data from processors in a given (source) sequence to another (target) sequence. We show that under certain conditions, the above problem is equivalent to the one of finding a minimum-diameter cyclic arrangement. The complexity of the latter problem is analyzed and upper and lower bounds on the number of intermediate rows of processors (between the source and target rows) are derived. 相似文献

6.

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

S.K.S. Gupta S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,32(2):155

Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with avirtual processor approachto give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to asvirtual-blockorvirtual-cyclicviews, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays. 相似文献

7.

Fast linear systolic matrix vector multiplication

《国际计算机数学杂志》2012,89(3-4):231-248

The systolic concept in the parallel architecture design proposed by the H. T. Kung [1,2] obtains high throughput and speedups. The linear array for the matrix vector multiplication executes the algorithm in 2n ? 1 time steps using 2n ? 1 processors. Although the speedup obtained is very high, the efficiency is very poor (typical values of 25% efficiency for problem size greater than 10). H. T. Kung proposed an idea for a linear systolic array using two data streams flowing in opposite directions. However, the processors in the array perform operations in every second time moment.

Attempts to improve this design have been made by many researchers. Nonlinear and folding transformations techniques [3,4,5] only decrease the number of processors used to half the size, but do not affect the time.

We propose the use of a fast linear systolic computation procedure to obtain a solution that uses 3n/2 processors and executes the algorithm in 3n/2 time steps for the same cells, the same communication and the same regular data flow as the H. T. Kung linear array. Only the algorithm is restructured and more efficiently organized. Now the processors are utilized in every time step and no idle steps are required. 相似文献

8.

Pipelined Diagnosis of Wafer-Scale Linear Arrays

《Journal of Parallel and Distributed Computing》1994,20(2):212-223

We present a comparison-based algorithm for identifying faulty and fault-free elements in a wafer-scale linear array of processors (or other logic elements). Only nearest neighbor communication is assumed to be possible between the processors in the array. Because the algorithm is simple and requires no storage of test vectors or test outcomes, it is ideally suited for implementation on the wafer to provide the capability for built-in production (or post production) testing. We show that, surprisingly, this method achieves high accuracy of diagnosis over a wide range of yields even though the diagnosis may be based on a high proportion of results produced by faulty processors. We also propose an improvement to the above algorithm which uses a processor diagnosed as fault-free by the basic algorithm as the starting point in improving the accuracy with which faulty processors are identified. Quantitative and qualitative reasoning validate the efficiency of these schemes. 相似文献

9.

Some results concerning linear iterative (systolic) arrays

Oscar H. Ibarra Michael A. Palis Sam M. Kim 《Journal of Parallel and Distributed Computing》1985,2(2):182-218

Characterizations of various types of linear iterative (systolic) arrays in terms of single processor sequential machines are given. Using these characterizations, new or improved results concerning the properties, power, and limitations of the different linear array models are proved. For example, a speed-up theorem is proved that is stronger than what has previously appeared in the literature and, moreover, works for arrays with one-way communication lines. Also investigated are the effects of augmenting the array with a supplemental control mechanism called global control. It is shown that in many cases arrays with global control can be simulated in real-time by arrays without global control. The result remains true even if one of the processors is augmented by a stack. Cases are also exhibited where the addition of global control makes the array strictly more powerful, even if the control lines are restricted to only a few processors of the array. 相似文献

10.

The efficient computation of ownership sets in HPF

Joisha P.G. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(8):769-788

Ownership sets are fundamental to the partitioning of program computations across processors by the owner-computes rule. These sets arise due to the mapping of arrays onto processors. In this paper, we focus on how ownership sets can be efficiently determined in the context of the HPF language and show how the structure of these sets can be symbolically characterized in the presence of arbitrary array alignment and array distribution directives. Our starting point is a system of equalities and inequalities due to Ancourt et al. (1995) that captures the array mapping problem in HPF. We arrive at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique and that requires the course vector as the only auxiliary vector. The formulation makes it possible to enumerate the elements of the ownership set exactly once, a feature that is very beneficial when such sets are applied to handle DO loops qualified by HPF's INDEPENDENT directive. We develop important and general properties pertaining to HPF alignments and distributions and show how they can be used to eliminate redundant communication due to array replication. Polynomial-time schemes that determine whether the ownership set of a particular processor, with respect to some array, is the empty set or whether the ownership set of every processor, with respect to some array, is the empty set, are presented. We show how distribution directives with unspecified processor meshes can be efficiently handled at compile time. We also show how to avoid the generation of communication code when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is presented and discussed 相似文献

11.

Systolic阵列中的多维DFT方体向量法研究

骆洪青万麟瑞胡宏《计算机工程与设计》2002,23(1):55-57,63

以脉动式阵列（Systolic）应用研究为背景，分析了离散傅立叶变换（DFT）经典并行算法及其在阵列机上的特点；并针对该算法在处理机上并行实现的弱点，提出了在并行处理环境下适合大规模DFT的方体向量法。这种方法不需要在处理机之间进行数据转置，减少了处理之间的通信以及运算数据之间的依赖性，使变换能够在较大程度上异步进行，并摆脱了在操作数规模上的制约。文章还给出了在Systolic阵列上由方体向量法实现的三维DFT的具体例子。相似文献

12.

The Role of Associative Array Processers in Data Base Machine Architecture

Berra P.B. Oliver E. 《Computer》1979,12(3):53-61

With content addressing and parallel processing capabilities, associative array processors are potentially useful for data base management. 相似文献

13.

Defining the scientific level of an assigned publication set

I. A. Libkind 《Automatic Documentation and Mathematical Linguistics》2014,48(4):179-186

The approach, methods and algorithms for defining the scientific level of an assigned paper array are expounded. A scientific level is defined, both with respect to specific subject areas (Web of Science Categories), and as some integral indicator with respect to the entire spectrum of topics that correspond to the papers from an assigned array. Web of Science (WoS) indexed papers as well as the most important data elements that describe specific papers (journals, in which they are published; the workplaces of authors, research funds that supported these papers, etc.) are used as initial data. The following now classical bibliometric indicators are used as initial indicators: the number of publications, the impact factor of a journal, and the aggregate impact factor of a WoS subject category. Indicators derived from classical ones are introduced. The calculation of a scientific level is based on the joint application of derived and classical bibliometric indicators. The examples of the results from calculating a scientific level are presented. 相似文献

14.

Parallel computation of symbolic robot models and control laws: Theory and application to transputer networks

N. Kir&#x;anski T. Petrovi&#x; M. Vukobratovi&#x; 《野外机器人技术杂志》1993,10(3):345-368

New computer architectures based on large numbers of processors are now used in various application areas ranging from embedded systems to supercomputers. Efficient parallel processing algorithms are applied in a wide variety of applications such as simulation, robot control, and image synthesis. This article presents two novel parallel algorithms for computing robot inverse dynamics (as well as control laws) starting from customized symbolic robot models. To gain the most benefit from the concurrent processor architecture, the whole job is divided into a large number of simple tasks, each involving only a single floating-point operation. Although requiring sophisticated scheduling schemes, fine granularity of tasks was the key factor for achieving nearly maximum efficiency and speedup. The first algorithm resolves the scheduling problem for an array of pipelined processors. The second one is devoted to parallel processors connected by a complete crossbar interconnection network. The main feature of the proposed algorithms is that they take into account the communication delays between processors and minimize both the execution time and communication cost. To prove the theoretical results, the algorithms have been verified by experiments on an INMOS T800 transputer-based system. We used four transputers in serial and parallel configurations. The experimental results show that the most complicated dynamic control laws can be executed in a submilisecond time interval. © 1993 John Wiley & Sons, Inc. 相似文献

15.

Multiprocessor Ray Tracing 总被引：2，自引：0，他引：2

John G. Cleary Brian M. Wyvill Graham M. Birtwistle Reddy Vatti† 《Computer Graphics Forum》1986,5(1):3-12

A multiprocessor algorithm for ray tracing is described. The performance of the algorithm is analysed for a cubic and square array of processors with only local communication between near neighbours. Theoretical expressions for the speedup of the system as a function of the number of processors are derived. These analytic results are supported by simulations of ray tracing on a number of simple scenes with polygonal surfaces. It is found that a square network of processors generally performs better than a cubic network. Some comments are made on the construction of such a system using current (1985) microprocessor technology. 相似文献

16.

VLSI design for massively parallel signal processors

SY Kung Jurgen Annevelink 《Microprocessors and Microsystems》1983,7(10):461-468

相似文献

17.

Efficient reconfiguration algorithms for communication-aware three-dimensional processor arrays

Guiyuan Jiang Jigang Wu Jizhou Sun 《Parallel Computing》2013

Homogeneous processor arrays are emerging in tera-scale computation and effective fault tolerance techniques are essential to improving the reliability of such complex integrated circuits. We study the degradable processor arrays to achieve fault tolerance by employing reconfiguration. Three bypass schemes and three rerouting schemes are proposed to reconfigure three-dimensional processor arrays with defective processors to achieve target arrays without faults. A heuristic algorithm is proposed to construct a target array on the selected rows and columns. It is also proved that the proposed greedy plane rerouting algorithm (GPR) produces maximum target array. In addition, the problem of constructing the communication efficient array is considered in this paper. An algorithm is proposed to refine the communication among processors within the target array constructed by GPR. Experimental study shows that the proposed algorithm GPR produces target arrays with higher harvest and lower degradation on the host arrays with fault density no more than 5%. In addition, the communication performance is significantly optimized by reducing the number of long interconnects, and the average improvement is about 34% for all cases considered in this paper. 相似文献

18.

Precompilation of Fortran Programs to Facilitate Array Processing

Brode B. 《Computer》1981,14(9):46-51

The automatic recognition of array operations in Fortran programs can now be applied to attached array processors using vectorizing precompiler software. 相似文献

19.

一种阵列众核处理器的多级指令缓存结构

陈逸飞李宏亮刘骁高红光《计算机工程与科学》2018,40(4):571-579

阵列众核处理器由于其较高的计算性能和能效比已经被广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器中,核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。在阵列众核处理器中,在单核心中引入硬件同时多线程技术,针对实验中一级指令缓存命中率随着线程数增加而显著降低的问题,提出了一种面向阵列众核处理器的冗余指令缓存存储结构,基于该结构,提出采用FIFO及类LRU替换策略。通过上述优化的高速缓存结构设计,经实验模拟,双线程整体指令Cache失效率降低了25.2%,整体CPI性能提升了30.2%。相似文献

20.

Making Effective Decisions in Computer Architects Real-World: Lessons and Experiences with Godson-2 Processor Designs

下载免费PDF全文

Wei-Wu Hu Jian Wang 《计算机科学技术学报》2008,23(4):620-632

Although the design of many kinds of microprocessors has been under developing for several decades,the computer architecture R&D community lacks well documented lessons and experiences about design decisions in the research literature.In this paper,we systematically present the design decisions we made during the designing and prototyping of Godson-2 series processors.The 250MHz Godson-2B,450MHz Godson-2C,and 1GHz Godson-2E processors that implement 64-bit,four-issue,out-of-order architecture were taped out in 2003,2004,and 2005,respectively.Each processor triples its predecessor in the SPEC CPU2000 rates.Our first-hand experiences and lessons gained from these designs would provide unique perspectives and insights that are not available in any existing text books and/or published papers.We summarize 10 critical lessons and experiences based on hundreds of our attempts at architectural and design optimizations for performance improvement of Godson-2 series processors.The issues include silicon-simulation correlation,design balancing,performance optimizing,and pico-architecture tuning.We conclude that persistent improvement,attitude towards work-on-silicon design, and insightful understanding of software and fabrication process are the three most important factors for designing a high performance processor with low energy consumption. 相似文献