期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李涛杨婷易学渊蒲林钱博文黄光新黄虎才韩俊刚《计算机工程与科学》2014,36(2):191-200

提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。相似文献

2.

面向数据流结构的指令映射优化方法

李易常成娟卢圣健江道忠范东睿叶笑春《计算机工程与科学》2019,41(1):9-13

在高性能计算领域,数据流是一类重要的计算结构,也在很多实际场景表现出很好的性能和适用性。在数据流计算模式中,程序是以数据流图来表示的,数据流计算中一个关键的问题是如何将数据流图映射到多个执行单元上。通过分析现有数据流结构的指令映射方法及其不足,提出了基于数据流结构的新型指令映射优化方法。主要是根据多地址共享数据包的特性对指令映射方法进行优化,延迟多地址共享数据路由包的拆分,减少网络拥堵。相似文献

3.

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

下载免费PDF全文

Feng Yu-Jing Li De-Jian Tan Xu Ye Xiao-Chun Fan Dong-Rui Li Wen-Ming Wang Da Zhang Hao Tang Zhi-Min 《计算机科学技术学报》2022,37(4):942-959

The dataflow architecture, which is characterized by a lack of a redundant unified control logic, has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency, especially of applications used in high-performance computing (HPC). Importantly, the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner. Therefore, a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts. Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection, or a handshake mechanism in which the data is only sent after acknowledgments are received. However, these mechanisms are characterized by both inefficient data transfer and increased area cost. Good performance of the dataflow architecture depends on the efficiency of data transfer. In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost, we propose a Look-Ahead Acknowledgment (LAA) mechanism. LAA accelerates the execution ow by speculatively acknowledging ahead without penalties. Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%, with a reduction in the average execution time by 17.4% and an increase in the average power efficiency of dataflow processors by 22.4%. Crucially, our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%. In conclusion, the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.

相似文献

4.

一种缓存数据流信息的处理器前端设计

刘炳涛王达叶笑春张浩范东睿张志敏《计算机研究与发展》2016,53(6):1221-1237

为了能够同时发掘程序的线程级并行性和指令级并行性,动态多核技术通过将数个小核重构为一个较强的虚拟核来适应程序多样的需求.通常这种虚拟核性能弱于占有等量芯片资源的原生核,一个重要的原因就是取指、译码和重命名等流水线的前端各阶段具有串行处理的特征较难经重构后协同工作.为解决此问题,提出了新的前端结构——数据流缓存,并给出与之配合的向量重命名机制.数据流缓存利用程序的数据流局部性,存储并重用指令基本块内的数据依赖等信息.处理器核利用数据流缓存能更好地发掘程序的指令级并行性并降低分支预测错误的惩罚,而动态多核技术中的虚拟核通过使用数据流缓存旁路传统的流水线前端各阶段,其前端难协同工作的问题得以解决.对SPEC CPU2006中程序的实验证明了数据流缓存能够以有限代价覆盖大部分程序超过90%的动态指令,然后分析了添加数据流缓存对流水线性能的影响.实验证明,在前端宽度为4条指令、指令窗口容量为512的配置下,采用数据流缓存的虚拟核性能平均提升9.4%,某些程序性能提升高达28%. 相似文献

5.

A Quantitative Analysis of Dataflow Program Execution - Preliminaries to a Hybrid Design

《Journal of Parallel and Distributed Computing》1993,18(3):314-326

While the dataflow execution model can potentially uncover all forms and levels of parallelism in a program, in its traditional fine grain form it does not exploit any form of locality. Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggests a synthesis of traditional nonstrict fine grain instruction execution and strict coarse grain execution in order to exploit locality. While an increase in instruction granularity favors the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We define fine grain intrathread locality as a dynamic measure of instruction level locality and quantify it using a set of numeric and nonnumeric benchmarks. The results point to a very large degree of intrathread locality and a remarkable uniformity and consistency of the distribution of thread locality across a wide variety of benchmarks. As the execution is moved to a coarser granularity it can result in an increase of the input latency of operands that would have a detrimental effect on performance. We evaluate the resulting latency incurred through the partitioning of fine grain instructions into coarser grain threads. We define the concept of a cluster of fine grain instructions to quantify coarse grain input and output latencies. The results of our experiments offer compelling evidence that a coarse grain execution outperforms a fine grain grain one on a significant number of numeric codes. These results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intrathread locality. Furthermore, simulation results indicate that strict or nonstrict data structure access does not change the basic cluster characteristics. 相似文献

6.

The explicit token store

David E. Culler Gregory M. Papadopoulos 《Journal of Parallel and Distributed Computing》1990,10(4)

This paper presents an unusually simple approach to dynamic dataflow execution, called the Explicit Token Store (ETS) architecture, and its current realization in Monsoon. The essence of dynamic dataflow execution is captured by a simple transition on state bits associated with storage local to a processor. Low-level storage management is performed by the compiler in assigning nodes to slots in an activation frame, rather than dynamically in hardware. The processor is simple, highly pipelined, and quite general. There is exactly one instruction executed for each action on the dataflow graph. Thus, the machine-oriented ETS model provides new insight into the real cost of direct execution of dataflow graphs. 相似文献

7.

Study of a simulated stream machine for dataflow computation

Sukumar Ghosh Somprakash Bandyopadhyay Chandan Mazumdar 《Performance Evaluation》1986,6(4):269-291

In this paper, a stream-based dataflow architecture is proposed, and its simulation model, which has helped to evaluate the effectiveness of the proposed architectural concept, is discussed. The machine integrates the conventional Von Neumann type of control flow subsystem with a dataflow processing element of token storage type. The control flow unit tackles the dynamic nature of the stream structure including input/output whereas the dataflow unit does the computation part in an applicative style. A pipelined version of the stream machine is also discussed. The effectiveness of the machine is studied by running a few example programs in the simulated machine. The machine is expected to be useful in real time signal processing applications. 相似文献

8.

事件触发并发数据流模型 总被引：9，自引：0，他引：9

王瑞荣汪乐宇《软件学报》2003,14(3):409-414

DHDF(动态纯数据流)是许多图形化编程平台的核心.由于它的自然属性(数据驱动)与操作系统事件驱动模型不能很好地结合,导致了两个明显的不足:运行效率低,CPU占用率高;对外部事件响应速度慢,系统实时性差.提出了一种ECDF(事件触发并发数据流)模型,并给出了该模型的文法描述以及调度算法.ECDF模型通过引入多优先级线程以及事件触发机制,在很大程度上提高了系统的实时性与运行效率.以测试系统为背景,对有关应用实例进行测试与分析,结果表明,与DHDF模型相比,ECDF模型使系统的性能在不同条件下都得到了相应的提高.该模型特别适用于处理突发性高速数据流,也适用于Reactive系统设计. 相似文献

9.

A Pipelining Loop Optimization Method for Dataflow Architecture

下载免费PDF全文

Xu Tan Xiao-Chun Ye Xiao-Wei Shen Yuan-Chao Xu Da Wang Lunkai Zhang Wen-Ming Li Dong-Rui Fan Zhi-Min Tang 《计算机科学技术学报》2018,33(1):116-130

与 exascale 来超级计算的时代,电源效率成为了最重要的障碍造一个 exascale 系统。Dataflow 建筑学在为科学应用完成高电源效率有本国的优点。然而,最先进的 dataflow 体系结构没能为循环处理利用高并行。处理这个问题,我们建议一个 pipelining 环优化方法(PLO ) ,它在处理元素(PE ) 在环流动做重复 dataflow 的数组加速器。这个方法由二种技术,帮助建筑学的硬件重复和帮助说明的软件重复组成。在硬件重复执行模型,一个在薄片上循环控制器被设计产生循环索引,减少计算内核并且打为 pipelining 执行的一个好基础的复杂性。在软件重复实行模型,另外的环指令被论述解决重复相关性问题。经由这二种技术,准备好了每周期执行的指令的平均数字被增加使浮点联合起来忙。当这二种技术的硬件费用是可接受的时,模拟结果证明分别地,我们的建议方法平均由 2.45x 和 1.1x 在浮点效率超过静电干扰和动态循环执行模型。相似文献

10.

A data-driven VLSI array for arbitrary algorithms

Koren I. Mendelson B. Peled I. Silberman G.M. 《Computer》1988,21(10):30-43

The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed. An approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array. The processors in this array execute the operations included in the corresponding nodes (or subsets of nodes) of the dataflow graph, while regular interconnections of these elements serve as edges of the graph. To speed up the execution, the proposed array allows the generation of computation fronts and their cancellation at a later time, depending on the arriving data operands; thus it is called a data-driven array. The structure of the basic cell and its programming are examined. Some design details are presented for two selected blocks, the instruction memory and the flag array. A scheme for mapping a dataflow graph (program) onto a hexagonally connected array is described and analyzed. Two distinct performance measures-mapping efficiency and array utilization-and some performance results are discussed 相似文献

11.

服务流程中的数据流处理

陈姣娟曹健《计算机科学》2013,40(1):14-18

服务流程需要处理服务之间大量的异构数据的交互,不同的数据流处理方式直接影响了服务流程的执行效率。阐述了服务流程模型中的数据流表示模型、数据映射机制与数据流验证机制,论述了服务流程运行中的数据流调度、数据存储以及传输等数据管理问题,分析了数据流处理在服务流程中的应用情况。最后,结合现有的数据流研究进展,提出了数据流研究的展望。相似文献

12.

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

下载免费PDF全文

Xu Tan Xiao-Wei Shen Xiao-Chun Ye Da Wang Dong-Rui Fan Lunkai Zhang Wen-Ming Li Zhi-Min Zhang Zhi-Min Tang 《计算机科学技术学报》2018,33(1):145-157

两倍缓冲是有效机制隐藏在在薄片上和离开薄片记忆之间的数据转移的潜伏。在 dataflow 建筑学,因为 dataflow 加速器的重复充满并且排干,然而,交换二在许多瓦减少的执行期间缓冲性能。在这个工作,我们为 dataflow 建筑学建议连续双的缓冲机制。没有停止通过在 dataflow 建筑学优化控制逻辑处理元素的执行,建议不停的机制把瓦分到处理元素数组。而且,我们建议一个工作流节目与连续双的缓冲机制合作。在控制逻辑上并且在工作流节目上的优化以后,充满并且排干数组需要越过属于一样的 dataflow 图的所有瓦的执行被做仅仅一次。试验性的结果证明没有优化,为 dataflow 建筑学的建议双缓冲机制在那上完成 16.2% 平均效率改进。相似文献

13.

Using dataflow in a system of multiple single board computers

Jukka Aspelund 《Microprocessors and Microsystems》1981,5(3):103-107

The concept of dataflow is usually limited to totally new computer structures. This paper discusses the simulation of dataflow in a symmetric system of standard microcomputer modules. The approach has advantages in that the parallelism of the computation can be expressed and handled in a natural way. The simulation of the dataflow semantics takes, however, hundreds of machine instructions for each actor execution and quite long actors are needed to keep the overhead low. The implementation of a dataflow computation using the adaptive trapezoid method for numerical integration is discussed. In this computation the total estimated overhead is less than 10 per cent. 相似文献

14.

Dataflow computer extension towards real-time processing

Masaru Takesue 《Real-Time Systems》1990,1(4):333-350

This paper presents an extended architecture and a scheduling algorithm for a dataflow computer aimed at real-time processing. From the real-time processing point of view, current dataflow computers have several problems which stem from their hardware mechanisms for scheduling instructions based on data synchronization. This mechanism extracts as many eligible instructions as possible for execution of a program, then executes them in parallel. Hence, the computation in a dataflow computer is generally difficult to interrupt and schedule using software. To realize a controllable dataflow computation, two basic mechanisms are introduced for serializing concurrent processes and interrupting the execution of a process. A parallel and distributed algorithm for the scheduler is presented, with these two mechanisms, which controls and decides state transitions and execution order of the processes based on priority and execution depth, while still maintaining the number of the running state processes at a preferred value. To gear the scheduler algorithm to meet one of the requirements for real-time processing, such as time-constrained computing, a data-parallel algorithm for selection of the user-process with the current highest priority in O (x log_x n) time is proposed, where n is the number of priority levels. 相似文献

15.

基于粗粒度数据流架构的稀疏卷积神经网络加速

吴欣欣欧焱李文明王达张浩范东睿《计算机研究与发展》2021,58(7):1504-1517

卷积神经网络(convolutional neural network, CNN)在图像处理、语音识别、自然语言处理等领域实现了很好的性能.大规模的神经网络模型通常遭遇计算、存储等资源限制,稀疏神经网络的出现有效地缓解了对计算和存储的需求.尽管现有的领域专用加速器能够有效处理稀疏网络,它们通过算法和结构的紧耦合实现高能效,却丧失了结构的灵活性.粗粒度数据流架构通过灵活的指令调度可以实现不同的神经网络应用.基于该架构,密集卷积规则的计算特性使不同通道共享相同的一套指令执行,然而稀疏网络中存在权值稀疏,使得这些指令中存在0值相关的无效指令,而现有的指令执行方式无法自动跳过它们从而产生无效计算.同时在执行不规则的稀疏网络时,现有的指令映射方法造成了计算阵列的负载不均衡.这些问题阻碍了稀疏网络性能的提升.基于不同通道共享一套指令的前提下,根据稀疏网络的数据和指令特征增加指令控制单元实现权值数据中0值相关指令的检测和跳过,同时使用负载均衡的指令映射算法解决稀疏网络中指令执行不均衡问题.实验表明：与密集网络相比稀疏网络实现了平均1.55倍的性能提升和63.77%的能耗减少.同时比GPU(cuSparse)和Cambricon-X实现的稀疏网络分别快2.39倍(Alexnet)、2.28倍(VGG16)和1.14倍(Alexnet)、1.23倍(VGG16). 相似文献

16.

An evaluation of medium-grain dataflow code

Walid A. Najjar Lucas Roh A. P. Wim Böhm 《International journal of parallel programming》1994,22(3):209-242

In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster. This work is supported in part by NSF Grants CCR-9010240 and MIP-9113268. 相似文献

17.

Exploiting Data Structure Locality in the Dataflow Model

《Journal of Parallel and Distributed Computing》1995,27(2):183-200

Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time. 相似文献

18.

SIGMA-1: A dataflow computer for scientific computations

《Computer Physics Communications》1985,37(1-3):141-148

This paper presents an overview of the SIGMA-1, a large-scale dataflow computer being developed at the Electrotechnical Laboratory, Japan. The SIGMA-1 is designed to accommodate about two hundred dataflow processing elements. Its estimated average speed is one hundred MFLOPS for certain numerical computations.Various aspects of the SIGMA-1, such as the organization of a processing element, the matching memory unit, the structure memory and the communication network, are described. The present status and development plans of the SIGMA-1 project are detailed. It is predicted that the SIGMA-1 will give higher speed over a wide range of applications than conventional von Neumann computers. 相似文献

19.

基于剖析信息和关键路径长度的软件扇出树生成算法

曾斌安虹王莉《计算机科学》2010,37(3):248-252

开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。相似文献

20.

Advances in the dataflow computational model

Walid A. Najjar Edward A. Lee Guang R. Gao 《Parallel Computing》1999,25(13-14)

The dataflow program graph execution model, or dataflow for short, is an alternative to the stored-program (von Neumann) execution model. Because it relies on a graph representation of programs, the strengths of the dataflow model are very much the complements of those of the stored-program one. In the last thirty or so years since it was proposed, the dataflow model of computation has been used and developed in very many areas of computing research: from programming languages to processor design, and from signal processing to reconfigurable computing. This paper is a review of the current state-of-the-art in the applications of the dataflow model of computation. It focuses on three areas: multithreaded computing, signal processing and reconfigurable computing. 相似文献