期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A pipelined interface for high floating-point performance with precise exceptions

Iacobovici S. 《Micro, IEEE》1988,8(3):77-87

Two options are presented that were considered for a pipelined interface between a central processing unit (CPU) and a floating-point coprocessor (FPU), along with the CPU recovery mechanisms that provide precise floating-point exceptions for each option. The first option supports parallel execution of both floating-point and integer instructions, while the second option pipelines only the execution of floating-point instructions. The use of the second option in National Semiconductor's 32532/32580 processor cluster because it offers high performance with significantly lower complexity. The 32532 microprocessor features a pipelined slave protocol that hides the CPU-FPU communication overhead for most floating-point instructions by pipelining their execution. A simple recovery mechanism implemented within the CPU maintains the precision of floating-point exceptions. As a result, the 32532 microprocessor supports very high floating point performance without sacrificing software compatibility with previous Series 32000 CPU-FPU clusters.<> 相似文献

2.

VLIW处理器循环指令缓冲器设计与实现

李勇胡慧俐杨焕荣《计算机应用》2014,34(4):1005-1009

数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。相似文献

3.

Dataflow computer extension towards real-time processing

Masaru Takesue 《Real-Time Systems》1990,1(4):333-350

This paper presents an extended architecture and a scheduling algorithm for a dataflow computer aimed at real-time processing. From the real-time processing point of view, current dataflow computers have several problems which stem from their hardware mechanisms for scheduling instructions based on data synchronization. This mechanism extracts as many eligible instructions as possible for execution of a program, then executes them in parallel. Hence, the computation in a dataflow computer is generally difficult to interrupt and schedule using software. To realize a controllable dataflow computation, two basic mechanisms are introduced for serializing concurrent processes and interrupting the execution of a process. A parallel and distributed algorithm for the scheduler is presented, with these two mechanisms, which controls and decides state transitions and execution order of the processes based on priority and execution depth, while still maintaining the number of the running state processes at a preferred value. To gear the scheduler algorithm to meet one of the requirements for real-time processing, such as time-constrained computing, a data-parallel algorithm for selection of the user-process with the current highest priority in O (x log_x n) time is proposed, where n is the number of priority levels. 相似文献

4.

支持多重循环软件流水的循环控制机制 总被引：1，自引：0，他引：1

汤志忠于涛《计算机研究与发展》1998,35(6):511-515

ＩＬＳＰ－内外层交替执行的多重循环的软件流水算法是对多重循环进行优化的有效方法。为了保证ＩＬＳＰ算法具有良好的时间效益和空间效益，就必须有一套支持这个算法的行之有效的多重循环软件流水机制。文中将比较详细地介绍一套控制机制。它与多重循环优化编译器相配合，可以有效地支持多重循环的软件流水，并且可以保证ＩＬＳＰ算法具有较高的加速比和较低的空间代价。相似文献

5.

面向BW104x软流水框架

洪立涛郑启龙《计算机系统应用》2016,25(10):114-119

现代高性能数字信号处理器大多数采用超长指令字体系结构,通过在同一时钟周期发射多条指令以便获得更高的运算性能来发掘目标机器指令级别并行性.介绍了BW104x目标体系特征,BWDSP104X是一款针对高性能计算领域设计的处理器,采用16发射、单指令流,多数据流架构.为了充分利用多簇及簇内硬件资源,基于open64编译基础设施提出了后端软流水优化,其中包括循环选择,资源依赖数据依赖计算,采用经典的模调度方法进行软流水调度,为解决不同迭代变量冲突引入模变量拓展模块.实验结果证明流水后性能相对流水前有了很好的提升. 相似文献

6.

3种提高软件流水有效性的算法:比较和结合 总被引：1，自引：0，他引：1

李文龙陈彧林海波汤志忠《软件学报》2005,16(10):1822-1832

软件流水是开发循环程序指令级并行性的技术,它通过并行执行连续的多个循环体来加快循环的执行速度.在软件流水中,循环体的重叠增加了寄存器需求,导致寄存器压力增大,当目标处理机所提供的寄存器不足时,软件流水可能失败.在Itanium处理机上评估了NAS和SPEC2000基准程序中的软件流水循环的寄存器需求,发现静态寄存器不足是造成软件流水失败的主要原因,提出了3种增加软件流水个数、提高软件流水有效性的算法:限制循环展开因子的算法(register sensitive unrolling,简称RSU)、堆栈寄存器分配算法(stacked registerallocation,简称SRA)以及变量类型转换的算法(variabletype conversion,简称VTC).RSU根据静态寄存器需求确定一个合理的展开因子,增加了软件流水的成功率;SRA和VTC分别使用空闲的堆栈寄存器和旋转寄存器来充当静态寄存器,提高了寄存器的利用率.在面向Itanium处理器的开放源码编译器ORC(open research compiler)上实现了这3种算法,通过NAS程序的测试比较了这3种算法的有效性,同时对它们的结合应用进行了研究和实验. 相似文献

7.

Software pipelining of loops by the method of modulo scheduling

N. I. V’yukova V. A. Galatenko S. V. Samborskii 《Programming and Computer Software》2007,33(6):307-315

Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the negative effects of software pipelining of loops (a significant growth in code size and increase in the register pressure) as well as methods making it possible to use some hardware features of a target architecture. The paper considers global-scheduling mechanisms allowing one to pipeline loops containing a few basic blocks and loops in which the number of iterations is not known before the execution. 相似文献

8.

基于申威1600的3级BLAS GEMM函数优化

刘昊刘芳芳张鹏杨超蒋丽娟《计算机系统应用》2016,25(12):234-239

BLAS是当前科学计算领域重要的底层支持数学库之一,其中的3级BLAS函数应用最为广泛.本文基于国产申威1600平台,提出了一种基础线性代数库BLAS的三级函数通用矩阵乘GEMM的高性能实现方法.在单核上,使用乘加指令、循环展开、软件流水线指令重排、SIMD向量化运算、寄存器分块技术等与平台架构相关的技术手段,实现汇编级手工优化;在多核上,提出了适用于该平台的多线程加速方案.实验结果显示,在单核串行性能测试中,与知名开源数学库GotoBLAS相比,我们实现了平均4.72倍的加速效果;在多核并行扩展测试中,4线程版的性能则平均达到了单线程版性能的3.02倍. 相似文献

9.

软件流水的开销模型和决策框架 总被引：1，自引：0，他引：1

下载免费PDF全文

李文龙林海波汤志忠《软件学报》2004,15(7):1005-1011

软件流水是一种重要的指令调度技术,它通过重叠地执行不同的循环体来提高指令级并行性(instruction level parallelism,简称ILP).模调度是一类被广泛采用的软件流水调度算法.软件流水并非一种无损的优化方法,它具有一定的开销,比如延长了编译时间、增加了寄存器压力等.而且,受到体系结构、调度算法以及程序特性的限制,进行软件流水并不一定能达到理想的加速比,有时反而会引起性能下降.提出了一种面向程序特性的软件流水开销模型,对此模型下的软件流水开销进行了量化分析,并提出了一种基于相关性分析的相似文献

10.

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

下载免费PDF全文

Feng Yu-Jing Li De-Jian Tan Xu Ye Xiao-Chun Fan Dong-Rui Li Wen-Ming Wang Da Zhang Hao Tang Zhi-Min 《计算机科学技术学报》2022,37(4):942-959

The dataflow architecture, which is characterized by a lack of a redundant unified control logic, has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency, especially of applications used in high-performance computing (HPC). Importantly, the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner. Therefore, a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts. Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection, or a handshake mechanism in which the data is only sent after acknowledgments are received. However, these mechanisms are characterized by both inefficient data transfer and increased area cost. Good performance of the dataflow architecture depends on the efficiency of data transfer. In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost, we propose a Look-Ahead Acknowledgment (LAA) mechanism. LAA accelerates the execution ow by speculatively acknowledging ahead without penalties. Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%, with a reduction in the average execution time by 17.4% and an increase in the average power efficiency of dataflow processors by 22.4%. Crucially, our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%. In conclusion, the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.

相似文献

11.

一种基于GPU的高精度体系结构级功耗模型

王卓薇程良伦肖红《计算机科学》2016,43(11):30-35

随着硬件功能的不断丰富和软件开发环境的逐渐成熟,GPU开始被应用于通用计算领域,协助CPU加速程序运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使得其性能远高于CPU的同时功耗也高于CPU,功耗问题已经成为制约GPU发展的重要问题之一。在深入研究Fermi GPU架构的基础上,提出一种高精度的体系结构级功耗模型,该模型首先计算不同native指令及每次访问存储器消耗的功耗;然后根据应用在硬件上的执行指令和采样工具获得采样结果,分析预测其功耗;最后通过13个基准测试应用对实际测试与功耗模型测试结果进行对比分析,该模型的预测精度可达90%左右。相似文献

12.

IA-64中软件流水的寄存器需求研究 总被引：1，自引：0，他引：1

林海波李文龙汤志忠《计算机研究与发展》2004,41(1):22-27

软件流水是开发循环程序指令级并行性的重要方法之一，IA-64是支持软件流水的EPIC体系结构，通过对NAS Benchmarks中可软件流水循环所需的寄存器进行量化分析，提出了一种限制循环展开因子的启发式算法，有效地解决了因可用寄存器不足而导致软件流水失败的问题，并提高了应用程序的执行速度。相似文献

13.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

14.

一种支持多重循环软件流水的寄存器结构 总被引：1，自引：0，他引：1

容红波汤志忠《软件学报》2000,11(3):401-409

寄存器结构及其分配是软件流水算法的关键之一.为支持多重循环的软件流水,该文提出一种新颖的寄存器结构：半共享跳跃式流水寄存器堆.它可以有效地解决多重循环软件流水下的特殊问题,即：同层次和跨层次的寄存器重命名问题以及断流问题;有效地消除外层循环的体间读写相关,提高程序的指令级并行度.它有3种分配方式可供灵活使用：单个寄存器、流水寄存器和寄存器组方式.流水寄存器方式对生存期确定的、局限于一个循环层次的寄存器重命名问题提供简单而有效的支持.寄存器组分配方式解决了多重循环软件流水时变量生存期不确定的情况.跳跃操作为相似文献

15.

服务流程中的数据流处理

陈姣娟曹健《计算机科学》2013,40(1):14-18

服务流程需要处理服务之间大量的异构数据的交互,不同的数据流处理方式直接影响了服务流程的执行效率。阐述了服务流程模型中的数据流表示模型、数据映射机制与数据流验证机制,论述了服务流程运行中的数据流调度、数据存储以及传输等数据管理问题,分析了数据流处理在服务流程中的应用情况。最后,结合现有的数据流研究进展,提出了数据流研究的展望。相似文献

16.

An evaluation of medium-grain dataflow code

Walid A. Najjar Lucas Roh A. P. Wim Böhm 《International journal of parallel programming》1994,22(3):209-242

In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster. This work is supported in part by NSF Grants CCR-9010240 and MIP-9113268. 相似文献

17.

The TMS390C602A floating-point coprocessor for Sparc systems

Darley M. Kronlage B. Bural D. Churchill B. Pulling D. Wang P. Iwamoto R. Yang L. 《Micro, IEEE》1990,10(3):36-47

A recent Sparc (scalable processor architecture) processor consists of a two-chip configuration, containing the TMS390C601 integer unit (IU) and the TMS390C602A floating-point unit (FPU). The second device, an innovative coprocessor that lets the processor execute single- or double-precision floating-point instructions concurrently with IU operations is described. Dedicated floating-point hardware in the FPU increases the performance of the system. Running at clock periods as small as 20 ns, the chip should deliver 5.5 million double-precision floating-point operations per second under the Linpack benchmark (50-MHz clock rate). The FPU provides single- and double-precision arithmetic functions: addition, subtraction, multiplication, division, square root, compare, and convert. To minimize its math unit's latency, the FPU uses a highly parallel architecture requiring separate math units to optimize additions and multiplications. Traps stop the execution of a program to jump to software routine for handling data-dependent errors or to execute instructions not implemented in the hardware. Benchmark results are presented 相似文献

18.

Simulation and Improvement of the Processing Subsystem of the Manchester Dataflow Computer

下载免费PDF全文

Lai Zhiyong Zheng Shouqi 《计算机科学技术学报》1995,10(6):557-563

The Manchester dataflow computer is a famous dynamic dataflow computer.It is centralized in architecture and simple in organization.Its overhead for communication and scheduling is very small.Its efficiency comes down,when processing elements in the processing subsystem increase.Several articles evaluated its performance and presented improved methods.The authors studied its processing subsystem and carried out the simulation.The simulation results show that the efficiency of the processing subsystem drops dramatically when average instruction execution microcycles become less and the maximum instruction execution rate is nearly attained.Two improved methods are presented to oversome the disadvantage.The improved processing subsystem with a cheap distributor made up of a bus and a two-level fixed priority circuit possesses almost full efficiency no matter whether the average nstruction execution microcycles number is large or small and even if the maximum instruction execution rate is approached. 相似文献

19.

面向密码流处理器的AES算法软件流水实现方法

王寿成徐进辉严迎建李功丽贾永旺《计算机应用》2017,37(6):1620-1624

针对轮函数在分组密码实现过程中耗时过长的问题,提出了面向可重构密码流处理器（RCSP）的高级加密标准（AES）算法软件流水实现方法。该方法将轮函数操作划分为若干流水段,不同流水段对应不同的并行密码资源,通过并行执行多个轮函数的不同流水段,从而开发指令级并行性提高轮函数执行速度,进而提升分组密码的执行性能。在RCSP的单簇、双簇和四簇运算资源下分析了AES算法的流水线划分过程和软件流水映射方法,实验结果表明,该软件流水实现方法使得单分组或多分组不同数据分块的操作并行执行,不仅能够提升单分组串行执行性能,还能够通过开发分组间的并行性来提高多分组并行执行性能。相似文献

20.

Performance and the i860 microprocessor

Atkins M. 《Micro, IEEE》1991,11(5)

The internal design of the i860 CPU, which exploits pipelining and parallelism more than previous microprocessors, is described. The i860 uses RISC concepts and memory-performance optimizations in several novel ways. Other innovations include simultaneous floating-point operations similar to digital signal processing, a two-instruction-per-clock mode, fast floating-point pipelines graphics instructions, and high-bandwidth registers and caches on-chip. These features make it one of the fastest single-chip processors available 相似文献