首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
SRF Coloring: Stream Register File Allocation via Graph Coloring   总被引:2,自引:0,他引:2       下载免费PDF全文
Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management...  相似文献   

2.
流体系结构是一种适应VLSI工艺发展的新型体系结构,它是否对科学计算程序有效是一个广泛关注的问题。本文选取NASA并行测试程序集中的一个数据密集型程序MG,研究了 它在一个64位的面向科学计算设计的流处理器FT64上的实现和优化问题。在FT64上的实测表明,经过面向片上存储层次的优化,FT64能够达到与Itanium2处理器相当的性能。
。  相似文献   

3.
提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MHz的情况下能够获得比1.6GHz的Iantium2近4倍的加速,证实了流体系结构在高性能计算领域的极大潜力.  相似文献   

4.
Multiple-Morphs Adaptive Stream Architecture   总被引:2,自引:0,他引:2       下载免费PDF全文
In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models according to applications' stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware. Finally MASA is evaluated by ten benchmarks. The result is encouraging.  相似文献   

5.
Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes.  相似文献   

6.
现代高性能数字信号处理器大多数采用超长指令字体系结构,通过在同一时钟周期发射多条指令以便获得更高的运算性能来发掘目标机器指令级别并行性.介绍了BW104x目标体系特征,BWDSP104X是一款针对高性能计算领域设计的处理器,采用16发射、单指令流,多数据流架构.为了充分利用多簇及簇内硬件资源,基于open64编译基础设施提出了后端软流水优化,其中包括循环选择,资源依赖数据依赖计算,采用经典的模调度方法进行软流水调度,为解决不同迭代变量冲突引入模变量拓展模块.实验结果证明流水后性能相对流水前有了很好的提升.  相似文献   

7.
FT64是一款自主研发的面向科学计算的64位流处理器。本文介绍了该处理器的微体系结构及其编程模型,重点讨论了片内流寄存器文件实现的关键技术;该流寄存器文件具有硬件代价低、支持多流虚拟并发访问等特性。测试结果表明,流寄存器文件满足某些类科学计算与工程应用的带宽需求。  相似文献   

8.
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone.  相似文献   

9.
FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.
Jing DuEmail:
  相似文献   

10.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

11.
存储系统是通用处理器在处理流应用时的瓶颈。该文基于FT64流处理器体系结构,提出一种面向流应用的流寄存器文件结构设计方法和数据传输机制,分析它在FT64中的作用。通过采用大容量、高带宽、虚拟多端口的存储器,将大部分流数据存取操作限制在寄存器文件这一层次,减少了主存压力。实验结果表明,该结构能很好地适应流应用需求。  相似文献   

12.
Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization.  相似文献   

13.
流处理器体系结构是一种针对流应用中固有的计算和数据流动特点提出的一种新型的处理器体系结构,它结合了向量和超长指令字体系结构的特点,能有效地加速流应用的执行,而它的适用领域一直是当前国际上的热点讨论问题.本文从数据并行应用4个不同领域--数字信号处理、科学计算、网络和安全、以及多媒体处理选取了4个典型应用,详细剖析了这些应用在流体系结构上的流并行程序设计过程,归纳出数据并行类应用的流化步骤和方法,通过实验对这类应用在流体系结构上的适用性做出评估.  相似文献   

14.
Aspray emphasizes von Neumann's critical role in the formation of modern computing and celebrates von Neumann as the scientific legitimizer of computing. He provides a survey of von Neumann's many important contributions to computer architecture, hardware, design and construction, programming, numerical analysis, scientific computation, and the theory of computing. Aspray's essay stresses especially the importance of von Neumann's work to promote the development of logical design.  相似文献   

15.
梅森素数并行求解算法的流式实现   总被引:1,自引:0,他引:1       下载免费PDF全文
本文以数论中的Lucas-Lehmer检验法为基础,提出了梅森素数并行求解算法在FT64流处理器上的流式实现,并通过重设流记录的大小对程序进行了优化。评测数据表明,在FT64上运行该应用的时间平均比1.5GHz Itanium2快2.5倍。本文为梅森素数求解问题寻找了一条可行的加速方法,同时证实了流体系结构在高性能计算领域的极大潜力。本文提出的流式算法以及各种优化手段,对于其他科学计算领域中的计算密集型问题在流体系结构上的映射有极大的借鉴意义。  相似文献   

16.
计算机体系结构的统一模型   总被引:6,自引:1,他引:5  
提出了一种计算机体系结构的统一模型,将基于数据流计算与基于构令流计算的体系结构统一到基于指令流计算的体系结构上来,命名为Unified-ISA模型.使基于数据流计算的ASIC电路与基于构令流计算的RC Device电路的设计,统一为基于指令流计算的SIMD PE阵列上的程序设计.  相似文献   

17.
本文提出一种支持流数据传输的互连网络控制器的设计。该设计应用于FT64流处理器上,使得多个流处理器能够通过高性能网络进行数据传输,以便进行并行流数据运算。该设计采用二维环绕网,使用虚通道避免死锁,支持多个流的数据同时传输。投片后的测试结果表明,该设计功能正确,核心频率为500MHz,链路时钟频率为400MHz,满足设计要求。  相似文献   

18.
随着计算机应用领域不断拓展,流媒体应用及科学计算正成为微处理器的一种重要负载.流媒体应用的特征是大量的数据并行、少量的数据重用以及每次访存带来的大量计算.因为带宽的限制,传统的微处理器结构很难满足这些特点.X处理器是一款流处理器,针对流应用特点,X处理器采用了新型的三级流式存储层次:局部寄存器文件、流寄存器文件和片外存储器,有效解决了带宽问题.本文在模拟平台采用了两种方法(RS码和测试程序)测试,验证了流存储层次解决带宽瓶颈的有效性,也证明了设计的正确性.  相似文献   

19.
如何有效地解决I/O瓶颈问题,一直是高性能并行计算机有待研究解决的关键技术。我们提出了一种可伸缩分布共享并行I/O系统方案,并自行研制了结点控制器芯片和路由器芯片,研制了原型系统SDSP604。为实现系统的计算、通讯和I/O性能随着系统规模均衡扩展的目标,该系统基于CC-NUMA系统结构,采用了合理的分布共享并行I/O系统结构。  相似文献   

20.
基于Babel的构件程序设计   总被引:1,自引:1,他引:0  
为了解决高性能科学计算程序设计当中存在的开发难度大,开发周期长以及时开发人员要求高等问题,人们已经开始将软件构件技术引入该领域。由美国能源部、犹他州大学、印弟安那大学等联合提出的CCA便是研究高性能科学计算构件技术的项目之一。本文主要介绍了CCA以及CCA框架下的语言互操作工具-Babel的相关情况,并且通过NPB基准测试程序IS详细描述了Babel的使用,分析了基于Babel的程序设计对程序性能的影响。初步实验表明Babel能够有效解决语言的互操作问题,在面向科学计算的构件程序设计环境中能够发挥关键作用。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号