期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SRF Coloring: Stream Register File Allocation via Graph Coloring 总被引：2，自引：0，他引：2

Xue-Jun Yang Yu Deng Li Wang Xiao-Bo Yan Jing Du Ying Zhang Gui-Bin Wang and Tao Tang 《计算机科学技术学报》2009,24(1):152-164

Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management... 相似文献

2.

科学计算程序在FT64流处理器上的实现、优化和评测

下载免费PDF全文

邓宇晏小波杜静张英杨学军《计算机工程与科学》2008,30(9):107-110

流体系结构是一种适应VLSI工艺发展的新型体系结构,它是否对科学计算程序有效是一个广泛关注的问题。本文选取NASA并行测试程序集中的一个数据密集型程序MG,研究了它在一个64位的面向科学计算设计的流处理器FT64上的实现和优化问题。在FT64上的实测表明,经过面向片上存储层次的优化,FT64能够达到与Itanium2处理器相当的性能。
。相似文献

3.

一种流处理器体系结构MASA及其在流体力学计算中的评测

伍楠文梅何义荀长庆任巨柴俊张春元《计算机学报》2008,31(1):133-141

提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MHz的情况下能够获得比1.6GHz的Iantium2近4倍的加速,证实了流体系结构在高性能计算领域的极大潜力. 相似文献

4.

Multiple-Morphs Adaptive Stream Architecture 总被引：2，自引：0，他引：2

下载免费PDF全文

Mei Wen Nan Wu Hai-Yan Li and Chun-Yuan Zhang 《计算机科学技术学报》2005,20(5):635-646

In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream, and then describes MASA （Multiple-morphs Adaptive Stream Architecture） prototype system which supports different execution models according to applications＇ stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware. Finally MASA is evaluated by ten benchmarks. The result is encouraging. 相似文献

5.

Generating data transfers for distributed GPU parallel programs

F. Silber-Chaussumier A. Muller R. Habel 《Journal of Parallel and Distributed Computing》2013

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. 相似文献

6.

面向BW104x软流水框架

洪立涛郑启龙《计算机系统应用》2016,25(10):114-119

现代高性能数字信号处理器大多数采用超长指令字体系结构,通过在同一时钟周期发射多条指令以便获得更高的运算性能来发掘目标机器指令级别并行性.介绍了BW104x目标体系特征,BWDSP104X是一款针对高性能计算领域设计的处理器,采用16发射、单指令流,多数据流架构.为了充分利用多簇及簇内硬件资源,基于open64编译基础设施提出了后端软流水优化,其中包括循环选择,资源依赖数据依赖计算,采用经典的模调度方法进行软流水调度,为解决不同迭代变量冲突引入模变量拓展模块.实验结果证明流水后性能相对流水前有了很好的提升. 相似文献

7.

流寄存器文件的实现及性能测评

下载免费PDF全文

陈海燕齐树波衣晓飞邓让钰李春江《计算机工程与科学》2009,31(1)

FT64是一款自主研发的面向科学计算的64位流处理器。本文介绍了该处理器的微体系结构及其编程模型,重点讨论了片内流寄存器文件实现的关键技术;该流寄存器文件具有硬件代价低、支持多流虚拟并发访问等特性。测试结果表明,流寄存器文件满足某些类科学计算与工程应用的带宽需求。相似文献

8.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

9.

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Xuejun Yang Jing Du Xiaobo Yan Yu Deng 《The Journal of supercomputing》2009,47(2):171-197

FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.

Jing DuEmail:

相似文献

10.

面向国产异构众核系统的Parallel C语言设计与实现

何王全刘勇方燕飞魏迪漆锋滨《软件学报》2017,28(4):764-785

异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用. 相似文献

11.

面向流应用的流寄存器文件

下载免费PDF全文

马驰远陈海燕齐树波陈书明肖嵘《计算机工程》2008,34(18):263-265

存储系统是通用处理器在处理流应用时的瓶颈。该文基于FT64流处理器体系结构,提出一种面向流应用的流寄存器文件结构设计方法和数据传输机制,分析它在FT64中的作用。通过采用大容量、高带宽、虚拟多端口的存储器,将大部分流数据存取操作限制在寄存器文件这一层次,减少了主存压力。实验结果表明,该结构能很好地适应流应用需求。相似文献

12.

StreamTMC: Stream compilation for tiled multi-core architectures

Haitao Wei Mingkang Qin Weiwei Zhang Junqing Yu Dongrui Fan Guang R. Gao 《Journal of Parallel and Distributed Computing》2013

Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization. 相似文献

13.

流处理器结构上数据并行类应用的开发和评估

王其刚安虹徐光周丽萍汪芳《小型微型计算机系统》2008,29(9)

流处理器体系结构是一种针对流应用中固有的计算和数据流动特点提出的一种新型的处理器体系结构,它结合了向量和超长指令字体系结构的特点,能有效地加速流应用的执行,而它的适用领域一直是当前国际上的热点讨论问题.本文从数据并行应用4个不同领域--数字信号处理、科学计算、网络和安全、以及多媒体处理选取了4个典型应用,详细剖析了这些应用在流体系结构上的流并行程序设计过程,归纳出数据并行类应用的流化步骤和方法,通过实验对这类应用在流体系结构上的适用性做出评估. 相似文献

14.

John von Neumann's Contributions to Computing and Computer Science

《Annals of the History of Computing, IEEE》1989,11(3):189-195

Aspray emphasizes von Neumann's critical role in the formation of modern computing and celebrates von Neumann as the scientific legitimizer of computing. He provides a survey of von Neumann's many important contributions to computer architecture, hardware, design and construction, programming, numerical analysis, scientific computation, and the theory of computing. Aspray's essay stresses especially the importance of von Neumann's work to promote the development of logical design. 相似文献

15.

梅森素数并行求解算法的流式实现 总被引：1，自引：0，他引：1

下载免费PDF全文

伍楠吴伟文梅杨乾明柴俊张春元《计算机工程与科学》2007,29(11):53-55

本文以数论中的Lucas-Lehmer检验法为基础,提出了梅森素数并行求解算法在FT64流处理器上的流式实现,并通过重设流记录的大小对程序进行了优化。评测数据表明,在FT64上运行该应用的时间平均比1．5GHz Itanium2快2．5倍。本文为梅森素数求解问题寻找了一条可行的加速方法,同时证实了流体系结构在高性能计算领域的极大潜力。本文提出的流式算法以及各种优化手段,对于其他科学计算领域中的计算密集型问题在流体系结构上的映射有极大的借鉴意义。相似文献

16.

计算机体系结构的统一模型 总被引：6，自引：1，他引：5

沈绪榜刘泽响王茹《计算机学报》2007,30(5):729-736

提出了一种计算机体系结构的统一模型,将基于数据流计算与基于构令流计算的体系结构统一到基于指令流计算的体系结构上来,命名为Unified-ISA模型.使基于数据流计算的ASIC电路与基于构令流计算的RC Device电路的设计,统一为基于指令流计算的SIMD PE阵列上的程序设计. 相似文献

17.

支持流数据传输的互连网络控制器研究与实现

下载免费PDF全文

马驰远陈书明邢座程郝跃《计算机工程与科学》2008,30(9):103-106

本文提出一种支持流数据传输的互连网络控制器的设计。该设计应用于FT64流处理器上,使得多个流处理器能够通过高性能网络进行数据传输,以便进行并行流数据运算。该设计采用二维环绕网,使用虚通道避免死锁,支持多个流的数据同时传输。投片后的测试结果表明,该设计功能正确,核心频率为500MHz,链路时钟频率为400MHz,满足设计要求。相似文献

18.

X处理器存储层次研究

付桂涛高军邢座程《计算机与现代化》2007,(12):22-24

随着计算机应用领域不断拓展,流媒体应用及科学计算正成为微处理器的一种重要负载.流媒体应用的特征是大量的数据并行、少量的数据重用以及每次访存带来的大量计算.因为带宽的限制,传统的微处理器结构很难满足这些特点.X处理器是一款流处理器,针对流应用特点,X处理器采用了新型的三级流式存储层次:局部寄存器文件、流寄存器文件和片外存储器,有效解决了带宽问题.本文在模拟平台采用了两种方法(RS码和测试程序)测试,验证了流存储层次解决带宽瓶颈的有效性,也证明了设计的正确性. 相似文献

19.

可伸缩分布共享大规模并行I／O系统设计

李琼郭御风庞征斌刘光明《计算机工程与科学》2006,28(1):135-138

如何有效地解决I／O瓶颈问题，一直是高性能并行计算机有待研究解决的关键技术。我们提出了一种可伸缩分布共享并行I／O系统方案，并自行研制了结点控制器芯片和路由器芯片，研制了原型系统SDSP604。为实现系统的计算、通讯和I／O性能随着系统规模均衡扩展的目标，该系统基于CC-NUMA系统结构，采用了合理的分布共享并行I／O系统结构。相似文献

20.

基于Babel的构件程序设计 总被引：1，自引：1，他引：0

谭袆炙黄春赵克佳《计算机科学》2006,33(12):235-237

为了解决高性能科学计算程序设计当中存在的开发难度大,开发周期长以及时开发人员要求高等问题,人们已经开始将软件构件技术引入该领域。由美国能源部、犹他州大学、印弟安那大学等联合提出的CCA便是研究高性能科学计算构件技术的项目之一。本文主要介绍了CCA以及CCA框架下的语言互操作工具－Babel的相关情况,并且通过NPB基准测试程序IS详细描述了Babel的使用,分析了基于Babel的程序设计对程序性能的影响。初步实验表明Babel能够有效解决语言的互操作问题,在面向科学计算的构件程序设计环境中能够发挥关键作用。相似文献