利用Stencil建模及评估Intel IMCI vgather指令 Modeling and evaluating Intel IMCI vgather instruction using stencilsJames期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

利用Stencil建模及评估Intel IMCI vgather指令

引用本文：	林新华,王一超,秦强,李硕,文敏华,松岡聡.利用Stencil建模及评估Intel IMCI vgather指令[J].计算机工程与科学,2016,38(9):1741-1747.

作者姓名：	林新华王一超秦强李硕文敏华松岡聡

作者单位：	;1.上海交通大学高性能计算中心;2.东京工业大学;3.Intel公司

基金项目：	国家863计划（2014AA01A302）;日本学术振兴会RONPAKU Fellowship资助

摘要：	Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。
关键词：	性能建模 vgather Xeon Phi 硬件计数器
收稿时间：	2015-12-11
修稿时间：	2016-09-25
Modeling and evaluating Intel IMCI vgather instruction using stencilsJames

Lin,WANG Yi chao,QIN Qiang,LI Shuo,WEN Min hua,Satoshi Matsuoka.Modeling and evaluating Intel IMCI vgather instruction using stencilsJames[J].Computer Engineering & Science,2016,38(9):1741-1747.

Authors:	Lin WANG Yi chao QIN Qiang LI Shuo WEN Min hua Satoshi Matsuoka

Affiliation:	（1.Center for High Performance Computing,Shanghai Jiao Tong University,Shanghai 200240,China; 2.Tokyo Institute of Technology,Tokyo 152 8550,Japan; 3.Intel Corporation,Portland OR97124,USA）

Abstract:	Vgather is a hardware implemented vector instruction introduced by Intel Initial Many Core Instructions (IMCI) for Xeon Phi. Its target is to help SIMD registers access data from non contiguous memory locations. However, experimental results show that it can also be one of the key performance bottlenecks on Xeon Phi. We model the performance of Vgather by using the profiling tool PAPI to directly collect the results of hardware performance counters. Address Generation Interlock (AGI) events are profiled as the number of Vgather and the average latency of Vgather are estimated with VPU_DATA_READ events based on which we model the total latencies of Vgather instructions. 3D7P stencils are used to evaluate our model and the results show that Vgather spents nearly 40% of total kernel time. We implement a Vgather free version with intrinsic instruction to validate this model. Our contribution includes modeling Intel IMCI vgather instruction with hardware counters and validating it by stencils. The model can also be applicable on CPUs.

Keywords:	performance modeling vgather Xeon Phi hardware performance counters

	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏