首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
苗旭鹏  王驭捷  沈佳  邵蓥侠  崔斌 《软件学报》2023,34(9):4407-4420
图神经网络由于其强大的表示能力和灵活性最近取得了广泛的关注. 随着图数据规模的增长和显存容量的限制, 基于传统的通用深度学习系统进行图神经网络训练已经难以满足要求, 无法充分发挥GPU设备的性能. 如何高效利用GPU硬件进行图神经网络的训练已经成为该领域重要的研究问题之一. 传统做法是基于稀疏矩阵乘法, 完成图神经网络中的计算过程, 当面对GPU显存容量限制时, 通过分布式矩阵乘法, 把计算任务分发到每个设备上, 这类方法的主要不足有: (1)稀疏矩阵乘法忽视了图数据本身的稀疏分布特性, 计算效率不高; (2)忽视了GPU本身的计算和访存特性, 无法充分利用GPU硬件. 为了提高训练效率, 现有一些研究通过图采样方法, 减少每轮迭代的计算带价和存储需求, 同时也可以支持灵活的分布式拓展, 但是由于采样随机性和方差, 它们往往会影响训练的模型精度. 为此, 提出了一套面向多GPU的高性能图神经网络训练框架, 为了保证模型精度, 基于全量图进行训练, 探索了不同的多GPU图神经网络切分方案, 研究了GPU上不同的图数据排布对图神经网络计算过程中GPU性能的影响, 并提出了稀疏块感知的GPU访存优化技术. 基于C++和CuDNN实现了该原型系统, 在4个不同的大规模GNN数据集上的实验表明: (1)通过图重排优化, 提高了GPU约40%的缓存命中率, 计算加速比可达2倍; (2)相比于现有系统DGL, 取得了5.8倍的整体加速比.  相似文献   

2.
基于加解密算法中访存频繁、循环执行与其边界和数据运算长度存在一一对应关系的特性,提出一个快速实现多种算法的指令集,其中包括基于该指令集五级流水硬件的实现。从软件和硬件层面上设计并实现一个完整的通用安全协处理器原型系统。实验表明该协处理器具有良好的结构和功能。  相似文献   

3.
数据流架构的执行方式与神经网络算法具有高度匹配性,能充分挖掘数据的并行性.然而,随着神经网络向更低精度的发展,数据流架构的研究并未面向低精度神经网络展开,在传统数据流架构部署低精度(INT8,INT4或者更低)神经网络时,会面临3个问题:1)传统数据流架构的计算部件数据通路与低精度数据不匹配,无法体现低精度神经网络的性能和能效优势;2)向量化并行计算的低精度数据在片上存储中要求顺序排列,然而它在片外存储层次中是分散排列的,使得数据的加载和写回操作变得复杂,传统数据流架构的访存部件无法高效支持这种复杂的访存模式;3)传统数据流架构中使用双缓冲机制掩盖数据的传输延迟,但是,当传输低精度数据时,传输带宽的利用率显著降低,导致计算延迟无法掩盖数据传输延迟,双缓冲机制面临失效风险,进而影响数据流架构的性能和能效.为解决这3个问题,设计了面向低精度神经网络的数据流加速器DPU_Q.首先,设计了灵活可重构的计算单元,根据指令的精度标志位动态重构数据通路,一方面能高效灵活地支持多种低精度数据运算,另一方面能进一步提高计算并行性和吞吐量.另外,为解决低精度神经网络复杂的访存模式,设计了Scatter引擎...  相似文献   

4.
MODV是一个通用的存储一致性模型动态验证工具,该工具实现了基于时间序的边界图算法,具有较低的时间复杂度.为了进一步提高MODV工具的性能,我们采用了多种方法对算法进行了性能优化,使得MODV工具能够有效验证更大规模的并发访存操作.实验结果表明,和基准算法相比,我们的改进算法在性能方面有较大的提升.  相似文献   

5.
肖成龙  林军  王珊珊  王宁 《计算机应用》2018,38(7):2024-2031
针对在高层次综合(HLS)过程中性能提升、功耗降低困难等问题,提出了一种面向高层次综合的自定义指令自动识别方法。在高层次综合过程之前实现对自定义指令的枚举和选择,从而为高层次综合提供通用的自定义指令识别方法。首先,将高层次源代码转换为控制数据流图(CDFG),实现了对源代码的预处理;其次,基于控制数据流图内的数据流图(DFG),采用子图枚举算法以自底而上的方式枚举出所有连通凸子图,有效提高了用户可灵活修改约束条件的能力;然后,分别从面积、性能和代码量三个角度考虑,利用子图选择算法选择部分最佳子图作为最终的自定义指令;最后,用所选的自定义指令重新生成新代码作为高层次综合工具的输入。与传统高层次综合相比,采用基于出现频率的模式选择可平均减少19.1%的面积,采用基于关键路径的子图选择可平均减少22.3%的时延。此外,与TD算法相比,所提算法的枚举效率平均提升70.8%。实验结果表明,自定义指令自动识别方法使高层次综合在电路设计中能够显著地提升性能,减少面积和代码量。  相似文献   

6.
可重构分组密码处理结构模型研究与设计   总被引:2,自引:0,他引:2  
随着信息技术的发展和网络规模不断扩大,网络通信等应用对数据加解密处理提出了更高的要求,可重构计算是将可重构硬件处理单元和软件可编程处理器结合的计算系统.因此采用可重构计算技术来设计密码处理系统,使同一硬件能够高效灵活地支持密码应用领域内的多种算法.同时满足了密码处理对性能和灵活性的要求,提高了密码系统的安全性.论文在分析分组密码算法处理结构的基础上,结合了可重构结构的设计思想和方法,提出了一种可重构密码处理结构模型RCPA,并基于该模型实现了一款验证原型.原型在FPGA上成功进行了验证测试并在0.18μm CMOS工艺标准单元库下进行逻辑综合以及布局布线.实验结果表明,在RCPA验证原型上执行的分组密码算法都可达到较高的性能,其密码处理性能与通用高性能微处理器处理性能相比提高了10~20倍;与其他一些专用可重构密码处理结构处理性能相比提高了1.1~5.1倍.结果说明研究的RCPA模型既能保证分组密码算法应用的灵活性又能够达到较高的性能.  相似文献   

7.
杨永滔  王意洁 《软件学报》2012,23(3):550-564
研究概率数据流上的q-skyline计算问题.与只支持滑动窗口数据流模型的已有方法相比,所提出的方法能够支持更为通用的n-of-N数据流模型.采用将q-skyline查询转换为区间树上刺入查询的方法支持n-of-N数据流模型.提出PnNM算法维护支持n-of-N数据流模型所需的相关数据结构,高效处理了不确定对象候选集合更新和区间更新等维护工作;提出PnNCont算法实现连续查询处理.理论分析和实验结果表明,算法能够有效地支持概率数据流n-of-N模型上的q-skyline查询处理.  相似文献   

8.
陈向  沈立  李家文 《计算机科学》2011,38(5):290-294
SIMID指令能够高效开发数据级并行,因此当前绝大多数通用微处理器都支持这种机制。但是应用程序和算法的一些固有特性,如访存地址不对齐、非连续存储访问以及控制流等,使得编译器或程序员必须借助置换指令重新组合向量的各个元素,才能得到符合SIMD指令要求的操作数。这些冗余的置换指令已成为当前挖掘数据级并行的主要性能瓶颈。提出一种自动的数据置换指令生成和优化算法,以有效地减少置换指令带来的性能损失。该算法基于提出的一种新中间表示形式,其中包含有足够的操作数地址信息,因此可以将置换指令的生成转换为数据流图中冲突边的识别问题,而将置换指令的优化转化为用最少的置换指令来删除所有冲突边的问题。面向一组典型多媒体程序进行测试的结果表明,提出的算法可平均获得7%的性能加速。  相似文献   

9.
FPGA传统RTL级别开发有着较高的编程难度和较长的设计时间,这限制了FPGA在分子动力学模拟中的应用.本文使用FPGA新一代编程方案HLS,基于Alevo U50板卡设计并实现了基于可重构计算平台硬件的分子动力学短程非键成力加速器,分别从粒子配对器设计优化、计算流水线设计等方面出发,设计具有高效率、低能耗的可重构计算方法.同时针对非键成力计算中存在的动态数据流,提出了HLS+ HDL的设计方法,进而在极大缩减设计时间的同时保证加速器的性能.  相似文献   

10.
数学库函数算法的特性致使函数存在大量的访存,而当前异构众核的从核结构采用共享主存的方式实现数据访问,从而严重影响了从核的访存速度,因此异构众核结构中数学库函数的性能无法满足高性能计算的要求。为了有效解决此问题,提出了一种基于访存指令的调度策略,亦即将访存延迟有效地隐藏于计算延迟中,以提高基于汇编实现的数学函数库的函数性能;结合动态调用方式,利用从核本地局部数据存储空间LDM(local data memory),提出了一种提高访存速度的ldm_call算法。两种优化技术在共享存储结构下具有普遍适用性,并能够有效减少函数访存开销,提高访存速度。实验表明,两种技术分别能够平均提高函数性能16.08%和37.32%。  相似文献   

11.
Dynamical connection graph changes are inherent in networks such as peer-to-peer networks, wireless ad hoc networks, and wireless sensor networks. Considering the influence of the frequent graph changes is thus essential for precisely assessing the performance of applications and algorithms on such networks. In this paper, using stochastic hybrid systems (SHSs), we model the dynamics and analyze the performance of an epidemic-like algorithm, distributed random grouping (DRG), for average aggregate computation on a wireless sensor network with dynamical graph changes. Particularly, we derive the convergence criteria and the upper bounds on the running time of the DRG algorithm for a set of graphs that are individually disconnected but jointly connected in time. An effective technique for the computation of a key parameter in the derived bounds is also developed. Numerical results and an application extended from our analytical results to control the graph sequences are presented to exemplify our analysis.  相似文献   

12.
We proposed a novel solution schema called the Hierarchical Labeling Schema (HLS) to answer reachability queries in directed graphs. Different from many existing approaches that focus on static directed acyclic graphs (DAGs), our schema focuses on directed cyclic graphs (DCGs) where vertices or arcs could be added to a graph incrementally. Unlike many of the traditional approaches, HLS does not require the graph to be acyclic in constructing its index. Therefore, it could, in fact, be applied to both DAGs and DCGs. When vertices or arcs are added to a graph, the HLS is capable of updating the index incrementally instead of re-computing the index from the scratch each time, making it more efficient than many other approaches in the practice. The basic idea of HLS is to create a tree for each vertex in a graph and link the trees together so that whenever two vertices are given, we can immediately know whether there is a path between them by referring to the appropriate trees. We conducted extensive experiments on both real-world datasets and synthesized datasets. We compared the performance of HLS, in terms of index construction time, query processing time and space consumption, with two state-of-the-art methodologies, the path-tree method and the 3-hop method. We also conducted simulations to model the situation when a graph is updated incrementally. The performance comparison of different algorithms against HLS on static graphs has also been studied. Our results show that HLS is highly competitive in the practice and is particularly useful in the cases where the graphs are updated frequently.  相似文献   

13.
Real-time computer vision systems often make use of dedicated image processing hardware to perform the pixel-oriented operations typical of early vision. This type of hardware is notoriously difficult to program, limiting the types of experiments that can be performed and posing a serious obstacle to research progress. This paper describes a pair of programming tools that we have developed to simplify the task of building real-time early vision systems using special-purpose hardware. The system allows users to describe computations in terms of coarse-grained dataflow graphs constructed using an interactive graphical tool. At initialization time it compiles these graphs into efficient executable programs for the underlying hardware. The system has been implemented on a popular commercial pipelined image processor. We describe the computational model that the system supports, the facilities it provides for building real-vision applications, and the algorithms used to generate effective execution schedules for the target machine.  相似文献   

14.
Popular distributed graph processing frameworks, such as Pregel and GraphLab, are based on the vertex-centric computation model, where users write their customized Compute function for each vertex to process the data iteratively. Vertices are evenly partitioned among the compute nodes, and the granularity of parallelism of the graph algorithm is normally tuned by adjusting the number of compute nodes. Vertex-centric model splits the computation into phases. Inside one specific phase, the computation proceeds as an embarrassingly parallel process, because no communication between compute nodes incurs. By default, current graph engine only handles one iteration of the algorithm in a phase. However, in this paper, we find that we can also tune the granularity of parallelism, by aggregating the computation of multiple iterations into one phase, which has a significant impact on the performance of the graph algorithm. In the ideal case, if all computations are handled in one phase, the whole algorithm turns into an embarrassingly parallel algorithm and the benefit of parallelism is maximized. Based on this observation, we propose two approaches, a function-based approach and a parameter-based approach, to automatically transform a Pregel algorithm into a new one with tunable granularity of parallelism. We study the cost of such transformation and the trade-off between the granularity of parallelism and the performance. We provide a new direction to tune the performance of parallel algorithms. Finally, the approaches are implemented in our graph processing system, N2, and we illustrate their performance using popular graph algorithms.  相似文献   

15.
The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1  相似文献   

16.
Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

17.
18.
公钥密码学对全球数字信息系统的安全起着至关重要的作用。然而,随着量子计算机研究的发展和Shor算法等的出现,公钥密码学的安全性受到了潜在的极大的威胁。因此,能够抵抗量子计算机攻击的密码算法开始受到密码学界的关注,美国国家标准与技术研究院(National Institute of Standards and Technology,NIST)发起了后量子密码(Post-quantum cryptography,PQC)算法标准全球征集竞赛。在参选的算法中,基于格的算法在安全性、公钥私钥尺寸和运算速度中达到了较好的权衡,因此是最有潜力的后量子加密算法体制。而CRYSTALS-KYBER作为基于格的密钥封装算法(Key encapsulation mechanism,KEM),通过了该全球征集竞赛的三轮遴选。对于后量子密码算法,算法的硬件实现效率是一个重要评价指标。因此,本文使用高层次综合工具(High-level synthesis,HLS),针对CRYSTALS-KYBER的三个主模块(密钥生成,密钥封装和密钥解封装),在不同参数集下探索了硬件设计的实现和优化空间。作为一种快速便捷的电路设计方法,HLS可以用来对不同算法的硬件实现进行高效和便捷的探索。本文利用该工具,对CRYSTALS-KYBER的软件代码进行了分析,并尝试不同的组合策略来优化HLS硬件实现结果,并最终获得了最优化的电路结构。同时,本文编写了tcl-perl协同脚本,以自动化地搜索最优优化策略,获得最优电路结构。实验结果表明,适度优化循环和时序约束可以大大提高HLS综合得到的KYBER电路性能。与已有的软件实现相比,本文具有明显的性能优势。与HLS实现工作相比,本文对Kyber-512的优化使得封装算法的性能提高了75%,解封装算法的性能提高了55.1%。与基准数据相比,密钥生成算法的性能提高了44.2%。对于CRYSTALS-KYBER的另外两个参数集(Kyber-768和Kyber-1024),本文也获得了类似的优化效果。  相似文献   

19.
Fault tolerance and scalability are important considerations in the design of sensor network applications. Data aggregation is an essential operation in sensor networks. Multiple techniques have been proposed recently to tackle the issues of scalability and fault tolerance of aggregation in sensor networks. In this article, we analyze the impact of using a few of the more reliable, though expensive, nodes–such as the Intel XScale–called microservers, in addition to the standard motes, on the fault tolerance and scalability of the aggregation algorithms in sensor networks. In particular, we propose a simple model that captures the essence of tree aggregation in such heterogeneous sensor networks. We validate this theoretical model with simulation results. We also study the effective impact on the sustainable probability of failure, and perform cost-benefit analysis. We also show how hybrid aggregation can be utilized instead of tree, to improve the performance of aggregation in heterogeneous sensor networks. We show that our work can be applied for effectively optimizing the use of expensive hardware while designing fault-tolerant, distributed sensor networks.  相似文献   

20.
会话推荐立足于目标用户的当前会话,根据项目类别、跨会话的上下文信息、多种用户行为等辅助信息学习项目间的依赖关系,从而捕捉用户的长短期偏好进行个性化推荐。近年来,流行的深度学习系列方法已经成为会话型推荐系统这个研究热点的前沿方法,尤其是图神经网络的引入,使会话推荐系统的性能得到了进一步提升。鉴于此,该综述从问题定义与会话推荐因素出发,从构图方面进行分析;将相关工作分为基于图卷积网络、门控图神经网络、图注意力网络和其他图神经网络架构的会话推荐系统,并进行归纳与对比;对各工作实验部分中的损失函数类别、所选用的数据集和模型性能评估指标三方面进行深入分析。重点从算法原理和性能分析两方面对各模型框架进行评估和梳理,旨在对近五年基于图神经网络的会话推荐系统相关工作进行评述、总结与展望。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号