首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。  相似文献   

2.
随着各领域需要处理的数据量越来越大,数据密集型应用也变得越来越被重视.该文提出一种包含数据访存层次和访存冲突等信息的新并行程序执行模型PSRAM(h).针对数据密集型应用以访存为主的特点,PSRAM(h)模型将程序执行时间简化为访存时间,通过分析各程序子段的访存层次和数量来预测串行程序的执行时间,进而通过使用各线程执行时间的最大值来预测并行程序的执行时间.使用PSRAM(h)模型下对最典型的数据密集型应用矩阵向量乘进行分析,在龙芯3A处理器和Intel Xeon E5520处理器两个平台上的测试结果表明,PSRAM(h)模型分析结果与实测结果大部分情况下误差小于20%.由此可见,针对数据密集型应用,PSRAM(h)不但可以给出程序执行时间的下限,还可以有效的预测程序的执行时间.  相似文献   

3.
张桢  梁军  贾海鹏  张云泉  李青 《计算机工程》2023,(4):159-165+173
RISC-V处理器的广泛应用使得FFmpeg多媒体算法库在RISC-V平台上的高性能实现日益重要。提出一种基于RISC-V架构的系列优化策略,针对开源音视频多媒体FFmpeg算法库中不同特征和计算密度的算法,利用RISC-V指令集的扩展性对算法库中某些耗时的算法进行指令加速和并行优化。在深入研究RISC-V开源架构的基础上,构建一个基于RISC-V开源架构的高性能FFmpeg算法库。针对不连续访存类算法、数据依赖类算法、数据快速转换类算法,从向量单元配置、向量化访存、汇编优化、指令流水优化4个方面出发,大幅提升FFmpeg算法库在RISC-V处理器上的性能。实验结果表明,采用以上优化策略后的FFmpeg算法库在基于RISC-V架构的XT-910芯片上的性能得到明显提升,其中的不连续访存类算法、数据依赖类算法、数据快速转换类算法的加速比分别为8.20、3.67、3.62。  相似文献   

4.
倪鸿  刘鑫 《计算机工程》2019,45(6):45-51
为解决高性能计算中的非结构网格离散访存问题,以神威·太湖之光国产超级计算机为平台,根据异构众核处理器SW26010的体系结构特点,提出一种基于排序思想的通用众核优化算法,以减少非结构网格计算中的随机访存。基于网格划分原理,在O(n)时间内对生成的稀疏矩阵非零元素进行并行重排序。采用一种内部映射方式对计算向量实现扩展或变换,将细粒度访存转化为无写冲突的粗粒度访存。对多个实际应用算例的通量计算进行众核优化,结果表明,相比主核上的串行算法,该算法能够获得平均10倍以上的加速效果。  相似文献   

5.
色彩空间转换、图像缩放、图像滤波都是图像处理领域常见的算法,广泛应用于数字媒体、数据通信、生物医学和航空航天等领域。目前上述算法在ARM处理器上虽有开源的OpenCV库,但缺少与Intel IPP库精度相当的高性能图像处理库。为此,根据算法的计算访存特征,将上述算法分为数据无关算法、数据共享算法及非规则访存算法3类,提出了不同类别算法在ARMv8计算平台上的优化方法体系,最终构建了一个基于ARMv8计算平台的高性能图像处理算法库,精度上对标Intel IPP库,并通过算法优化、访存优化、SIMD优化及汇编指令优化等一系列优化方法的应用,大幅提升了图像处理算法的性能。实验结果表明,在华为鲲鹏920计算平台上,重点优化的CvtColor、Filter和Resize模块性能较OpenCV算法库都有显著提升。  相似文献   

6.
流处理器在科研、国防和生活中有着广泛的应用,但随着人类对计算速度的不断追求,现有的流处理器计算速度已不能满足人们的需求,经研究发现,影响流处理器处理数据速度的是访存的时间,所以减少访存次数是提高流处理器计算速度的关键。通过引入Cordic算法,构建超越函数单元,减少了流处理器在处理流应用中的访存次数,从而提高流处理器的计算速度。  相似文献   

7.
印乐  黄磊 《软件学报》2013,24(10):2289-2299
并行发生(may happen in parallel,简称MHP)分析计算并行程序中哪些语句可以并行执行,它是并行分析技术的重要组成部分.提出一种针对Java 程序的新颖的MHP分析算法.与已有算法相比,新算法抛弃了“子线程只会被父线程等待同步”的假设,以非耦合的方式分别处理start 同步和join 同步;新算法的处理逻辑虽然更加简单,但却更加完备;在计算控制信息时,新算法不必像已有算法那样通过内联构造全局的控制流图,显著地提高了算法的扩展性.新的MHP 算法被用来过滤静态数据竞争检测中虚假的数据竞争.在14 个Java 测试程序上的实验结果表明,新的MHP 算法计算控制信息的开销远远小于已有算法.  相似文献   

8.
从单机性能优化角度对一个高阶精度结构网格CFI)并行程序进行了优化。通过识别关键变量并对其进行 常量参数化优化,使编译器能够实现更高级别的针对性优化;根据程序数据结构特点及访问模式,设计了分级数据缓 存技术,使程序主要计算代码能够以更优的方式访问主要数据结构,提高了访存空间局部性;进行了各种循环变换,以 优化访存性能。在国家超算长沙中心“`Tianhe—lA',并行机上的测试结果表明,相对于采用Intel编译器最高优化级别 的版本,其对10。万网格点二维翼型算例,串行程序性能提高约22.2%-28.9%;对1. 12亿网格点三角翼算例,并行 程序性能提高约13.9%-20.2%。  相似文献   

9.
数学库函数算法的特性致使函数存在大量的访存,而当前异构众核的从核结构采用共享主存的方式实现数据访问,从而严重影响了从核的访存速度,因此异构众核结构中数学库函数的性能无法满足高性能计算的要求。为了有效解决此问题,提出了一种基于访存指令的调度策略,亦即将访存延迟有效地隐藏于计算延迟中,以提高基于汇编实现的数学函数库的函数性能;结合动态调用方式,利用从核本地局部数据存储空间LDM(local data memory),提出了一种提高访存速度的ldm_call算法。两种优化技术在共享存储结构下具有普遍适用性,并能够有效减少函数访存开销,提高访存速度。实验表明,两种技术分别能够平均提高函数性能16.08%和37.32%。  相似文献   

10.
随着视频编解码标准的不断演进,算法处理的数据量也随之剧增。多核结构并行化处理技术在提升算法计算速度的同时,使得存储结构成为了整个编解码系统性能的瓶颈。针对视频编解码算法访存的局部性、各算法之间数据交互频繁性、算法内部大量临时数据不交互性的特点,设计并实现了由私有存储层和共享存储层构成的多层次分布式存储结构。通过Xilinx公司的Virtex-6系列xc6vlx550T开发板对设计进行测试,实验结果表明,该结构在保持简洁性和可扩展性的同时,最高可提供9.73 GB/s的访存带宽,能够满足视频编解码算法数据访存的需求。  相似文献   

11.
This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses.  相似文献   

12.
The problem of interest is to verify data consistency of a concurrent Java program. In particular, we present a new decision procedure for verifying that a class of data races caused by inconsistent accesses on multiple fields of an object cannot occur (so-called atomic-set serializability). Atomic-set serializability generalizes the ordinary notion of a data race (i.e., inconsistent coordination of accesses on a single memory location) to a broader class of races that involve accesses on multiple memory locations. Previous work by some of the authors presented a technique to abstract a concurrent Java program into an EML program, a modeling language based on pushdown systems and a finite set of reentrant locks. Our previous work used only a semi-decision procedure, and hence provides a definite answer only some of the time. In this paper, we rectify this shortcoming by developing a decision procedure for verifying data consistency, i.e., atomic-set serializability, of an EML program. When coupled with the previous work, it provides a decision procedure for verifying data consistency of a concurrent Java program. We implemented the decision procedure, and applied it to detect both single-location and multi-location data races in models of concurrent Java programs. Compared with the prior method based on a semi-decision procedure, not only was the decision procedure 34 times faster overall, but the semi-decision procedure timed out on about 50% of the queries, whereas the decision procedure timed out on none of the queries.  相似文献   

13.
Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.  相似文献   

14.
SpMV的自动性能优化实现技术及其应用研究   总被引:1,自引:0,他引:1  
在科学计算中,稀疏矩阵向量乘(SpMV)是一个十分重要且经常被大量调用的计算内核.由于SpMV一般实现算法的浮点计算和存储访问次数比率非常低,且其存储访问模式极为不规则,其实际运行性能往往很低.通过采用寄存器分块算法和启发式分块大小选择算法,将稀疏矩阵分成小的稠密分块,重用保存在寄存器中向量x元素,可以提高该计算内核的性能.剖析和总结了OSKI软件包所采用的若干关键优化技术,并进行了实际应用性能测试.测试表明,在实际应用这些优化技术的过程中,应用程序对SpMV的调用次数要达到上百次的量级,才能抵消由于应用这些性能优化技术所带来的额外时间开销,取得性能加速效果.在Pentium 4和AMD Athlon平台上,测试了10个矩阵,其平均加速比分别达到了1.69和1.48.  相似文献   

15.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

16.
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM’s speed and throughput. To achieve this goal, this paper proposes techniques to take advantage of the characteristics of the 3-stage access of contemporary DRAM chips by grouping the accesses of the same row together and interleaving the execution of memory accesses from different banks. A family of Bubble Filling Scheduling (BFS) algorithms are proposed in this paper to minimize memory access schedule length and improve memory access time for embedded systems.When the memory access trace is known in some application-specific embedded systems, this information can be fully utilized to generate efficient memory access schedules. The offline BFS algorithm can generate schedules which are 47.49% shorter than in-order scheduling and 8.51% shorter than existing burst scheduling on average. When memory accesses are received by the single memory controller in real time, the memory accesses have to be scheduled as they come. The online BFS algorithm in this paper serves this purpose and generates schedules which are 58.47% shorter than in-order scheduling and 4.73% shorter than burst scheduling on average. To improve the memory throughput and further reduce the memory access schedule, an architecture with dual memory controllers is proposed. According to the experimental results, the dual controller algorithm can generate schedules which are 62.89% shorter than in-order scheduling, 14.23% shorter than burst scheduling, and 10.07% shorter than single controller BFS algorithms on average.  相似文献   

17.
The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management.  相似文献   

18.
Unaligned accesses are forbidden in many high-performance architectures. In most of these architectures, the least significant address bits of a multibyte memory access must be zero. Otherwise, the program generating the access is considered erroneous and an exception is flagged. The objective of this paper is to propose an alternative behaviour using the least significant address bits to encode the byte order of the accessed data. Modifying a traditional architecture to support the proposed behaviour presents several advantages, including backward compatibility at binary-code level, and the possibility of carrying out an endianness conversion during multibyte memory accesses without increasing the execution time nor using additional opcodes. The technique is demonstrated by modifying an OpenRISC 1000 implementation without introducing any penalty in hardware resources or performance. Subroutines written and compiled for the traditional architecture and originally designed for only the native byte order can, in the modified architecture, read and write data in a non-native byte order without any need to recompile. The execution of a sample algorithm operating on non-native byte order shows a reduction of 60% in the user execution time in the modified implementation when compared to the original implementation.  相似文献   

19.
On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads.  相似文献   

20.
稀疏矩阵向量乘是许多科学计算的核心,计算中大量的间接和随机访存成为计算的主要瓶颈。本文通过分析稀疏矩阵向量乘运算的数据结构和计算过程,得到计算中不同数据的访存特征,并提出了一种面向数据访存特性的Cache划分方法。对12个稀疏矩阵向量乘的测试表明,本文的Cache划分方法能有效地提高可重用向量的Cache命中率,同时减少计算对Cache空间的需求。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号