期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating data gravitation-based classification using GPU

Peng Lizhi Zhang Haibo Hassan Houcine Chen Yuehui Yang Bo 《The Journal of supercomputing》2019,75(6):2930-2949

Data gravitation-based classification model, a new physic law inspired classification model, has been demonstrated to be an effective classification model for both standard and imbalanced tasks. However, due to its large scale of gravitational computation during the feature weighting process, DGC suffers from high computational complexity, especially for large data sets. In this paper, we address the problem of speeding up gravitational computation using graphics processing unit (GPU). We design a GPU parallel algorithm namely GPU–DGC to accelerate the feature weighting process of the DGC model. Our GPU–DGC model distributes the gravitational computing process to parallel GPU threads, in order to compute gravitation simultaneously. We use 25 open classification data sets to evaluate the parallel performance of our algorithm. The relationship between the speedup ratio and the number of GPU threads is discovered and discussed based on the empirical studies. The experimental results show the effectiveness of GPU–DGC, with the maximum speedup ratio of 87 to the serial DGC. Its sensitivity to the number of GPU threads is also discovered in the empirical studies.

相似文献

2.

基于粒子群优化粒子滤波和CUDA加速的故障诊断方法

曹洁李钊王进花余萍《计算机应用与软件》2020,37(4):240-246,251

在非线性系统中,粒子滤波需要大量粒子才能保证状态估计的准确度,这降低了算法的实时性,导致故障诊断的准确率和实时性不佳。针对该问题,提出基于GPU平台的粒子群优化粒子滤波(PSOPF)并行算法。通过分析PSOPF算法的并行性,设计并实现一种基于CUDA并行计算架构的PSOPF并行算法,利用大量的GPU线程对算法进行加速。为解决拒绝重采样对GPU全局内存的非合并访问带来的执行效率低问题,通过改进拒绝重采样并行算法,使线程束中的线程对同一内存区段中的粒子进行重采样,提高了其执行效率。通过对风力机组变桨距系统故障诊断验证了算法的有效性。实验结果表明,该方法可满足故障诊断准确率和实时性的要求。相似文献

3.

一种高效直方图生成算法在GPU上的实现

狄鹏胡长军李建江《计算机科学》2012,39(3):304-307

直方图生成算法(Histogram Generation)是一种顺序的非规则数据依赖的循环运算,已在许多领域被广泛应用。但是,由于非规则的内存访问,使得多线程对共享内存访问会产生很多存储体冲突(Bank Conflict),从而阻碍并行效率。如何在并行处理器平台,特别是当前最先进的图像处理单元(Graphic Processing Unit,GPU)实现高效的直方图生成算法是很有研究价值的。为了减少直方图生成过程中的存储体冲突,通过内存填充技术,将多线程的共享内存访问均匀地分散到各个存储体,可以大幅减少直方图生成算法在GPU上的内存访问延时。同时,通过提出有效可靠的近似最优配置搜索模型,可以指导用户配置GPU执行参数,以获得更高的性能。经实验验证,在实际应用中,改良后的算法比原有算法性能提高了42%～88%。相似文献

4.

Accelerating IP routing algorithm using graphics processing unit for high speed multimedia communication

Jia Uddin In-Kyu Jeong Myeongsu Kang Cheol-Hong Kim Jong-Myon Kim 《Multimedia Tools and Applications》2016,75(23):15365-15379

This paper presents a Graphics Processing Unit (GPU)-based implementation of a Bellman-Ford (BF) routing algorithm using NVIDIA’s Compute Unified Device Architecture (CUDA). In the proposed GPU-based approach, multiple threads run concurrently over numerous streaming processors in the GPU to dynamically update routing information. Instead of computing the individual vertex distances one-by-one, a number of threads concurrently update a larger number of vertex distances, and an individual vertex distance is represented in a single thread. This paper compares the performance of the GPU-based approach to an equivalent CPU implementation while varying the number of vertices. Experimental results show that the proposed GPU-based approach outperforms the equivalent sequential CPU implementation in terms of execution time by exploiting the massive parallelism inherent in the BF routing algorithm. In addition, the reduction in energy consumption (about 99 %) achieved by using the GPU is reflective of the overall merits of deploying GPUs across the entire landscape of IP routing for emerging multimedia communications. 相似文献

5.

Adaptive parallel Louvain community detection on a multicore platform

《Microprocessors and Microsystems》2017

Community detection is a demanded technique in analyzing complex and massive graph-based networks. The quality of the detected communities in an acceptable time is an important aspect of an algorithm, which aims at passing through an ultra large scale graph, for instance a social network graph. In this paper, an efficient method is proposed to tackle Louvain community detection problem on multicore systems in the line of thread-level parallelization. The main contribution of this article is to present an adaptive parallel thread assignment for the calculation of adding qualified neighbor nodes to the community. This leads to obtain a better load balancing method for the execution of threads. The proposed method is evaluated on an AMD system with 64 cores, and can reduce the execution time by 50% in comparison with the previous fastest parallel algorithms. Moreover, it was observed in the course of the experiments that our method could find comparably qualified communities. 相似文献

6.

A parallel and scalable CAST-based clustering algorithm on GPU

Kawuu W. Lin Chun-Hung Lin Chun-Yuan Hsiao 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2014,18(3):539-547

The advances in nanometer technology and integrated circuit technology enable the graphics card to attach individual memory and one or more processing units, named GPU, in which most of the graphing instructions can be processed in parallel. Obviously, the computation resource can be used to improve the execution efficiency of not only graphing applications but other time consuming applications like data mining. The Clustering Affinity Search Technique is a famous clustering algorithm, which is widely used in clustering the biological data. In this paper, we will propose an algorithm that can utilize the GPU and the individual memory of graphics card to accelerate the execution. The experimental results show that our proposed algorithm can deliver excellent performance in terms of execution time and is scalable to very large databases. 相似文献

7.

Online dynamic graph drawing 总被引：1，自引：0，他引：1

Frishman Y Tal A 《IEEE transactions on visualization and computer graphics》2008,14(4):727-740

This paper presents an algorithm for drawing a sequence of graphs online. The algorithm strives to maintain the global structure of the graph and thus the user's mental map, while allowing arbitrary modifications between consecutive layouts. The algorithm works online and uses various execution culling methods in order to reduce the layout time and handle large dynamic graphs. Techniques for representing graphs on the GPU allow a speedup by a factor of up to 17 compared to the CPU implementation. The scalability of the algorithm across GPU generations is demonstrated. Applications of the algorithm to the visualization of discussion threads in Internet sites and to the visualization of social networks are provided. 相似文献

8.

Scalable distributed Louvain algorithm for community detection in large graphs

Sattar Naw Safrin Arifuzzaman Shaikh 《The Journal of supercomputing》2022,78(7):10275-10309

Community detection (or clustering) in large-scale graphs is an important problem in graph mining. Communities reveal interesting organizational and functional characteristics of a network. Louvain algorithm is an efficient sequential algorithm for community detection. However, such sequential algorithms fail to scale for emerging large-scale data. Scalable parallel algorithms are necessary to process large graph datasets. In this work, we show a comparative analysis of our different parallel implementations of Louvain algorithm. We design parallel algorithms for Louvain method in shared memory and distributed memory settings. Developing distributed memory parallel algorithms is challenging because of inter-process communication and load balancing issues. We incorporate dynamic load balancing in our final algorithm DPLAL (Distributed Parallel Louvain Algorithm with Load-balancing). DPLAL overcomes the performance bottleneck of the previous algorithms and shows around 12-fold speedup scaling to a larger number of processors. We also compare the performance of our algorithm with some other prominent algorithms in the literature and get better or comparable performance . We identify the challenges in developing distributed memory algorithm and provide an optimized solution DPLAL showing performance analysis of the algorithm on large-scale real-world networks from different domains.

相似文献

9.

基于图形处理器的可变形部件模型算法的并行化

刘宝平陈庆奎李金静刘伯成《计算机应用》2015,35(11):3075-3078

目前目标识别领域,在人体检测中精确度最高的算法就是可变形部件模型(DPM)算法,针对DPM算法计算量大的缺点,提出了一种基于图形处理器(GPU)的并行化解决方法.采用GPU编程模型OpenCL,对DPM算法的整个算法的实现细节采用了并行化的思想进行重新设计实现,优化算法实现的内存模型和线程分配.通过对OpenCV库和采用GPU重新实现的程序进行对比,在保证了检测效果的前提下,使得算法的执行效率有了近8倍的提高. 相似文献

10.

Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

Gregorio Bernabé Javier Cuenca Luis Pedro García Domingo Giménez 《The Journal of supercomputing》2014,70(2):830-844

This paper presents an enhanced auto-optimization method to run the 3D-Fast Wavelet Transform on different computing units in a system (GPU, MIC, CPU). The proposed method automatically selects a set of parameter values (block size, number of streams and number of threads) in order to reduce the total execution time, obtaining performances close to the optimal and decreasing the number of evaluations needed. 相似文献

11.

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Fang Juan Zhang Xibei Liu Shijian Chang Zeqing 《The Journal of supercomputing》2019,75(8):4519-4528

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the last-level cache (LLC), the competition for LLC is more serious. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of LLC capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications have high number of concurrent threads and they can tolerate access latency. Taking into account the GPU program memory latency tolerance characteristics, we propose an LLC buffer management strategy (buffer-for-GPU, BFG) for heterogeneous multi-core. A buffer is added on the side of LLC to filtrate streaming requests of GPU. Cache-insensitive GPU messages directly access to buffer instead of accessing to LLC, thereby filtering the GPU request and freeing up the LLC space for the CPU application. Then, for the different characteristics of CPU and GPU applications, an improved LRU replacement taking into account the recent access time and access frequency of the cache block is adopted. The cache misses-aware algorithm dynamically selects the improved LRU or LRU algorithm to fit the current operating state by comparing the miss rate of cache in buffer so that the performance of the system will be improved significantly.

相似文献

12.

基于GPU的压缩感知重构算法的设计与实现

张静熊承义高志荣《计算机科学》2016,43(8):318-322

针对大尺度压缩感知重构算法实时性应用的需要,探讨了基于图形处理器(GPU)的正交匹配追踪算法(OMP)的加速方法及实现。为降低中央处理器与GPU之间传输的高延迟,将整个OMP算法的迭代过程转移到GPU上并行执行。其中,在GPU端根据全局存储器的访问特点,改进CUDA程序使存储访问满足合并访问条件,降低访问延迟。同时,根据流多处理器(SM)的资源条件,增加SM中共享存储器的分配,通过改进线程访问算法来降低bank conflict,提高访存速度。在NVIDIA Tesla K20Xm GPU和Intel(R) E5-2650 CPU上进行了测试,结果表明,算法中耗时长的投影模块、更新权值模块分别可获得32和46倍的加速比,算法整体可获得34倍的加速比。相似文献

13.

基于GPU的重启PGMRES并行算法研究

陈华史悦戎《计算机工程与应用》2014,50(7):35-40

重启的PGMRES算法是求解稀疏线性方程组高效的迭代方法之一,计算过程也比较稳定。为加快大规模稀疏线性方程组的求解速度,对重启PGMRES算法使用GPU并行方式进行并行算法实现。提出了ELL压缩存储格式的新存取方式,并依据问题规模和SM数目提出了动态分配线程策略。实验结果表明,该算法可有效提高SM资源利用率,获得3~10倍的加速比。相似文献

14.

不规则任务在图形处理器集群上的调度策略

平凡汤小春潘彦宇李战怀《计算机应用》2021,41(11):3295-3301

针对大量的资源需求少且并行度高的不规则任务集合,利用图形处理器（GPU）来加速处理是目前的主流。然而现有的不规则任务调度策略要么采用独占GPU的方式,要么使用传统的优化方法将任务映射到GPU设备上。前者导致GPU资源的闲置,后者不能最大限度利用GPU计算资源。在分析了现存问题的基础上,采用多背包优化思想,使更多的不规则任务以最佳的方式共享GPU设备。首先,针对GPU集群的特点,给出了由调度器、执行器组成的分布式GPU作业调度框架;然后,以GPU显存为代价,设计了一种基于GPU计算资源的扩展贪心调度（EGS）算法,该算法将尽可能多的不规则任务调度到多个可用的GPU上,以最大限度地利用GPU计算资源,并解决了GPU资源的闲置问题;最后,使用实际基准程序随机生成目标任务集来验证所提调度策略的有效性。实验结果表明,与传统的贪心算法、最早完成时间（MCT）算法和Min-min算法相比,当任务数量等于1 000时,EGS算法的执行时长分别平均降低至原来的58%、64%和80%,并且能有效提升GPU资源利用率。相似文献

15.

Efficient document image binarization using heterogeneous computing and parameter tuning

Westphal Florian Grahn Håkan Lavesson Niklas 《International Journal on Document Analysis and Recognition》2018,21(1-2):41-58

In the context of historical document analysis, image binarization is a first important step, which separates foreground from background, despite common image degradations, such as faded ink, stains, or bleed-through. Fast binarization has great significance when analyzing vast archives of document images, since even small inefficiencies can quickly accumulate to years of wasted execution time. Therefore, efficient binarization is especially relevant to companies and government institutions, who want to analyze their large collections of document images. The main challenge with this is to speed up the execution performance without affecting the binarization performance. We modify a state-of-the-art binarization algorithm and achieve on average a 3.5 times faster execution performance by correctly mapping this algorithm to a heterogeneous platform, consisting of a CPU and a GPU. Our proposed parameter tuning algorithm additionally improves the execution time for parameter tuning by a factor of 1.7, compared to previous parameter tuning algorithms. We see that for the chosen algorithm, machine learning-based parameter tuning improves the execution performance more than heterogeneous computing, when comparing absolute execution times.

相似文献

16.

GPU-accelerated string matching for database applications

Evangelia A. Sitaridi Kenneth A. Ross 《The VLDB Journal The International Journal on Very Large Data Bases》2016,25(5):719-740

Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations. 相似文献

17.

基于CUDA的并行粒子群优化算法的设计与实现 总被引：1，自引：0，他引：1

蔡勇李光耀王琥《计算机应用研究》2013,30(8):2415-2418

针对处理大量数据和求解大规模复杂问题时粒子群优化(PSO)算法计算时间过长的问题, 进行了在显卡(GPU)上实现细粒度并行粒子群算法的研究。通过对传统PSO算法的分析, 结合目前被广泛使用的基于GPU的并行计算技术, 设计实现了一种并行PSO方法。本方法的执行基于统一计算架构（CUDA）, 使用大量的GPU线程并行处理各个粒子的搜索过程来加速整个粒子群的收敛速度。程序充分使用CUDA自带的各种数学计算库, 从而保证了程序的稳定性和易写性。通过对多个基准优化测试函数的求解证明, 相对于基于CPU的串行计算方法, 在求解收敛性一致的前提下, 基于CUDA架构的并行PSO求解方法可以取得高达90倍的计算加速比。相似文献

18.

利用并行GPU对分层分布式狄利克雷分布算法加速

温腊芮建武何婷婷郭亮《计算机应用》2013,33(12):3313-3316

分层分布式狄利克雷分布(HD-LDA)算法是一个对潜在狄利克雷分布(LDA)进行改进的基于概率增长模型的文本分类算法,与只能在单机上运行的LDA算法相比,可以运行在分布式框架下,进行分布式并行处理。Mahout在Hadoop框架下实现了HD-LDA算法,但是因为单节点算法的计算量大,仍然存在对大数据分类运行时间太长的问题。而大规模文本集合分散到多个节点上迭代推导,单个节点上文档集合的推导仍是顺序进行的,所以处理大规模文本集合时仍然需要很长时间才能完成全部文本的分类。为此,提出将Hadoop与图形处理器(GPU)相结合,将单节点文本集合的推导过程转移到GPU上运行,实现单节点多个文档并行推导,利用多台并行的GPU对HD-LDA算法进行加速。应用结果表明,使用该方法能使分布式框架下的HD-LDA算法对大规模文本集合处理达到7倍的加速比。相似文献

19.

基于GPU的光子映射并行化算法

贺怀清孙希栋《计算机应用》2012,32(7):1939-1942

针对串行情况下光子映射算法速度慢的问题,对光子映射算法并行化进行可行性分析,充分利用图像处理器(GPU)的统一设备计算架构(CUDA)的并行和计算能力,实现光子映射算法的并行化。同时针对算法中光子发射追踪阶段生成GPU线程数与光子数相同的方法的不足以及平均分配方法所造成的资源浪费等,提出线程之间协同工作的方法并采用动态平衡处理,使光子渲染速度提升了将近一倍。实验结果证明了多线程间协同工作及动态平衡相结合方法的有效性。相似文献

20.

Automated Verification of Functional Correctness of Race-Free GPU Programs

Kensuke Kojima Akifumi Imanishi Atsushi Igarashi 《Journal of Automated Reasoning》2018,60(3):279-298

We study an automated verification method for functional correctness of parallel programs running on graphics processing units (GPUs). Our method is based on Kojima and Igarashi’s Hoare logic for GPU programs. Our algorithm generates verification conditions (VCs) from a program annotated by specifications and loop invariants, and passes them to off-the-shelf SMT solvers. It is often impossible, however, to solve naively generated VCs in reasonable time. A main difficulty stems from quantifiers over threads due to the parallel nature of GPU programs. To overcome this difficulty, we additionally apply several transformations to simplify VCs before calling SMT solvers. Our implementation successfully verifies correctness of several GPU programs, including matrix multiplication optimized by using shared memory. In contrast to many existing verification tools for GPU programs, our verifier succeeds in verifying fully parameterized programs: parameters such as the number of threads and the sizes of matrices are all symbolic. We empirically confirm that our simplification heuristics is highly effective for improving efficiency of the verification procedure. 相似文献