期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王庆林李东升梅松竹赖志权窦勇《计算机研究与发展》2020,57(6):1140-1151

随着深度学习的快速发展,卷积神经网络已广泛应用于计算机视觉、自然语言处理等人工智能领域中.Winograd快速卷积算法因能有效降低卷积神经网络中卷积操作的计算复杂度而受到广泛关注.随着国防科技大学自主研制的飞腾多核处理器在智能领域的推广应用,对面向飞腾多核处理器的高性能卷积实现提出了强烈需求.针对飞腾多核处理器的体系结构特征与Wingorad快速卷积算法的计算特点,提出了一种高性能并行Winograd快速卷积算法.该算法不依赖通用矩阵乘库函数,由卷积核转换、输入特征图转换、逐元素乘、输出特征图逆变换等4个部分构成,融合设计了4个部分的数据操作,并设计了与之配套的数据布局、多级并行数据转换算法与多级并行矩阵乘算法,实现访存性能以及算法整体性能的提升.在两款飞腾多核处理器上的测试结果显示,与开源库ACL和NNPACK中的Winograd快速卷积实现相比,该算法分别能获得1.05~16.11倍与1.66~16.90倍的性能加速;集成到开源框架Mxnet后,该算法使得VGG16网络的前向计算获得了3.01~6.79倍的性能加速. 相似文献

2.

Parallel cube computation on modern CPUs and GPUs

Guoliang Zhou Hong Chen 《The Journal of supercomputing》2012,61(3):394-417

With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets. 相似文献

3.

Performance Evaluation of GPU-Accelerated Spatial Interpolation Using Radial Basis Functions for Building Explicit Surfaces

Zengyu Ding Gang Mei Salvatore Cuomo Nengxiong Xu Hong Tian 《International journal of parallel programming》2018,46(5):963-991

This paper focuses on evaluating the computational performance of parallel spatial interpolation with Radial Basis Functions (RBFs) that is developed by utilizing modern GPUs. The RBFs can be used in spatial interpolation to build explicit surfaces such as Discrete Elevation Models. When interpolating with large-size of data points and interpolated points for building explicit surfaces, the computational cost would be quite expensive. To improve the computational efficiency, we specifically develop a parallel RBF spatial interpolation algorithm on many-core GPUs, and compare it with the parallel version implemented on multi-core CPUs. Five groups of experimental tests are conducted on two machines to evaluate the computational efficiency of the presented GPU-accelerated RBF spatial interpolation algorithm. Experimental results indicate that: in most cases, the parallel RBF interpolation algorithm on many-core GPUs does not have any significant advantages over the parallel version on multi-core CPUs in terms of computational efficiency. This unsatisfied performance of the GPU-accelerated RBF interpolation algorithm is due to: (1) the limited size of global memory residing on the GPU, and (2) the need to solve a system of linear equations in each GPU thread to calculate the weights and prediction value of each interpolated point. 相似文献

4.

多核处理器上的并行联机分析处理算法研究

周国亮王桂兰朱永利《计算机科学与探索》2013,(2):180-190

近年来,计算机硬件技术获得了很大发展,尤其是大内存和多核,但算法效率并没有随着硬件技术的发展而提高,根本原因是没有充分利用CPU缓存以及单线程程序设计的局限性。在联机分析处理领域,数据方体计算是一个重要而又耗时的操作,因此如何提高数据方体的计算效率是该领域的一个研究难点。探讨了基于多核CPU特征的并行立方体算法,提出了MT-Multi-Way(multi-threading multi-way)和MT-BUC(multi-threading bottom-up computation)算法。该算法通过有效的数据划分和多线程协作,避免了Cache竞争,并确保了负载均衡,获得了近似线性加速比。以上述算法为基础,提出了处理立方体算法的多核框架,包括数据划分策略及递归算法的多核处理,指导立方体算法的并行化。相似文献

5.

Optimized OpenCL implementation of the Elastodynamic Finite Integration Technique for viscoelastic media

M. Molero-Armenta Ursula Iturrarán-Viveros S. Aparicio M.G. Hernández 《Computer Physics Communications》2014

Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included. 相似文献

6.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

7.

多核与众核上MNF并行算法与性能优化

方民权张卫民高畅方建滨《软件学报》2015,26(S2):247-256

高光谱遥感影像降维最大噪声分数变换(maximum noise fraction rotation,简称MNF rotation)方法运算量大,耗时长.基于多核CPU与众核MIC(many integrated cores)平台,研究MNF算法的并行方案和性能优化.通过热点分析,针对滤波、协方差矩阵运算和MNF变换等热点,提出相应并行方案和多种优化策略,量化分析优化效果,设计MKL(math kernel library)库函数实现方案并测评其性能;设计并实现基于多核CPU的C-MNF和基于CPU/MIC的M-MNF并行算法.实验结果显示,C-MNF算法在多核CPU取得的加速比为58.9~106.4,而基于CPU/MIC异构系统的M-MNF算法性能最好,加速比最高可达137倍. 相似文献

8.

Fast anomaly detection in hyperspectral images with RX method on heterogeneous clusters

J. M. Molero A. Paz E. M. Garzón J. A. Martínez A. Plaza I. García 《The Journal of supercomputing》2011,58(3):411-419

Remotely sensed hyperspectral sensors provide image data containing rich information in both the spatial and the spectral domain, and this information can be used to address detection tasks in many applications. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the RX algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few approaches have been developed for parallel implementation of this algorithm. In this paper, we evaluate the suitability of using a hybrid parallel implementation with a high-dimensional hyperspectral scene. A general strategy to automatically map parallel hybrid anomaly detection algorithms for hyperspectral image analysis has been developed. Parallel RX has been tested on an heterogeneous cluster using this routine. The considered approach is quantitatively evaluated using hyperspectral data collected by the NASA’s Airborne Visible Infra-Red Imaging Spectrometer system over the World Trade Center in New York, 5 days after the terrorist attacks. The numerical effectiveness of the algorithms is evaluated by means of their capacity to automatically detect the thermal hot spot of fires (anomalies). The speedups achieved show that a cluster of multi-core nodes can highly accelerate the RX algorithm. 相似文献

9.

基于申威众核架构的分组卷积计算加速与优化

王鑫张铭《计算机应用研究》2023,40(6):1745-1749

针对应用普通卷积结构的卷积计算复杂度较高、计算量与参数量较大的问题,提出以国产SW26010P众核处理器为平台的并行分组卷积算法。核心思想是利用独特的数据布局,通过多核映射处理进行并行计算。实验测试结果表明,与单核串行算法相比,使用该并行分组卷积算法可以获得79.5的最高加速比及186.7MFLOPS的最大有效算力。通过SIMD指令对并行分组卷积算法进行数据并行优化后,与使用优化前的并行分组卷积算法相比,可以获得10.2的最高加速比。相似文献

10.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

11.

基于多线程的地震相干体属性提取算法

杨尚琴许自龙洪承煜《计算机系统应用》2012,21(11):72-75

为了充分发挥计算机的多核优势,提高地震数据相干体的计算速度,通过研究在多核上的多线程并行技术,完成了并行相干体算法的设计与实现,并分别对串行和并行算法进行性能比较测试．测试结果表明：Pthread多线程技术可以充分利用多核资源,取得比较理想的线性加速比,且提高了系统的计算效率,非常适合于大数据量的地震数据处理的应用．相似文献

12.

基于横向局部性的多核计算模型

袁良张云泉《计算机科学》2012,39(7):1-6

片内多核已成为延长摩尔定律的方式,并行算法设计、编程模型、编译器和运行时系统都需要利用计算模型进行分析。现有多核模型对线程间共享缓存等资源的竞争已有较精确的模型,但是对于线程间数据共享考虑较少。提出线程间共享缓存的横向局部性和任务共享率概念,基于此扩展串行存储层次模型RAM(h),提出考虑任务共享率的多核并行计算模型MRAM(h)。相似文献

13.

Optimization of minimum volume constrained hyperspectral image unmixing on CPU–GPU heterogeneous platform

Zebin Wu Jianjun Liu Shun Ye Le Sun Zhihui Wei 《Journal of Real-Time Image Processing》2018,15(2):265-277

Hyperspectral unmixing is essential for efficient hyperspectral image processing. Nonnegative matrix factorization based on minimum volume constraint (MVC-NMF) is one of the most widely used methods for unsupervised unmixing for hyperspectral image without the pure-pixel assumption. But the model of MVC-NMF is unstable, and the traditional solution based on projected gradient algorithm (PG-MVC-NMF) converges slowly with low accuracy. In this paper, a novel parallel method is proposed for minimum volume constrained hyperspectral image unmixing on CPU–GPU Heterogeneous Platform. First, a optimized unmixing model of minimum logarithmic volume regularized NMF is introduced and solved based on the second-order approximation of function and alternating direction method of multipliers (SO-MVC-NMF). Then, the parallel algorithm for optimized MVC-NMF (PO-MVC-NMF) is proposed based on the CPU–GPU heterogeneous platform, taking advantage of the parallel processing capabilities of GPUs and logic control abilities of CPUs. Experimental results based on both simulated and real hyperspectral images indicate that the proposed algorithm is more accurate and robust than the traditional PG-MVC-NMF, and the total speedup of PO-MVC-NMF compared to PG-MVC-NMF is over 50 times. 相似文献

14.

Spectral–spatial classification of hyperspectral images by algebraic multigrid based multiscale information fusion

Yi Wang Hexiang Duan 《International journal of remote sensing》2019,40(4):1301-1330

In this work, we present a novel spectral-spatial classification framework of hyperspectral images (HSIs) by integrating the techniques of algebraic multigrid (AMG), hierarchical segmentation (HSEG) and Markov random field (MRF). The proposed framework manifests two main contributions. First, an effective HSI segmentation method is developed by combining the AMG-based marker selection approach and the conventional HSEG algorithm to construct a set of unsupervised segmentation maps in multiple scales. To improve the computational efficiency, the fast Fish Markov selector (FMS) algorithm is exploited for feature selection before image segmentation. Second, an improved MRF energy function is proposed for multiscale information fusion (MIF) by considering both spatial and inter-scale contextual information. Experiments were performed using two airborne HSIs to evaluate the performance of the proposed framework in comparison with several popular classification methods. The experimental results demonstrated that the proposed framework can provide superior performance in terms of both qualitative and quantitative analysis. 相似文献

15.

CPU-GPU异构多核系统的动态任务调度算法

裴颂文宁静张俊格《计算机应用研究》2016,33(11)

CPU-GPU异构多核系统对计算密集型的应用加速效果显著而得到广泛应用,但该系统易出现负载均衡问题。针对此问题,本文提出了一种CPU-GPU异构多核系统的动态任务调度算法。该算法充分利用CPU的线程资源和GPU的计算资源,准确测量CPU和GPU的计算能力,从而动态调整分配到CPU和GPU上的数据块大小,减小负载的总执行时间,提高系统加速比。实验结果表明,该算法使得系统加速比提高34%~103%。相似文献

16.

Multi-core-CPU and GPU-accelerated radiative transfer models based on the discrete ordinate method

Dmitry S. Efremenko Diego G. Loyola Adrian Doicu Robert J.D. Spurr 《Computer Physics Communications》2014

The operational processing of remote sensing data usually requires high-performance radiative transfer model (RTM) simulations. To date, multi-core CPUs and also Graphical Processing Units (GPUs) have been used for highly intensive parallel computations. In this paper, we have compared multi-core and GPU implementations of an RTM based on the discrete ordinate solution method. To implement GPUs, the original CPU code has been redesigned using the C-oriented Compute Unified Device Architecture (CUDA) developed by NVIDIA. 相似文献

17.

Parallel online spatial and temporal aggregations on multi-core CPUs and many-core GPUs

《Information Systems》2014

With the increasing availability of locating and navigation technologies on portable wireless devices, huge amounts of location data are being captured at ever growing rates. Spatial and temporal aggregations in an Online Analytical Processing (OLAP) setting for the large-scale ubiquitous urban sensing data play an important role in understanding urban dynamics and facilitating decision making. Unfortunately, existing spatial, temporal and spatiotemporal OLAP techniques are mostly based on traditional computing frameworks, i.e., disk-resident systems on uniprocessors based on serial algorithms, which makes them incapable of handling large-scale data on parallel hardware architectures that have already been equipped with commodity computers. In this study, we report our designs, implementations and experiments on developing a data management platform and a set of parallel techniques to support high-performance online spatial and temporal aggregations on multi-core CPUs and many-core Graphics Processing Units (GPUs). Our experiment results show that we are able to spatially associate nearly 170 million taxi pickup location points with their nearest street segments among 147,011 candidates in about 5–25 s on both an Nvidia Quadro 6000 GPU device and dual Intel Xeon E5405 quad-core CPUs when their Vector Processing Units (VPUs) are utilized for computing intensive tasks. After spatially associating points with road segments, spatial, temporal and spatiotemporal aggregations are reduced to relational aggregations and can be processed in the order of a fraction of a second on both GPUs and multi-core CPUs. In addition to demonstrating the feasibility of building a high-performance OLAP system for processing large-scale taxi trip data for real-time, interactive data explorations, our work also opens the paths to achieving even higher OLAP query efficiency for large-scale applications through integrating domain-specific data management platforms, novel parallel data structures and algorithm designs, and hardware architecture friendly implementations. 相似文献

18.

An image division approach for volume ray casting in multi-threading environment

Sukhyun Lim Daesung Lee Byeong-Seok Shin 《Multimedia Tools and Applications》2014,68(2):211-223

For an efficient parallel volume ray casting suitable for recent multi-core CPUs, we propose an image-ordered approach by using a cost function to allocate loaded tasks impartially per each processing node. At the first frame, we divide an image space evenly, and we compute a cost function. By applying the frame coherence property, we divide the image space unevenly using the computed previous cost function since the next frame. Conventional image-ordered parallel approaches have focused on dividing and compositing volume datasets. However, the divisions and accumulations are negligible for recent multi-core CPUs because they are performed inside one physical CPU. As a result, we can reduce the rendering time without deteriorating the image quality by applying a cost function reflecting on all time-consuming steps of the volume ray casting. 相似文献

19.

Providing Source Code Level Portability Between CPU and GPU with MapCG

下载免费PDF全文

洪春涛陈德颢陈羽北陈文光郑纬民林海波《计算机科学技术学报》2012,27(1):42-56

Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod... 相似文献

20.

Improved parallelism and scheduling in multi-core software routers

Norbert Egi Gianluca Iannaccone Maziar Manesh Laurent Mathy Sylvia Ratnasamy 《The Journal of supercomputing》2013,63(1):294-322

Recent technological advances in commodity server architectures, with multiple multi-core CPUs, integrated memory controllers, high-speed interconnects, and enhanced network interface cards, provide substantial computational capacity, and thus an attractive platform for packet forwarding. However, to exploit this available capacity, we need a suitable software platform that allows effective parallel packet processing and resource management. In this paper, we at first introduce an improved forwarding architecture for software routers that enhances parallelism by exploiting hardware classification and multi-queue support, already available in recent commodity network interface cards. After evaluating the original scheduling algorithm of the widely-used Click modular router, we propose solutions for extending this scheduler for improved fairness, throughput, and more precise resource management. To illustrate the potential benefits of our proposal, we implement and evaluate a few key elements of our overall design. Finally, we discuss how our improved forwarding architecture and resource management might be applied in virtualized software routers. 相似文献