期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel prefix operations on GPU: tridiagonal system solvers and scan operators

Diéguez Adrián P. Amor Margarita Doallo Ramón 《The Journal of supercomputing》2019,75(3):1510-1523

The Journal of Supercomputing - Modern GPUs can achieve high computing power at low cost, but still requires much time and effort. Tridiagonal system and scan solvers are one example of widely used... 相似文献

2.

Performance evaluation of iterative geometric fitting algorithms

Kenichi Kanatani Yasuyuki Sugaya 《Computational statistics & data analysis》2007,52(2):1208-1222

The convergence performance of typical numerical schemes for geometric fitting for computer vision applications is compared. First, the problem and the associated KCR lower bound are stated. Then, three well-known fitting algorithms are described: FNS, HEIV, and renormalization. To these, we add a special variant of Gauss-Newton iterations. For initialization of iterations, random choice, least squares, and Taubin's method are tested. Simulation is conducted for fundamental matrix computation and ellipse fitting, which reveals different characteristics of each method. 相似文献

3.

Performance of iterative equation solvers for mass transfer problems in three-dimensional sphere packings in COMSOL

《Simulation Modelling Practice and Theory》2013

Packed bed chromatography is commonly applied for the separation of large molecules in biopharmaceutical industry. A technical chromatography system is typically composed of a cylindrical column that is filled with porous spheres. Particularly in small columns, which are increasingly applied for parallel experiments on lab robotic platforms, the impacts of inhomogeneous packing and wall effects on separation performance can be quite significant. We hence study mass transfer by convection, diffusion and adsorption in three-dimensional sphere packings. Random packings are externally generated and imported into COMSOL where the model equations are easy to implement. However, the COMSOL functions for automatic meshing and for iteratively solving the resulting equation systems fail to work with default settings. We have previously established a semi-automated and half-manual meshing procedure for rather small packings with less than 150 spheres that works with the direct PARDISO solver. The present contribution addresses the evaluation and optimization of the iterative equation solvers that are provided by COMSOL for the given spatial geometries with up to ten million degrees of freedom. The presented results illustrate that we can iteratively solve systems with up to 750 spheres using less memory and less computational time. 相似文献

4.

The Linux kernel as a case study in software evolution

Ayelet Israeli Author Vitae Author Vitae 《Journal of Systems and Software》2010,83(3):485-1097

We use 810 versions of the Linux kernel, released over a period of 14 years, to characterize the system’s evolution, using Lehman’s laws of software evolution as a basis. We investigate different possible interpretations of these laws, as reflected by different metrics that can be used to quantify them. For example, system growth has traditionally been quantified using lines of code or number of functions, but functional growth of an operating system like Linux can also be quantified using the number of system calls. In addition we use the availability of the source code to track metrics, such as McCabe’s cyclomatic complexity, that have not been tracked across so many versions previously. We find that the data supports several of Lehman’s laws, mainly those concerned with growth and with the stability of the process. We also make some novel observations, e.g. that the average complexity of functions is decreasing with time, but this is mainly due to the addition of many small functions. 相似文献

5.

A case study on mathematical expression recognition to GPU

Ying-Chin Lin Chun-Yao Wang Jing-Yun Zeng 《The Journal of supercomputing》2017,73(8):3333-3343

The technology of mathematical expression identification and recognition extracts mathematical expressions in document images, and it has been studied for over a decade. Based on previous works, we develop an automatic recognition tool, named EqnEye, which leverages the OpenCV library to perform image processing and Tesseract tool to recognize mathematical symbols. We also apply correction methods before the recognition stage to improve the recognition accuracy. To improve the efficiency for processing images of high resolution, the parallel implementation of thresholding method on GPU is integrated into this work. Experimental results exhibit the success of our correction methods to enhance the accuracy and the slight improvement to the performance. In addition, porting the recognition tool to handy devices can produce more value-added applications. 相似文献

6.

BLAS 库在多核处理器上的性能测试与分析

陈少虎张云泉张先轶程豪《软件学报》2010,21(Z1):214-223

BLAS 库是高性能计算中最基本的数学库,它的性能对超级计算机的性能有着极大的影响.而且随着CPU多核化的发展,BLAS 的多核并行性能已经变得比与体系结构相关的单核性能更加重要.实验以流行于高性能计算的Xeon、Opteron 系列多核X86 处理器为例,全面测试了GotoBLAS、Atlas、MKL 和ACML 四种主流的BLAS 库的所有1,2,3 级函数,并覆盖了不同计算规模和多核并行方面的测试.通过测试结果,分析源代码、BLAS 库资料和论文的方式,分析BLAS 有效的优化和并行方法,以及它们所适合的平台.为BLAS 的优化、使用,甚至高性能处理器的发展上提供有益的建议.实验结果表明,比起一个逻辑处理强大但是复杂的处理器,一个cache 更大、性能更好,内存带宽更宽、延迟更小,主频更高的处理器往往能在高性能计算中取得更好的性能.同时,X86 平台上的状况对其他体系结构也有巨大的借鉴意义. 相似文献

7.

Performance evaluation of parallel iterative deepening A on clusters of workstations

Abdel-Elah 《Performance Evaluation》2005,60(1-4):223-236

In this paper we investigate the performance of distributed heuristic search methods based on a well-known heuristic search algorithm, the iterative deepening A^* (IDA^*). The contribution of this paper includes proposing and assessing a distributed algorithm for IDA^*. The assessment is based on space, time and solution quality that are quantified in terms of several performance parameters such as generated search space and real execution time among others. The experiments are conducted on a cluster computer system consisting of 16 hosts built around a general-purpose network. The objective of this research is to investigate the feasibility of cluster computing as an alternative for hosting applications requiring intensive graph search. The results reveal that cluster computing improves on the performance of IDA^* at a reasonable cost. 相似文献

8.

Parallel iterative solvers involving fast wavelet transforms for the solution of BEM systems

《Advances in Engineering Software》2002,33(7-10):417-426

This paper describes the parallelization of a strategy to speed up the convergence of iterative methods applied to boundary element method (BEM) systems arising from problems with non-smooth boundaries and mixed boundary conditions. The aim of the work is the application of fast wavelet transforms as a black box transformation in existing boundary element codes. A new strategy was proposed, applying wavelet transforms on the interval, so it could be used in case of non-smooth coefficient matrices. Here, we describe the parallel iterative scheme and we present some of the results we have obtained. 相似文献

9.

Improving performance of iterative solvers with the AXC format using the Intel Xeon Phi

Coronado-Barrientos Edoardo Indalecio Guillermo García-Loureiro Antonio 《The Journal of supercomputing》2018,74(6):2823-2840

The Journal of Supercomputing - This work is focused on the application of the new AXC format in iterative algorithms on the Intel Xeon Phi coprocessor to solve linear systems by accelerating the... 相似文献

10.

Central force optimization on a GPU: a case study in high performance metaheuristics 总被引：1，自引：1，他引：0

Robert C. Green II Lingfeng Wang Mansoor Alam Richard A. Formato 《The Journal of supercomputing》2012,62(1):378-398

Central Force Optimization (CFO) is a new and deterministic population based metaheuristic algorithm that has been demonstrated to be competitive with other metaheuristic algorithms such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO), and Group Search Optimization (GSO). While CFO often shows superiority in terms of functional evaluations and solution quality, the algorithm is complex and typically requires increased computational time. In order to decrease the computational time required for convergence when using CFO, this study presents the first parallel implementation of CFO on a Graphics Processing Unit (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA). Two versions of the CFO algorithm, Parameter-Free CFO (PF-CFO) and Pseudo-Random CFO (PR-CFO), are implemented using CUDA on a NVIDIA Quadro 1000M and examined using four test problems ranging from 10 to 50 dimensions. Discussion is made concerning the implementation of the CFO algorithms in terms of problem decomposition, memory access, scalability, and divergent code. Results demonstrate substantial speedups ranging from roughly 1 to 28 depending on problem size and complexity. 相似文献

11.

通过GPU加速数据挖掘的研究进展和实践

戴春娥陈维斌傅顺开李志强《计算机工程与应用》2015,51(16):109-116

将计算密度高的部分迁移到GPU上是加速经典数据挖掘算法的有效途径。首先介绍GPU特性和主要的GPU编程模型,随后针对数据挖掘主要任务类型分别介绍基于GPU加速的工作,包括分类、聚类、关联分析、时序分析和深度学习。最后分别基于CPU和GPU实现协同过滤推荐的两类经典算法,并基于经典的MovieLens数据集的实验验证GPU对加速数据挖掘应用的显著效果,进一步了解GPU加速的工作原理和实际意义。相似文献

12.

Performance evaluation of industrial enterprises via fuzzy inference system approach: a case study

Gözde Ulutagay Fatih Ecer Efendi Nasibov 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2015,19(2):449-458

相似文献

13.

Performance and precision of histogram calculation on GPUs: Cosmological analysis as a case study

Miguel Cárdenas-Montes Juan José Rodríguez-Vázquez Miguel A. Vega-Rodríguez Ignacio Sevilla-Noarbe Eusebio Sánchez Alvaro 《Computer Physics Communications》2014

Histogram calculation is an essential part of many scientific analyses. In Cosmology, histograms are employed intensively in the computation of correlation functions of galaxies, as part of Large Scale Structure studies. Among the most commonly used ones are the two-point, three-point and the shear–shear correlation functions. In these computations, the precision of the calculation of the counts in each bin is a key element for achieving the highest accuracy. In order to accelerate the analysis of increasingly larger datasets, GPU computing is becoming widely employed in this field. However, the recommended histogram calculation procedure becomes less precise when bins become highly populated in these sort of algorithms. In this work, an alternative implementation to correct this problem is proposed and tested. This approach is based on distributing the creation of histograms between the CPU and GPU. The implementation is tested using three cosmological analyses with observational data. The results show an increased performance in terms of accuracy while keeping the same execution time. 相似文献

14.

Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters

Fengshun Lu Junqiang Song Fukang Yin Xiaoqian Zhu 《Computer Physics Communications》2012,183(6):1172-1181

相似文献

15.

Multigrid solvers for the non-aligned sonic flow: the constant coefficient case

《Computers & Fluids》1999,28(4-5):511-549

An approach for the construction of multigrid solvers for non-elliptic equations on a rectangular grid is presented. The results of both the analysis and the numerical experiments demonstrate that such an approach permits a full multigrid efficiency to be achieved, even in the case that the equation characteristics do not align with the grid. To serve as a model problem, the two- and three-dimensional linearized sonic-flow equations with a constant velocity field have been chosen. Efficient FMG solvers for the problems are demonstrated. 相似文献

16.

The performances of iterative type-2 fuzzy C-mean on GPU for image segmentation

Ali Noureddine Ait abbassi Ahmed El Cherradi Bouchaib 《The Journal of supercomputing》2022,78(2):1583-1601

The Journal of Supercomputing - Fuzzy C-mean (FCM) is an algorithm for data segmentation and classification, robust and very popular within the scientific community. It is used in several fields... 相似文献

17.

The Webkit tangible user interface: a case study of iterative prototyping 总被引：1，自引：0，他引：1

Stringer M. Rode J.A. Toye E.F. Blackwell A.F. Simpson A.R. 《Pervasive Computing, IEEE》2005,4(4):35-41

Here we describe our 18-month iterative-design program to create Webkit, a tangible user interface (TUI) to help school children learn argumentation skills - an English National Curriculum requirement. The goal was to have Webkit disappear into the environment such that the teachers and students remained focused on the learning activity. However, the school children wouldn't be allowed to evaluate our prototypes unless we showed that the classroom time allocated to our research simultaneously helped teach the curriculum. To achieve this, we developed a form of user-centered design called curriculum-focused design. When we used CFD, the curriculum and student needs motivated not just the final technology but also our design process. 相似文献

18.

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

B. Neelima G. Ram Mohana Reddy Prakash S. Raghavendra 《Concurrency and Computation》2015,27(1):47-68

General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

19.

GPU上实现的向量点积的性能分析

郭雷刘进锋《计算机工程与应用》2012,48(2):201-202

CUDA是一种较为简便的利用GPU进行通用计算的技术。研究了GPU上基于CUDA的几种向量点积算法,比较、分析了每种算法的性能。实验表明,GPU上最快的算法比CPU上的算法快了约7倍。相似文献

20.

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

S. Tabik M. Peemen L. F. Romero 《The Journal of supercomputing》2018,74(4):1580-1608

This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (\(>25\%\), this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65\(\times \) faster than the case in which we fully decompose our stencil without tiling and 5.3\(\times \) faster with respect to the fully fused version on the NVIDIA GPUs. 相似文献