共查询到20条相似文献,搜索用时 0 毫秒
1.
Fabrizio MiloAuthor Vitae Massimo BernaschiAuthor Vitae 《Journal of Systems and Software》2011,84(12):2088-2096
We describe the implementation, based on the Compute Unified Device Architecture (CUDA) for Graphics Processing Units (GPU), of a novel and very effective approach to quickly test passphrases used to protect private keyrings of OpenPGP cryptosystems.Our combination of algorithm and implementation, reduces the time required to test a set of possible passphrases by three-orders of magnitude with respect to an attack based on the procedure described in the OpenPGP standard and implemented by software packages like GnuPG, and a tenfold speed up if compared to our highly tuned CPU implementation. Our solution can be considered a replacement and an extension of pgpcrack, a utility used in the past for attacking PGP.The optimizations described can be applied to other cryptosystems and confirm that the GPU architecture is also very effective for running applications that make extensive (if not exclusive) use of integer operations. 相似文献
2.
The error-resilient entropy coding (EREC) algorithm is an effective method for combating error propagation at low cost in many compression methods using variable-length coding (VLC). However, the main drawback of the EREC is its high complexity. In order to overcome this disadvantage, a parallel EREC is implemented on a graphics processing unit (GPU) using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel at each stage which brings additional communication overhead. To achieve high efficiency of parallel EREC, we propose partitioning the EREC (P-EREC) algorithm, which splits variable-length blocks into groups and then every group is coded using the EREC separately. Each GPU thread processes one group so as to make the EREC coarse-grained parallel. In addition, some optimization strategies are discussed in order to obtain higher performance using the GPU. In the case that the variable-length data blocks are divided into 128 groups (256 groups, resp.), experimental results show that the parallel P-EREC achieves 32× to 123× (54× to 350×, resp.) speedup over the original C code of EREC compiled with the O2 optimization option. Higher speedup can even be obtained with more groups. Compared to the EREC, the P-EREC not only achieves a good speedup performance, but it also slightly improves the resilience of the VLC bit-stream against burst or random errors. 相似文献
3.
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献
4.
A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy. 相似文献
5.
A performance study of general-purpose applications on graphics processors using CUDA 总被引:1,自引:0,他引:1
Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献
6.
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node. 相似文献
7.
一种双目立体视觉算法的GPU实现 总被引:1,自引:0,他引:1
利用可编程图形硬件GPU实现了非参数局域变换双目立体视觉算法。该算法使用局部非参数统计的结果而不是像素灰度值作为匹配代价,相对于其它基于区域的立体匹配算法,具有物体边界区域处理稳定和适于硬件实现等优点。该文利用GPU的最新特性实现了算法的全部运算都在GPU上执行。由于GPU的并行流水特性,算法在GPU上的运算速度较在CPU上得到提高。 相似文献
8.
微机环境下基于PVM的网络并行程序开发方法 总被引:1,自引:0,他引:1
并行虚拟机PVM是一种通用的网络并行程序开发环境,它可以把连网的巨型机,大规模并行机,工作站以及微机作为一大型并行机使用,供人们开发并行算法或运行并行系统。此文对PVM的基本情况和最新进展进行介绍,讨论了基于PVM的网络并行程序开发方法,最后给出了具体的实例。 相似文献
9.
N. Ferrando M.A. Gosálvez J. Cerdá R. Gadea K. Sato 《Computer Physics Communications》2011,182(3):628-640
Presently, dynamic surface-based models are required to contain increasingly larger numbers of points and to propagate them over longer time periods. For large numbers of surface points, the octree data structure can be used as a balance between low memory occupation and relatively rapid access to the stored data. For evolution rules that depend on neighborhood states, extended simulation periods can be obtained by using simplified atomistic propagation models, such as the Cellular Automata (CA). This method, however, has an intrinsic parallel updating nature and the corresponding simulations are highly inefficient when performed on classical Central Processing Units (CPUs), which are designed for the sequential execution of tasks. In this paper, a series of guidelines is presented for the efficient adaptation of octree-based, CA simulations of complex, evolving surfaces into massively parallel computing hardware. A Graphics Processing Unit (GPU) is used as a cost-efficient example of the parallel architectures. For the actual simulations, we consider the surface propagation during anisotropic wet chemical etching of silicon as a computationally challenging process with a wide-spread use in microengineering applications. A continuous CA model that is intrinsically parallel in nature is used for the time evolution. Our study strongly indicates that parallel computations of dynamically evolving surfaces simulated using CA methods are significantly benefited by the incorporation of octrees as support data structures, substantially decreasing the overall computational time and memory usage. 相似文献
10.
Ramtin ShamsAuthor Vitae Parastoo SadeghiAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(4):584-593
A model for the computational cost of the finite-difference time-domain (FDTD) method irrespective of implementation details or the application domain is given. The model is used to formalize the problem of optimal distribution of computational load to an arbitrary set of resources across a heterogeneous cluster. We show that the problem can be formulated as a minimax optimization problem and derive analytic lower bounds for the computational cost. The work provides insight into optimal design of FDTD parallel software. Our formulation of the load distribution problem takes simultaneously into account the computational and communication costs. We demonstrate that significant performance gains, as much as 75%, can be achieved by proper load distribution. 相似文献
11.
基于粒子系统的雨雪模拟大幅提高了三维场景的真实感,但传统的基于中央处理(CPU)的粒子系统的渲染效率难以达到在大规模场景中进行雨雪渲染的要求.为此,提出了一种基于GPU的粒子系统来渲染雨雪场景的算法.该算法在视点前的一个固定区域内产生和绘制粒子,在顶点着色器中进行粒子属性的更新,在几何着色器中将粒子从点扩展为矩形,并对每一帧中的粒子的属性进行缓存处理,保证了粒子属性更新的连续性.此外,采用多幅雪花纹理与粒子随机组合,使雪花效果符合多样性和随机性.实验结果表明,该算法能在大规模场景中进行雨雪效果的实时渲染,并有较高的真实感. 相似文献
12.
Parallel generation of architecture on the GPU 总被引:1,自引:0,他引:1
Markus Steinberger Michael Kenzel Bernhard Kainz Jörg Müller Wonka Peter Dieter Schmalstieg 《Computer Graphics Forum》2014,33(2):73-82
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context‐sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter‐rule dependencies required for context‐sensitive evaluation, and introduce intra‐rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on‐chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU‐based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context‐sensitivity. 相似文献
13.
Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general‐purpose computation and visualization applications. However, the programming model for high performance general‐purpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy frame‐work, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two‐level parallelism hierarchy and a non‐uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared‐memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort‐last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster. 相似文献
14.
15.
Membrane systems are parallel distributed computing models that are used in a wide variety of areas. Use of a sequential machine to simulate membrane systems loses the advantage of parallelism in Membrane Computing. In this paper, an innovative classification algorithm based on a weighted network is introduced. Two new algorithms have been proposed for simulating membrane systems models on a Graphics Processing Unit (GPU). Communication and synchronization between threads and thread blocks in a GPU are time-consuming processes. In previous studies, dependent objects were assigned to different threads. This increases the need for communication between threads, and as a result, performance decreases. In previous studies, dependent membranes have also been assigned to different thread blocks, requiring inter-block communications and decreasing performance. The speedup of the proposed algorithm on a GPU that classifies dependent objects using a sequential approach, for example with 512 objects per membrane, was 82×, while for the previous approach (Algorithm 1), it was 8.2×. For a membrane system with high dependency among membranes, the speedup of the second proposed algorithm (Algorithm 3) was 12×, while for the previous approach (Algorithm 1) and the first proposed algorithm (Algorithm 2) that assign each membrane to one thread block, it was 1.8×. 相似文献
16.
Stuart D.C. Walsh Martin O. Saar Peter Bailey David J. Lilja 《Computers & Geosciences》2009,35(12):2353-2364
Many complex natural systems studied in the geosciences are characterized by simple local-scale interactions that result in complex emergent behavior. Simulations of these systems, often implemented in parallel using standard central processing unit (CPU) clusters, may be better suited to parallel processing environments with large numbers of simple processors. Such an environment is found in graphics processing units (GPUs) on graphics cards.This paper discusses GPU implementations of three example applications from computational fluid dynamics, seismic wave propagation, and rock magnetism. These candidate applications involve important numerical modeling techniques, widely employed in physical system simulations, that are themselves examples of distinct computing classes identified as fundamental to scientific and engineering computing. The presented numerical methods (and respective computing classes they belong to) are: (1) a lattice-Boltzmann code for geofluid dynamics (structured grid class); (2) a spectral-finite-element code for seismic wave propagation simulations (sparse linear algebra class); and (3) a least-squares minimization code for interpreting magnetic force microscopy data (dense linear algebra class). Significant performance increases (between 10× and 30× in most cases) are seen in all three applications, demonstrating the power of GPU implementations for these types of simulations and, more generally, their associated computing classes. 相似文献
17.
传统的多目标进化算法多是基于Pareto最优概念的类随机搜索算法,求解速度较慢,特别是当问题维度变高,需要群体规模较大时,上述问题更加凸显。这一问题已经获得越来越多研究人员以及从业人员的关注。实验仿真中可以发现,构造非支配集和保持群体多样性这两部分工作占用了算法99%以上的执行时间。解决上述问题的一个有效方法就是对这一部分算法进行并行化改造。本文提出了一种基于CUDA平台的并行化解决方案,采用小生境技术实现共享适应度来维持候选解集的多样性,将多目标进化算法的实现全部置于GPU端,区别于以往研究中非支配排序的部分工作以及群体多样性保持的全部工作仍在CPU上执行。通过对ZDT系列函数的仿真结果,可以看出本文算法性能远远优于NSGA-Ⅱ和NPGA。最后通过求解油品调和过程这一有约束多目标优化问题,可以看出在解决化工应用中的有约束多目标优化问题时,该算法依然表现出优异的加速效果。 相似文献
18.
杨霞 《计算机工程与设计》2009,30(24)
为降低加固计算机的振动实验成本,提高产品的设计效率,提出了应用神经元BP网络对加固计算机振动仿真的方法.通过对加固计算机振动系统简化模型进行分析,采用最普遍的梯度算法确定BP网络的拓扑结构,利用ANSYS软件分析得到的数据和已有的振动实验数据进行对比分析,对网络模型进行修正,实现了对加固计算机振动的幅频特性进行预测.最后,将预测仿真得到的数据和对样机进行实际振动荻取的实验数据对比分析,表明了该算法的有效性. 相似文献
19.
Due to the quick advances in the scale of problem domain of complex systems under investigation, the complexity of multi-input component models used to construct logical processes (LP) has significantly increased. High-performance computing technologies have therefore been extensively used to enable parallel simulation execution. However, the traditional multi-process parallel method (MPM) executes LPs in parallel on multi-core platforms, which ignores the intrinsic parallel capabilities of multi-input component models. In this study, a vectorized component model (VCM) framework has been proposed. The design aims to better utilize the parallelism of multi-input component models. A two-level composite parallel method (CPM) has then been constructed within the framework, which can sustain complex system simulation applications consisting of multi-input component models. CPM first employs MPM to dispatch LPs onto a multi-core computing platform. It then maps VCMs to the multiple-core platform for parallel execution. Experimental results indicate that (1) the proposed VCM framework can better utilize the parallelism of multi-input component models, and (2) CPM can significantly improve the performance comparing to the traditional MPM. The results also show that CPM can effectively cope with the size and complexity of complex simulation applications with multi-input component models. 相似文献
20.
This paper presents a new approach to solve dynamic traffic assignment problems. The approach employs a mixed method of real-time simulation and off-line optimization. The fundamental approach to the simulation is systolic parallel processing based on autonomous agent modeling. Agents continuously act on their own initiatives and access to database to get the status of the simulation world. In particular, existing models and algorithms were incorporated in designing the behavior of relevant agents such as car-following and headway distribution. Simulation is based on predetermined routes between centroids that are computed off-line by a conventional optimal path-finding algorithm such as the Frank-Wolf algorithm. Iterating the cycles of optimization and simulation, the proposed system will provide a practical and valuable traffic assignment. Gangnam-Gu district in Seoul, Korea is selected as the target area for the modeling. It is expected that real-time traffic assignment services can be provided on the Internet soon. 相似文献