首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Quantum Monte Carlo methods enable us to determine the ground-state properties of atomic or molecular clusters. Here, we present a reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally intensive kernels of a Quantum Monte Carlo (QMC) application applied to N-body systems. We focus on two key kernels of the QMC application: acceleration of potential energy and wave function calculations. We compare the performance of our application on two reconfigurable platforms. Firstly, we use a dual-processor 2.4 GHz Intel Xeon augmented with two reconfigurable development boards consisting of Xilinx Virtex-II Pro FPGAs. Using this platform, we achieve a speedup of 3× over a software-only implementation. Following this, the chemistry application is ported to the Cray XD1 supercomputer equipped with Xilinx Virtex-II Pro and Virtex-4 FPGAs. The hardware-accelerated application on one node of the high performance system equipped with a single Virtex-4 FPGA yields a speedup of approximately 25× over the serial reference code running on one node of the dual-processor dual-core 2.2 GHz AMD Opteron. This speedup is mainly attributed to the use of pipelining, the use of fixed-point arithmetic for all calculations and the fine-grained parallelism using FPGAs. We can further enhance the performance by operating multiple instances of our design in parallel.  相似文献   

2.
We propose two hardware mechanisms to decrease energy consumption on massively parallel graphics processors for ray tracing. First, we use a streaming data model and configure part of the L2 cache into a ray stream memory to enable efficient data processing through ray reordering. This increases L1 hit rates and reduces off‐chip memory energy substantially through better management of off‐chip memory access patterns. To evaluate this model, we augment our architectural simulator with a detailed memory system simulation that includes accurate control, timing and power models for memory controllers and off‐chip dynamic random‐access memory . These details change the results significantly over previous simulations that used a simpler model of off‐chip memory, indicating that this type of memory system simulation is important for realistic simulations that involve external memory. Secondly, we employ reconfigurable special‐purpose pipelines that are constructed dynamically under program control. These pipelines use shared execution units that can be configured to support the common compute kernels that are the foundation of the ray tracing algorithm. This reduces the overhead incurred by on‐chip memory and register accesses. These two synergistic features yield a ray tracing architecture that reduces energy by optimizing both on‐chip and off‐chip memory activity when compared to a more traditional approach.  相似文献   

3.
Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low‐level, hardware‐specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter‐vdt and intra‐vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro‐benchmarks and macro‐benchmarks. With micro‐benchmarks, we study the execution model of VDTs and the role of the compiler‐aided vectorizer on five devices. With macro‐benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

4.
A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks that are controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data implementation is described and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data implementation is also described and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.  相似文献   

5.
As semiconductor manufacturing technology continues to improve, it is possible to integrate more and more transistors onto a single processor. Many-core processor design has resulted in part from the search to utilize this enormous transistor real estate. The Single-Chip Cloud Computer (SCC) is an experimental many-core processor created by Intel Labs. In this paper we present a study in which we analyze this innovative many-core system by running several workloads with distinctive parallelism characteristics. We investigate the effect on system performance by monitoring specific hardware performance counters. Then, we experiment on varying different hardware configuration parameters such as number of cores, clock frequency and voltage levels. We execute the chosen workloads and collect the timing, power consumption and energy consumption information on such a many-core research platform. Thus, we can comprehensively analyze the behavior and scalability of the Intel SCC system with the introduced workload in terms of performance and energy consumption. Our results show that the profiled parallel workload execution has a communication bottleneck on the Intel SCC system. Moreover, our results indicate that we should carefully choose the number of cores to execute different workloads in order to yield a balance between execution performance and energy efficiency for different applications.  相似文献   

6.
HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system.  相似文献   

7.
Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code.  相似文献   

8.
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.  相似文献   

9.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

10.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

11.
基于雷达资料的外推是临近预报中重要的方法之一,随着全国气象雷达网络建设规模的不断提高以及观测资料精细化程度的提升,基于区域乃至全国雷达拼图的外推预报,每次计算都需花费大量时间,甚至滞后于每6分钟一次的资料观测频次。为解决传统外推算法运算复杂度高,实时性差的问题,运用OpenCL构建基于GPU的异构计算模型对外推算法进行并行化改进。然后逐步分析影响算法性能的瓶颈,并通过改进和测试数据比对,阐述算法优化的过程。其中,内存与线程的映射优化、合理利用局部存储器作为高速缓存以及隐藏CPU执行时间等方法不仅对本算法的执行效率带来显著提升,也可为其他基于OpenCL异构计算的优化提供参考。以AMD Graphic Core Next和Northern Islands二代GPU架构作为测试平台,并以Intel CPU并行计算作为测试参考,测试结果表明,改进后的算法在硬件同等功耗的情况下,计算性能提升15~22倍。  相似文献   

12.
Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster  相似文献   

13.
Graphics processing units (GPUs) pose an attractive choice for designing high-performance and energy-efficient software systems. This is because GPUs are capable of executing massively parallel applications. However, the performance of GPUs is limited by the contention in memory subsystems, often resulting in substantial delays and effectively reducing the parallelism. In this paper, we propose GRAB, an automated debugger to aid the development of efficient GPU kernels. GRAB systematically detects, classifies and discovers the root causes of memory-performance bottlenecks in GPUs. We have implemented GRAB and evaluated it with several open-source GPU kernels, including two real-life case studies. We show the usage of GRAB through improvement of GPU kernels on a real NVIDIA Tegra K1 hardware – a widely used GPU for mobile and handheld devices. The guidance obtained from GRAB leads to an overall improvement of up to 64%.  相似文献   

14.
Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.  相似文献   

15.
为了提高并行程序中共享内存数据的读写访问性能,事务内存机制于1993年被提出。因为事务内存机制直接涉及内存数据的读写控制,所以也得到了系统安全研究人员的极大关注。2013年,Intel公司开始支持TSX(Transactional Synchronizatione Xtension)特性,第一次在广泛使用的计算机硬件中支持事务内存机制。利用事务内存机制的内存访问跟踪、内存访问信号触发和内存操作回滚,以及Intel TSX特性的用户态事务回滚处理、在Cache中执行所有操作和硬件实现高效率,研究人员完成了各种的系统安全研究成果,包括:授权策略实施、虚拟机自省、密钥安全、控制流完整性、错误恢复和侧信道攻防等。本文先介绍了各种基于事务内存机制的研究成果;然后分析了现有各种系统安全研究成果与事务内存机制特性之间的关系,主要涉及了3个角度:内存访问的控制、事务回滚处理、和在Cache中执行所有操作。我们将已有的研究成果的技术方案从3个角度进行分解,与原有的、不基于事务内存机制的解决方案比较,解释了引入事务内存机制带来的技术优势。最后,我们总结展望了将来的研究,包括:硬件事务内存机制的实现改进,事务内存机制(尤其是硬件事务内存机制)在系统安全研究中的应用潜力。  相似文献   

16.
Computer architectures supporting shared memory continue to increase in complexity as designers seek to improve memory performance. This is especially true of proposals for massively parallel systems with distributed, yet shared, memory. The need to maintain a reasonably simple memory model for programmers, in spite of enhancements like caches and access pipelining, is responsible for many of the complications. We develop a novel graph model, access graphs, for visualizing processor/memory interaction. Access graphs symbolically represent the causal relationships between load, store, and synchronization events. The focus is on two classes of access graphs: pseudo and real. A pseudo access graph describes an execution in terms of abstract events familiar to the programmer. If the pseudo access graph is acyclic, then memory consistency is preserved during the execution. A real access graph describes an execution in terms of physical events known to the hardware designer. A real access graph must be acyclic since hardware cannot violate causality. Memory consistency can be verified for a given computer system by proving that for any acyclic real access graph describing a program's execution on that computer, an acyclic pseudo access graph can be derived describing the same execution  相似文献   

17.
In this paper, we propose a model for the energy consumption of the concurrent execution of three key dense matrix factorizations, with task parallelism leveraged via the Symmetric Multi‐Processing Superscalar (SMPSs) runtime, on a multicore processor. Our model decomposes the power dissipation into the system, static and dynamic components, with the former two being estimated from basic, off‐line experiments. The dynamic power, on the other hand, requires significantly more care, and we introduce a contention‐aware model that accommodates for the variability of power consumption due to memory contention. Experimental results on an Intel Xeon E5504 processor with four cores, using an internal powermeter that samples the power drawn by the mainboard with a frequency of 1 KHz, show the reliability of the energy model for the Cholesky, LU, and QR factorizations on this platform. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

18.
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path.  相似文献   

19.
In some warning applications, such as aircraft taking-off and landing, ship sailing, and traffic guidance in foggy weather, the high definition (HD) and rapid dehazing of images and videos is increasingly necessary. Existing technologies for the dehazing of videos or images have not completely exploited the parallel computing capacity of modern multi-core CPU and GPU, and leads to the long dehazing time or the low frame rate of video dehazing which cannot meet the real-time requirement. In this paper, we propose a parallel implementation and optimization method for the real-time dehazing of the high definition videos based on a single image haze removal algorithm. Our optimization takes full advantage of the modern CPU+GPU architecture, which increases the parallelism of the algorithm, and greatly reduces the computational complexity and the execution time. The optimized OpenCL parallel implementation is integrate into FFmpeg as an independent module. The experimental results show that for a single image, the performance of the optimized OpenCL algorithm is improved approximately 500% compared with the existing algorithm, and approximately 153% over the basic OpenCL algorithm. The 1080p (1920?×?1080) high definition hazy video can also processed at a real-time rate (more than 41 frames per second).  相似文献   

20.
Implementation of GAMMA on a Massively Parallel Computer   总被引:1,自引:0,他引:1       下载免费PDF全文
The GAMMA paradigm is recently proposed by Banatre and Metayer to describe the systematic construction of parallel programs without introducing artificial sequentiality.This paper presents two synchronous execution models for GAMMA and discusses how to implement them on MasPar MP-1,a massively data parallel computer.The results show that GAMMA paradign can be implemented very naturally on data parallel machines,and very high level language,such as GAMMA in which parallelism is left implicit,is suitable for specifying massively parallel applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号