期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

《Microprocessors and Microsystems》2020

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization.This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads^[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels.The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%. 相似文献

2.

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

Qianming Yang Mei Wen Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,63(2):508-537

Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability. 相似文献

3.

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

《Journal of Parallel and Distributed Computing》2014,74(12):3202-3216

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries. 相似文献

4.

General memory efficient packet matching FPGA architecture for future high-speed networks

《Microprocessors and Microsystems》2020

Packet classification (matching) is one of the critical operations in networking widely used in many different devices and tasks ranging from switching or routing to a variety of monitoring and security applications like firewall or IDS. To satisfy the ever-growing performance demands of current and future high-speed networks, specially designed hardware accelerated architectures implementing packet classification are necessary. These demands are now growing to such an extent, that in order to keep up with the rising throughputs of network links, the FPGA accelerated architectures are required to perform matching of multiple packets in every single clock cycle. To meet this requirement a simple replication approach can be utilized – instantiate multiple copies of a processing pipeline matching incoming packets in parallel. However, simple replication of pipelines inseparably brings a significant increase in utilization of FPGA resources of all types, which is especially costly for rather scarce on-chip memories used in matching tables.We propose and examine a unique parallel hardware architecture for hash-based exact match classification of multiple packets in each clock cycle that offers a reduction of memory replication requirements. The core idea of the proposed architecture is to exploit the basic memory organization structure present in all modern FPGAs, where hundreds of individual block or distributed memory tiles are available and can be accessed (addressed) independently. This way, we are able to maintain a rather high throughput of matching multiple packets per clock cycle even without fully replicated memory resources in matching tables. Our results show that the designed approach can use on-chip memory resources very efficiently and even scales exceptionally well with increased capacities of match tables. For example, the proposed architecture is able to achieve a throughput of more than 2 Tbps (over 3 000 Mpps) with an effective capacity of more than 40 000 IPv4 flow records at the cost of only a few hundred block memory tiles (366 BlockRAM for Xilinx or 672 M20K for Intel FPGAs) utilizing only a small fraction of available logic resources (around 68 000 LUTs for Xilinx or 95 000 ALMs for Intel). 相似文献

5.

Streaming‐Enabled Parallel Dataflow Architecture for Multicore Systems

Huy T. Vo Daniel K. Osmari Brian Summa João L. D. Comba Valerio Pascucci Cláudio T. Silva 《Computer Graphics Forum》2010,29(3):1073-1082

We propose a new framework design for exploiting multi‐core architectures in the context of visualization dataflow systems. Recent hardware advancements have greatly increased the levels of parallelism available with all indications showing this trend will continue in the future. Existing visualization dataflow systems have attempted to take advantage of these new resources, though they still have a number of limitations when deployed on shared memory multi‐core architectures. Ideally, visualization systems should be built on top of a parallel dataflow scheme that can optimally utilize CPUs and assign resources adaptively to pipeline elements. We propose the design of a flexible dataflow architecture aimed at addressing many of the shortcomings of existing systems including a unified execution model for both demand‐driven and event‐driven models; a resource scheduler that can automatically make decisions on how to allocate computing resources; and support for more general streaming data structures which include unstructured elements. We have implemented our system on top of VTK with backward compatibility. In this paper, we provide evidence of performance improvements on a number of applications. 相似文献

6.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

7.

FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method

Yousef Elkurdi Evgueni Souleimanov Warren J. Gross 《Computer Physics Communications》2008,178(8):558-570

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory. 相似文献

8.

AdaBoost-based face detection for embedded systems

Ming Yang James Crenshaw Bruce Augustine Russell Mareachen Ying Wu 《Computer Vision and Image Understanding》2010,114(11):1116-1125

Face detection is a widely studied topic in computer vision, and recent advances in algorithms, low cost processing, and CMOS imagers make it practical for embedded consumer applications. As with graphics, the best cost-performance ratio is achieved with dedicated hardware. In this paper, we design an embedded face detection system for handheld digital cameras or camera phones. The challenges of face detection in embedded environments include an efficient pipeline design, bandwidth constraints set by low cost memory, a need to find parallelism, and how to utilize the available hardware resources efficiently. In addition, consumer applications require reliability which calls for a hard real-time approach to guarantee that processing deadlines are met. Specifically, the main contributions of the paper include: (1) incorporation of a Genetic Algorithm in the AdaBoost training to optimize the detection performance given the number of Haar features; (2) a complexity control scheme to meet hard real-time deadlines; (3) a hardware pipeline design for Haar-like feature calculation and a system design exploiting several levels of parallelism. The proposed architecture is verified by synthesis to Altera’s low cost Cyclone II FPGA. Simulation results show the system can achieve about 75–80% detection rate for group portraits. 相似文献

9.

A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Gerald R. Morris Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2008

相似文献

10.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

11.

Biometrics-based consumer applications driven by reconfigurable hardware architectures

M. FonsAuthor VitaeF. FonsAuthor Vitae E. CantóAuthor Vitae 《Future Generation Computer Systems》2012,28(1):268-286

Nowadays the development of automatic biometrics-based personal recognition systems is a reality in the current technological age. Not only those applications demanding stringent security levels but also many daily use consumer applications request the existence of high performance computational platforms in charge of recognizing the identity of an individual based on the analysis of his/her physiological or behavioural characteristics. The state of the art points out two main open problems in the implementation of such automatic applications: on the one hand, the needed improvement of the reliability level of the existing recognition systems in terms of accuracy, security and real-time performances; on the other hand, the cost reduction of those physical platforms in charge of the processing.This work addresses those limitations of current systems and aims at finding the proper system architecture to develop this kind of high-performance applications at low cost. Because of that, those existing solutions based on expensive multiprocessor systems like HPC (High Performance Computer), GPU (Graphics Processing Unit), or PC (Personal Computer) platforms need to be discarded, and instead of them embedded system solutions based on programmable logic devices are suggested in this work. The programmability performances of FPGA (Field Programmable Gate Array) devices together with the inherent parallelism of hardware design provide the needed flexibility to develop made-to-measure coprocessors in charge of accelerating those time-critical computational tasks. To address the cost of the system, dynamically reconfigurable FPGAs are suggested in this work. The scheduling of the recognition application into a series of mutually exclusive tasks, and the reutilization of those functional resources available in the FPGA by multiplexing different coprocessors in the same area along the application execution time allows reducing the size of the device and therefore its cost at the expense of the reconfiguration overhead.The hardware-software co-design of an AFAS (automatic fingerprint-based authentication system) under two different run-time reconfigurable platforms is presented as the proof of concept of the suggested architecture. The outstanding results achieved in this work pave the way for the implementation of biometric applications by means of run-time reconfigurable FPGAs. 相似文献

12.

Implementation of the EARTH programming model on SMP clusters: a multi‐threaded language and runtime system

G. Tremblay C. J. Morrone J. N. Amaral G. R. Gao 《Concurrency and Computation》2003,15(9):821-844

This paper describes the design and implementation of an Efficient Architecture for Running THreads (EARTH) runtime system for a multi‐processor/multi‐node cluster. The (EARTH) model was designed to support the efficient execution of parallel (multi‐threaded) programs with irregular fine‐grain parallelism using off‐the‐shelf computers. Implementing an EARTH runtime system requires an explicitly threaded runtime system. For portability, we built this runtime system on top of Pthreads under Linux and used sockets for inter‐node communication. Moreover, in order to make the best use of the resources available on a cluster of symmetric multi‐processors (SMP), this implementation enables the overlapping of communication and computation. We used Threaded‐C, a language designed to implement the programming model supported by the EARTH architecture. This language allows the expression of various levels of parallelism and provides the primitives needed to manage the required communication and synchronization. The Threaded‐C programming language supports irregular fine‐grain parallelism through a two‐level hierarchy of threads and fibers. It also provides various synchronization and communication constructs that reflect the nature of EARTH's fibers—non‐preemptive execution with data‐driven scheduling—as well as the extensive use of split‐phase transactions on EARTH to execute long‐latency operations. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

13.

Performance evaluation of computer architectures with main memory data compression

《Journal of Systems Architecture》1999,45(8):571-590

In this paper we investigate the application of data compression to the main memory of current workstation architectures. We propose an organisation where data and code are stored in compressed form while there is competition for memory resources. Main memory data compression allows programs with working sets much greater than physical memory to execute efficiently without resorting to costly disk paging activity. A performance model for this architecture clearly demonstrates that system performance may be improved by up to a factor of two when using software based memory compression instead of paging, while hardware assisted memory compression can improve performance by up to an order of magnitude. 相似文献

14.

实时微处理器体系结构综述 总被引：1，自引：0，他引：1

下载免费PDF全文

石伟张明郭御风龚锐《计算机工程与科学》2015,37(5):857-864

实时应用已经成为嵌入式应用中一类快速崛起的典型应用。作为实时系统的核心部件,实时微处理器体系结构是微处理器领域的一个重要研究方向。与通用处理器追求最大吞吐量不同,实时处理器要求具有紧凑且可计算的最坏执行时间。传统的实时处理器往往采用较为简单的处理器结构,避免复杂结构引入执行时间的不确定性。随着实时应用对处理器性能需求越来越高,实时处理器正逐渐向多线程与多核结构发展。在多线程与多核处理器中,共享资源竞争导致实时系统的确定性变差,对实时处理器体系结构带来了更大挑战。对实时微处理器体系结构进行综述,首先从指令集、微体系结构、存储、I/O、任务调度等多个方面对传统实时处理器进行分析;然后分别对采用多线程与多核结构的高性能实时处理器展开分析;最后对几种商用实时处理器结构进行比较,总结实时处理器发展现状与未来发展趋势。相似文献

15.

性能非对称多核处理器下异构感知调度技术

赵姗杨秋松李明树《软件学报》2019,30(4):1164-1190

为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心（core）具有不同的微架构或者指令集架构（ISA）,为应用提供多样化特性支持,比如指令级并行（ILP）、内存级并行（MLP）,这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望. 相似文献

16.

FPGA acceleration of a quantum Monte Carlo application

Akila Gothandaraman Gregory D. Peterson G.L. Warren Robert J. Hinde Robert J. Harrison 《Parallel Computing》2008,34(4-5):278-291

Quantum Monte Carlo methods enable us to determine the ground-state properties of atomic or molecular clusters. Here, we present a reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally intensive kernels of a Quantum Monte Carlo (QMC) application applied to N-body systems. We focus on two key kernels of the QMC application: acceleration of potential energy and wave function calculations. We compare the performance of our application on two reconfigurable platforms. Firstly, we use a dual-processor 2.4 GHz Intel Xeon augmented with two reconfigurable development boards consisting of Xilinx Virtex-II Pro FPGAs. Using this platform, we achieve a speedup of 3× over a software-only implementation. Following this, the chemistry application is ported to the Cray XD1 supercomputer equipped with Xilinx Virtex-II Pro and Virtex-4 FPGAs. The hardware-accelerated application on one node of the high performance system equipped with a single Virtex-4 FPGA yields a speedup of approximately 25× over the serial reference code running on one node of the dual-processor dual-core 2.2 GHz AMD Opteron. This speedup is mainly attributed to the use of pipelining, the use of fixed-point arithmetic for all calculations and the fine-grained parallelism using FPGAs. We can further enhance the performance by operating multiple instances of our design in parallel. 相似文献

17.

A template system for the efficient compilation of domain abstractions onto reconfigurable computers

《Journal of Systems Architecture》2013,59(2):91-102

Past research has addressed the issue of using FPGAs as accelerators for HPC systems. Such research has identified that writing low level code for the generation of an efficient, portable and scalable architecture is challenging. We propose to increase the level of abstraction in order to help developers of reconfigurable accelerators deal with these three key issues. Our approach implements domain specific abstractions for FPGA based accelerators using techniques from generic programming. In this paper we explain the main concepts behind our system to Design Accelerators by Template Expansions (DATE). The DATE system can be effectively used for expanding individual kernels of an application and also for the generation of interfaces between various kernels to implement a complete system architecture. We present evaluations for six kernels as examples of individual kernel generation using the proposed system. Our evaluations are mainly intended to provide a proof-of-concept. We also show the usage of the DATE system for integration of various kernels to build a complete system based on a Template Architecture for Reconfigurable Accelerator Designs (TARCAD). 相似文献

18.

Micro-Task Processing in Heterogeneous Reconfigurable Systems

下载免费PDF全文

Sebastian Wallner 《计算机科学技术学报》2005,20(5):624-634

New reconfigurable computing architectures are introduced to overcome some of the limitations of conventional microprocessors and fine-grained reconfigurable devices (e.g., FPGAs). One of the new promising architectures are Configurable System-on-Chip (CSoC) solutions. They were designed to offer high computational performance for real-time signal processing and for a wide range of applications exhibiting high degrees of parallelism. The programming of such systems is an inherently challenging problem due to the lack of an programming model. This paper describes a novel heterogeneous system architecture for signal processing and data streaming applications. It offers high computational performance and a high degree of flexibility and adaptability by employing a micro Task Controller (mTC) unit in conjunction with programmable and configurable hardware. The hierarchically organized architecture provides a programming model, allows an efficient mapping of applications and is shown to be easy scalable to future VLSI technologies. Several mappings of commonly used digital signal processing algorithms for future telecommunication and multimedia systems and implementation results are given for a standard-cell ASIC design realization in 0.18 micron 6-layer UMC CMOS technology. 相似文献

19.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

20.

FPGA-based architecture for the real-time computation of 2-D convolution with large kernel size

F. Javier Toledo-Moreo J. Javier Martínez-Alvarez Javier Garrigós-Guerrero J. Manuel Ferrández-Vicente 《Journal of Systems Architecture》2012,58(8):277-285

Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the FPGA-based implementation of 2-D convolution with medium–large kernels. It is a multiplierless solution based on Distributed Arithmetic implemented using general purpose resources in FPGAs. Our proposal is modular and coefficient independent, so it remains fully flexible and customizable for any application. The architecture design includes a control unit to manage efficiently the operations at the borders of the input array. Results in terms of occupied resources and timing are reported for different configurations. We compare these results with other approaches in the state of the art to validate our approach. 相似文献