期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method

Yousef Elkurdi Evgueni Souleimanov Warren J. Gross 《Computer Physics Communications》2008,178(8):558-570

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory. 相似文献

2.

Hardware accelerator for solving 0–1 knapsack problems using binary harmony search

Mohammed El-Shafei Imtiaz Ahmad 《International Journal of Parallel, Emergent and Distributed Systems》2018,33(1):87-102

The 0–1 knapsack problem (KP) is a well-known intractable optimization problem with wide range of applications. Harmony Search (HS) is one of the most popular metaheuristic algorithms to successfully solve 0–1 KPs. Nevertheless, metaheuristic algorithms are generally compute intensive and slow when implemented in software. In this paper, we present an FPGA-based pipelined hardware accelerator to reduce computation time for solving large dimension 0–1 KPs using Binary Harmony Search algorithm. The proposed architecture exploits the intrinsic parallelism of population based metaheuristic algorithm and the flexibility and parallel processing capabilities of FPGAs to perform the computation concurrently thus enhancing performance. To validate the efficiency of the proposed hardware accelerator, experiments were conducted using a large number of 0–1 KPs. Comparative analysis on experimental results reveals that the proposed approach offers promising speedups of 51× – 111× as compared with a software implementation and 2× – 5× as compared with a hardware implementation of Binary Particle Swarm Optimization algorithm. 相似文献

3.

A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Gerald R. Morris Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2008

相似文献

4.

A hardware intelligent processing accelerator for domestic service robots

Yutaro Ishida Takashi Morie Hakaru Tamukoh 《Advanced Robotics》2020,34(14):947-957

ABSTRACT

We present a method for implementing hardware intelligent processing accelerator on domestic service robots. These domestic service robots support human life; therefore, they are required to recognize environments using intelligent processing. Moreover, the intelligent processing requires large computational resources. Therefore, standard personal computers (PCs) with robot middleware on the robots do not have enough resources for this intelligent processing. We propose a ‘connective object for middleware to an accelerator (COMTA),’ which is a system that integrates hardware intelligent processing accelerators and robot middleware. Herein, by constructing dedicated architecture digital circuits, field-programmable gate arrays (FPGAs) accelerate intelligent processing. In addition, the system can configure and access applications on hardware accelerators via a robot middleware space; consequently, robotic engineers do not require the knowledge of FPGAs. We conducted an experiment on the proposed system by utilizing a human-following application with image processing, which is commonly applied in the robots. Experimental results demonstrated that the proposed system can be automatically constructed from a single-configuration file on the robot middleware and can execute the application 5.2 times more efficiently than an ordinary PC. 相似文献

5.

A comprehensive reconfigurable computing approach to memory wall problem of large graph computation

《Journal of Systems Architecture》2016

Graph computation problems that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures. Although recent studies use FPGA technology to tackle the memory wall problem of graph computation by adopting a massively multi-threaded architecture, the performance is still far less than optimal memory performance due to the long memory access latency. In this paper, we propose a comprehensive reconfigurable computing approach to address the memory wall problem. First, we present an extended edge-streaming model with massive partitions to provide better load balance while taking advantage of the streaming bandwidth of external memory in processing large graphs. Second, we propose a two-level shuffle network architecture to significantly reduce the on-chip memory requirement while provide high processing throughput that matches the bandwidth of the external memory. Third, we introduce a compact storage design based on graph compression schemes and propose the corresponding encoding and decoding hardware to reduce the data volume transferred between the processing engines and external memory. We validate the effectiveness of the proposed architecture by implementing three frequently-used graph algorithms on ML605 board, showing an up to 3.85 × improvement in terms of performance to bandwidth ratio over previously published FPGA-based implementations. 相似文献

6.

A fast and scalable architecture to run convolutional neural networks in low density FPGAs

《Microprocessors and Microsystems》2020

Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045. 相似文献

7.

Development process for clusters on a reconfigurable chip

Eduard Fernandez-AlonsoAuthor Vitae David Castells-Rufas^{Author Vitae} 《Computers & Electrical Engineering》2012,38(3):756-771

Reconfigurable MPSoCs (Multiprocessor System-on-Chip) could be viable for certain applications niche where the flexibility of FPGAs (Field-Programmable Gate Array) and software is needed, and a small number of units dismiss other silicon options. However, their design complexity is very high, and raises additional problems, i.e. the definition of a suitable programming model, an efficient memory organization, and the need for ways to optimize application performance.In this paper, we propose a complete development process, which addresses these problems by complementing the current SoC (System-on-Chip) development process with additional steps to support parallel programming and software optimization. This work explains systematically problems and solutions to achieve a FPGA-based MPSoC following our systematic flow and offering tools and techniques to develop parallel applications for such systems. 相似文献

8.

面向数据驱动处理器阵列的自动综合 总被引：1，自引：0，他引：1

邬贵明窦勇王淼《计算机工程与科学》2009,31(Z1)

本文提出了一种数据驱动处理器阵列结构,该结构能有效平衡存储和计算,适合用于在FPGA上实现高性能的算法加速,同时提出了一个面向该结构的自动综合框架,通过该框架可以将常规循环有效地映射到数据驱动处理器阵列上。实验结果表明了该自动综合框架的有效性,且生成的设计性能优于通用处理器。相似文献

9.

PMSS: A programmable memory system and scheduler for complex memory patterns

Tassadaq Hussain Amna Haider Eduard Ayguadé 《Journal of Parallel and Distributed Computing》2014

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system. 相似文献

10.

Compact modular exponentiation accelerator for modern FPGA devices

Timo Panu Marko Timo D. 《Computers & Electrical Engineering》2007,33(5-6):383-391

We present a compact FPGA implementation of a modular exponentiation accelerator suited for cryptographic applications. The implementation efficiently exploits the properties of modern FPGAs. The accelerator consumes 434 logic elements, four 9-bit DSP elements, and 13604 memory bits in Altera Stratix EP1S40. It performs modular exponentiations with up to 2250-bit integers and scales easily to larger exponentiations. Excluding pre- and post-processing time, 1024-bit and 2048-bit exponentiations are performed in 26.39 ms and 199.11 ms, respectively. Due to its compactness, standard interface, and support for different clock domains, the accelerator can effortlessly be integrated into a larger system in the same FPGA. The accelerator and its performance are demonstrated in practice with a fully functional prototype implementation consisting of software and hardware components. 相似文献

11.

FDGLib:A Communication Library for Efficient Large-Scale Graph Processing in FPGA-Accelerated Data Centers

下载免费PDF全文

Yu-Wei Wu Qing-Gang Wang Long Zheng Xiao-Fei Liao Hai Jin Wen-Bin Jiang Ran Zheng Kan Hu 《计算机科学技术学报》2021,36(5):1051-1070

With the rapid growth of real-world graphs,the size of which can easily exceed the on-chip (board) storage capacity of an accelerator,processing large-scale graphs on a single Field Programmable Gate Array (FPGA) becomes difficult.The multi-FPGA acceleration is of great necessity and importance.Many cloud providers (e.g.,Amazon,Microsoft,and Baidu) now expose FPGAs to users in their data centers,providing opportunities to accelerate large-scale graph processing.In this paper,we present a communication library,called FDGLib,which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center,with minimal hardware engineering efforts.FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications.Considering the torus-based FPGA interconnection in data centers,FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes.We interface FDGLib into AccuGraph,a state-of-the-art graph accelerator.Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini,with better scalability. 相似文献

12.

基于Cell多核处理器的层次化运行时支持技术

董小社冯国富王旭昊冯景华胡雷钧《计算机研究与发展》2010,47(4)

基于Cell处理器的异构多核架构及软件显式管理的多级存储层次,使其面临编程困难和性能难以有效发挥等问题.现有基于Cell/B.E.的编程模型多侧重于支持类似于流处理的批量访存(bulk data transfer)应用,传统非规则访存应用性能较低.通过扩展Cell/B.E.访存库增强协处理单元的自主作用,以协处理单元为中心建立Cell计算平台上的MPI和弱一致性Pthread分层并行编程运行时支持.分层的运行时支持结构及扩展后的Cell/B.E.访存库使模型具有更好的效率和可扩展性,并且提高了非规则应用的性能;模型中的MPI方便了大量传统并行应用向新架构的移植及开发,而弱一致性Pthread则为MPI提供高效的任务运行时管理支持及为系统级用户提供对架构全面控制的编程接口.实验结果表明,提出的运行时支持技术不仅可适应不同应用的要求,同时借助访存库中的剖分优化机制可有效地挖掘Cell/B.E.架构性能. 相似文献

13.

面向GPU并行编程的线程同步综述

高岚赵雨晨张伟功王晶钱德沛《软件学报》2024,35(2):1028-1047

并行计算已成为主流趋势. 在并行计算系统中, 同步是关键设计之一, 对硬件性能的充分利用至关重要. 近年来, GPU (graphic processing unit, 图形处理器)作为应用最为广加速器得到了快速发展, 众多应用也对GPU线程同步提出更高要求. 然而, 现有GPU系统却难以高效地支持真实应用中复杂的线程同步. 研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展, 但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战. 根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类. 在此基础上, 围绕GPU线程同步的表达和执行, 首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战; 而后依据不同的GPU线程同步粒度, 从线程同步表达方法和性能优化方法两个方面入手, 介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究, 对现有研究方法进行分析与总结. 最后, 指出GPU线程同步未来的研究趋势和发展前景, 并给出可能的研究思路, 从而为该领域的研究人员提供参考. 相似文献

14.

Design of a high performance architecture for real-time enhancement of video stream captured in extremely low lighting environment

Hau T. Ngo Ming Zhang Li Tao Vijayan K. Asari 《Microprocessors and Microsystems》2009,33(4):273-280

A high performance digital architecture for the implementation of a non-linear image enhancement technique is proposed in this paper. The image enhancement is based on a luminance dependent non-linear enhancement algorithm which achieves simultaneous dynamic range compression, colour consistency and lightness rendition. The algorithm provides better colour fidelity, enhances less noise, prevents the unwanted luminance drop at the uniform luminance areas, keeps the ‘bright’ background unaffected, and enhances the ‘dark’ objects in ‘bright’ background. The algorithm contains a large number of complex computations and thus it requires specialized hardware implementation for real-time applications. Systolic, pipelined and parallel design techniques are utilized effectively in the proposed FPGA-based architectural design to achieve real-time performance. Estimation techniques are also utilized in the hardware algorithmic design to achieve faster, simpler and more efficient architecture. The video enhancement system is implemented using Xilinx’s multimedia development board that contains a VirtexII-X2000 FPGA and it is capable of processing approximately 67 Mega-pixels (Mpixels) per second. 相似文献

15.

Exploiting graphical processing units for data‐parallel scientific applications

A. Leist D. P. Playne K. A. Hawick 《Concurrency and Computation》2009,21(18):2400-2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

16.

Real-time compression architecture for efficient coding in autostereoscopic displays

D. P. Chaikalis N. P. Sgouros D. E. Maroulis M. S. Sangriotis 《Journal of Real-Time Image Processing》2010,5(1):45-56

Integral imaging is a promising technique for delivering high-quality three-dimensional content. However, the large amounts of data produced during acquisition prohibits direct transmission of Integral Image data. A number of highly efficient compression architectures are proposed today that outperform standard two-dimensional encoding schemes. However, critical issues regarding real-time compression for quality demanding applications are a primary concern to currently existing Integral Image encoders. In this work we propose a real-time FPGA-based encoder for Integral Image and integral video content transmission. The proposed encoder is based on a highly efficient compression algorithm used in Integral Imaging applications. Real-time performance is achieved by realizing a pipelined architecture, taking into account the specific structure of an Integral Image. The required memory access operations are minimized by adopting a systolic concept of data flow through the core processing elements, further increasing the performance boost. The encoder targets, real-time, broadcast-type high-resolution Integral Image and video sequences and performs three orders of magnitude faster than the analogous software approach. 相似文献

17.

Micro-Task Processing in Heterogeneous Reconfigurable Systems

下载免费PDF全文

Sebastian Wallner 《计算机科学技术学报》2005,20(5):624-634

New reconfigurable computing architectures are introduced to overcome some of the limitations of conventional microprocessors and fine-grained reconfigurable devices (e.g., FPGAs). One of the new promising architectures are Configurable System-on-Chip (CSoC) solutions. They were designed to offer high computational performance for real-time signal processing and for a wide range of applications exhibiting high degrees of parallelism. The programming of such systems is an inherently challenging problem due to the lack of an programming model. This paper describes a novel heterogeneous system architecture for signal processing and data streaming applications. It offers high computational performance and a high degree of flexibility and adaptability by employing a micro Task Controller (mTC) unit in conjunction with programmable and configurable hardware. The hierarchically organized architecture provides a programming model, allows an efficient mapping of applications and is shown to be easy scalable to future VLSI technologies. Several mappings of commonly used digital signal processing algorithms for future telecommunication and multimedia systems and implementation results are given for a standard-cell ASIC design realization in 0.18 micron 6-layer UMC CMOS technology. 相似文献

18.

针对Xilinx可编程片上系统的硬件加速方案的研究

张宇冯丹《小型微型计算机系统》2010,31(6)

当前嵌入式计算应用不断增加,嵌入式系统需要具备相当的处理能力以满足应用需求.在系统中耦合一个专用硬件处理模块来加速某种计算机密集型应用是一种被广泛采纳的有效手段.针对基于Xilinx FPGA的可编程片上系统,从体系结构角度分别研究了三种形式的硬件加速方案:(1)与CPU耦合的协处理器;(2)挂接在PLB总线上的加速器;(3)挂接在MPMC Switch Fabric上的加速器.分析了三种方案各自的特点.在实验环节选取了128位AES加密算法,并在Xilinx Virtex5 器件上做了硬件实现,结果表明基于MPMC扩展的加速器方案性能较好,CPU占用率最低. 相似文献

19.

Low power data processing system with self-reconfigurable architecture

《Journal of Systems Architecture》2007,53(9):568-576

In this paper, a low power data processing system with a self-reconfigurable architecture and USB interface is presented. A single FPGA performs all processing and controls the multiple configurations without any additional elements, such as microprocessor, host computer or additional FPGAs. This architecture allows high performance with very low power consumption, a comprehensive alternative to microprocessor or DSP systems. In addition, a hierarchical reconfiguration system is used to support a large number of different processing tasks without the power consumption penalty of a big local configuration memory. Due to its simplicity and low power, this data processing system is especially suitable for portable applications, reducing the disadvantage of FPGAs against ASICS in low power consumption applications [A. Amara, F. Amiel, T. Ea, FPGA vs. ASIC for low power applications, Microelectronics Journal 37 (8) (2006) 669–677]. 相似文献

20.

FPGA-based architecture for the real-time computation of 2-D convolution with large kernel size

F. Javier Toledo-Moreo J. Javier Martínez-Alvarez Javier Garrigós-Guerrero J. Manuel Ferrández-Vicente 《Journal of Systems Architecture》2012,58(8):277-285

Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the FPGA-based implementation of 2-D convolution with medium–large kernels. It is a multiplierless solution based on Distributed Arithmetic implemented using general purpose resources in FPGAs. Our proposal is modular and coefficient independent, so it remains fully flexible and customizable for any application. The architecture design includes a control unit to manage efficiently the operations at the borders of the input array. Results in terms of occupied resources and timing are reported for different configurations. We compare these results with other approaches in the state of the art to validate our approach. 相似文献