期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SpExSim: assessing kernel suitability for C-based high-level hardware synthesis

Oppermann Julian Sommer Lukas Koch Andreas 《The Journal of supercomputing》2019,75(8):4062-4077

We present SpExSim, a software tool for quickly surveying legacy code bases for kernels that could be accelerated by FPGA-based compute units. We specifically aim for low development effort by considering the use of C-based high-level hardware synthesis, instead of complex manual hardware designs. SpExSim not only exploits the spatially distributed model of computation commonly used on FPGAs, but can also model the effect of two different microarchitectures commonly used in C-to-hardware compilers, including pipelined architectures with modulo scheduling. The estimations have been validated against actual hardware generated by two current HLS tools.

相似文献

2.

FPGA implementation of particle swarm optimization for Bayesian network learning

Matthew J. Hibbard Eric R. Peskin Ferat Sahin 《Computers & Electrical Engineering》2013

Using a Bayesian network (BN) learned from data can aid in diagnosing and predicting failures within a system while achieving other capabilities such as the monitoring of a system. However, learning a BN requires computationally intensive processes. This makes BN learning a candidate for acceleration using reconfigurable hardware such as field-programmable gate arrays (FPGAs). We present a FPGA-based implementation of BN learning using particle-swarm optimization (PSO). This design thus occupies the intersection of three areas: reconfigurable computing, BN learning, and PSO. There is significant prior work in each of these three areas. Indeed, there are examples of prior work in each pair among the three. However, the present work is the first to study the combination of all three. As a baseline, we use a prior software implementation of BN learning using PSO. We compare this to our FPGA-based implementation to study trade-offs in terms of performance and cost. Both designs use a master–slave topology and floating-point calculations for the fitness function. The performance of the FPGA-based version is limited not by the fitness function, but rather by the construction of conditional probability tables (CPTs). The CPT construction only requires integer calculations. We exploit this difference by separating these two functions into separate clock domains. The FPGA-based solution achieves about 2.6 times the number of fitness evaluations per second per slave compared to the software implementation. 相似文献

3.

Parameterized floating-point logarithm and exponential functions for FPGAs

Jrmie Detrey Florent de Dinechin 《Microprocessors and Microsystems》2007,31(8):537-545

As FPGAs are increasingly being used for floating-point computing, the feasibility of a library of floating-point elementary functions for FPGAs is discussed. An initial implementation of such a library contains parameterized operators for the logarithm and exponential functions. In single precision, those operators use a small fraction of the FPGA’s resources, have a smaller latency than their software equivalent on a high-end processor, and provide about ten times the throughput in pipelined version. Previous work had shown that FPGAs could use massive parallelism to balance the poor performance of their basic floating-point operators compared to the equivalent in processors. As this work shows, when evaluating an elementary function, the flexibility of FPGAs provides much better performance than the processor without even resorting to parallelism. The presented library is freely available from http://www.ens-lyon.fr/LIP/Arenaire/. 相似文献

4.

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

Qianming Yang Mei Wen Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,63(2):508-537

Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability. 相似文献

5.

RPM: a rapid prototyping engine for multiprocessor systems

Barroso L.A. Iman S. Dubois M. Ramamurthy K. 《Computer》1995,28(2):26-34

RPM enables rapid prototyping of different multiprocessor architectures. It uses hardware emulation for reliable design verification and performance evaluation. The major objective of the RPM project is to develop a common, configurable hardware platform to accurately emulate different MIMD systems with up to eight execution processors. Because emulation is orders of magnitude faster than simulation, an emulator can run problems with large data sets more representative of the workloads for which the target machine is designed. Because an emulation is closer to the target implementation than an abstracted simulation, it can accomplish more reliable performance evaluation and design verification. Finally, an emulator is a real computer with its own I/O; the code running on the emulator is not instrumented. As a result, the emulator looks exactly like the target machine (to the programmer) and can run several different workloads, including code from production compilers, operating systems, databases, and software utilities 相似文献

6.

Compiling for power with ScalaPipe

《Journal of Systems Architecture》2013,59(8):615-625

相似文献

7.

Real time FPGA implementation of a high speed and area optimized Harris corner detection algorithm

《Microprocessors and Microsystems》2021

相似文献

8.

Fast description and synthesis of control-dominant circuits

Marc-André DaigneaultAuthor Vitae Jean Pierre David 《Computers & Electrical Engineering》2014

相似文献

9.

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Xiaofang Wang 《Journal of Real-Time Image Processing》2014,9(1):187-204

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %. 相似文献

10.

LALP: A Language to Program Custom FPGA-Based Acceleration Engines

Ricardo Menotti João M. P. Cardoso Marcio M. Fernandes Eduardo Marques 《International journal of parallel programming》2012,40(3):262-289

相似文献

11.

A coarse-grain reconfigurable architecture for multimedia applications supporting subword and floating-point calculations

Claudio Brunelli Fabio Garzia Davide Rossi Jari Nurmi 《Journal of Systems Architecture》2010,56(1):38-47

Signal processors exploiting ASIC acceleration suffer from sky-rocketing manufacturing costs and long design cycles. FPGA-based systems provide a programmable alternative for exploiting computation parallelism, but the flexibility they provide is not as high as in processor-oriented architectures: HDL or C-to-HDL flows still require specific expertise and a hardware knowledge background. On the other hand, the large size of the configuration bitstream and the inherent complexity of FPGA devices make their dynamic reconfiguration not a very viable approach. Coarse-grained reconfigurable architectures (CGRAs) are an appealing solution but they pose implementation problems and tend to be application specific. This paper presents a scalable CGRA which eases the implementation of algorithms on field programmable gate array (FPGA) platforms. This design option is based on two levels of programmability: it takes advantage of performance and reliability provided by state-of-the-art FPGA technology, and at the same time it provides the user with flexibility, performance and ease of reconfiguration typical of standard CGRAs. The basic cell template provides advanced features such as sub-word SIMD integer and floating-point computation capabilities, as well as saturating arithmetic. Multiple reconfiguration contexts and partial run-time reconfiguration capabilities are provided, tackling this way the problem of high reconfiguration overhead typical of FPGAs. Selected instances of the proposed architecture have been implemented on an Altera Stratix II EP2S180 FPGA. On this system, we mapped some common DSP, image processing, 3D graphics and audio compression algorithms in order to validate our approach and to demonstrate its effectiveness by benchmarking the benefits achieved. 相似文献

12.

基于CORDIC算法的高精度浮点超越函数的FPGA实现 总被引：2，自引：1，他引：2

李全李晓欢陈石平《电子技术应用》2009,35(5)

提出了一种新的输入输出浮点处理单元硬件架构,将数据从CORDIC算法内部格式转换为处理器能够支持的IEEE754标准浮点数据格式。输入数据支持2种不同的角度单位浮点数据直接输入,同时,硬件模块还直接支持超过360°的大角度数据输入。在Altera公司NiosⅡ处理器系统中以用户自定义指令的形式实现了该浮点硬件计算模块,并通过C语言程序验证了该模块的正确性。相似文献

13.

FDGLib:A Communication Library for Efficient Large-Scale Graph Processing in FPGA-Accelerated Data Centers

下载免费PDF全文

Yu-Wei Wu Qing-Gang Wang Long Zheng Xiao-Fei Liao Hai Jin Wen-Bin Jiang Ran Zheng Kan Hu 《计算机科学技术学报》2021,36(5):1051-1070

With the rapid growth of real-world graphs,the size of which can easily exceed the on-chip (board) storage capacity of an accelerator,processing large-scale graphs on a single Field Programmable Gate Array (FPGA) becomes difficult.The multi-FPGA acceleration is of great necessity and importance.Many cloud providers (e.g.,Amazon,Microsoft,and Baidu) now expose FPGAs to users in their data centers,providing opportunities to accelerate large-scale graph processing.In this paper,we present a communication library,called FDGLib,which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center,with minimal hardware engineering efforts.FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications.Considering the torus-based FPGA interconnection in data centers,FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes.We interface FDGLib into AccuGraph,a state-of-the-art graph accelerator.Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini,with better scalability. 相似文献

14.

PMSS: A programmable memory system and scheduler for complex memory patterns

Tassadaq Hussain Amna Haider Eduard Ayguadé 《Journal of Parallel and Distributed Computing》2014

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system. 相似文献

15.

基于FPGA的RC4加密算法设计及实现

下载免费PDF全文

苗三立左金印宋宇飞《计算机测量与控制》2018,26(2):252-254

针对通信安全问题,采用自顶向下的设计方法,设计了一种RC4算法基于FPGA的实现方式,实现了通信数据的加密传输。根据RC4加密算法的原理和设计流程,使用Verilog HDL编程语言,采用有限状态机（FSM）的编程方式实现算法,通过Modelsim SE 10.1a仿真软件进行仿真,并在FPGA开发板上进行验证。采用本文提出的FPGA设计方法实现的RC4加密算法相比软件加密方式和已有的FPGA实现方式速度有明显提高。相似文献

16.

A single-chip multiprocessor

Nayfeh B.A. Olukotun K. 《Computer》1997,30(9):79-85

Presents the case for billion-transistor processor architectures that will consist of chip multiprocessors (CMPs): multiple (four to 16) simple, fast processors on one chip. In their proposal, each processor is tightly coupled to a small, fast, level-one cache, and all processors share a larger level-two cache. The processors may collaborate on a parallel job or run independent tasks (as in the SMT proposal). The CMP architecture lends itself to simpler design, faster validation, cleaner functional partitioning, and higher theoretical peak performance. However for this architecture to realize its performance potential, either programmers or compilers will have to make code explicitly parallel. Old ISAs will be incompatible with this architecture (although they could run slowly on one of the small processors) 相似文献

17.

Improving performance of codes with large/irregular stride memory access patterns via high performance reconfigurable computers

Khalid H. Abed Gerald R. Morris 《Journal of Parallel and Distributed Computing》2013

Codes that have large-stride/irregular-stride (L/I) memory access patterns, e.g., sparse matrix and linked list codes, often perform poorly on mainstream clusters because of the general purpose processor (GPP) memory hierarchy. High performance reconfigurable computers (HPRC) contain both GPPs and field programmable gate arrays (FPGAs) connected via a high-speed network. In this research, simple 64-bit floating-point codes are used to illustrate the runtime performance impact of L/I memory accesses in both software-only and FPGA-augmented codes and to assess the benefits of mapping L/I-type codes onto HPRCs. The experiments documented herein reveal that large-stride software-only codes experience severe performance degradation. In contrast, large-stride FPGA-augmented codes experience minimal performance degradation. For experiments with large data sizes, the unit-stride FPGA-augmented code ran about two times slower than software. On the other hand, the large-stride FPGA-augmented code ran faster than software for all the larger data sizes. The largest showed a 17-fold runtime speedup. 相似文献

18.

FPGA-based architecture for the real-time computation of 2-D convolution with large kernel size

F. Javier Toledo-Moreo J. Javier Martínez-Alvarez Javier Garrigós-Guerrero J. Manuel Ferrández-Vicente 《Journal of Systems Architecture》2012,58(8):277-285

Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the FPGA-based implementation of 2-D convolution with medium–large kernels. It is a multiplierless solution based on Distributed Arithmetic implemented using general purpose resources in FPGAs. Our proposal is modular and coefficient independent, so it remains fully flexible and customizable for any application. The architecture design includes a control unit to manage efficiently the operations at the borders of the input array. Results in terms of occupied resources and timing are reported for different configurations. We compare these results with other approaches in the state of the art to validate our approach. 相似文献

19.

On parallel computation of centrality measures of graphs

García Juan F. Carriegos M. V. 《The Journal of supercomputing》2019,75(3):1410-1428

Centrality measures or indicators of centrality identify most relevant nodes of graphs. Although optimized algorithms exist for computing of most of them, they are still time consuming and are even infeasible to apply to big enough graphs like the ones representing social networks or extensive enough computer networks. In this paper, we present a parallel implementation in C language of some optimal algorithms for computing of some indicators of centrality. Our parallel version greatly reduces the execution time of their sequential (non-parallel) counterpart. The proposed solution relies on threading, allowing for a theoretical improvement in performance close to the number of logical processors (cores) of the single computer in which it is running. Our software has been tested in several platforms, including the supercomputer Calendula, in which we achieved execution times close to 18 times faster when running our parallel implementation instead of our sequential one. Our solution is multi-platform and portable, working on any machine with several logical processor which is capable of compiling and running C language code.

相似文献

20.

基于FPGA的步进电机控制芯片设计

刘征东吴学杰《工业控制计算机》2011,24(4):71-72

研究了一种以FPGA为基础的步进电机控制芯片的设计,本芯片采用Actel公司的ProASIC3系列FPGA:A3P030进行开发,采用Verilog HDL硬件描述语言进行硬件电路设计,使用Actel公司出品的Libero集成开发环境,通过C8051F060单片机,以C语言为开发语言对设计的芯片进行实际测试. 相似文献