期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Dynamic scheduling of task graphs in multi-FPGA systems using critical path

Ramezani Reza 《The Journal of supercomputing》2021,77(1):597-618

SRAM-based FPGAs feature high performance and flexibility. Thus, they have found many applications in modern high-performance computing (HPC) systems. These systems suffer from the limitation of the computing resources problem for running HPC applications. Therefore, multi-FPGA systems have been emerged to alleviate such resource limitations. In this regard, efficient scheduling strategies are required to dynamically steer the execution of applications—represented as task graphs—on a set of connected FPGAs. In this paper, a heuristic-based dynamic critical path-aware scheduling technique named CPA is presented to schedule task graphs on multi-FPGA systems. The proposed technique, by considering the computation and communication capabilities of FPGAs, dynamically assigns priority to tasks in different steps in order to achieve better makespans. The proposed technique has been evaluated by conducting several experiments on real-world and three different shapes of random task graphs with different number of tasks, and its efficiency has been compared with that of three task graph scheduling approaches. The obtained results demonstrate that the proposed CPA technique outperforms well-known heuristic scheduling strategies and improves their makespan by 13.47% on average. In addition, the experiments show that the proposed technique generates the schedules in the order of milliseconds and the average of its yielded makespans is 12.05% longer than that of an optimum schedule.

相似文献

2.

Generation of logic designs for efficiently solving ordinary differential equations on field programmable gate arrays

Silas Bartel Matthias Korch 《Software》2023,53(1):27-52

相似文献

3.

HASPRNG: Hardware Accelerated Scalable Parallel Random Number Generators

JunKyu Lee Yu Bi Gregory D. Peterson Robert J. Hinde Robert J. Harrison 《Computer Physics Communications》2009,180(12):2574-2581

The Scalable Parallel Random Number Generators library (SPRNG) supports fast and scalable random number generation with good statistical properties for parallel computational science applications. In order to accelerate SPRNG in high performance reconfigurable computing systems, we present the Hardware Accelerated SPRNG library (HASPRNG). Ported to the Xilinx University Program (XUP) and Cray XD1 reconfigurable computing platforms, HASPRNG includes the reconfigurable logic for Field Programmable Gate Arrays (FPGAs) along with a programming interface which performs integer random number generation that produces identical results with SPRNG. This paper describes the reconfigurable logic of HASPRNG exploiting the mathematical properties and data parallelism residing in the SPRNG algorithms to produce high performance and also describes how to use the programming interface to minimize the communication overhead between FPGAs and microprocessors. The programming interface allows a user to be able to use HASPRNG the same way as SPRNG 2.0 on platforms such as the Cray XD1. We also describe how to install HASPRNG and use it. For HASPRNG usage we discuss a FPGA π-estimator for a High Performance Reconfigurable Computer (HPRC) sample application and compare to a software π-estimator. HASPRNG shows 1.7x speedup over SPRNG on the Cray XD1 and is able to obtain substantial speedup for a HPRC application.

Program summary

Program title: HASPRNGCatalogue identifier: AEER_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEER_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 594 928No. of bytes in distributed program, including test data, etc.: 6 509 724Distribution format: tar.gzProgramming language: VHDL (XUP and Cray XD1), C++ (XUP), C (Cray XD1)Computer: PowerPC 405 (XUP) / AMD 2.2 GHz Opteron processor (Cray XD1)Operating system: LinuxFile size: 15 MB (XUP) / 22 MB (Cray XD1)Classification: 4.13Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations such as π-estimation are able to consume limitless random numbers forthe computation as long as hardware resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The library presented here accelerates the generators of independent streams of random numbers.Solution method: Multiple copies of random number generators in FPGAs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. HASPRNG is a random number generators library to allow a computational science application to employ the multiple copies of random number generators to boost performance. Users can interface HASPRNG with software code executing on microprocessors and/or with hardware applications executing on FPGAs. 相似文献

4.

ARC: a metacomputing environment for clusters augmented with reconfigurable hardware

Philip D. Healy John P. Morrison 《The Journal of supercomputing》2012,61(3):743-779

The addition of reconfigurable hardware (FPGAs) to the nodes of Beowulf-style clusters has the potential to accelerate a variety of parallel applications through a combination of parallel programming and reconfigurable computing techniques. However, making efficient use of the computational resources available places a significant burden on the application developer due to the lack of support for reconfigurable computing and task heterogeneity in standard message-passing libraries. This paper describes Accessible Reconfigurable Computing (ARC), a metacomputing environment designed to address these issues. The architecture, implementation, and operation of the system are described in detail. 相似文献

5.

A fast and scalable architecture to run convolutional neural networks in low density FPGAs

《Microprocessors and Microsystems》2020

Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045. 相似文献

6.

A systematic literature review on hardware implementation of artificial intelligence algorithms

Talib Manar Abu Majzoub Sohaib Nasir Qassim Jamal Dina 《The Journal of supercomputing》2021,77(2):1897-1938

Artificial intelligence (AI) and machine learning (ML) tools play a significant role in the recent evolution of smart systems. AI solutions are pushing towards a significant shift in many fields such as healthcare, autonomous airplanes and vehicles, security, marketing customer profiling and other diverse areas. One of the main challenges hindering the AI potential is the demand for high-performance computation resources. Recently, hardware accelerators are developed in order to provide the needed computational power for the AI and ML tools. In the literature, hardware accelerators are built using FPGAs, GPUs and ASICs to accelerate computationally intensive tasks. These accelerators provide high-performance hardware while preserving the required accuracy. In this work, we present a systematic literature review that focuses on exploring the available hardware accelerators for the AI and ML tools. More than 169 different research papers published between the years 2009 and 2019 are studied and analysed.

相似文献

7.

Dynamic reconfiguration architectures for multi-context FPGAs

Yitzhak Birk Evgeny Fiksman Author vitae 《Computers & Electrical Engineering》2009,(6):878-903

Field-programmable gate arrays (FPGAs) are being integrated with processors on the same motherboard or even chip in order to achieve flexible high-performance computing, and this may become main stream in chip multi-core architectures. However, the expensive FPGA area is often used inefficiently, with much of the logic idle at any given time. This work, motivated by the Dynamic-Link Library (DLL) concept in software, explores the possibility of “hardware DLLs” by finding ways for fast dynamic incremental reconfiguration of FPGAs. So doing would, among other things, enable same-function replication at any given time, with functions changing quickly over time, thereby enabling efficient exploitation of data parallelism at no additional hardware cost.We present two new multi-context FPGA architectures based on two different configuration storage architectures: local and centralized. Problems such as configuration storage and reconfiguration (time, power and space) overhead are considered. Well known area and power models are used in evaluating various approaches and in order to provide guidelines for matching architectures to target applications. Lastly, we provide insights into resulting scheduling issues. Our findings provide the foundation and “rules of the game” for subsequent development of reconfiguration schedulers and execution environments. 相似文献

8.

PMSS: A programmable memory system and scheduler for complex memory patterns

Tassadaq Hussain Amna Haider Eduard Ayguadé 《Journal of Parallel and Distributed Computing》2014

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system. 相似文献

9.

A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Gerald R. Morris Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2008

相似文献

10.

A fully automated reconfigurable calculation engine dedicated to the real-time simulation of high switching frequency power electronic circuits

《Mathematics and computers in simulation》2013

Real-time simulation allows rapid deployment and thorough testing of prototyped hardware in the automotive and aerospace industries. However, the simulation of power electronic circuits (PECs) in the context of PC-based simulations is challenging for several reasons, and imposes a limit in the 1–5 kHz range to the achievable switching frequencies. As FPGA devices gain computing power, conducting the real-time simulation of PECs on chip becomes an attractive alternative. This paper demonstrates the feasibility of high-performance floating-point calculation engines aimed for the real-time simulation of PECs on high-end and low-cost FPGAs as well. The paper discusses emerging paradigms for reconfigurable floating-point computing that favor optimal performance and offer near double precision arithmetic at a minimal hardware cost. The effectiveness of the approach is demonstrated by considering three different circuit topologies and simulating their high-frequency stimulation (20 kHz) using the same automated calculation engine. The considered circuits are a boost converter, a two-level three-phase bridge, and a two-level-three-phase bridge driven by a boost converter. 相似文献

11.

Application design for configurable computing

Mangione-Smith W.H. 《Computer》1997,30(10):115-117

Configurable computing systems enhance traditional computing systems through the addition of programmable hardware. Configurable computing offers the opportunity to change the partition at run-time by re-programming the hardware. Recent research has shifted to CAD and application development tools. Almost all existing configurable computing systems are based on field-programmable gate arrays (FPGAs). These devices implement reasonably arbitrary digital circuits, and the flexibility allows us to think of configurable computing systems based on FPGAs as netlist computers. The configurable computing approach integrates FPGAs as an intimate and fundamental component of the computing system, rather than relegating them to their earlier role of supporting system prototyping and low-volume production. However, the author believes that automated approaches to the design of configurable computing systems are premature because they do not pay enough attention to performance 相似文献

12.

编译群体智能系统应用程序:以全分布式智能建筑系统为例

陈文杰杨启亮姜子炎邢建春周启臻邹荣伟冯博伟《软件学报》2024,35(6)

群体智能系统通过邻居个体的信息交互实现群体级别的应用任务,具有良好的鲁棒性和灵活性.与此同时,大多数开发人员难以对分布式、并行的个体交互机制进行描述.一些高级语言允许用户以串行思维方式、从系统全局角度来编程并行的群体智能计算任务,而无需考虑通信协议、数据分布等底层交互细节.但面向用户、全局声明式的群体智能系统应用程序与个体并行执行逻辑存在的巨大语义差距,使得编译过程复杂进而导致应用程序开发效率不高.本文提出了一个编译系统及其支撑工具,支持将高级的群体智能系统应用程序转换为安全、高效的分布式实现.该编译系统通过并行信息识别,计算划分,交互信息生成技术,将面向系统全局、串行编程的群体智能应用程序编译为面向个体独立执行的并行目标代码,从而使用户不必了解个体间的复杂交互机制.设计了一种标准化中间表示,将复杂群体智能计算任务转换为群体智能算子和输入输出变量组合而成的标准化语义模块序列,其以独立于平台的形式表示源程序信息,屏蔽了目标硬件平台的异构性.在一个群体智能系统案例平台中部署和测试了该编译系统,结果表明该系统能够有效将群体智能应用程序编译为平台可执行的目标代码并提升应用程序开发效率,其生成的代码在一系列基准测试中具有比现有编译器更好的性能. 相似文献

13.

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

Qianming Yang Mei Wen Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,63(2):508-537

Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability. 相似文献

14.

Genetic Algorithm for Boolean minimization in an FPGA cluster 总被引：1，自引：0，他引：1

César Pedraza Javier Castillo José I. Martínez Pablo Huerta Jose L. Bosque Javier Cano 《The Journal of supercomputing》2011,58(2):244-252

Evolutionary algorithms are an alternative option to the Boolean synthesis due to that they allow one to create hardware structures that would not be able to be obtained with other techniques. This paper shows a parallel genetic programming (PGP) Boolean synthesis implementation based on a cluster of FPGAs that takes full advantage of parallel programming and hardware/software co-design techniques. The performance of our cluster of FPGAs implementation has been compared with an HPC implementation. The experimental results have shown an excellent behavior in terms of speed up (up to ×500) and in terms of solving the scalability problems of this algorithms present in previous works. 相似文献

15.

Towards efficient partial evaluation in logic programming

David A. Fuller Sacha A. Bocic Leopoldo E. Berstossi 《New Generation Computing》1996,14(2):237-259

Partial evaluation is a symbolic manipulation technique used to produce efficient algorithms when part of the input to the algorithm is known. Other applications of partial evaluators such as universal compilation and compiler generation are also known to be possible. A partial evaluator receives as input a program and partially known input to that program, and outputs a residual program which should run at least as efficient as the input program with restricted input. In this paper we study the case where both the input and residual programs are logic programs, being the partial evaluator itself a logic program. Up to now, partial evaluators have failed to process large “non=toy” examples. Here we present extensions to partial evaluators whic will allow us to produce more efficient residual programs using less computing resources, during partial evaluation. First, the introduced extensions allow the processing of large examples, which is not possible with the previous techniques. This is now possible since the extensions use less CPU time and memory consumption during the partial evaluation process. Second, the extended partial evaluator produces smaller residual programs, producing important CPU time optimizing effects. With the standard techniques, a partial evaluator will most probably act as a pessimizer, not as an optimizer. Examples are given. 相似文献

16.

FPGA based unified architecture for public key and private key cryptosystems

Yi WANG Renfa LI 《Frontiers of Computer Science》2013,7(3):307-316

Recently, security in embedded system arises attentions because of modern electronic devices need cautiously either exchange or communicate with the sensitive data. Although security is classical research topic in worldwide communication, the researchers still face the problems of how to deal with these resource constraint devices and enhance the features of assurance and certification. Therefore, some computations of cryptographic algorithms are built on hardware platforms, such as field program gate arrays (FPGAs). The commonly used cryptographic algorithms for digital signature algorithm (DSA) are rivest-shamir-adleman (RSA) and elliptic curve cryptosystems (ECC) which based on the presumed difficulty of factoring large integers and the algebraic structure of elliptic curves over finite fields. Usually, RSA is computed over GF(p), and ECC is computed over GF(p) or GF(2^p). Moreover, embedded applications need advance encryption standard (AES) algorithms to process encryption and decryption procedures. In order to reuse the hardware resources and meet the trade-off between area and performance, we proposed a new triple functional arithmetic unit for computing high radix RSA and ECC operations over GF(p) and GF(2^p), which also can be extended to support AES operations. A new high radix signed digital (SD) adder has been proposed to eliminate the carry propagations over GF(p). The proposed unified design took up 28.7% less hardware resources than implementing RSA, ECC, and AES individually, and the experimental results show that our proposed architecture can achieve 141.8MHz using approximately 5.5k CLBs on Virtex-5 FPGA. 相似文献

17.

可重构计算混合系统硬件的设计与实现

下载免费PDF全文

李美锋邓庆绪金曦刘柄蔚孔繁鑫《计算机工程》2012,38(7):243-246

可重构计算系统中软硬件资源的管理缺乏统一的机制,资源不能被有效利用。为此,设计并实现一种硬件任务模型,为上层软件提供统一的硬件接口,使操作系统能够对软硬件任务进行统一管理,并给出硬件任务下载器的实现结构及工作流程。实验结果表明,该硬件任务模型的运行效率较高,硬件任务下载器能较大地提高硬件任务的下载速率。相似文献

18.

A pipelined array architecture for Euclidean distance transformation and its FPGA implementation

《Microprocessors and Microsystems》2005,29(8-9):405-410

The Euclidean Distance Transform (EDT) is an important tool in image analysis and machine vision. This paper provides an area-efficient hardware solution to the computation of EDT on a binary image. An O(n) hardware algorithm for computing EDT of an n×n image is presented. A pipelined 2D array architecture for harware implementation is designed. The architecture has a regular structure with locally connected identical processing elements. Further, pipelining reduces hardware resources. Such an array architecture is easily scalable to handle images of different sizes and is suitable for implementation on reconfigurable devices like FPGAs. Results of FPGA-based implementation shows that the hardware can process about 6000 images of size 512×512 per second which is much higher than the video rate of 30 frames per second. 相似文献

19.

动态部分重配置及其FPGA实现 总被引：2，自引：1，他引：2

下载免费PDF全文

李涛刘培峰杨愚鲁《计算机工程》2006,32(14):224-226

动态部分重配置充分利用了FPGA芯片提供的可重配置功能，提高了FPGA芯片的利用率，减小了FPGA芯片的配置时间，有效地提高了系统的整体性能。该文介绍了动态部分重配置的两种实现方法，并在Spartan-II FPGA上进行了验证。相似文献

20.

Telegraphos: A Substrate for High-Performance Computing on Workstation Clusters

Manolis G.H. Katevenis Evangelos P. Markatos George Kalokerinos Apostolos Dollas 《Journal of Parallel and Distributed Computing》1997,43(2):1257

Networks of workstations and high-performance microcomputers have been rarely used for running parallel applications, because, although they have significant aggregate computing power, they lack the support for efficient message-passing and shared-memory communication. In this paper we presentTelegraphos, a distributed system that provides efficient message-passing and shared-memory support on top of a workstation cluster. We focus on the network interface of Telegraphos that provides a variety of shared-memory operations such as remote read, remote write, remote atomic operations, and DMA, all launched from user level without any intervention of the operating system. Telegraphos I, the Telegraphos prototype, has been implemented. Emphasis was placed on rapid prototyping, so the technology used was conservative: FPGAs, SRAMs, and TTL buffers. 相似文献