首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 7 毫秒
1.

SRAM-based FPGAs feature high performance and flexibility. Thus, they have found many applications in modern high-performance computing (HPC) systems. These systems suffer from the limitation of the computing resources problem for running HPC applications. Therefore, multi-FPGA systems have been emerged to alleviate such resource limitations. In this regard, efficient scheduling strategies are required to dynamically steer the execution of applications—represented as task graphs—on a set of connected FPGAs. In this paper, a heuristic-based dynamic critical path-aware scheduling technique named CPA is presented to schedule task graphs on multi-FPGA systems. The proposed technique, by considering the computation and communication capabilities of FPGAs, dynamically assigns priority to tasks in different steps in order to achieve better makespans. The proposed technique has been evaluated by conducting several experiments on real-world and three different shapes of random task graphs with different number of tasks, and its efficiency has been compared with that of three task graph scheduling approaches. The obtained results demonstrate that the proposed CPA technique outperforms well-known heuristic scheduling strategies and improves their makespan by 13.47% on average. In addition, the experiments show that the proposed technique generates the schedules in the order of milliseconds and the average of its yielded makespans is 12.05% longer than that of an optimum schedule.

  相似文献   

2.
3.
The Scalable Parallel Random Number Generators library (SPRNG) supports fast and scalable random number generation with good statistical properties for parallel computational science applications. In order to accelerate SPRNG in high performance reconfigurable computing systems, we present the Hardware Accelerated SPRNG library (HASPRNG). Ported to the Xilinx University Program (XUP) and Cray XD1 reconfigurable computing platforms, HASPRNG includes the reconfigurable logic for Field Programmable Gate Arrays (FPGAs) along with a programming interface which performs integer random number generation that produces identical results with SPRNG. This paper describes the reconfigurable logic of HASPRNG exploiting the mathematical properties and data parallelism residing in the SPRNG algorithms to produce high performance and also describes how to use the programming interface to minimize the communication overhead between FPGAs and microprocessors. The programming interface allows a user to be able to use HASPRNG the same way as SPRNG 2.0 on platforms such as the Cray XD1. We also describe how to install HASPRNG and use it. For HASPRNG usage we discuss a FPGA π-estimator for a High Performance Reconfigurable Computer (HPRC) sample application and compare to a software π-estimator. HASPRNG shows 1.7x speedup over SPRNG on the Cray XD1 and is able to obtain substantial speedup for a HPRC application.

Program summary

Program title: HASPRNGCatalogue identifier: AEER_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEER_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 594 928No. of bytes in distributed program, including test data, etc.: 6 509 724Distribution format: tar.gzProgramming language: VHDL (XUP and Cray XD1), C++ (XUP), C (Cray XD1)Computer: PowerPC 405 (XUP) / AMD 2.2 GHz Opteron processor (Cray XD1)Operating system: LinuxFile size: 15 MB (XUP) / 22 MB (Cray XD1)Classification: 4.13Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations such as π-estimation are able to consume limitless random numbers forthe computation as long as hardware resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The library presented here accelerates the generators of independent streams of random numbers.Solution method: Multiple copies of random number generators in FPGAs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. HASPRNG is a random number generators library to allow a computational science application to employ the multiple copies of random number generators to boost performance. Users can interface HASPRNG with software code executing on microprocessors and/or with hardware applications executing on FPGAs.  相似文献   

4.
The addition of reconfigurable hardware (FPGAs) to the nodes of Beowulf-style clusters has the potential to accelerate a variety of parallel applications through a combination of parallel programming and reconfigurable computing techniques. However, making efficient use of the computational resources available places a significant burden on the application developer due to the lack of support for reconfigurable computing and task heterogeneity in standard message-passing libraries. This paper describes Accessible Reconfigurable Computing (ARC), a metacomputing environment designed to address these issues. The architecture, implementation, and operation of the system are described in detail.  相似文献   

5.
Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045.  相似文献   

6.

Artificial intelligence (AI) and machine learning (ML) tools play a significant role in the recent evolution of smart systems. AI solutions are pushing towards a significant shift in many fields such as healthcare, autonomous airplanes and vehicles, security, marketing customer profiling and other diverse areas. One of the main challenges hindering the AI potential is the demand for high-performance computation resources. Recently, hardware accelerators are developed in order to provide the needed computational power for the AI and ML tools. In the literature, hardware accelerators are built using FPGAs, GPUs and ASICs to accelerate computationally intensive tasks. These accelerators provide high-performance hardware while preserving the required accuracy. In this work, we present a systematic literature review that focuses on exploring the available hardware accelerators for the AI and ML tools. More than 169 different research papers published between the years 2009 and 2019 are studied and analysed.

  相似文献   

7.
Field-programmable gate arrays (FPGAs) are being integrated with processors on the same motherboard or even chip in order to achieve flexible high-performance computing, and this may become main stream in chip multi-core architectures. However, the expensive FPGA area is often used inefficiently, with much of the logic idle at any given time. This work, motivated by the Dynamic-Link Library (DLL) concept in software, explores the possibility of “hardware DLLs” by finding ways for fast dynamic incremental reconfiguration of FPGAs. So doing would, among other things, enable same-function replication at any given time, with functions changing quickly over time, thereby enabling efficient exploitation of data parallelism at no additional hardware cost.We present two new multi-context FPGA architectures based on two different configuration storage architectures: local and centralized. Problems such as configuration storage and reconfiguration (time, power and space) overhead are considered. Well known area and power models are used in evaluating various approaches and in order to provide guidelines for matching architectures to target applications. Lastly, we provide insights into resulting scheduling issues. Our findings provide the foundation and “rules of the game” for subsequent development of reconfiguration schedulers and execution environments.  相似文献   

8.
HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system.  相似文献   

9.
10.
Real-time simulation allows rapid deployment and thorough testing of prototyped hardware in the automotive and aerospace industries. However, the simulation of power electronic circuits (PECs) in the context of PC-based simulations is challenging for several reasons, and imposes a limit in the 1–5 kHz range to the achievable switching frequencies. As FPGA devices gain computing power, conducting the real-time simulation of PECs on chip becomes an attractive alternative. This paper demonstrates the feasibility of high-performance floating-point calculation engines aimed for the real-time simulation of PECs on high-end and low-cost FPGAs as well. The paper discusses emerging paradigms for reconfigurable floating-point computing that favor optimal performance and offer near double precision arithmetic at a minimal hardware cost. The effectiveness of the approach is demonstrated by considering three different circuit topologies and simulating their high-frequency stimulation (20 kHz) using the same automated calculation engine. The considered circuits are a boost converter, a two-level three-phase bridge, and a two-level-three-phase bridge driven by a boost converter.  相似文献   

11.
Mangione-Smith  W.H. 《Computer》1997,30(10):115-117
Configurable computing systems enhance traditional computing systems through the addition of programmable hardware. Configurable computing offers the opportunity to change the partition at run-time by re-programming the hardware. Recent research has shifted to CAD and application development tools. Almost all existing configurable computing systems are based on field-programmable gate arrays (FPGAs). These devices implement reasonably arbitrary digital circuits, and the flexibility allows us to think of configurable computing systems based on FPGAs as netlist computers. The configurable computing approach integrates FPGAs as an intimate and fundamental component of the computing system, rather than relegating them to their earlier role of supporting system prototyping and low-volume production. However, the author believes that automated approaches to the design of configurable computing systems are premature because they do not pay enough attention to performance  相似文献   

12.
群体智能系统通过邻居个体的信息交互实现群体级别的应用任务,具有良好的鲁棒性和灵活性.与此同时,大多数开发人员难以对分布式、并行的个体交互机制进行描述.一些高级语言允许用户以串行思维方式、从系统全局角度来编程并行的群体智能计算任务,而无需考虑通信协议、数据分布等底层交互细节.但面向用户、全局声明式的群体智能系统应用程序与个体并行执行逻辑存在的巨大语义差距,使得编译过程复杂进而导致应用程序开发效率不高.本文提出了一个编译系统及其支撑工具,支持将高级的群体智能系统应用程序转换为安全、高效的分布式实现.该编译系统通过并行信息识别,计算划分,交互信息生成技术,将面向系统全局、串行编程的群体智能应用程序编译为面向个体独立执行的并行目标代码,从而使用户不必了解个体间的复杂交互机制.设计了一种标准化中间表示,将复杂群体智能计算任务转换为群体智能算子和输入输出变量组合而成的标准化语义模块序列,其以独立于平台的形式表示源程序信息,屏蔽了目标硬件平台的异构性.在一个群体智能系统案例平台中部署和测试了该编译系统,结果表明该系统能够有效将群体智能应用程序编译为平台可执行的目标代码并提升应用程序开发效率,其生成的代码在一系列基准测试中具有比现有编译器更好的性能.  相似文献   

13.
Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability.  相似文献   

14.
Genetic Algorithm for Boolean minimization in an FPGA cluster   总被引:1,自引:0,他引:1  
Evolutionary algorithms are an alternative option to the Boolean synthesis due to that they allow one to create hardware structures that would not be able to be obtained with other techniques. This paper shows a parallel genetic programming (PGP) Boolean synthesis implementation based on a cluster of FPGAs that takes full advantage of parallel programming and hardware/software co-design techniques. The performance of our cluster of FPGAs implementation has been compared with an HPC implementation. The experimental results have shown an excellent behavior in terms of speed up (up to ×500) and in terms of solving the scalability problems of this algorithms present in previous works.  相似文献   

15.
Partial evaluation is a symbolic manipulation technique used to produce efficient algorithms when part of the input to the algorithm is known. Other applications of partial evaluators such as universal compilation and compiler generation are also known to be possible. A partial evaluator receives as input a program and partially known input to that program, and outputs a residual program which should run at least as efficient as the input program with restricted input. In this paper we study the case where both the input and residual programs are logic programs, being the partial evaluator itself a logic program. Up to now, partial evaluators have failed to process large “non=toy” examples. Here we present extensions to partial evaluators whic will allow us to produce more efficient residual programs using less computing resources, during partial evaluation. First, the introduced extensions allow the processing of large examples, which is not possible with the previous techniques. This is now possible since the extensions use less CPU time and memory consumption during the partial evaluation process. Second, the extended partial evaluator produces smaller residual programs, producing important CPU time optimizing effects. With the standard techniques, a partial evaluator will most probably act as a pessimizer, not as an optimizer. Examples are given.  相似文献   

16.
Recently, security in embedded system arises attentions because of modern electronic devices need cautiously either exchange or communicate with the sensitive data. Although security is classical research topic in worldwide communication, the researchers still face the problems of how to deal with these resource constraint devices and enhance the features of assurance and certification. Therefore, some computations of cryptographic algorithms are built on hardware platforms, such as field program gate arrays (FPGAs). The commonly used cryptographic algorithms for digital signature algorithm (DSA) are rivest-shamir-adleman (RSA) and elliptic curve cryptosystems (ECC) which based on the presumed difficulty of factoring large integers and the algebraic structure of elliptic curves over finite fields. Usually, RSA is computed over GF(p), and ECC is computed over GF(p) or GF(2 p ). Moreover, embedded applications need advance encryption standard (AES) algorithms to process encryption and decryption procedures. In order to reuse the hardware resources and meet the trade-off between area and performance, we proposed a new triple functional arithmetic unit for computing high radix RSA and ECC operations over GF(p) and GF(2 p ), which also can be extended to support AES operations. A new high radix signed digital (SD) adder has been proposed to eliminate the carry propagations over GF(p). The proposed unified design took up 28.7% less hardware resources than implementing RSA, ECC, and AES individually, and the experimental results show that our proposed architecture can achieve 141.8MHz using approximately 5.5k CLBs on Virtex-5 FPGA.  相似文献   

17.
可重构计算系统中软硬件资源的管理缺乏统一的机制,资源不能被有效利用。为此,设计并实现一种硬件任务模型,为上层软件提供统一的硬件接口,使操作系统能够对软硬件任务进行统一管理,并给出硬件任务下载器的实现结构及工作流程。实验结果表明,该硬件任务模型的运行效率较高,硬件任务下载器能较大地提高硬件任务的下载速率。  相似文献   

18.
The Euclidean Distance Transform (EDT) is an important tool in image analysis and machine vision. This paper provides an area-efficient hardware solution to the computation of EDT on a binary image. An O(n) hardware algorithm for computing EDT of an n×n image is presented. A pipelined 2D array architecture for harware implementation is designed. The architecture has a regular structure with locally connected identical processing elements. Further, pipelining reduces hardware resources. Such an array architecture is easily scalable to handle images of different sizes and is suitable for implementation on reconfigurable devices like FPGAs. Results of FPGA-based implementation shows that the hardware can process about 6000 images of size 512×512 per second which is much higher than the video rate of 30 frames per second.  相似文献   

19.
动态部分重配置及其FPGA实现   总被引:2,自引:1,他引:2       下载免费PDF全文
李涛  刘培峰  杨愚鲁 《计算机工程》2006,32(14):224-226
动态部分重配置充分利用了FPGA芯片提供的可重配置功能,提高了FPGA芯片的利用率,减小了FPGA芯片的配置时间,有效地提高了系统的整体性能。该文介绍了动态部分重配置的两种实现方法,并在Spartan-II FPGA上进行了验证。  相似文献   

20.
Networks of workstations and high-performance microcomputers have been rarely used for running parallel applications, because, although they have significant aggregate computing power, they lack the support for efficient message-passing and shared-memory communication. In this paper we presentTelegraphos, a distributed system that provides efficient message-passing and shared-memory support on top of a workstation cluster. We focus on the network interface of Telegraphos that provides a variety of shared-memory operations such as remote read, remote write, remote atomic operations, and DMA, all launched from user level without any intervention of the operating system. Telegraphos I, the Telegraphos prototype, has been implemented. Emphasis was placed on rapid prototyping, so the technology used was conservative: FPGAs, SRAMs, and TTL buffers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号