期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

ERI sorting for emerging processor architectures

Tirath Ramdas Gregory K. Egan 《Computer Physics Communications》2009,180(8):1221-1229

Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures. 相似文献

2.

Efficient algorithms for 2D area management and online task placement on runtime reconfigurable FPGAs

Zonghua Weichen Jiang Jin Xiuqiang Qingxu 《Microprocessors and Microsystems》2009,33(5-6):374-387

Partial Runtime Reconfigurable (PRTR) FPGAs allow HW tasks to be placed and removed dynamically at runtime. We make two contributions in this paper. First, we present an efficient algorithm for finding the complete set of Maximal Empty Rectangles on a 2D PRTR FPGA. We also present a HW implementation of the algorithm with negligible runtime overhead. Second, we present an efficient online deadline-constrained task placement algorithm for minimizing area fragmentation on the FPGA by using an area fragmentation metric that takes into account probability distribution of sizes of future task arrivals as well as the time axis. The techniques presented in this paper are useful in an operating system for runtime reconfigurable FPGAs to manage the HW resources on the FPGA when HW tasks that arrive and finish dynamically at runtime. 相似文献

3.

Compiling for power with ScalaPipe

《Journal of Systems Architecture》2013,59(8):615-625

相似文献

4.

Efficient dynamic priority based soft error mitigation techniques for configuration memory of FPGA hardware

《Microprocessors and Microsystems》2017

Radiation-induced single bit upsets (SBUs) and multi-bit upsets (MBUs) are more prominent in Field Programmable Gate Arrays (FPGAs) due to the presence of a large number of latches in the configuration memory (CM) of FPGAs. At the same time, SBUs and MBUs in the CM can permanently or temporarily affect the hardware circuit implemented on FPGA. Hence, error mitigation and recovery techniques are necessary to protect the FPGA hardware from permanent faults arising due to such SBUs and MBUs. Different existing techniques used to mitigate the effect of soft errors in FPGA have high overhead and their implementations are also quite complex. In this paper, we have proposed efficient single bit as well as multi-bit error correcting methods to correct errors in the CM of FPGAs using simple parity equations and Erasure code. These codes are easy to implement, and the needed decoding circuits are also simple. Use of Dynamic Partial Reconfiguration (DPR) along with a simple hardware scheduling algorithm based download manager helps to perform the error correction in the CM without suspending the operations of the other hardware blocks. We propose a first of its kind methodology for novel transient fault correction using efficient error correcting codes with hardware scheduling for FPGAs. To validate the design we have tested the proposed methodology with Kintex FPGA. We have also measured different parameters like fault recovery time, power consumption, resource overhead and error correction efficiency to estimate the performance of our proposed methods. 相似文献

5.

基于FPGA的传感数据加速处理*

孙淑荣王永利《计算机应用研究》2011,28(4):1365-1367

随着传感器网络技术的快速发展,实时处理海量的高速的传感数据为现有数据处理技术提出许多挑战。利用非对称性多核系统加速特殊操作是当前计算机体系结构发展的一种趋势,是解决物联网应用实时响应的有效途径。本文提出一种基于FPGA加速处理传感数据中位数计算的方法,包括计算模型、符合FPGA设计限制的实现过程,把FPGA连接到系统的其他部分的适当的融合策略,实验表明与通用CPU相比,本文提出的基于FPGAs的加速数据处理策略在功耗和并行传感数据流评价方面具有明显的优势,可以广泛地应用于物联网中前端轻量级数据预处理节点的设计与实现。相似文献

6.

Optimal utilization of available reconfigurable hardware resources

Kashif Latif Arshad Aziz Athar MahboobAuthor vitae 《Computers & Electrical Engineering》2011,37(6):1043-1057

Field programmable gate arrays (FPGAs) are continuously gaining momentum and becoming essential part of today’s digital systems and applications. The growing use of these devices coupled with increasingly more complex and integrated designs necessitates search for techniques in efficient utilization of their internal resources. Standard HDL coding techniques and synthesis tools implement logic to look up table (LUT) based architecture. The resulting design utilizes more area on the chip and some fast and dedicated areas and resources of the chip remain unutilized. This in turn results in slower clock rates and larger critical path lengths, hence the design remains inefficient in terms of both speed and area. In this paper we present and discuss techniques to effectively utilize the FPGA dedicated resources in order to speed up achievable clock rates and reduce the FPGA area utilization. Various useful HDL constructs are presented that utilize dedicated hardware resources of modern Xilinx FPGAs. Optimization techniques are presented with implementation examples and corresponding quantitative performance evaluation. In most of the cases we have achieved 50% reduction in chip area utilization and simultaneously improved timing results significantly. 相似文献

7.

Performance evaluation and optimal design for FPGA-based digit-serial DSP functions

Hanho LeeAuthor Vitae Gerald E. SobelmanAuthor Vitae 《Computers & Electrical Engineering》2003,29(2):357-377

As field programmable gate array (FPGA) technology has steadily improved, FPGAs are now viable alternatives to other technology implementations for high-speed classes of digital signal processing (DSP) applications. Digit-serial DSP architectures have been effective implementation method for FPGAs. In this work, a method of implementing digit-serial DSP architectures on FPGAs is presented, and their performance is evaluated with the objective of finding and developing the most efficient digit-serial DSP architectures on FPGAs. This paper discusses area costs and operational delays of the various digit-serial DSP functions and presents the area/delay models on Xilinx XC4000-series FPGAs. These area/delay models can make predictions of performance and hardware resource utilization before a lengthy layout and synthesis process is undertaken. The results show that the area/delay models proposed here are valid and the digit-serial DSP designs are promising candidates for efficient FPGA implementations. 相似文献

8.

Generation of logic designs for efficiently solving ordinary differential equations on field programmable gate arrays

Silas Bartel Matthias Korch 《Software》2023,53(1):27-52

相似文献

9.

Reconfiguring one-time programmable FPGAs

《Micro, IEEE》1999,19(6):53-63

Field-programmable gate arrays can suffer from a variety of faults, ranging from wire anomalies and defects to inoperative programmable connections. The solution to these faults depends on whether or not we are dealing with a reprogrammable FPGA or a one time programmable (OTP) FPGA. To correct faults, developers can reconfigure FPGAs such as those made by Xilinx and Altera by reprogramming. These devices can be programmed many times, for different designs and applications. Correcting faults in OTP FPGAs, such as those made by Actel is more difficult. For one thing, OTP FPGAs are based on antifuses. With an antifuse, the FPGAs configuration information has an initial (default) value that can be changed, but once changed cannot be restored. Therefore, the procedures to bypass faulty cells or faulty routing in an OTP FPGA must meet more stringent requirements than for reprogrammable FPGAs. The “Reconfiguration Approaches” sidebar describes two methods other researchers have tried. This article describes our approach to reconfiguring OTP FPGAs. We explain how we determine if reconfiguration is feasible, the algorithms we used, and the results of our experiments on a generic OTP FPGA model and a generic detail router 相似文献

10.

Server Selection,Configuration and Reconfiguration Technology for IaaS Cloud with Multiple Server Types

Yoji Yamato 《Journal of Network and Systems Management》2018,26(2):339-360

We propose a server selection, configuration, reconfiguration and automatic performance verification technology to meet user functional and performance requirements on various types of cloud compute servers. Various servers mean there are not only virtual machines on normal CPU servers but also container or baremetal servers on strong graphic processing unit (GPU) servers or field programmable gate arrays (FPGAs) with a configuration that accelerates specified computation. Early cloud systems are composed of many PC-like servers, and virtual machines on these severs use distributed processing technology to achieve high computational performance. However, recent cloud systems change to make the best use of advances in hardware power. It is well known that baremetal and container performances are better than virtual machines performances. And dedicated processing servers, such as strong GPU servers for graphics processing, and FPGA servers for specified computation, have increased. Our objective for this study was to enable cloud providers to provision compute resources on appropriate hardware based on user requirements, so that users can benefit from high performance of their applications easily. Our proposed technology select appropriate servers for user compute resources from various types of hardware, such as GPUs and FPGAs, or set appropriate configurations or reconfigurations of FPGAs to use hardware power. Furthermore, our technology automatically verifies the performances of provisioned systems. We measured provisioning and automatic performance verification times to show the effectiveness of our technology. 相似文献

11.

High-Level synthesis assisted design and verification framework for automotive radar processors

《Microprocessors and Microsystems》2020

In radar-based advanced driver assistance systems, baseband processing is necessary to detect the speed, distance, and angle of elevation of the target (e.g., vehicle, pedestrian, traffic sign, etc.). The target and the source often move at high speeds; therefore, the computation rate must be sufficiently high to perform actions (e.g., braking) in real-time. Software-based implementations of such systems fall short of the required performance, which has led to an increase in the popularity of custom hardware implementations, e.g., on field-programmable gate arrays (FPGAs). FPGAs also serve as platforms to develop software concurrent with system-on-chip (SoC) development, thereby decreasing the time to market. High-level synthesis (HLS) tools are gaining considerable attention in the very-large-scale integration design community because of their flexibility. In this paper, we propose a novel design and verification framework for a RADAR processing SoC. The framework is assisted by an HLS-based design scheme for the processor and supports the application of a real-world stimulus to register transfer-level design implementation running on FPGAs. Customer use cases for the distance and velocity calculations are executed in a pre-silicon environment using range and Doppler processing on the Xilinx Kintex-7(XC 7K 480T) FPGA. Our findings show that the proposed framework, based on MATLAB HDL Coder and HDL Verifier, is superior to similar implementations from prior research in terms of speed and FPGA resources. This is owing to the usage of appropriate HLS directives and the usage of a novel design method based on application-specific bit width for intermediate data nodes. 相似文献

12.

Extended overlay architectures for heterogeneous FPGA cluster management

《Journal of Systems Architecture》2017

This paper proposes a novel approach for the hardware virtualization of FPGA resources, based on overlay architectures. Overlays are reconfigurable architectures synthesized on top of commercial-of-the-shelf (COTS) FPGAs. They have demonstrated to improve portability, speed up reconfiguration, and promote resource abstraction hence durability. This work demonstrates how slightly extending the architecture overlaying on top of COTS FPGAs can bring novel features for sake of improved management of hardware tasks, and ensure the binary compatibility among heterogeneous FPGAs. This comes along with a deployment platform and a software stack offering an operating system service. As a result, the platform is capable of node-to-node a hardware application live migration, while operating a cluster of heterogeneous FPGAs. Besides, the proposed software stack ensures backward compatibility when introducing a new overlay architecture. This paper also introduces accurate cost models for the early estimation of the reconfiguration time overhead. This approach that has been demonstrated in DASIP international conference is evaluated in this paper on both the Xilinx Artix-7 and Altera Cyclone V C9 FPGAs. 相似文献

13.

Embedding a high speed interval type-2 fuzzy controller for a real plant into an FPGA

Roberto Sepúlveda Oscar Castillo 《Applied Soft Computing》2012,12(3):988-998

The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing. This is an important issue since the use of IT2 FIS still being controversial for several reasons, one of the most important is related to the resulting shocking increase in computational complexity that type reducers, like the Karnik-Mendel (KM) iterative method, can cause even for small systems. Hence, comparing our results against a typical implementation of a IT2 FIS using a high level language implemented into a computer, we show that using a hardware implementation the the whole IT2 FIS (fuzzification, inference engine, type reducer and defuzzification) last only four clock cycles; a speed up of nearly 225,000 and 450,000 can be obtained for the Spartan 3 and Virtex 5 Field Programmable Gate Arrays (FPGAs), respectively. This proposal is suitable to be implemented in pipeline, so the complete IT2 process can be obtained in just one clock cycle with the consequently gain in speed of 900,000 and 2,400,000 for the aforementioned FPGAs. This paper also shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software. Comparative experiments of control surfaces, and time response in the control of a real plant using the IT2 FIS implemented into a computer against the IT2 FIS into an FPGA are shown. 相似文献

14.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

15.

Security in SRAM FPGAs

Trimberger S. 《Design & Test of Computers, IEEE》2007,24(6):581-581

As FPGAs have grown larger and more complex, the value of the IP implemented in them has grown commensurately. Since SRAM FPGAs reload their programming data every time they are powered up, an adversary can potentially copy the program as it is being loaded. FPGA manufacturers have added security features to protect designs from unauthorized copy, theft, and reverse-engineering as the bitstream is transmitted from permanent storage into the FPGA. These bitstream security features use well-known information security methods to protect design data. In this discussion, it is assumed that an adversary has physical access to the FPGA. In this environment, denial-of-service attacks on the configuration are irrelevant: A trivial denial-of-service method would be to physically damage the device - the so-called "whack-it-with-a-hammer" attack. 相似文献

16.

Stacked Autoencoders Using Low-Power Accelerated Architectures for Object Recognition in Autonomous Systems

Joao Maria Joao Amaro Gabriel Falcao Luís A. Alexandre 《Neural Processing Letters》2016,43(2):445-458

This paper investigates low-energy consumption and low-power hardware models and processor architectures for performing the real-time recognition of objects in power-constrained autonomous systems and robots. Most recent developments show that convolutional deep neural networks are currently the state-of-the-art in terms of classification accuracy. In this article we propose to use of a different type of deep neural network—stacked autoencoders—and show that within a limited number of layers and nodes, for accommodating the use of low-power accelerators such as mobile GPUs and FPGAs, we are still able to achieve both classification levels not far from the state-of-the-art and a high number of processed frames per second. We present experiments using the color CIFAR-10 dataset. This enables the adaptation of the architecture to a live feed camera. Another novelty equally proposed for the first time in this work suggests that the training phase can also be performed in these low-power devices, instead of the usual approach that uses a desktop CPU or a GPU to perform this task and only runs the trained network later on the FPGA. This allows incorporating new functionalities as, for example, a robot performing runtime learning. 相似文献

17.

Deterministic bit-stream digital neurons

Braendler D. Hendtlass T. O''Donoghue P. 《Neural Networks, IEEE Transactions on》2002,13(6):1514-1525

In this paper, we present the design of a deterministic bit-stream neuron, which makes use of the memory rich architecture of fine-grained field-programmable gate arrays (FPGAs). It is shown that deterministic bit streams provide the same accuracy as much longer stochastic bit streams. As these bit streams are processed serially, this allows neurons to be implemented that are much faster than those that utilize stochastic logic. Furthermore, due to the memory rich architecture of fine-grained FPGAs, these neurons still require only a small amount of logic to implement. The design presented here has been implemented on a Virtex FPGA, which allows a very regular layout facilitating efficient usage of space. This allows for the construction of neural networks large enough to solve complex tasks at a speed comparable to that provided by commercially available neural-network hardware. 相似文献

18.

ProNoC: A low latency network-on-chip based many-core system-on-chip prototyping platform

《Microprocessors and Microsystems》2017

Network-on-chip (NoC) is an emerging interconnect infrastructure to address the scalability limitation of conventional shared bus architecture for many-core system-on-chip (MCSoC). Current field-programmable gate arrays (FPGAs) have over million lookup tables, making it possible to prototype a complete NoC-based MCSoC on a single FPGA device. FPGA prototyping allows rapid system verification and optimum design parameters estimation. However, existing NoC-based MCSoC prototypes are usually adopting simple NoC architectural functionality. These NoC prototypes cannot represent a realistic projection of the state-of-the-art application-specific integrated circuit (ASIC) NoCs as these prototypes have limited overall system performance. This paper presents ProNoC, an integrated tool for rapid prototyping and validation of NoC-based MCSoC projects targeting FPGA devices. ProNoC adopts most advanced NoC features such as the support of virtual channel (VC), virtual network, low latency routing and different routing algorithms. Results show that NoC interconnect in ProNoC outperforms CONNECT, the most recent VC based prototype NoC with lower logic cell utilization, higher maximum operating frequency, higher average saturation throughput, and lower average communication latency. Moreover, ProNoC is equipped with graphical user interface to facilitate the development of MCSoC prototypes on FPGA platforms. 相似文献

19.

Dynamic,scalable and flexible resource discovery for large-dimension many-core systems

《Future Generation Computer Systems》2015

相似文献

20.

The Garp architecture and C compiler 总被引：1，自引：0，他引：1

Callahan T.J. Hauser J.R. Wawrzynek J. 《Computer》2000,33(4):62-69

Various projects and products have been built using off-the-shelf field-programmable gate arrays (FPGAs) as computation accelerators for specific tasks. Such systems typically connect one or more FPGAs to the host computer via an I/O bus. Some have shown remarkable speedups, albeit limited to specific application domains. Many factors limit the general usefulness of such systems. Long reconfiguration times prevent the acceleration of applications that spread their time over many different tasks. Low-bandwidth paths for data transfer limit the usefulness of such systems to tasks that have a high computation-to-memory-bandwidth ratio. In addition, standard FPGA tools require hardware design expertise which is beyond the knowledge of most programmers. To help investigate the viability of connected FPGA systems, the authors designed their own architecture called Garp and experimented with running applications on it. They are also investigating whether Garp's design enables automatic, fast, effective compilation across a broad range of applications. They present their results in this article 相似文献