首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Network-on-Chip (NoC) interconnect fabrics are categorized according to trade-offs among latency, throughput, speed, and silicon area, and the correctness and performance of these fabrics in Field-Programmable Gate Array (FPGA) applications are assessed through experimentation and simulation. In this paper, we propose a consistent parametric method for evaluating the FPGA performance of three common on-chip interconnect architectures namely, the Mesh, Torus and Fat-tree architectures. We also investigate how NoC architectures are affected by interconnect and routing parameters, and demonstrate their flexibility and performance through FPGA synthesis and testing of 392 different NoC configurations. In this process, we found that the Flit Data Width (FDW) and Flit Buffer Depth (FBD) parameters have the heaviest impact on FPGA resources, and that these parameters, along with the number of Virtual Channels (VCs), significantly affect reassembly buffering and routing and logic requirements at NoC endpoints. Applying our evaluation technique to a detailed and flexible cycle accurate simulation, we drive the three NoC architectures using benign (Nearest Neighbor and Uniform) and adversarial (Tornado and Random Permutation) traffic patterns with different numbers of VCs, producing a set of load–delay curves. The results show that by strategically tuning the router and interconnect parameters, the Fat-tree network produces the best utilization of FPGA resources in terms of silicon area, clock frequency, critical path delays, network cost, saturation throughput, and latency, whereas the Mesh and Torus networks showed comparatively high resource costs and poor performance under adversarial traffic patterns. From our findings it is clear that the Fat-tree network proved to be more efficient in terms of FPGA resource utilization and is compliant with the current Xilinx FPGA devices. This approach will assist engineers and architects in establishing an early decision in the choice of right interconnects and router parameters for large and complex NoCs. We demonstrate that our approach substantially improves performance under a large variety of experimentation and simulation which confirm its suitability for real systems.  相似文献   

2.
Field programmable gate arrays (FPGAs) are continuously gaining momentum and becoming essential part of today’s digital systems and applications. The growing use of these devices coupled with increasingly more complex and integrated designs necessitates search for techniques in efficient utilization of their internal resources. Standard HDL coding techniques and synthesis tools implement logic to look up table (LUT) based architecture. The resulting design utilizes more area on the chip and some fast and dedicated areas and resources of the chip remain unutilized. This in turn results in slower clock rates and larger critical path lengths, hence the design remains inefficient in terms of both speed and area. In this paper we present and discuss techniques to effectively utilize the FPGA dedicated resources in order to speed up achievable clock rates and reduce the FPGA area utilization. Various useful HDL constructs are presented that utilize dedicated hardware resources of modern Xilinx FPGAs. Optimization techniques are presented with implementation examples and corresponding quantitative performance evaluation. In most of the cases we have achieved 50% reduction in chip area utilization and simultaneously improved timing results significantly.  相似文献   

3.
Computer architectures are quickly changing toward heterogeneous many-core systems. Such a trend opens up interesting opportunities but also raises immense challenges since the efficient use of heterogeneous many-core systems is not a trivial problem. Software-configurable microprocessors and FPGAs add further diversity but also increase complexity. In this paper, we explore the use of sorting networks on field-programmable gate arrays (FPGAs). FPGAs are very versatile in terms of how they can be used and can also be added as additional processing units in standard CPU sockets. Our results indicate that efficient usage of FPGAs involves non-trivial aspects such as having the right computation model (a sorting network in this case); a careful implementation that balances all the design constraints in an FPGA; and the proper integration strategy to link the FPGA to the rest of the system. Once these issues are properly addressed, our experiments show that FPGAs exhibit performance figures competitive with those of modern general-purpose CPUs while offering significant advantages in terms of power consumption and parallel stream evaluation.  相似文献   

4.
Packet classification (matching) is one of the critical operations in networking widely used in many different devices and tasks ranging from switching or routing to a variety of monitoring and security applications like firewall or IDS. To satisfy the ever-growing performance demands of current and future high-speed networks, specially designed hardware accelerated architectures implementing packet classification are necessary. These demands are now growing to such an extent, that in order to keep up with the rising throughputs of network links, the FPGA accelerated architectures are required to perform matching of multiple packets in every single clock cycle. To meet this requirement a simple replication approach can be utilized – instantiate multiple copies of a processing pipeline matching incoming packets in parallel. However, simple replication of pipelines inseparably brings a significant increase in utilization of FPGA resources of all types, which is especially costly for rather scarce on-chip memories used in matching tables.We propose and examine a unique parallel hardware architecture for hash-based exact match classification of multiple packets in each clock cycle that offers a reduction of memory replication requirements. The core idea of the proposed architecture is to exploit the basic memory organization structure present in all modern FPGAs, where hundreds of individual block or distributed memory tiles are available and can be accessed (addressed) independently. This way, we are able to maintain a rather high throughput of matching multiple packets per clock cycle even without fully replicated memory resources in matching tables. Our results show that the designed approach can use on-chip memory resources very efficiently and even scales exceptionally well with increased capacities of match tables. For example, the proposed architecture is able to achieve a throughput of more than 2 Tbps (over 3 000 Mpps) with an effective capacity of more than 40 000 IPv4 flow records at the cost of only a few hundred block memory tiles (366 BlockRAM for Xilinx or 672 M20K for Intel FPGAs) utilizing only a small fraction of available logic resources (around 68 000 LUTs for Xilinx or 95 000 ALMs for Intel).  相似文献   

5.
Network-on-chip (NoC) is an emerging interconnect infrastructure to address the scalability limitation of conventional shared bus architecture for many-core system-on-chip (MCSoC). Current field-programmable gate arrays (FPGAs) have over million lookup tables, making it possible to prototype a complete NoC-based MCSoC on a single FPGA device. FPGA prototyping allows rapid system verification and optimum design parameters estimation. However, existing NoC-based MCSoC prototypes are usually adopting simple NoC architectural functionality. These NoC prototypes cannot represent a realistic projection of the state-of-the-art application-specific integrated circuit (ASIC) NoCs as these prototypes have limited overall system performance. This paper presents ProNoC, an integrated tool for rapid prototyping and validation of NoC-based MCSoC projects targeting FPGA devices. ProNoC adopts most advanced NoC features such as the support of virtual channel (VC), virtual network, low latency routing and different routing algorithms. Results show that NoC interconnect in ProNoC outperforms CONNECT, the most recent VC based prototype NoC with lower logic cell utilization, higher maximum operating frequency, higher average saturation throughput, and lower average communication latency. Moreover, ProNoC is equipped with graphical user interface to facilitate the development of MCSoC prototypes on FPGA platforms.  相似文献   

6.
As field programmable gate array (FPGA) technology has steadily improved, FPGAs are now viable alternatives to other technology implementations for high-speed classes of digital signal processing (DSP) applications. Digit-serial DSP architectures have been effective implementation method for FPGAs. In this work, a method of implementing digit-serial DSP architectures on FPGAs is presented, and their performance is evaluated with the objective of finding and developing the most efficient digit-serial DSP architectures on FPGAs. This paper discusses area costs and operational delays of the various digit-serial DSP functions and presents the area/delay models on Xilinx XC4000-series FPGAs. These area/delay models can make predictions of performance and hardware resource utilization before a lengthy layout and synthesis process is undertaken. The results show that the area/delay models proposed here are valid and the digit-serial DSP designs are promising candidates for efficient FPGA implementations.  相似文献   

7.
Using transferred circuits and metal interconnections placed between layers of active devices anywhere on the chip, Rothko aims at solving utilization, routing, and delay problems of existing FPGA architectures. Experimental implementations have demonstrated important performance advantages  相似文献   

8.
In radar-based advanced driver assistance systems, baseband processing is necessary to detect the speed, distance, and angle of elevation of the target (e.g., vehicle, pedestrian, traffic sign, etc.). The target and the source often move at high speeds; therefore, the computation rate must be sufficiently high to perform actions (e.g., braking) in real-time. Software-based implementations of such systems fall short of the required performance, which has led to an increase in the popularity of custom hardware implementations, e.g., on field-programmable gate arrays (FPGAs). FPGAs also serve as platforms to develop software concurrent with system-on-chip (SoC) development, thereby decreasing the time to market. High-level synthesis (HLS) tools are gaining considerable attention in the very-large-scale integration design community because of their flexibility. In this paper, we propose a novel design and verification framework for a RADAR processing SoC. The framework is assisted by an HLS-based design scheme for the processor and supports the application of a real-world stimulus to register transfer-level design implementation running on FPGAs. Customer use cases for the distance and velocity calculations are executed in a pre-silicon environment using range and Doppler processing on the Xilinx Kintex-7(XC 7K 480T) FPGA. Our findings show that the proposed framework, based on MATLAB HDL Coder and HDL Verifier, is superior to similar implementations from prior research in terms of speed and FPGA resources. This is owing to the usage of appropriate HLS directives and the usage of a novel design method based on application-specific bit width for intermediate data nodes.  相似文献   

9.
随着集成密度的增大以及工作电压的降低,基于SRAM的FPGA芯片更加容易受到单粒子翻转的影响。提出了一种基于通用布局布线工具VPR的抗辐射布线算法,通过改变相关布线资源节点的成本函数,来减少因单粒子翻转引起的桥接错误,并与VPR比较下板测试结果。实验结果表明,该布线算法可以使芯片的容错性能提升20%左右,并且不需要增加额外的硬件资源或引入电路冗余。  相似文献   

10.
This paper proposes a novel approach for the hardware virtualization of FPGA resources, based on overlay architectures. Overlays are reconfigurable architectures synthesized on top of commercial-of-the-shelf (COTS) FPGAs. They have demonstrated to improve portability, speed up reconfiguration, and promote resource abstraction hence durability. This work demonstrates how slightly extending the architecture overlaying on top of COTS FPGAs can bring novel features for sake of improved management of hardware tasks, and ensure the binary compatibility among heterogeneous FPGAs. This comes along with a deployment platform and a software stack offering an operating system service. As a result, the platform is capable of node-to-node a hardware application live migration, while operating a cluster of heterogeneous FPGAs. Besides, the proposed software stack ensures backward compatibility when introducing a new overlay architecture. This paper also introduces accurate cost models for the early estimation of the reconfiguration time overhead. This approach that has been demonstrated in DASIP international conference is evaluated in this paper on both the Xilinx Artix-7 and Altera Cyclone V C9 FPGAs.  相似文献   

11.
SRAM-based FPGAs are attractive to critical applications due to their reconfiguration capability, which allows the design to be adapted on the field under different upset rate environments. High level Synthesis (HLS) is a powerful method to explore different design architectures in FPGAs. In this paper, the HLS tool from Xilinx is used to generate different design architectures and then analyze the probability of errors in those architectures. Two different case studies scenarios are investigated. First, it is evaluated the influence of control flow and pipeline architectures combined with the use of specialized DSP blocks in the FPGA. The number of errors classified as silent data corruption and timeout according to the architectures and DSP blocks usage is analyzed. Moreover, more possibilities of HLS designs are explored such as data organization, aggressive pipeline insertion and the implementation of the algorithm in a soft processor like the Microblaze from Xilinx. These architectures are strongly optimized in performance and the least susceptible design under soft errors is investigated. All case-study designs are evaluated in a 28 nm SRAM-based FPGA under fault injection. The dynamic cross section, soft error rate and mean work between failures are calculated based on the experimental results. The proposed characterization method can be used to guide designers to select better architectures concerning the susceptibility to upsets and performance efficiency.  相似文献   

12.
The interconnection structures in FPGA devices increasingly contribute more to the delay, power consumption and area overhead. The demand for even higher clock frequencies makes this problem even more important. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moores momentum and fuel the next wave of consumer electronics products. However, the benefits of such a new integration paradigm have not been sufficiently explored yet. In this paper, a novel 3-D architecture, as well as the software supporting tools for exploring and evaluating application implementation, are introduced. More specifically, by assigning to different layers logic and I/O resources, we achieve mentionable wire-length reduction. Experimental results prove the effectiveness of such a selection, since target architectures outperform the conventional 2-D FPGAs.  相似文献   

13.
Image correlation is widely used for image and picture processing. Typical applications of image correlation are object location, image registration and sub-image similarity measurement. However, image correlation requires the comparison of a large number of sub-images implying a large computational effort that may prevent its use for real-time applications. On the other hand, correlation computation is very well suited for FPGA implementations. In this work we present efficient architectures for the implementation of Zero-Mean Normalized Cross-Correlation using FPGAs with application to image correlation. In particular, we compare the implementations of correlation in the spatial and spectral domains. Experimental results demonstrate that FPGAs improve performance by at least two orders of magnitude with respect to software implementations on a modern personal computer. This speed-up makes the performance of correlation computation suitable for real-time image processing. The proposed architectures have been applied to a correlation-based fingerprint-matching algorithm, demonstrating that real-time processing requirements can be well satisfied with an FPGA-based implementation.  相似文献   

14.
The parallelism attained by the use of Field Programmable Gate Arrays (FPGAs) has shown remarkable potential for accelerating control systems applications. This comes at a time when well established methods based on inherited serial Central Processor Units (CPUs) cannot guarantee solutions for the increasing execution speed demands. However, the transition from serial to parallel architectures represents a tremendous challenge due to overwhelming numbers of unexplored options and conflicting factors. The work presented achieves a parallelisation characterisation for generic MIMO systems using stand-alone FPGA implementations. The main contribution is that a very fine subset of possible serial/parallel implementations is obtained. This is used to achieve a flexible trade-off between cost and performance. Automatic optimisation of latency, occupied FPGA area and execution speed is attained and justified in respect to most of the feasible scenarios.  相似文献   

15.
基于布线资源图的FPGA互连测试算法   总被引:1,自引:1,他引:0       下载免费PDF全文
代莉  梁绍池  王伶俐 《计算机工程》2009,35(14):258-260
分析基于静态随机访问内存的FPGA开关盒互连资源,提出一种自动生成且与应用无关的测试配置集算法,通过建立布线资源图,根据线网的走向动态设定各边的权重,利用改进的Kruskal算法,自动产生测试配置集。对于FPGA不同的互连结构,该算法对互连资源中的开路和短路故障的覆盖率能够达到100%,且具有测试配置个数少、运行速度快以及与具体硬件结构无关等优点。  相似文献   

16.
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip.For multiprocessors,an efficient communication network that matches the needs of the target application is always critical to the overall performance.Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs).However,the quest for high performance networks has led to very complex and resource-expensive NoC designs,leaving little room for the real computing force,i.e.,PEs.Moreover,many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers.We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs.This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs.We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology,but also the resource requirement of each router.Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network.The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.  相似文献   

17.
Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability.  相似文献   

18.
Mangione-Smith  W.H. 《Computer》1997,30(10):115-117
Configurable computing systems enhance traditional computing systems through the addition of programmable hardware. Configurable computing offers the opportunity to change the partition at run-time by re-programming the hardware. Recent research has shifted to CAD and application development tools. Almost all existing configurable computing systems are based on field-programmable gate arrays (FPGAs). These devices implement reasonably arbitrary digital circuits, and the flexibility allows us to think of configurable computing systems based on FPGAs as netlist computers. The configurable computing approach integrates FPGAs as an intimate and fundamental component of the computing system, rather than relegating them to their earlier role of supporting system prototyping and low-volume production. However, the author believes that automated approaches to the design of configurable computing systems are premature because they do not pay enough attention to performance  相似文献   

19.
通用处理器设计中硬件仿真验证   总被引:1,自引:0,他引:1  
基于动态的RTL仿真依然是验证超大规模集成电路的主要方法。在使用动态仿真方法对通用微处理器这样大规模的设计进行功能验证时仿真速度成为了瓶颈,通常的解决方案是使用FPGA进行硬件的物理原型仿真,使用FPGA可以在较短的时间内测试大量的测试向量,但是使用FPGA物理原型验证的可调试很差。针对这一主要问题,提出了三级的层次化仿真验证环境,使用硬件仿真器的仿真加速作为中间层的解决方案,即可以提高仿真速度,也提供了良好的调试环境。同时针对大规模设计多片FPGA逻辑划分提出了改进的K—L算法,优化了FPGA的利用率和片间五连。  相似文献   

20.
We propose a server selection, configuration, reconfiguration and automatic performance verification technology to meet user functional and performance requirements on various types of cloud compute servers. Various servers mean there are not only virtual machines on normal CPU servers but also container or baremetal servers on strong graphic processing unit (GPU) servers or field programmable gate arrays (FPGAs) with a configuration that accelerates specified computation. Early cloud systems are composed of many PC-like servers, and virtual machines on these severs use distributed processing technology to achieve high computational performance. However, recent cloud systems change to make the best use of advances in hardware power. It is well known that baremetal and container performances are better than virtual machines performances. And dedicated processing servers, such as strong GPU servers for graphics processing, and FPGA servers for specified computation, have increased. Our objective for this study was to enable cloud providers to provision compute resources on appropriate hardware based on user requirements, so that users can benefit from high performance of their applications easily. Our proposed technology select appropriate servers for user compute resources from various types of hardware, such as GPUs and FPGAs, or set appropriate configurations or reconfigurations of FPGAs to use hardware power. Furthermore, our technology automatically verifies the performances of provisioned systems. We measured provisioning and automatic performance verification times to show the effectiveness of our technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号