首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
RNA二级结构预测是生物信息学领域重要的研究方向,基于最小自由能模型的Zuker算法是目前该领域最典型使用最广泛的算法之一.基于FPGA平台实现了一种细粒度的并行Zuker算法,采用按矩阵列循环划分的任务分配策略实现了处理单元间的负载平衡;采用数据预取、滑动窗口和数据传递流水线实现了处理单元间的数据重用;采用曲线拟合、离散点赋值和地址空间压缩编码等策略减少了约85%的自由能参数存储需求.在单片FPGA上集成了由20个PE构成的主从多PE线性阵列,实验结果表明与运行在AMD四核9650处理器上的ViennaRNA-1.6.5程序相比,可获得超过18倍的加速效果,并且FPGA加速器功耗仅为通用微处理器平均功耗的1/5.  相似文献   

2.
在讨论了逆QR分解(逆正交三角分解)SM(I采样矩阵求逆)自适应波束形成算法的基础上,研究了逆QR分解SMI算法的Systolic阵列(脉动阵列)并行实现结构,分析了组成Systolic阵列的各PE(处理单元)单元的基本运算模块的实现,并给出了逆QR分解SMI算法基于Systolic阵列结构的FPGA(现场可编程门阵列)并行实现方法,提出了系统整体的设计与构架。  相似文献   

3.
针对实时图像采集系统数据量大,实时性强的特点,提出了一种基于FPGA的解决方案;在单片FPGA芯片上完成整个系统的软硬件设计,集成度高,可靠性强;图像的采集、存储及显示都采用硬件逻辑实现,此外,用逻辑实现处理算法,经几个时钟周期的延时就完成了图像处理,充分体现了FPGA并行处理的优势;实验表明,该系统较好地满足了系统的实时处理要求。  相似文献   

4.
针对通信安全问题,采用自顶向下的设计方法,设计了一种RC4算法基于FPGA的实现方式,实现了通信数据的加密传输。根据RC4加密算法的原理和设计流程,使用Verilog HDL编程语言,采用有限状态机(FSM)的编程方式实现算法,通过Modelsim SE 10.1a仿真软件进行仿真,并在FPGA开发板上进行验证。采用本文提出的FPGA设计方法实现的RC4加密算法相比软件加密方式和已有的FPGA实现方式速度有明显提高。  相似文献   

5.
基于FPGA的神经网络自整定PID控制器设计   总被引:3,自引:0,他引:3  
本文基于FPGA(现场可编程门阵列)技术实现了改进的BP网络自整定PID控制器的设计。首先,采用MATLAB设计控制器,针对特定被控对象模型,在闭环控制系统中通过改进的BP网络算法训练神经网络,获得比较理想的系统输出;依据训练好的网络权值,在FPGA集成开发环境下,基于VHDL(甚高速集成电路硬件描述语言)设计BP网络自整定PID控制器,完成时序仿真测试,并在一种具体的FPGA器件上实现。实验表明,其设计过程合理,实现结果正确,适合于采用复杂智能控制策略并要求实时性、快速性的单片或小型控制系统。  相似文献   

6.
随着网络带宽和报文转发线速的快速增长,报文分类成为各种网络应用中的关键技术。早期的报文分类算法无法满足大规模规则集和高吞吐量的需求,因此提出一个启发式高效比特选择报文分类算法,采用局部最优策略动态选择比特来建立决策树,无需复制规则,能够节省存储资源。同时,基于该算法并结合硬件特性设计了一种基于现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)的多流水线架构分类器。实验结果表明,基于单片FPGA可以对64字节报文数据包实现超过400 Gb/s的吞吐量,并且支持128 000条五元组规则集。  相似文献   

7.
针对目前模式匹配算法多采用软件实现,而软件实现效率低下的弊端,提出了一种基于硬件实现模式匹配算法的设计方案.综合Aho-Corasick(AC)算法原理和FPGA硬件特点,在FPGA上实现AC算法;然后利用Quartus Ⅱ对设计进行了验证和性能分析.实验结果表明,基于硬件实现的Aho-Corasick(AC)算法的效...  相似文献   

8.
基于随机上下文无关文法(SCFG)理论模型进行RNA二级结构预测是目前采用计算方法研究RNA二级结构的一种重要途径.由于基于SCFG模型的标准结构预测算法(Coche-Younger-Kasami,CYK)巨大的时空复杂度,对CYK算法进行加速成为计算生物学领域一个极具挑战性的热点问题.CYK的并行性能受限于算法多维度、非一致性的数据依赖关系和较低的计算/通信比,现有的基于通用微处理器结构的大规模并行处理方案不能获得令人满意的加速效果,并且大规模并行计算机系统硬件设备的购置、使用、日常维护的成本高昂,其适用性受到诸多限制.文中在深入分析CYK算法计算特征的基础上,基于FPGA平台提出并实现了一种细粒度的并行CYK算法.设计采用了对三维动态规划矩阵按区域分割和逐层按列并行处理的计算策略实现了多个处理单元间的负载均衡;采用数据预取、滑动窗口和数据传递流水线实现处理单元间的数据重用,有效解决了计算和通信间的平衡问题;设计了一种类似脉动阵列(systolic-like array)结构的主从多PE并行计算阵列,并在目前最大规模的FPGA芯片(Xilinx XC5VLX330)上成功集成了16个处理单元(process...  相似文献   

9.
针对K-means聚类算法受初始类中心影响,聚类结果容易陷入局部最优导致聚类准确率较低的问题,提出了一种基于自适应布谷鸟搜索的K-means 聚类改进算法,并利用MapReduce编程模型实现了改进算法的并行化。通过搭建的Hadoop分布式计算平台对不同样本数据集分别进行10次准确性实验和效率实验,结果表明:(1)聚类的平均准确率在实验所采用的4种UCI标准数据集上,相比原始K-means聚类算法和基于粒子群优化算法改进的K-means聚类算法都有所提高;(2) 聚类的平均运行效率在实验所采用的5种大小递增的随机数据集上,当数据量较大时,显著优于原始K-means串行算法,稍好于粒子群优化算法改进的并行K-means聚类算法。可以得出结论,在大数据情景下,应用该算法的聚类效果较好。  相似文献   

10.
介绍了一种基于FPGA数字电路设计的PID直流电机控制器;实现了对直流电机的位置控制。相对比单片机设计的控制器,采用了单片的FPGA芯片实现磁栅反馈信号的采集处理、PID控制器设计以及I2C从机接口设计。整个设计采用了Verilog HDL语言编写,对改进的PID算法、反馈信号采集以及I2C从机接口电路进行设计和仿真。仿真和实验结果表明该控制器的有效性和可行性。  相似文献   

11.
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip.For multiprocessors,an efficient communication network that matches the needs of the target application is always critical to the overall performance.Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs).However,the quest for high performance networks has led to very complex and resource-expensive NoC designs,leaving little room for the real computing force,i.e.,PEs.Moreover,many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers.We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs.This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs.We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology,but also the resource requirement of each router.Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network.The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.  相似文献   

12.
大规模QR分解在信号处理、图像处理、计算结构力学等领域有着广泛的应用。大规模矩阵QR分解主要在高性能并行机上进行运算,目前还没有基于FPGA平台的加速实现。本文在分析快速Givens Rotation QR分解算法特征的基础上,提出并实现了一种细粒度并行QR分解算法,并在Altera StratixⅡ FPGA平台上实现可扩展QR分解线性阵列处理器。相对于单处理单元,该阵列处理器可取得近似线性加速比,显示了良好的可扩展性。在100 MHz频率下的性能测试结果表明,相对于2.0GHz的Pentium双核通用微处理器,该阵列处理器可取得19倍的加速比。  相似文献   

13.
The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory.  相似文献   

14.
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations.  相似文献   

15.
针对可重构阵列处理器访存数据量大、数据并行性要求高且数据全局重用少、局部性明显的特点,提出了一种分布式Cache结构的簇内局部优先高效互连访问结构,该结构实现了簇内4×4个PE对4×4个Cache的并行访问,选用Xilinx公司的ZYNQ系列芯片XC7Z045 FFG900-2进行FPGA综合。在无冲突情况下,该互连结构支持簇内16个PE的同时读/写访问,最高频率可达221 MHz,访存峰值带宽为7.6 GB/s。在此结构上实现了灰度共生矩阵提取纹理图像特征算法,数据访存带宽达到478.125 MB/s,运行时间为0.24 ms。  相似文献   

16.
QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified co-processor structure with a scalable linear array of processing elements (PEs), in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.  相似文献   

17.
AVS标准是由2002年6月成立的"数字音视频编解码技术标准工作组"联合国内从事数字音视频编解码技术研发的科研机构和企业制定完成的,一套适应面十分广阔的技术标准。目前,视频解码器的实现的主要方法有:1)基于PC的软件实现;2)基于DSP的嵌入式系统实现;3)基于可编程逻辑器件的专用芯片实现。通用PC机非专用于视频处理,所以实现效率不高,而DSP虽然灵活性强,但是在性能以及性价比上不及FPGA。因此,FPGA平台是目前实现视频应用系统的理想平台。介绍AVS视频压缩标准,帧内预测部分的算法,帧内预测器系统的硬件实现;给出系统仿真和综合情况。  相似文献   

18.
Partial Runtime Reconfigurable (PRTR) FPGAs allow HW tasks to be placed and removed dynamically at runtime. We make two contributions in this paper. First, we present an efficient algorithm for finding the complete set of Maximal Empty Rectangles on a 2D PRTR FPGA. We also present a HW implementation of the algorithm with negligible runtime overhead. Second, we present an efficient online deadline-constrained task placement algorithm for minimizing area fragmentation on the FPGA by using an area fragmentation metric that takes into account probability distribution of sizes of future task arrivals as well as the time axis. The techniques presented in this paper are useful in an operating system for runtime reconfigurable FPGAs to manage the HW resources on the FPGA when HW tasks that arrive and finish dynamically at runtime.  相似文献   

19.
在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。  相似文献   

20.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号