首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Resampling algorithms and architectures for distributed particle filters   总被引:7,自引:0,他引:7  
In this paper, we propose novel resampling algorithms with architectures for efficient distributed implementation of particle filters. The proposed algorithms improve the scalability of the filter architectures affected by the resampling process. Problems in the particle filter implementation due to resampling are described, and appropriate modifications of the resampling algorithms are proposed so that distributed implementations are developed and studied. Distributed resampling algorithms with proportional allocation (RPA) and nonproportional allocation (RNA) of particles are considered. The components of the filter architectures are the processing elements (PEs), a central unit (CU), and an interconnection network. One of the main advantages of the new resampling algorithms is that communication through the interconnection network is reduced and made deterministic, which results in simpler network structure and increased sampling frequency. Particle filter performances are estimated for the bearings-only tracking applications. In the architectural part of the analysis, the area and speed of the particle filter implementation are estimated for a different number of particles and a different level of parallelism with field programmable gate array (FPGA) implementation. In this paper, only sampling importance resampling (SIR) particle filters are considered, but the analysis can be extended to any particle filters with resampling.  相似文献   

2.
In this paper, we describe resource-efficient hardware architectures for software-defined radio (SDR) front-ends. These architectures are made efficient by using a polyphase channelizer that performs arbitrary sample rate changes, frequency selection, and bandwidth control. We discuss area, time, and power optimization for field programmable gate array (FPGA) based architectures in an M -path polyphase filter bank with modified N -path polyphase filter. Such systems allow resampling by arbitrary ratios while simultaneously performing baseband aliasing from center frequencies at Nyquist zones that are not multiples of the output sample rate. A non-maximally decimated polyphase filter bank, where the number of data loads is not equal to the number of M subfilters, processes M subfilters in a time period that is either less than or greater than the M data-load’s time period. We present a load-process architecture (LPA) and a runtime architecture (RA) (based on serial polyphase structure) which have different scheduling. In LPA, N subfilters are loaded, and then M subfilters are processed at a clock rate that is a multiple of the input data rate. This is necessary to meet the output time constraint of the down-sampled data. In RA, M subfilters processes are efficiently scheduled within N data-load time while simultaneously loading N subfilters. This requires reduced clock rates compared with LPA, and potentially less power is consumed. A polyphase filter bank that uses different resampling factors for maximally decimated, under-decimated, over-decimated, and combined up- and down-sampled scenarios is used as a case study, and an analysis of area, time, and power for their FPGA architectures is given. For resource-optimized SDR front-ends, RA is superior for reducing operating clock rates and dynamic power consumption. RA is also superior for reducing area resources, except when indices are pre-stored in LUTs.  相似文献   

3.
The packet classification is a fundamental process in provisioning security and quality of service for many intelligent network-embedded systems running in the Internet of Things (IoT). In recent years, researchers have tried to develop hardware-based solutions for the classification of Internet packets. Due to higher throughput and shorter delays, these solutions are considered as a major key to improving the quality of services. Most of these efforts have attempted to implement a software algorithm on the FPGA to reduce the processing time and enhance the throughput. The proposed architectures, however, cannot reach a compromise among power consumption, memory usage, and throughput rate. In view of this, the architecture proposed in this paper contains a pipeline-based micro-core that is used in network processors to classify packets. To this end, three architectures have been implemented using the proposed micro-core. The first architecture performs parallel classification based on header fields. The second one classifies packets in a serial manner. The last architecture is the pipeline-based classifier, which can increase performance by nine times. The proposed architectures have been implemented on an FPGA chip. The results are indicative of a reduction in memory usage as well as an increase in speedup and throughput. The architecture has a power consumption of is 1.294w, and its throughput with a frequency of 233 ?MHz exceeds 147 Gbps.  相似文献   

4.

The most challenging aspect of particle filtering hardware implementation is the resampling step. This is because of high latency as it can be only partially executed in parallel with the other steps of particle filtering and has no inherent parallelism inside it. To reduce the latency, an improved resampling architecture is proposed which involves pre-fetching from the weight memory in parallel to the fetching of a value from a random function generator along with architectures for realizing the pre-fetch technique. This enables a particle filter using M particles with otherwise streaming operation to get new inputs more often than 2M cycles as the previously best approach gives. Results show that a pre-fetch buffer of five values achieves the best area-latency reduction trade-off while on average achieving an 85% reduction in latency for the resampling step leading to a sample time reduction of more than 40%. We also propose a generic division-free architecture for the resampling steps. It also removes the need of explicitly ordering the random values for efficient multinomial resampling implementation. In addition, on-the-fly computation of the cumulative sum of weights is proposed which helps reduce the word length of the particle weight memory. FPGA implementation results show that the memory size is reduced by up to 50%.

  相似文献   

5.
The embedded block coding with optimized truncation (EBCOT) algorithm is the heart of the JPEG 2000 image compression system. The MQ coder used in this algorithm restricts throughput of the EBCOT because there is very high correlation among all procedures to be performed in it. To overcome this obstacle, a high throughput MQ coder architecture is presented in this paper. To accomplish this, we have studied the number of rotations performed and the rate of byte emission in an image. This study reveals that in an image, on an average 75.03% and 22.72% of time one and two shifts occur, respectively. Similarly, about 5.5% of time two bytes are emitted concurrently. Based on these facts, a new MQ coder architecture is proposed which is capable of consuming one symbol per clock cycle. The throughput of this coder is improved by operating the renormalization and byte out stages concurrently. To reduce the hardware cost, synchronous shifters are used instead of hard shifters. The proposed architecture is implemented on Stratix FPGA and is capable of operating at 145.9 MHz. Memory requirement of the proposed architecture is reduced by a minimum of 66% compared to those of the other existing architectures. Relative figure of merit is computed to compare the overall efficiency of all architectures which show that the proposed architecture provides good balance between the throughput and hardware cost.  相似文献   

6.
In this paper, we analyze algorithmic and architectural characteristics of a class of particle filters known as Gaussian Particle Filters (GPFs). GPFs approximate the posterior density of the unknowns with a Gaussian distribution which limits the scope of their applications in comparison with the universally applied sample-importance resampling filters (SIRFs) but allows for their implementation without the classical resampling procedure. Since there is no need for resampling, we propose a modified GPF algorithm that is suitable for parallel hardware realization. Based on the new algorithm, we propose an efficient parallel and pipelined architecture for GPF that is superior to similar architectures for SIRF in the sense that it requires no memories for storing particles and it has very low amount of data exchange through the communication network. We analyze the GPF on the bearings-only tracking problem and the results are compared with results obtained by SIRF in terms of computational complexity, potential throughput, and hardware energy. We consider implementation on FPGAs and we perform detailed comparison of the GPF and SIRF algorithms implemented in different ways on this platform. GPFs that are implemented in parallel pipelined fashion on FPGAs can support higher sampling rates than SIRFs and as such they might be a more suitable candidate for real-time applications.  相似文献   

7.
In this paper, we introduce a hierarchical resampling (HR) algorithm and architecture for distributed particle filters (PFs). While maintaining the same accuracy as centralized resampling in statistics, the proposed HR algorithm decomposes the resampling step into two hierarchies including intermediate resampling (IR) and unitary resampling (UR), which suits PFs for distributed hardware implementation. Also presented includes a residual cumulative resampling (RCR) method that pipelines and accelerates the UR step. The corresponding architecture, when compared with traditional distributed architectures, eliminates the particle redistribution step, and has such advantages as short execution time and high memory efficiency. The prototype containing 8 PEs has been developed in Xilinx Virtex IV FPGA (XC4VFX100-12FF1152) for the bearings-only tracking (BOT) problem, and the result shows that the input observations can be processed at 37.21 KHz with 8 K particles and a clock speed of 80 MHz.  相似文献   

8.
徐礼胜 《电子器件》2012,35(4):406-411
针对超声成像系统的信号采集要求,介绍了一种基于FPGA的多通道数据采集和传输系统的设计与实现。采用ADS6122,实现了12 bit、单通道最高采样频率达65 MHz的A/D转换电路。该系统采用FPGA进行逻辑控制,实现了高频信号单通道采集,低频信号多通道同时采集的数据采集系统。系统测试结果表明:当单通道模拟信号输入频率不超过7 MHz时,得到的采样速度和采样精度都能满足超声信号采集的高要求。该系统还可以作为相关多通道信号采集系统设计的参考。  相似文献   

9.
We summarize our recent state-of-the-art programmable and reconfigurable detector and QR decomposition (QRD) implementations targeting 3G long term evolution (LTE) downlink and uplink requirements. The downlink transmission is based on the orthogonal frequency division multiplexing, whereas the uplink transmission uses a single-carrier frequency-division multiple access. The downlink implementations are based on the programmable transport triggered architecture (TTA) which provides a flexible and energy efficient architecture template. In TTA detector implementation, the LTE detection rate requirements up to 20 MHz bandwidth and 4 × 4 antenna system with 64-QAM, are achieved by using 1–6 programmable cores in parallel. Each core runs at 277 MHz clock frequency and consumes 55.5–64.0 mW depending on the detector configuration. The downlink detector is based on the selective spanning with fast enumeration algorithm. The uplink field-programmable gate array (FPGA) detector implementation is targeted for 4 × 4 antenna system and 64-QAM achieving a detection rate requirement for 20 MHz bandwidth. The used FPGA board for uplink implementation is Xilinx Virtex-6 and the implementation has been carried out using Xilinx Vivado high level synthesis tool. Two different detector architectures are implemented. The first one achieves the detection rate requirement with a single processing block running at 231 MHz and the latter one with four blocks in parallel, each running at 247 MHz. The implemented detector is based on the K-best algorithm. A multiple-input multiple-output receiver requires QRD to produce valid inputs for the detector. In addition to detector implementations, QRD is also implemented on both TTA and FPGA. Modified Gram–Schmidt algorithm is used in both QRD implementations.  相似文献   

10.
This paper describes a novel control system processor architecture based on DeltaSigma modulation known as the DeltaSigma -CSP. The DeltaSigma -CSP utilizes 1-bit processing which is a new concept in digital control applications with the direct benefit of making multi-bit multiplication operations redundant. A simple conditional-negate-and-add (CNA) unit is instead used for operations in control law implementations. For this reason, the proposed processor has a very small silicon footprint and runs at very high frequencies making it ideal for high-sampling rate, real-time control applications. A number of DeltaSigma -CSP configurations have been implemented as VLSI hard macros in a high-performance 0.13-mum CMOS process and a particular configuration achieved a post-route operating frequency of 355 MHz resulting in a 2.17 MHz sampling rate for a fourth-order control law implementation. Additional results prove that the DeltaSigma -CSP compares very favorably, in terms of silicon area and sampling rates, to two other specialized digital control processing systems, including direct, hardwired implementation of control laws; at the same time, it substantially outperforms software implementations of control laws running on very wide, general-purpose VLIW architectures.  相似文献   

11.
In this paper, two new architectures for high-speed CMOS wave-pipelined current-mode A/D converters (WP-IADCs) are proposed and analyzed. In the new WP-IADC architectures, the wave-pipelined theory is applied to both pipeline structures, called full WP-IADC (FWP-IADC) and indirect transfer WP-IADC (ITWP-IADC). In the FWP-IADC, each stage uses the full current-mode wave-pipelined structure without switched-current cell circuits. In the ITWP-IADC, the switched-current cells are incorporated into the wave-pipelined stages which are divided into several sections with controlled clocks. Therefore, the proposed ITWP-IADC performs optimally in terms of speed and accuracy in the WP-IADCs. Generally, the proposed WP-IADCs have the advantages of high speed, high input frequency, high efficiency of timing usage, high clock-period flexibility in switched-current cells for precision enhancement, and reduced number of switched-current cells in the overall data path for linearity improvement. According to the theoretical analysis on the proposed WP-IADC structures, the minimum sampling clock period is proportional to the intrinsic delay of the current mirror and the increased rise/fall time in each wave-pipelined stage. The HSPICE simulation results reveal that, under Nyquist rate sampling in 8-b resolution, a sampling rate of 20 and 54 MHz can be achieved for FWP-IADC and two-section ITWP-IADC, respectively. If four wave-pipelined sections are used, the ITWP-IADC can be operated at 166 MHz at an input frequency of 8 MHz. To experimentally verify the correct function of the proposed WP-IADC structures, the proposed new architecture of the FWP-IADC is implemented by using 0.35-/spl mu/m CMOS technology. The measurement results successfully demonstrate the feasibility of wave-pipelined IADC architectures in applications of high-speed ADCs.  相似文献   

12.
用于纯方位跟踪的简化粒子滤波算法及其硬件实现   总被引:2,自引:2,他引:0  
针对粒子滤波运算量大,硬件复杂性高的问题,该文提出了一种用于纯方位跟踪的简化粒子滤波算法,该算法引入了一种新的基于阈值的重采样方法,降低了硬件实现的复杂度。在算法研究的基础上,论文研究了基于FGPA的硬件电路实现方法,给出了系统的整体硬件结构及重采样/采样模块的实现方案,讨论了粒子滤波硬件实现的资源优化及时间优化问题。仿真结果表明,对于纯方位跟踪问题,该粒子滤波算法具有优于扩展Kalman滤波器(EKF)的性能;硬件电路实验表明,该滤波器可以实现对被动目标的纯方位跟踪,并具有比通用粒子滤波器较快的处理速度。  相似文献   

13.
《Microelectronics Journal》2014,45(6):690-701
Recent studies have verified the efficiency of stochastic state point process filter (SSPPF) in coefficients tracking in the modeling of the mammalian nervous system. In this study, a hardware architecture of SSPPF is both designed and implemented on a field-programmable gate array (FPGA). It provides a time-efficient method to investigate the nonlinear neural dynamics through coefficients tracking of a generalized Laguerre–Volterra model describing the spike train transformations of different brain sub-regions. The proposed architecture is able to process matrices and vectors with arbitrary sizes. It is designed to be scalable in parallel degree and to provide different customizable levels of parallelism, by exploring the intrinsic parallelism of the FPGA. Multiple architectures with different degrees of parallelism are explored. This design maintains numerical precision and the proposed parallel architectures for coefficients estimation are also much more power efficient.  相似文献   

14.
朱彬  朱晓章  杨仕甫  许媛 《现代雷达》2012,34(10):28-31
提出了一种可变分数延时宽带数字滤波器的优化设计方法,该方法首先采用内插的方法提高采样率,降低信号的归一化带,再采用Farrow结构来实现分数延时,通过抽取,恢复信号的初始采样率.其实现形式采用基于多相滤波的级联结构,使得内插和抽取相互抵消,降低滤波器的阶数,提高运算效率.采用基于FPGA的并行分布式算法,设计利用了器件的结构特点以及与器件特性独立的2种方法,在时域实现了高速、高阶的宽带分数延时滤波器,并在Altera Stratix FPGA上进行了仿真验证,最高工作频率分别为184 MHz和119 MHz.  相似文献   

15.
Efficient Implementations for AES Encryption and Decryption   总被引:1,自引:0,他引:1  
This paper proposes two efficient architectures for hardware implementation of the Advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Inverse S-box) transformations investigated by several authors is used as the basis for deriving the proposed architectures. The first architecture for encryption is based on optimized S-box followed by bit-wise implementation of MixColumns and AddRoundKey and optimized Inverse S-box followed by bit-wise implementation of InvMixColumns and AddMixRoundKey for decryption. The proposed S-box and Inverse S-box used in this architecture are designed as a cascade of three blocks. In the second proposed architecture, the block III of the proposed S-box is combined with the MixColumns and AddRoundKey transformations forming an integrated unit for encryption. An integrated unit for decryption combining the block III of the proposed InvSubBytes with InvMixColumns and AddMixRoundKey is formed on similar lines. The delays of the proposed architectures for VLSI implementation are found to be the shortest compared to the state-of-the-art implementations of AES operating in non-feedback mode. Iterative and fully unrolled sub-pipelined designs including key schedule are implemented using FPGA and ASIC. The proposed designs are efficient in terms of Kgates/Giga-bits per second ratio compared with few recent state-of-the-art ASIC (0.18-μm CMOS standard cell) based designs and throughput per area (TPA) for FPGA implementations.  相似文献   

16.
洪琪  曹伟  童家榕 《电子学报》2011,39(5):1059-1063
提出了一种新的支持MPEG-4 AVC/H.264标准4×4整数变换的动态可重构结构.首先,针对4×4正反变换分别推导了两个新的二维直接信号流图.进而设计了一个面向HDTV应用的动态可重构多变换结构.该结构无需转置寄存器且计算单元仅需16个加法器(减法器).采用0.18μm CMOS工艺实现了该电路结构.结果表明,最高...  相似文献   

17.
沈仲弢  封常青  高山山  陈晓东  刘树彬 《红外与激光工程》2017,46(12):1217002-1217002(6)
针对锁模激光器微弱回波信号探测的需求,提出了一种基于高速相关采样和在线实时并行累加处理算法相结合的方法,对传统的模拟取样积分方法进行了改进,可实现无参考信号条件下的实时数字累加检测。设计并实现了一套基于该方法的锁模激光器微弱回波信号检测原型系统,使用12 bit@900 MSPS模拟-数字转换芯片(ADC) ADS5409对经光电转换之后的锁模激光回波信号进行波形采样,并利用现场可编程逻辑门阵列(FPGA)芯片Kintex-7实现对ADC的控制及在线数据处理。系统测试结果表明,对于重复频率为8 MHz、平均功率为0.04 nW的锁模激光回波信号,通过在FPGA内进行16 000次脉冲波形精确累加,可实现信号的有效检出,且从波形采集完毕到输出检测结果的延时小于100 ns,达到了高度的实时性。经900次重复实验,检测效率达到100%,且无虚警情况发生。  相似文献   

18.
Channel estimation based on superimposed training (ST) has been an active research topic around the world in recent years, because it offers similar performance when compared to methods based on pilot assisted transmissions (PAT), with the advantage of a better bandwidth utilization. However, physical implementations of such estimators are still under research, and only few approaches have been reported to date. This is due to the computational burden and complexity involved in the algorithms in conjunction with their relative novelty. In order to determine the suitability of the ST-based channel estimation for commercial applications, the performance and complexity analysis of the ST approaches is mandatory. This work proposes two full-hardware channel estimator architectures for a data-dependent superimposed training (DDST) receiver with perfect synchronization and nonexistent DC-offset. These architectures were described using Verilog HDL and targeted in Xilinx Virtex-5 XC5VLX110T FPGA. The synthesis results of such estimators showed a consumption of 3 % and 1 % of total slices available in the FPGA and frequencies operation over 160 MHz. They have also been implemented on a generic 90 nm CMOS process achieving clock frequencies of 187 MHz and 247 MHz while consuming 3.7 mW and 2.74 mW, respectively. In addition, for the first time, a novel architecture that includes channel estimation, training/block synchronization and DC-offset estimation is also proposed. Its fixed-point analysis has been carried out, allowing the design to produce practically equal performance to those achieved with the floating-point models. Finally, the high throughputs and reduced hardware consumptions of the implemented channel estimators, leads to the conclusion that ST/DDST can be utilized in practical communications systems.  相似文献   

19.
Efficient reconfigurable field programmable gate array (FPGA) architectures for the MD6-224/256/384/512 Hash algorithm are proposed in this article. The basic iterative compact design requires 923 ALMs and achieves a throughput ranging from 225 Mbit/s to 394 Mbit/s at a maximum frequency of 198 MHz. The 32-step-unrolled high-throughput design requires 7 090 ALMs and achieves a throughput ranges from 5 776 Mbit/s to 9 490 Mbit/s at a maximum frequency of 173 MHz. The simulation results show that high flexibility and efficient FPGA implementation of the MD6 Hash function is achieved.  相似文献   

20.
This article introduces a novel lookup table (LUT) and its usage in the configurable logic block (CLB) architectures for SRAM-based field-programmable gate array (FPGA) architectures. The proposed CLB allows sharing of SRAM tables of LUTs among NPN-equivalent functions to reduce the size of memories used for storing the functions and also reduces the number of configuration bits required. We measured many different characteristics of FPGAs using our new CLB architecture, including area, delay, routing, and power requirements. We experimentally found that for many different FPGA architectures, CLBs can share one-fourth of their SRAM tables between two basic logic elements (BLEs), which reduced both power consumption and area without negatively affecting routing or wirelength, and there was only a negligible increase in critical path delay of 0.27%. Specifically, we find that FPGAs consisting of CLBs with 16 BLEs and 34 inputs can be implemented with eight normal SRAMs and four SRAMs shared between two BLEs, for an overall reduction of four out of sixteen SRAM tables per CLB. With this new CLB architecture, we measured an approximate reduction in overall power consumption of 2% and an estimated reduction in area of 3%  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号