共查询到19条相似文献,搜索用时 109 毫秒
1.
提出一种基于提升算法实现JPEG2000编码系统中的二维离散小波变换(Discrete Wavelet Transform)的并行阵列式的VLSI结构设计方法.利用该方法所得结构由两个行处理器,一个列处理器以及少量行缓存组成;行列处理器内部是由并行阵列式的处理单元组成;能使行和列滤波器同时进行滤波,用优化的移位加操作替代乘法操作.整个结构采用流水线的设计方法处理,在保证同样的精度下,大大减少了运算量和提高了硬件资源利用率,几乎达到100%,加快了变换速度,也减少了电路的规模.该结构对于N×N大小的图像,处理速度达到O(N2/2)个时钟周期.二维离散小波滤波器结构已经过FPGA验证,并可作为单独的IP核应用于正在开发的JPEG2000图像编解码芯片中. 相似文献
2.
流水线结构FFT/IFFT处理器的设计与实现 总被引:1,自引:0,他引:1
针对实时高速信号处理的要求,设计并实现了一种高效的FFT处理器。在分析了FFT算法的复杂度和硬件实现结构的基础上,处理器采用了按频率抽取的基—4算法,分级流水线以及定点运算结构。可以根据要求设置成4P点的FFT或IFFT。处理器可以对多个输入序列进行连续的FFT运算,消除了数据的输入输出对延时的影响。平均每完成一次N点FFT运算仅需要Ⅳ个时钟周期。整个设计基于Verilog HDL语言进行模块化设计。并在Altera公司的Cyclone Ⅱ器件上实现。 相似文献
3.
提出了一种新型快速自动频率校准技术,应用于宽带频率综合器的频带搜索和频率锁定过程。该自动频率校准模块通过直接控制频率综合器中压控振荡器(VCO)的开关电容阵列的闭合状态来调节VCO的振荡频率,实现快速锁定输出频率的目的。这种自校准技术由纯数字电路实现,校准过程只需5个时钟周期即可完成,时钟信号直接使用外部输入的参考时钟,具有算法简单、所需时钟周期少的优点。电路采用SMIC 0.18 μm CMOS工艺进行设计和验证,相比以往的校准技术,其校准时间明显减少。 相似文献
4.
5.
针对ARM并行阵列机结构,提出了与之相适应的通信结构,采用4个路由器完成16个处理器内核之间的通信,有效地节约了面积.该路由器采用基于数据包交换的片上网络通信方式,内部运用缓存机制、经典的XY路由算法和专用的仲裁策略再加入数据多播,且处理器选用低功耗、高性能的ARM内核,通过采用以上机制能够有效降低数据传播延迟和功耗.实验结果表明采用该方案设计的路由器时钟频率最高可达406.009 MHz,能够满足该ARM阵列机对于通信速率的要求. 相似文献
6.
针对现场可编程门阵列(FPGA)实现图像中值滤波处理时,面临着提高FPGA运行时钟频率和优化硬件资源相冲突的问题,提出一种时序优化中值滤波算法。该算法先通过二输入比较器级联模块代替三输入比较器模块,实现数据多拍处理,减少算法的硬件时序迟滞,提高算法在FPGA上的运行时钟频率。接着使用极值比较器模块对算法的并行运算流程进行优化,节省硬件资源,缩短算法耗时。仿真结果表明:对3?3滤波器,算法8个时钟周期后输出首个中值,后续每个时钟周期输出1 个中值,理论稳定运行的最高时钟频率为231.2MHz。 相似文献
7.
高速除法器设计及ASIC实现 总被引:3,自引:0,他引:3
为提高除法计算的速度,提出了新的基-16算法的高速除法器算法,并以专用集成电路设计方法实现。与MIPS处理器中使用的除法器相比,电路最大延迟减少了27%,计算所需时钟周期数减少了68%,速度性能改善了77%左右。给出了电路的其他性能指标。该电路适用于对运算速度要求非常高的场合。 相似文献
8.
提出了一种基于Simple Sealar和SystemC的异构异步多核仿真器,不同运行频率的内核之间采用共享存储区实现通信及数据共享。实验结果表明该仿真器能够在时钟周期级正确模拟异构多核处理器的运行情况,并准确评估异构多核处理器的性能。该仿真器在异构多核系统的软硬件协同设计方面将有较好的应用前景。 相似文献
9.
首先证明了DTMB标准中采用的BCH码是纠错能力为1的循环汉明码,并基于此提出了适用于该BCH码的译码算法,及其串行和并行两种FPGA实现电路。考虑到该BCH码缩短码的特性,通过修改差错检测电路,使其译码时延缩短34%。实现结果表明,译码器译码正确无误,FPGA资源占用极少。串行译码器总时延为762个时钟周期,最大工作时钟频率可达357MHz。并行译码器总时延仅为77个时钟周期,最大工作时钟频率可达276MHz。 相似文献
10.
11.
Takahashi J. Hamaguchi S. Tansho K. Kimura T. 《Solid-State Circuits, IEEE Journal of》1991,26(6):833-843
A speech recognition processor CMOS LSI was developed as the processing element (PE) of a ring array processor previously proposed by the authors as architecture to carry out highly parallel recognition processing with array size flexibility. There are three key features for the LSI: (1) a highly parallel I/O structure of triple buffer with cyclical-mode transition control methods to solve the serious problem of inter-PE data transfer overhead versus the array processing; (2) a control structure with two direct memory access (DMA) controllers to realize inter-PE data I/O processing and intra-PE processing in parallel; and (3) a pipelined recognition processing at a high execution rate realized by a pipelined structure and a balanced clock distribution design technique. These effective designs for the PE LSI allow high-speed recognition processing without any inter-PE data transfer overhead in the ring array processor. Combining the PE-LSI architecture with the proposed array architecture for highly parallel dynamic time warping (DTW) processing, a real-time continuous speech recognition system based on continuous dynamic programming matching using the SPLIT method for a 1000-word vocabulary, can be constructed using a ring array processor consisting of 30 PEs 相似文献
12.
基于FPGA硬件技术,以空间换时间的思路,提出了一种并行全比较的排序算法。该算法通过对数据的并行全比较,计算出每个数据在排序中的位置实现数据排序。该算法可在4个时钟周期内实现数字序列的排序,通过实验证明,实时性好,通用性强。 相似文献
13.
Bongjin Jung Burleson W.P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1998,6(3):475-483
We present a parallel algorithm, architecture, and implementation for efficient Lempel-Ziv (LZ)-based data compression. The parallel algorithm exhibits a scalable, parameterized, and regular structure and is well suited for VLSI array implementation. Based on our parallel algorithm and systematic design methodologies, two semisystolic array architectures have been developed which are low power and area efficient. The first architecture trades off the compression speed for the area and has a low run-time overhead for multichannel compression. The second architecture achieves a high compression rate (one data symbol per clock) at the expense of the area due to a large clock load and global wiring. Compared to a recent state-of-the-art parallel architecture, our first array structure requires significantly less chip area (≃330 k versus ≃36 k transistors) and more than an order of magnitude less power (≈1.0 W versus ≈70 mW) while still providing the compression speed required for most data communication applications. Hence, data compression can be adopted in portable data communication as well as wireless local area networks. The second architecture has at least three times less area and power while providing the same constant compression rate. To demonstrate the correctness of our design, a prototype module for the first architecture has been implemented using 1.2 μ complementary metal-oxide-semiconductor (CMOS) technology. The compression module contains 32 simple and identical processors, has an average compression rate of 12.5 million bytes/s, and consumes 18.34 mW without the dictionary (≈70 mW with a 4.1k SRAM for the dictionary) while operating at a 100 MHz clock rate (simulated) 相似文献
14.
Chih-Yuang Su Shih-Am Hwang Po-Song Chen Cheng-Wen Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(2):280-284
We revise Montgomery's algorithm such that modular multiplication can be executed two times faster. Each iteration in our algorithm requires only one addition, while that in Montgomery's requires two additions. We then propose a cellular array to implement modular exponentiation for the Rivest-Shamir-Adleman cryptosystem. It has approximately 2n cells, where n is the word length. The cell contains one full-adder and some controlling logic. The time to calculate a modular exponentiation is about 2n2 clock cycles. The proposed architecture has a data rate of 100 kb/s for 512-b words and a 100 MHz clock 相似文献
15.
Manabu Yamada Tohru Nakagawa Hajime Kitagawa 《Analog Integrated Circuits and Signal Processing》1992,2(4):389-393
This paper presents an ultra-high-speed sorter based upon a simplified parallel sorting algorithm using a binary neural network which consists both of binary neurons and of AND-OR synaptic connections to solve sorting problems at two and only two clock cycles. Our simplified algorithm is based on the super parallel sorting algorithm proposed by Takefuji and Lee. Nevertheless, our algorithm does not need any adders, while Takefuji's algorithm needs n×(n–1) analog adders of which each has multiple input ports. For an example of the simplified parallel sorter, a hardware design and its implementation will be introduced in this paper, which performs a sorting operation at two clock cycles. Both results of a logic circuit simulation and of an algorithm simulation show the justice of our hardware implementation even if in the practical size of the problem. 相似文献
16.
基于二进制多字Montgomery模乘算法,提出了一种参数可灵活配置的规则的脉动阵列硬件结构,并使用此结构在FPGA上实现了不同位宽的Montgomery模乘算法.该结构成功地在不增加额外电路或运行周期的情况下,将脉动阵列的关键路径限制在运算单元内部的加法器中.硬件实现结果表明,该结构具有更高的电路频率、更少的电路面积消耗及算法运算时间. 相似文献
17.
Gross W.J. Kschischang F.R. Gulak P.G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(3):309-318
Reed-Solomon codes are powerful error-correcting codes that can be found in many digital communications standards. Recently, there has been an interest in soft-decision decoding of Reed-Solomon codes, incorporating reliability information from the channel into the decoding process. The Koetter-Vardy algorithm is a soft-decision decoding algorithm for Reed-Solomon codes which can provide several dB of gain over traditional hard-decision decoders. The algorithm consists of a soft-decision front end to the interpolation-based Guruswami-Sudan list decoder. The main computational task in the algorithm is a weighted interpolation of a bivariate polynomial. We propose a parallel architecture for the hardware implementation of bivariate interpolation for soft-decision decoding. The key feature is the embedding of both a binary tree and a linear array into a 2-D array processor, enabling fast polynomial evaluation operations. An field-programmable gate array interpolation processor was implemented and demonstrated at a clock frequency of 23 MHz, corresponding to decoding rates of 10-15 Mb/s 相似文献
18.
This paper presents the performance of a decision feedback adaptive array based on the interarray correlation-neglecting (ICN) algorithm that is suitable for high-speed mobile-communication systems. Although the ICN algorithm is regarded as a means of complexity reduction from the recursive least-square (RLS) algorithm, we present that the ICN algorithm has a superior initial acquisition performance close to the RLS algorithm, theoretically. The requirements for stable convergence are analytically revealed. Moreover, the performance is confirmed by Monte Carlo simulation. Besides, since the decision feedback adaptive array based on the ICN algorithm can be easily implemented with parallel processing, this array can reduce run-time cycles needed for one symbol. Therefore, this array can be applied to high-speed radio communication systems. In addition to this, the low operational complexity of this array makes it easy to apply this for high-speed mobile-communication systems where low complexity is basically required. 相似文献
19.
Moon Ho Lee Jong Oh Park Yasuhiko Yasuda 《Multidimensional Systems and Signal Processing》1990,1(4):389-398
in this paper, simple 1-D and 2-D systolic array for realizing the discrete cosine transform (DCT) based on the discrete Fourier transform (DFT) fo an input sequence are presented. The proposed arrays are obtained by a simple modified DFT (MDFT) and an inverse DFT (IDFT) version of the Goertzel algorithm combined with Kung's approach. The 1-D array requiresN cells, one multiplier and takesN clock cycles to produce a completeN-point DCT. The 2-D array takes N clock cycles, faster than the 1-D array, but the area complexity is larger. A continuous flow of input data is allowed and no idle time is required between the input sequences. 相似文献