期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accurate estimation of decay coefficients for dynamic range compressors in hearing aids and a hardware level comparison of different architectures

《Microprocessors and Microsystems》2020

Dynamic Range Compression (DRC) algorithm helps to protect the residual hearing ability of hearing aid users by compressing the signal levels which go above a particular threshold. This paper addresses two different aspects of DRC for hearing aid applications. In the first part, methods to estimate the decay coefficients corresponding to the required time constants for a feed-forward DRC architecture accurately, to meet the hearing aid specifications are proposed. The effect of compression on the attack and release time parameters are compensated with the new formula. The hardware implementation of four different DRC architectures is explained in the second part of the paper. The estimated decay coefficients for a test signal were used for the corresponding hardware implementations and verified the validity of proposed algorithmic modifications. The architectures were implemented using UMC 65 nm standard cell libraries and the power and error results were compared. The proposed methods to estimate the decay coefficients for both attack and release phases show close to 0 dB error from expected output values, while conventional methods are not meeting the specifications. Hardware implementation shows that there is not much improvement in power performance, between a lower resolution Look-Up Table (LUT) based logarithm implementation and a higher resolution one. From the results, we propose using the absolute level detector based DRC with higher resolution logarithm without a gain smoothing stage at the output for lowest power consumption and better approximation error performance. 相似文献

2.

Real-time visual content description system based on MPEG-7 descriptors

Rafał Kapela Paweł Śniatała Andrzej Rybarczyk 《Multimedia Tools and Applications》2011,53(1):119-150

相似文献

3.

VLSI architecture for low latency radix-4 CORDIC

B. Lakshmi A.S. Dhar 《Computers & Electrical Engineering》2011,37(6):1032-1042

The CORDIC algorithm, originally proposed using nonredundant radix-2 arithmetic, has been refined in terms of throughput and latency with the introduction of redundant arithmetic and higher radix techniques. In this paper, we propose a pipelined architecture using signed digit arithmetic for the VLSI efficient implementation of rotational radix-4 CORDIC algorithm, eliminating z path completely. A detailed comparison of the proposed architecture with the available radix-2 architectures shows the latency and hardware improvement. The proposed architecture achieves latency improvement over the previously proposed radix-4 architecture with a relatively small hardware overhead. The proposed architecture for 16-bit precision was implemented using VHDL and extensive simulations have been performed to validate the results. The functionally simulated net list has been synthesized for 16-bit precision with 90 nm CMOS technology library and the area-time measures are provided. This architecture was also implemented using Xilinx ISE9.1 software and a Virtex device. 相似文献

4.

Reconfigurable address generator for multi-standard interleaver

《Microprocessors and Microsystems》2019

This paper presents low-complex and novel techniques for designing reconfigurable architectures for multi-standard address generator and interleaver. The emphasis of this work is on hardware re-use, but it also focuses on optimizing the hardware to support multiple standards. A low-cost reconfigurable architecture for address generator and interleaver is proposed which operates in WLAN (802.11a/b/g and 802.11n), WiMAX (802.16e) and 3GPP LTE standards. A simple algorithm and a reconfigurable architecture that eliminates the computationally intricate mod function for LTE, and floor as well as mod function for WLAN/WiMAX, are proposed to reduce the hardware cost as well as implementation complexity. Novel architectures are also proposed to select the increment values for 16-QAM and 64-QAM schemes. A unique configurable subtracting sub-block for each modulation scheme is also presented. Software simulation is carried out to authenticate the functionality of the algorithm. The proposed reconfigurable architectures are realized on FPGA and tested on board. Synthesis results on Spartan-3 FPGA display 66% reduction in FPGA resource utilization and 74% increase in operating frequency compared to the cited address generators. Implementation results on Kintex UltraScale FPGA display a reduction of 34% in resource utilization and 20% in total on-chip power compared to the cited interleavers. This design is also implemented using 45 nm CMOS standard cell technology, and ASIC synthesis results of the reconfigurable address generator exhibit 76.4% improvement in data rate and 52.23% decrease in latency compared to the state-of-the-art address generators. The proposed multimode interleaver also exhibit 60.28% reduction in hardware complexity. 相似文献

5.

An 8-bit systolic AES architecture for moderate data rate applications

Sheikh Muhammad Farhan Shoab A. Khan Habibullah Jamal 《Microprocessors and Microsystems》2009,33(3):221-231

The complexity involved in mapping an algorithm to hardware is a function of the controller logic and data path. Minimizing data path size can lead to significant savings in hardware area and power dissipation. This paper presents an implementation of a novel architectural transformation technique for mapping a word bit wide algorithm to byte vector serial architecture. The technique divides the input word to several bytes and then traces each byte for extracting architectural transformation. The technique is applied on Advanced Encryption Standard (AES) algorithm which is non-linear in nature. Using this technique, the 32-bit AES algorithm is transformed into a byte-systolic architecture. The novelty of the technique is more pronounced around the mix column design which is the most complex part of the AES algorithm. The complex matrix multiplication component and standard transformations of the 32-bit AES algorithm are transformed to support 8-bit operations. The resulted AES architectures reuse same logic resources for key expansion and encryption/decryption. The proposed design offers moderate data rates in the range of 41 Mbps for encryption and 37 Mbps for decryption while utilizing 236 and 280 slices, respectively, on Xilinx Virtex II xc2v1000-6 FPGA. Comparison results show significant gain in throughput when compared with other 8-bit designs. This makes it a viable data/communication security solution for a variety of embedded and consumer electronics. 相似文献

6.

Implementation and analysis of optimized architectures for rank order filter

S. M. Meena K. Linganagouda 《Journal of Real-Time Image Processing》2008,3(1-2):33-41

In this paper we present two architectures based on the replication sort algorithm (RSA) and rank based network sorting algorithm (RBNS) for implementation of Rank order filer (ROF). This paper focuses on optimization strategies for sorting in terms of operating speed (throughput) and area (no. of comparators). The RSA algorithm achieves maximum throughput by sorting, which finds the position of all the window elements in parallel using eight bit comparators, a LUT to store the bit sum and a decoder. The time cost for filtering the complete image remains constant irrespective of the size of the window and the algorithm is generalized for all rank orders. The RBNS architecture is based on Sorting Network architecture algorithm, optimized for each desired output rank with O (N) hardware complexity compared to O (N²) complexity of the existing architectures that are based on bubble-sort and quick-sort reported so far. The proposed architectures use the concepts of pipelining and grain level parallelism and accomplish the task of sorting and filtering each sample appearing at the input window of the filter in one clock cycle, excluding the initial latency. 相似文献

7.

Modular Neural Tile Architecture for Compact Embedded Hardware Spiking Neural Network

Sandeep Pande Fearghal Morgan Seamus Cawley Tom Bruintjes Gerard Smit Brian McGinley Snaider Carrillo Jim Harkin Liam McDaid 《Neural Processing Letters》2013,38(2):131-153

Biologically-inspired packet switched network on chip (NoC) based hardware spiking neural network (SNN) architectures have been proposed as an embedded computing platform for classification, estimation and control applications. Storage of large synaptic connectivity (SNN topology) information in SNNs require large distributed on-chip memory, which poses serious challenges for compact hardware implementation of such architectures. Based on the structured neural organisation observed in human brain, a modular neural networks (MNN) design strategy partitions complex application tasks into smaller subtasks executing on distinct neural network modules, and integrates intermediate outputs in higher level functions. This paper proposes a hardware modular neural tile (MNT) architecture that reduces the SNN topology memory requirement of NoC-based hardware SNNs by using a combination of fixed and configurable synaptic connections. The proposed MNT contains a 16:16 fully-connected feed-forward SNN structure and integrates in a mesh topology NoC communication infrastructure. The SNN topology memory requirement is 50 % of the monolithic NoC-based hardware SNN implementation. The paper also presents a lookup table based SNN topology memory allocation technique, which further increases the memory utilisation efficiency. Overall the area requirement of the architecture is reduced by an average of 66 % for practical SNN application topologies. The paper presents micro-architecture details of the proposed MNT and digital neuron circuit. The proposed architecture has been validated on a Xilinx Virtex-6 FPGA and synthesised using 65 nm low-power CMOS technology. The evolvable capability of the proposed MNT and its suitability for executing subtasks within a MNN execution architecture is demonstrated by successfully evolving benchmark SNN application tasks representing classification and non-linear control functions. The paper addresses hardware modular SNN design and implementation challenges and contributes to the development of a compact hardware modular SNN architecture suitable for embedded applications 相似文献

8.

Embedding of a real time image stabilization algorithm on a parameterizable SoPC architecture a chip multi-processor approach

Lionel Damez Loic Sieler Alexis Landrault Jean Pierre D��rutin 《Journal of Real-Time Image Processing》2011,6(1):47-58

Highly regular multi-processor architectures are suitable for inherently highly parallelizable applications such as most of the image processing domain. Systems embedded in a single programmable chip platform (SoPC) allow hardware designers to tailor every aspect of the architecture in order to match the specific application needs. These platforms are now large enough to embed an increasing number of cores, allowing implementation of a multi-processor architecture with an embedded communication network. In this paper we present the parallelization and the embedding of a real time image stabilization algorithm on a SoPC platform. Our overall hardware implementation method is based upon meeting algorithm processing power requirements and communication needs with refinement of a generic parallel architecture model. Actual implementation is done by the choice and parameterization of readily available reconfigurable hardware modules and customizable commercially available IPs (Intellectual Property). We present both software and hardware implementation with performance results on a Xilinx SoPC target. 相似文献

9.

FPGA implementation of reversible watermarking in digital images using reversible contrast mapping

《Journal of Systems and Software》2014

Reversible contrast mapping (RCM) and its various modified versions are used extensively in reversible watermarking (RW) to embed secret information into the digital contents. RCM based RW accomplishes a simple integer transform applied on pair of pixels and their least significant bits (LSB) are used for data embedding. It is perfectly invertible even if the LSBs of the transformed pixels are lost during data embedding. RCM offers high embedding rate at relatively low visual distortion (embedding distortion). Moreover, low computation cost and ease of hardware realization make it attractive for real-time implementation. To this aim, this paper proposes a field programmable gate array (FPGA) based very large scale integration (VLSI) architecture of RCM-RW algorithm for digital images that can serve the purpose of media authentication in real-time environment. Two architectures, one for block size (8 × 8) and the other one for (32 × 32) block are developed. The proposed architecture allows a 6-stage pipelining technique to speed up the circuit operation. For a cover image of block size (32 × 32), the proposed architecture requires 9881 slices, 9347 slice flip-flops, 11291 number 4-input LUTs, 3 BRAMs and a data rate of 1.0395 Mbps at an operating frequency as high as 98.76 MHz. 相似文献

10.

多处理器信号处理平台系统软件设计与实现

蔡炜冀映辉孔超蔡惠智《微计算机应用》2011,32(5):66-70

研究了多处理器并行处理机的系统架构,针对三种不同构架的CPU,基于模块化和可重构的原则提出了不同的系统软件架构和实现方案,最后在已有硬件和前期平台软件的基础上,运用文中所提出的系统软件架构方案实现了一个雷达系统的演示,验证了系统软件架构方案的可行性和实用性。相似文献

11.

Recursive algorithm, architectures and FPGA implementation of the two-dimensional discrete cosine transform

An S. Wang C. 《Image Processing, IET》2008,2(6):286-294

A new recursive algorithm and two types of circuit architectures are presented for the computation of the two-dimensional discrete cosine transform (2D DCT). The new algorithm permits to compute the 2D DCT by a simple procedure of the 1D recursive calculations involving only cosine coefficients. The recursive kernel for the proposed algorithm contains a small number of operations. Also, it requires a smaller number of pre-computed data compared with many of existing algorithms in the same category. The kernel can be easily implemented in a simple circuit block with a short critical delay path. In order to evaluate the performance improvement resulting from the new algorithm, an architecture for the 2D DCT designed by direct mapping from the computation structure of the proposed algorithm has been implemented in an FPGA board. The results show that the reduction of the hardware consumption can easily reach 25% and the clock frequency can increase 17% compared with a system implementing a recently reported 2D DCT recursive algorithm. For a further reduction of the hardware, another architecture has been proposed for the same 2D DCT computation. Using one recursive computation block to perform different functions, this architecture needs only approximately one-half of the hardware that is required in the first architecture, which has been confirmed by an FPGA implementation. 相似文献

12.

Reconfigurable hardware for neural networks: binary versus stochastic

Nadia Nedjah Luiza de Macedo Mourelle 《Neural computing & applications》2007,16(3):249-255

This paper is focused on hardware implementation of neural networks. We propose a reconfigurable, low-cost and readily available hardware architecture for an artificial neuron. For this purpose, we use field-programmable gate arrays i.e. FPGAs. As the state-of-the-art FPGAs still lack the gate density necessary to the implementation of large neural networks of thousands of neurons, we use a stochastic process to implement efficiently the computation performed by a neuron. This paper describes and compares the characteristics of two architectures designed to implement feed-forward fully connected artificial neural networks: the first FPGA prototype is based on traditional adders and multipliers of binary inputs while the second takes advantage of stochastic representation of the inputs. The paper compares both prototypes using the time × area classic factor. 相似文献

13.

FPGA-based detection of SIFT interest keypoints

Leonardo Chang José Hernández-Palancar L. Enrique Sucar Miguel Arias-Estrada 《Machine Vision and Applications》2013,24(2):371-392

The use of local features in images has become very popular due to its promising results. They have shown significant benefits in a variety of applications such as object recognition, image retrieval, robot navigation, panorama stitching, and others. SIFT is one of the local features methods that have shown better results. Among its main disadvantages is its high computational cost. In order to speedup this algorithm, this work proposes the design and implementation of an efficient hardware architecture based on FPGAs for SIFT interest point detection In order to take full advantage of the parallelism in this algorithm and to minimize the device area occupied by its implementation in hardware, part of the algorithm was reformulated. The main contribution of the hardware architecture proposed in this paper and the main difference with the rest of the architectures reported in the literature is that as the number of octaves to be processed is increased, the amount of occupied device area remains almost constant. The evaluations and experiments to the architecture support this contribution, as well as accuracy, repeatability, and distinctiveness of the results. Experiments also showed device area occupation and time constraints of the hardware implementation. The architecture presented in this paper is able to detect interest points in an image of 320 × 240 in 11 ms, which represents a speedup of 250 × with respect to a software implementation. 相似文献

14.

An FPGA based soft multiprocessor for DNS/DNSSEC authoritative server

Rabah SadounAuthor Vitae El Bey Bourennane^{Author Vitae} 《Microprocessors and Microsystems》2011,35(5):473-483

相似文献

15.

Flexible architectures for retinal blood vessel segmentation in high-resolution fundus images

Hamza Bendaoudi Farida Cheriet Ashley Manraj Houssem Ben Tahar J. M. Pierre Langlois 《Journal of Real-Time Image Processing》2018,15(1):31-42

相似文献

16.

Performance,optimization, and fitness: Connecting applications to architectures

Mohammad A. Bhuiyan Melissa C. Smith Vivek K. Pallipuram 《Concurrency and Computation》2011,23(10):1066-1100

Recent trends involving multicore processors and graphical processing units (GPUs) focus on exploiting task‐ and thread‐level parallelism. In this paper, we have analyzed various aspects of the performance of these architectures including NVIDIA GPUs, and multicore processors such as Intel Xeon, AMD Opteron, IBM's Cell Broadband Engine. The case study used in this paper is a biological spiking neural network (SNN), implemented with the Izhikevich, Wilson, Morris–Lecar, and Hodgkin–Huxley neuron models. The four SNN models have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the variation of performance with network (problem size) scaling, available optimization techniques and execution configuration. A Fitness performance model, that predicts the suitability of the architecture for accelerating an application, is proposed and verified with the SNN implementation results. The Roofline model, another existing performance model, has also been utilized to determine the hardware bottleneck(s) and attainable peak performance of the architectures. Significant speedups for the four SNN neuron models utilizing these architectures are reported; the maximum speedup of 574x was observed in our GPU implementation. Our results and analysis show that a proper match of architecture with algorithm complexity provides the best performance. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

17.

An architectural co-synthesis algorithm for energy-aware Network-on-Chip design

Yi-Jung Chen Chia-Lin Yang Yen-Sheng Chang 《Journal of Systems Architecture》2009,55(5-6):299-309

Network-on-Chip (NoC) has been proposed to overcome the complex on-chip communication problem of System-on-Chip (SoC) design in deep sub-micron. A complete NoC design contains exploration on both hardware and software architectures. The hardware architecture includes the selection of Processing Elements (PEs) with multiple types and their topology. The software architecture contains allocating tasks to PEs, scheduling of tasks and their communications. To find the best hardware design for the target tasks, both hardware and software architectures need to be considered simultaneously. Previous works on NoC design have concentrated on solving only one or two design parameters at a time. In this paper, we propose a hardware–software co-synthesis algorithm for a heterogeneous NoC architecture. The design goal is to minimize energy consumption while meeting the real-time requirements commonly seen in embedded applications. The proposed algorithm is based on Simulated-Annealing (SA). To compare the solution quality and efficiency of the proposed algorithm, we also implement the branch-and-bound and iterative algorithm to solve the hardware–software co-synthesis problem of a heterogeneous NoC. With the given synthetic task sets, the experimental results show that the proposed SA-based algorithm achieves near-optimal solution in a reasonable time, while the branch-and-bound algorithm takes a very long time to find the optimal solution, and the iterative algorithm fails to achieve good solution quality. When applying the co-synthesis algorithms to a real-world application with PE library that has little variation in PE performance and energy consumption, the iterative algorithm achieves solution quality comparable to that of the proposed SA-based algorithm. 相似文献

18.

Parallel desynchronized block matching: A feasible scheduling algorithm for the input-buffered wavelength-routed switch

《Computer Networks》2007,51(15):4270-4283

The input-buffered wavelength-routed (IBWR) switch is a promising switching architecture for slotted optical packet switching (OPS) networks. The benefits of the IBWR fabric are a better scalability and lower hardware cost, when compared to output buffered OPS proposals. A previous work characterized the scheduling problem of this architecture as a type of matching problem in bipartite graphs. This characterization establishes an interesting relation between the IBWR scheduling and the scheduling of electronic virtual output queuing switches. In this paper, this relation is further explored, for the design of feasible IBWR scheduling algorithms, in terms of hardware implementation and execution time. As a result, the parallel desynchronized block matching (PDBM) algorithm is proposed. The evaluation results presented reveal that IBWR switch performance using the PDBM algorithm is close to the performance bound given by OPS output buffered architectures. The performance gap is especially small for dense wavelength division multiplexing (DWDM) architectures. 相似文献

19.

JPEG2000中的EBCOT并行处理方法研究

陈美丽黄士坦《微机发展》2006,16(6):104-106

为了提高JPEG2000编码的效率,降低在此过程中EBCOT的瓶颈效应,文中在分析EBCOT算法的基础上研究了三种并行结构的优化方案,结合软硬件的考虑,并行结构大大提高了编码速度,增强了编码的实时性。相似文献

20.

An instruction-level distributed processor for symmetric-key cryptography

Elbirt A.J. Paar C. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(5):468-480

Efficient implementation of block ciphers is critical toward achieving both high security and high-speed processing. Numerous block ciphers have been proposed and implemented, using a wide and varied range of functional operations. Existing architectures such as microcontrollers do not provide this broad range of support. Therefore, we will present a hardware architecture that achieves efficient block cipher implementation while maintaining flexibility through reconfiguration. In an effort to achieve such a hardware architecture, a study of a wide range of block ciphers was undertaken to develop an understanding of the functional requirements of each algorithm. This study led to the development of COBRA, a reconfigurable architecture for the efficient implementation of block ciphers. A detailed discussion of the top-level architecture, interconnection scheme, and underlying elements of the architecture will be provided. System configuration and on-the-fly reconfiguration will be analyzed, and from this analysis, it will be demonstrated that the COBRA architecture satisfies the requirements for achieving efficient implementation of a wide range of block ciphers that meet the 622 Mbps ATM network encryption throughput requirement. 相似文献