期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Design of New DSP Instructions and Their Hardware Architecture for High-Speed FFT

Jae Sung Lee Myung H. Sunwoo 《The Journal of VLSI Signal Processing》2003,33(3):247-254

This paper presents new DSP (Digital Signal Processor) instructions and their hardware architecture for high-speed FFT. The instructions perform new operation flows, which are different from the MAC (Multiply and Accumulate) operation on which existing DSP chips heavily depend. This paper proposes the DPU (Data Processing Unit) supporting the instructions and shows it to be two times faster than existing DSP chips for FFT. The architecture has been modeled by the Verilog HDL and logic synthesis has been performed using the 0.35 m standard cell library. The maximum operating clock frequency is about 144.5 MHz and the architecture will be employed on an application-specific DSP chip. 相似文献

2.

Architectural techniques for accelerating subword permutations with repetitions

McGregor J.P. Lee R.B. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):325-335

We propose two new instructions, swperm and sieve, that can be used to efficiently complete an arbitrary bit-level permutation of an n-bit word with or without repetitions. Permutations with repetitions are rearrangements of an ordered set in which elements may replace other elements in the set; such permutations are useful in cryptographic algorithms. On a four-way superscalar processor, we can complete an arbitrary 64-bit permutation with repetitions of 1-bit subwords in 11 instructions and only four cycles using the two proposed instructions. For subwords of size 4 bits or greater, we can perform an arbitrary permutation with repetitions of a 64-bit register in a single cycle using a single swperm instruction. This improves upon previous results by requiring fewer instructions to permute 4-bit or larger subwords packed in a 64-bit register and fewer execution cycles for 1-bit subwords on wide superscalar processors. We also demonstrate that we can accelerate the performance of the popular DES block cipher using the proposed instructions. We obtain a DES performance improvement of at least 55% in constrained embedded environments and an improvement of 71% on a four-way superscalar processor when applying DES as a cryptographic hash function. 相似文献

3.

Design of a cycle-efficient 64-b/32-b integer divisor using a table-sharing algorithm

Chua-Chin Wang Po-Ming Lee Jun-Jie Wang Chenn-Jung Huang 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):737-740

In new generations of microprocessors, the superscalar architecture is widely adopted to increase the number of instructions executed in one cycle. The division instruction among all of the instructions needs more cycles than the rest, e.g., addition and multiplication. This makes the division instruction an important cycles-per-instruction figure for modern microprocessors. In this paper, a radix-16/8/4/2 divisor is proposed, which uses a variety of techniques, including operand scaling, table partitioning, and, particularly, table sharing, to increase performance without the cost of increasing complexity. A physical chip using the proposed method is implemented by 0.35-/spl mu/m single poly four metal (1P4M) CMOS technology. The testing measurement shows that the chip can execute signed 64-b/32-b integer division between 3-13 cycles with a 80-MHz operating clock. 相似文献

4.

Fast Elliptic Curve Cryptography on FPGA

Chelton W.N. Benaissa M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(2):198-205

This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology. Different levels of pipelining were applied to the data path to explore the resulting performances and find an optimal pipeline depth. Three complex instructions were used to reduce the latency by reducing the overall number of instructions, and a new combined algorithm was developed to perform point doubling and point addition using the application specific instructions. An implementation for the United States Government National Institute of Standards and Technology-recommended curve over GF(2¹⁶³) is shown, which achieves a point multiplication time of 33.05 s at 91 MHz on a Xilinx Virtex-E FPGA-the fastest figure reported in the literature to date. Using the more modern Xilinx Virtex-4 technology, a point multiplication time of 19.55 s was achieved, which translates to over 51120 point multiplications per second. 相似文献

5.

Hardware/Software Co-reconfigurable Instruction Decoder for Adaptive Multi-core DSP Architectures

Yong-Kyu Jung 《Journal of Signal Processing Systems》2011,62(3):273-285

A programmable instruction decoder (PID) is introduced for designing adaptive multi-core DSP architectures by using a hardware/software co-reconfigurable approach without employing programmable devices. This PID permits DSP software developers for post-manufacturing modification of their DSP instruction sets to add their application-specific instructions whenever necessary. In addition, PID offers software developers an enhanced means to utilize the underlying DSP architectures by rescheduling implemented micro-operations for their tailored instructions in the DSP processors. Thus, emerging DSP applications can be swiftly and efficiently re-imported to PID-based DSP processors without re-fabrication of new DSP chips. In addition to instruction-level modification, an innovative instruction-packing procedure for PID is presented for further enhancement of the PID-based DSP systems. PID architecture was developed and implemented in VHDL. The PID-based DSP systems were also developed and evaluated to demonstrate various post-manufacturing adaptabilities in DSP processor systems. Various multi-core DSP architectures based on Texas Instruments’ TMS320C55 DSP processor were used for evaluating performance and adaptability of this new programmable instruction decoder. 相似文献

6.

一种用于并行H.264编码器的语法元素级分组并行算术编码器体系结构的评估

下载免费PDF全文

陈胜刚陈书明谷会涛刘尧《电子学报》2012,40(2):400-405

设计了一种语法元素指令流驱动的全流水CABAC(Context-based Adaptive Binary Arithmetic Coding)熵编码VLSI结构,并对提出的语法元素级分组并行算术编码器的体系结构进行了设计和开销评估.该并行方法可以与现有符号级并行算法正交,可同时使用,适合大规模片上并行视频编码器;相比标准CABAC,增加约55%的晶体管即可实现2倍以上的符号处理加速比和>1Gbin/s的吞吐率. 相似文献

7.

Research on software protection based on virtual machine

Jun-feng XU Wei ZHANG Bo SUN 《中国邮电高校学报(英文版)》2012

Software security protection has become an important topic in the field of computer security. Most of the traditional software protection methods are no longer suitable for the requirements of modern software protection. This paper presents a new virtual machine (VM)-based software protection program, by which the X86 assembly instructions are compiled into virtual instructions that VM can interpret and implement. This method can greatly increase the difficulty of reverse analysis, and protect the rights of software developers and intellectual property. In addition, this method adopts a random instruction generation algorithm which makes different software instructions generated by our solution, so that the software security can be improved greatly. It is presented by experiments that the protective effect of the method above is good in either static or dynamic condition. 相似文献

8.

An Approach to Programmable Signal Processor Assemblers and Simulators

Meng T. Messerschmitt D. 《Communications, IEEE Transactions on》1986,34(12):1275-1277

A method of programming an assembler and simulator for a programmable digital signal processor (PDSP) is proposed. This method is general and simple to execute, and makes use of theUNIX^®; utilities awk, sed, and grep to translate a source program, consisting of assembly instructions and C statements, to either machine code for the assembler or a C program for the simulator. The approach also provides the advantages that little programming effort is required to generate a new assembler and simulator for another processor instruction set. It also allows a PDSP to be simulated in the context of a larger system containing other PDSP's, other hardware, and an external environment. A disadvantage is that the approach is applicable only to the UNIX operating system. 相似文献

9.

Control-flow checking by software signatures 总被引：5，自引：0，他引：5

Oh N. Shirvani P.P. McCluskey E.J. 《Reliability, IEEE Transactions on》2002,51(1):111-122

This paper presents a new signature monitoring technique, CFCSS (control flow checking by software signatures); CFCSS is a pure software method that checks the control flow of a program using assigned signatures. An algorithm assigns a unique signature to each node in the program graph and adds instructions for error detection. Signatures are embedded in the program during compilation time using the constant field of the instructions and compared with run-time signatures when the program is executed. Another algorithm reduces the code size and execution time overhead caused by checking instructions in CFCSS. A "branching fault injection experiment" was performed with benchmark programs. Without CFCSS, an average of 33.7 % of the injected branching faults produced undetected incorrect outputs; however, with CFCSS, only 3.1 % of branching faults produced undetected incorrect outputs. Thus it is possible to increase error detection coverage for control flow errors by an order of magnitude using CFCSS. The distinctive advantage of CFCSS over previous signature monitoring techniques is that CFCSS is a pure software method, i.e., it needs no dedicated hardware such as a watchdog processor for control flow checking. A watchdog task in multitasking environment also needs no extra hardware, but the advantage of CFCSS over a watchdog task is that CFCSS can be used even when the operating system does not support multitasking 相似文献

10.

A DSP Architecture for High‐Speed FFT in OFDM Systems

Jaesung Lee Jeonghoo Lee Myung H. Sunwoo Sangman Moh Seongkeun Oh 《ETRI Journal》2002,24(5):391-397

This paper presents digital signal processor (DSP) instructions and their data processing unit (DPU) architecture for high‐speed fast Fourier transforms (FFTs) in orthogonal frequency division multiplexing (OFDM) systems. The proposed instructions jointly perform new operation flows that are more efficient than the operation flow of the multiply and accumulate (MAC) instruction on which existing DSP chips heavily depend. We further propose a DPU architecture that fully supports the instructions and show that the architecture is two times faster than existing DSP chips for FFTs. We simulated the proposed model with a Verilog HDL, performed a logic synthesis using the 0.35 µm standard cell library, and then verified the functions thoroughly. 相似文献

11.

面向嵌入式应用的指令集自动扩展 总被引：2，自引：1，他引：1

下载免费PDF全文

吕雅帅沈立黄立波王志英《电子学报》2008,36(5):985-988

面向特定应用扩展指令集,并通过定制的硬件实现这些扩展指令,能够大幅度提高嵌入式处理器的性能.本文提出了一种全自动的面向特定应用的指令集扩展流程,该流程能够较精确地估算扩展指令的性能加速比和硬件开销,并高效完成指令模板匹配.实验结果表明,在给定的硬件开销限制下,该方法产生的扩展指令能够显著提升嵌入式应用的性能. 相似文献

12.

Merging VLIW and vector processing techniques for a simple,high-performance processor architecture

《Microelectronics Journal》2015,46(7):637-655

This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated. 相似文献

13.

A Novel Application-specific Instruction-set Processor Design Approach for Video Processing Acceleration

Mame Maria Mbaye Normand Bélanger Yvon Savaria Samuel Pierre 《The Journal of VLSI Signal Processing》2007,47(3):297-315

Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm’s critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.

Samuel PierreEmail:

相似文献

14.

一个典型的XSS框架注入攻击的分析和预防

查杭冷雪辉《通信技术》2020,(1):186-190

现在网络技术的发展趋势朝着移动化、WEB化程度发展,其中最典型的代表是微信小程序。目前,这种集移动和WEB应用程序为一体的新式应用蓬勃发展。WEB应用程序将逻辑处理程序部署在服务器上,客户端对服务器发出请求指令,服务器处理这些指令并返回处理后的结果,其中涉及到客户端和服务端的数据传输,同时客户端发送处理指令传递给服务器,服务器接收指令并进行处理。笔者在某社交空间中发现了某网站的开放平台的外链被恶意注入,强制点击该链接的用户执行恶意转发。下面通过分析该案例的发生和机理,提出相应的预防方案,为之后的微信小程序开发中的服务器安全提供一个安全教例。相似文献

15.

应用于智能卡的Java嵌入式微处理器核的设计 总被引：2，自引：1，他引：1

唐小勇羊性滋清华大学《微电子学》2000,30(6):382-386

介绍了一种可直接执行Ｊａｖａ字节码的嵌入式微处理器体系结构。该处理器核实现了Ｊａｖａ卡虚拟机（ＪＣＶＭ）指令集。类ＲＩＳＣ的流水线显著加快了指令的执行速度。文中对堆栈类型指令间的数据相关问题提出了一种新的解决办法。相似文献

16.

Design Space Exploration for an ASIP/Co-Processor Architecture used in GNSS Receivers

G. Kappen L. Kurz O. Priebe T. G. Noll 《Journal of Signal Processing Systems》2010,58(1):41-51

相似文献

17.

The Design and Implementation of FFTW3 总被引：27，自引：0，他引：27

Frigo M. Johnson S.G. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2005,93(2):216-231

FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. 相似文献

18.

Including landmarks in routing instructions 总被引：1，自引：0，他引：1

《Journal of Location Based Services》2013,7(1):28-52

This article addresses the problem of incorporating cognitively salient landmarks in computer-generated navigation instructions. On the basis of a review of the existing literature in the domain of navigation with landmarks, the article develops algorithms for generating routing instructions that include references to landmarks. The most basic algorithm uses a new weighting model to annotate simple routes with references to landmarks. A key novel feature of this algorithm is that it depends only on commonly available data and generic capabilities of existing web mapping environments. A suite of extensions are also proposed for improving the cognitive ergonomics of the basic landmark instructions. A case study, implemented within a national online routing system, demonstrates practicality of the approach. The article then concludes by reviewing a range of further issues for future work. 相似文献

19.

High Acceleration for Video Processing Applications Using Specialized Instruction Set Based on Parallelism and Data Reuse

Nicolas Beucher Normand Bélanger Yvon Savaria Guy Bois 《Journal of Signal Processing Systems》2009,56(2-3):155-165

This paper describes an application-specific instruction set for a configurable processor to accelerate motion-compensated frame rate conversion (MC-FRC) algorithms based on block motion estimation (BME). The paper shows that the key to achieve very high performance when creating new instructions is to leverage, at the same time, parallel computations, data reuse, and efficient cache use. This is supported by concrete examples that demonstrate how it can be done in the case of the two algorithms considered. The new instructions are used to implement two BME algorithms: one implements the full search (FS) block matching algorithm (BMA), while the other implements the One-Dimensional Full Search (ODFS) BMA. The obtained acceleration factors exceed one hundred for the MC-FRC algorithm embedding the FS algorithm and twenty for the ODFS algorithm. The results show that getting such global acceleration is the consequence of combining parallel computations, data reuse, and efficient cache use, not of only one of them. 相似文献

20.

Potential analysis of a superscalar core employing a reconfigurable array for improving instruction-level parallelism

Marcelo?Brandalero Email author View author&#;s OrcID profile Antonio?Carlos?S.?Beck 《Design Automation for Embedded Systems》2016,20(2):155-169

As technology scaling reduces pace and energy efficiency becomes a new important design constraint, superscalar processor designs are reaching their performance limits due to area and power restrictions. As a result, new microarchitectural paradigms need to be developed. This work proposes a new organization for x86 processors, based on a traditional superscalar design coupled to a reconfigurable array. The system exploits the fact that few basic blocks are responsible for most of the instructions that execute in the processor, and transforms these basic blocks into configurations for the reconfigurable array. Each configuration encodes the semantics and dependencies for all instructions in the block, so that the ones already mapped can execute bypassing the fetch, decode and dependency checks stages and improving instruction throughput. Our study on the potential of the architecture shows that performance gains of up to 2.5\(\times \) with respect to a traditional superscalar can be achieved. 相似文献