首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
This paper presents new DSP (Digital Signal Processor) instructions and their hardware architecture for high-speed FFT. The instructions perform new operation flows, which are different from the MAC (Multiply and Accumulate) operation on which existing DSP chips heavily depend. This paper proposes the DPU (Data Processing Unit) supporting the instructions and shows it to be two times faster than existing DSP chips for FFT. The architecture has been modeled by the Verilog HDL and logic synthesis has been performed using the 0.35 m standard cell library. The maximum operating clock frequency is about 144.5 MHz and the architecture will be employed on an application-specific DSP chip.  相似文献   

2.
We propose two new instructions, swperm and sieve, that can be used to efficiently complete an arbitrary bit-level permutation of an n-bit word with or without repetitions. Permutations with repetitions are rearrangements of an ordered set in which elements may replace other elements in the set; such permutations are useful in cryptographic algorithms. On a four-way superscalar processor, we can complete an arbitrary 64-bit permutation with repetitions of 1-bit subwords in 11 instructions and only four cycles using the two proposed instructions. For subwords of size 4 bits or greater, we can perform an arbitrary permutation with repetitions of a 64-bit register in a single cycle using a single swperm instruction. This improves upon previous results by requiring fewer instructions to permute 4-bit or larger subwords packed in a 64-bit register and fewer execution cycles for 1-bit subwords on wide superscalar processors. We also demonstrate that we can accelerate the performance of the popular DES block cipher using the proposed instructions. We obtain a DES performance improvement of at least 55% in constrained embedded environments and an improvement of 71% on a four-way superscalar processor when applying DES as a cryptographic hash function.  相似文献   

3.
In new generations of microprocessors, the superscalar architecture is widely adopted to increase the number of instructions executed in one cycle. The division instruction among all of the instructions needs more cycles than the rest, e.g., addition and multiplication. This makes the division instruction an important cycles-per-instruction figure for modern microprocessors. In this paper, a radix-16/8/4/2 divisor is proposed, which uses a variety of techniques, including operand scaling, table partitioning, and, particularly, table sharing, to increase performance without the cost of increasing complexity. A physical chip using the proposed method is implemented by 0.35-/spl mu/m single poly four metal (1P4M) CMOS technology. The testing measurement shows that the chip can execute signed 64-b/32-b integer division between 3-13 cycles with a 80-MHz operating clock.  相似文献   

4.
This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology. Different levels of pipelining were applied to the data path to explore the resulting performances and find an optimal pipeline depth. Three complex instructions were used to reduce the latency by reducing the overall number of instructions, and a new combined algorithm was developed to perform point doubling and point addition using the application specific instructions. An implementation for the United States Government National Institute of Standards and Technology-recommended curve over GF(2163) is shown, which achieves a point multiplication time of 33.05 s at 91 MHz on a Xilinx Virtex-E FPGA-the fastest figure reported in the literature to date. Using the more modern Xilinx Virtex-4 technology, a point multiplication time of 19.55 s was achieved, which translates to over 51120 point multiplications per second.  相似文献   

5.
A programmable instruction decoder (PID) is introduced for designing adaptive multi-core DSP architectures by using a hardware/software co-reconfigurable approach without employing programmable devices. This PID permits DSP software developers for post-manufacturing modification of their DSP instruction sets to add their application-specific instructions whenever necessary. In addition, PID offers software developers an enhanced means to utilize the underlying DSP architectures by rescheduling implemented micro-operations for their tailored instructions in the DSP processors. Thus, emerging DSP applications can be swiftly and efficiently re-imported to PID-based DSP processors without re-fabrication of new DSP chips. In addition to instruction-level modification, an innovative instruction-packing procedure for PID is presented for further enhancement of the PID-based DSP systems. PID architecture was developed and implemented in VHDL. The PID-based DSP systems were also developed and evaluated to demonstrate various post-manufacturing adaptabilities in DSP processor systems. Various multi-core DSP architectures based on Texas Instruments’ TMS320C55 DSP processor were used for evaluating performance and adaptability of this new programmable instruction decoder.  相似文献   

6.
 设计了一种语法元素指令流驱动的全流水CABAC(Context-based Adaptive Binary Arithmetic Coding)熵编码VLSI结构,并对提出的语法元素级分组并行算术编码器的体系结构进行了设计和开销评估.该并行方法可以与现有符号级并行算法正交,可同时使用,适合大规模片上并行视频编码器;相比标准CABAC,增加约55%的晶体管即可实现2倍以上的符号处理加速比和>1Gbin/s的吞吐率.  相似文献   

7.
Software security protection has become an important topic in the field of computer security. Most of the traditional software protection methods are no longer suitable for the requirements of modern software protection. This paper presents a new virtual machine (VM)-based software protection program, by which the X86 assembly instructions are compiled into virtual instructions that VM can interpret and implement. This method can greatly increase the difficulty of reverse analysis, and protect the rights of software developers and intellectual property. In addition, this method adopts a random instruction generation algorithm which makes different software instructions generated by our solution, so that the software security can be improved greatly. It is presented by experiments that the protective effect of the method above is good in either static or dynamic condition.  相似文献   

8.
A method of programming an assembler and simulator for a programmable digital signal processor (PDSP) is proposed. This method is general and simple to execute, and makes use of theUNIX®; utilities awk, sed, and grep to translate a source program, consisting of assembly instructions and C statements, to either machine code for the assembler or a C program for the simulator. The approach also provides the advantages that little programming effort is required to generate a new assembler and simulator for another processor instruction set. It also allows a PDSP to be simulated in the context of a larger system containing other PDSP's, other hardware, and an external environment. A disadvantage is that the approach is applicable only to the UNIX operating system.  相似文献   

9.
Control-flow checking by software signatures   总被引:5,自引:0,他引:5  
This paper presents a new signature monitoring technique, CFCSS (control flow checking by software signatures); CFCSS is a pure software method that checks the control flow of a program using assigned signatures. An algorithm assigns a unique signature to each node in the program graph and adds instructions for error detection. Signatures are embedded in the program during compilation time using the constant field of the instructions and compared with run-time signatures when the program is executed. Another algorithm reduces the code size and execution time overhead caused by checking instructions in CFCSS. A "branching fault injection experiment" was performed with benchmark programs. Without CFCSS, an average of 33.7 % of the injected branching faults produced undetected incorrect outputs; however, with CFCSS, only 3.1 % of branching faults produced undetected incorrect outputs. Thus it is possible to increase error detection coverage for control flow errors by an order of magnitude using CFCSS. The distinctive advantage of CFCSS over previous signature monitoring techniques is that CFCSS is a pure software method, i.e., it needs no dedicated hardware such as a watchdog processor for control flow checking. A watchdog task in multitasking environment also needs no extra hardware, but the advantage of CFCSS over a watchdog task is that CFCSS can be used even when the operating system does not support multitasking  相似文献   

10.
This paper presents digital signal processor (DSP) instructions and their data processing unit (DPU) architecture for high‐speed fast Fourier transforms (FFTs) in orthogonal frequency division multiplexing (OFDM) systems. The proposed instructions jointly perform new operation flows that are more efficient than the operation flow of the multiply and accumulate (MAC) instruction on which existing DSP chips heavily depend. We further propose a DPU architecture that fully supports the instructions and show that the architecture is two times faster than existing DSP chips for FFTs. We simulated the proposed model with a Verilog HDL, performed a logic synthesis using the 0.35 µm standard cell library, and then verified the functions thoroughly.  相似文献   

11.
面向嵌入式应用的指令集自动扩展   总被引:2,自引:1,他引:1       下载免费PDF全文
 面向特定应用扩展指令集,并通过定制的硬件实现这些扩展指令,能够大幅度提高嵌入式处理器的性能.本文提出了一种全自动的面向特定应用的指令集扩展流程,该流程能够较精确地估算扩展指令的性能加速比和硬件开销,并高效完成指令模板匹配.实验结果表明,在给定的硬件开销限制下,该方法产生的扩展指令能够显著提升嵌入式应用的性能.  相似文献   

12.
《Microelectronics Journal》2015,46(7):637-655
This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated.  相似文献   

13.
Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm’s critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.
Samuel PierreEmail:
  相似文献   

14.
查杭  冷雪辉 《通信技术》2020,(1):186-190
现在网络技术的发展趋势朝着移动化、WEB化程度发展,其中最典型的代表是微信小程序。目前,这种集移动和WEB应用程序为一体的新式应用蓬勃发展。WEB应用程序将逻辑处理程序部署在服务器上,客户端对服务器发出请求指令,服务器处理这些指令并返回处理后的结果,其中涉及到客户端和服务端的数据传输,同时客户端发送处理指令传递给服务器,服务器接收指令并进行处理。笔者在某社交空间中发现了某网站的开放平台的外链被恶意注入,强制点击该链接的用户执行恶意转发。下面通过分析该案例的发生和机理,提出相应的预防方案,为之后的微信小程序开发中的服务器安全提供一个安全教例。  相似文献   

15.
应用于智能卡的Java嵌入式微处理器核的设计   总被引:2,自引:1,他引:1  
介绍了一种可直接执行Java字节码的嵌入式微处理器体系结构。该处理器核实现了Java卡虚拟机(JCVM)指令集。类RISC的流水线显著加快了指令的执行速度。文中对堆栈类型指令间的数据相关问题提出了一种新的解决办法。  相似文献   

16.
17.
The Design and Implementation of FFTW3   总被引:27,自引:0,他引:27  
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.  相似文献   

18.
Including landmarks in routing instructions   总被引:1,自引:0,他引:1  
This article addresses the problem of incorporating cognitively salient landmarks in computer-generated navigation instructions. On the basis of a review of the existing literature in the domain of navigation with landmarks, the article develops algorithms for generating routing instructions that include references to landmarks. The most basic algorithm uses a new weighting model to annotate simple routes with references to landmarks. A key novel feature of this algorithm is that it depends only on commonly available data and generic capabilities of existing web mapping environments. A suite of extensions are also proposed for improving the cognitive ergonomics of the basic landmark instructions. A case study, implemented within a national online routing system, demonstrates practicality of the approach. The article then concludes by reviewing a range of further issues for future work.  相似文献   

19.
This paper describes an application-specific instruction set for a configurable processor to accelerate motion-compensated frame rate conversion (MC-FRC) algorithms based on block motion estimation (BME). The paper shows that the key to achieve very high performance when creating new instructions is to leverage, at the same time, parallel computations, data reuse, and efficient cache use. This is supported by concrete examples that demonstrate how it can be done in the case of the two algorithms considered. The new instructions are used to implement two BME algorithms: one implements the full search (FS) block matching algorithm (BMA), while the other implements the One-Dimensional Full Search (ODFS) BMA. The obtained acceleration factors exceed one hundred for the MC-FRC algorithm embedding the FS algorithm and twenty for the ODFS algorithm. The results show that getting such global acceleration is the consequence of combining parallel computations, data reuse, and efficient cache use, not of only one of them.  相似文献   

20.
As technology scaling reduces pace and energy efficiency becomes a new important design constraint, superscalar processor designs are reaching their performance limits due to area and power restrictions. As a result, new microarchitectural paradigms need to be developed. This work proposes a new organization for x86 processors, based on a traditional superscalar design coupled to a reconfigurable array. The system exploits the fact that few basic blocks are responsible for most of the instructions that execute in the processor, and transforms these basic blocks into configurations for the reconfigurable array. Each configuration encodes the semantics and dependencies for all instructions in the block, so that the ones already mapped can execute bypassing the fetch, decode and dependency checks stages and improving instruction throughput. Our study on the potential of the architecture shows that performance gains of up to 2.5\(\times \) with respect to a traditional superscalar can be achieved.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号