首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Microelectronics Journal》2015,46(7):637-655
This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated.  相似文献   

2.
Multiprocessor System on Chips (MPSoCs) are quickly becoming the mainstay in embedded processing platforms due to their hardware and software design flexibility. This flexibility increases the design space for developers, introducing trade-offs between performance and resource/power consumption. This paper presents a comprehensive evaluation of memory customisations for MPSoCs. Custom arrangements of instruction and data cache are presented to optimise off-chip memory consumption and improve system performance. Off-chip memory management and threading are presented to balance the computational load on available processors and improve system performance. The proposed methods are applied to an object detection case study, where performance increases of up to 2.93x are achieved when compared to standard memory designs. Furthermore, the proposed techniques can increase the number of possible processors in an MPSoC by reducing the number of bus interconnects.  相似文献   

3.
简单介绍了多DSP硬件模块的应用背景。主要介绍了基于美国德州仪器(TI)公司生产的TMS320 VC5416 DSP芯片实现的8 DSP硬件模块实现方法。该模块的结构主要包括多片DSP、FLASH程序加载、JTAG硬件仿真和FPGA等子模块。详细论述了多DSP与FPGA的连接、FLASH存储器与DSP、FPGA的连接,以及硬件仿真所用的JTAG菊花链。并且通过验证该硬件模块运行正确。  相似文献   

4.
《电子学报:英文版》2017,(6):1198-1205
FPGA based soft vector processing accelerators are used frequently to perform highly parallel data processing tasks. Since they are not able to implement complex control manipulations using software, most FPGA systems now incorporate either a soft processor or hard processor. A FPGA based AXI bus compatible vector accelerator architecture is proposed which utilises fully pipelined and heterogeneous ALU for performance, and microcoding is employed for reusability. The design is tested with several design examples in four different lane configurations. Compared with Central processing unit (CPU), Digital signal processor (DSP), Altera C2H tool and OpenCL SDK implementations, the vector processor improves on execution time and energy consumption by factors of up to 6.6 and 6.4 respectively.  相似文献   

5.
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.  相似文献   

6.
A heterogeneous multicore system-on-chip (SoC) has been developed for high-definition (HD) multimedia applications that require secure DRM (digital rights management). The SoC integrates three types of processors: two specific-purpose accelerators for cipher and high-resolution video decoding; one general-purpose accelerator (MX); and three CPUs. This is how our SoC achieves high performance and low power consumption with hardware customized for video processing applications that process a large amount of data. To achieve secure data control, hardware memory management and software system virtualization are adopted. The security of the system is the result of the cooperation between the hardware and software on the system. Furthermore, a highly tamper-resistant system is provided on our SiP (System in a package), through DDR2 SDRAMs and a flash memory that contain confidential information in one package. This secure multimedia processor provides a solution to protect contents and to safely deliver secure sensitive information when processing billing transactions that involve digital content delivery. The SoC was implemented in a 90 nm generic CMOS technology.   相似文献   

7.
Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.  相似文献   

8.
多处理器系统已广泛应用于高速信号处理领域,为提高系统性能,更好地发挥多处理器优势,介绍采用基于FPGA的多DSP架构。利用FPGA作为数据调度核心,将处理器从繁杂的数据通信工作中解放出来,充分发挥了多处理器的并行工作能力,增强了系统的重构和拓展性。该系统已应用于工程实践中,以一块高密度电路板实现了从数据采集到图像校正、图像处理,以及图像显示的整个流程,能够满足对处理时间要求较高、较为复杂的图像处理算法的要求。  相似文献   

9.
A 250-MHz microprocessor intended for home computer entertainment consists of a CPU core with 128-b multimedia extensions, two single-instruction, multiple-data (SIMD) very long instruction word (VLIW) vector processors containing ten floating-point multiplier accelerators and four floating-point dividers, an MPEG-2 decoder, a ten-channel direct memory access (DMA) controller, and other peripherals with 128 b internal buses on one die. The core is a two-way superscalar MIPS-compatible microprocessor with 16-kB scratch-pad RAM. Each vector processor is a five-way SIMD-VLIW architecture, which is tightly dedicated for specific applications concerning three-dimensional geometry calculation and physical simulation. A DMA controller connects between main memory and each processor's local memory to conceal memory access penalty. It contains 10.5 M transistors in 17×14.1 mm and dissipates 15 W at 1.8 V  相似文献   

10.
Novel algorithmic features of multimedia applications and advances in VLSI technologies are driving forces behind the new multimedia signal processors. We propose an architecture platform which could provide high performance and flexibility, and would require less external I/O and memory access. It is comprised of array processors to be used as the hardware accelerator and RISC cores to be used as the basis of the programmable processor. It is a hierarchical and scalable architecture style which facilitates the hardware-software codesign of multimedia signal processing circuits and systems. While some control-intensive functions can be implemented using programmable CPUs, other computation-intensive functions can rely on hardware accelerators.To compile multimedia algorithms, we also present an operation placement and scheduling scheme suitable for the proposed architectural platform. Our scheme addresses data reusability and exploits local communication in order to avoid the memory/communication bandwidth bottleneck, which leads to faster program execution. Our method shows a promising performance: a linear speed-up of 16 times can be achieved for the block-matching motion estimation algorithm and the true motion tracking algorithm, which have formed many multimedia applications (e.g., MPEG-2 and MPEG-4).  相似文献   

11.
In order to achieve maximization of parallelism, effective distribution of rendering tasks, balance between performance and flexibility in graphics processing pipeline, this article presents design, performance analysis and optimization for multi-core interactive graphics processing unit (MIGPU). This processor integrates twelve processing cores with specific instruction set architecture and many sophisticated application-specific accelerators into a 3D graphics engine. It is implemented on XC6VLX550T field programmable gate array (FPGA). MIGPU supports OpenGL2.0 with programmable front-end processor, vertex shader, plane clipper, geometry transformer, three-D clippers and pixel shaders. For boosting the performance of MIGPU, the relationship model is established between primitive types, vertices, pixels, and the effect of culling, clipping, and memory access, and shows a way to improve the speed up of the graphics pipeline. It is capable of assigning graphics rendering tasks to different processors for efficiency and flexibility. The pixel filling rate can reach to 40 Mpixel/s at its peak performance.  相似文献   

12.
General-purpose multicore processors are being accepted in all segments of the industry, including signal processing and embedded space, as the need for more performance and general-purpose programmability has grown. Parallel processing increases performance by adding more parallel resources while maintaining manageable power characteristics. The implementations of multicore processors are numerous and diverse. Designs range from conventional multiprocessor machines to designs that consist of a "sea" of programmable arithmetic logic units (ALUs). In this article, we cover some of the attributes common to all multicore processor implementations and illustrate these attributes with current and future commercial multicore designs. The characteristics we focus on are application domain, power/performance, processing elements, memory system, and accelerators/integrated peripherals.  相似文献   

13.
Next-generation mobile devices will continue to demand high processing power for imaging applications. The expected performance is in the class of supercomputers, but delivered with limited energy and memory bandwidth for embedded systems. This article advocates a streaming computation model that leverages the deterministic access patterns in imaging applications to deliver the necessary processing throughput. A reconfigurable datapath connects a set of functional units, forming a computation pipeline to offer energy efficiency. The architecture and implementation of a stream processor are presented along with the memory subsystem to support stream data transfers. The results show speedup ranging from a factor of 2 to 28 for imaging applications, offering favorable comparison against scalar processors.  相似文献   

14.
朱玉飞  戴紫彬  徐进辉  李功丽 《电子学报》2017,45(12):2957-2964
以信息安全设备的密码应用需求为基础,融合流体系结构处理器基本架构,设计出流体系结构密码处理器.文章主要研究和设计影响该处理器性能的瓶颈--流存储系统.此系统针对专用密码处理器的存储特点,并采用可配置化设计,满足密码应用对处理器存储系统灵活高效的要求.同时,该设计将层次化-分布-分体式存储、多数据通道流水并行化访存、流访存调度策略相结合,优化存储系统的访存效率,以提高该处理器的整体性能.研究结果表明,相比于典型密码处理器的存储设计,该设计的访存效率最高可提升约6倍.  相似文献   

15.
In this paper we analyze the power consumption and energy efficiency of general matrix-matrix multiplication (GEMM) and Fast Fourier Transform (FFT) implemented as streaming applications for an FPGA-based coprocessor card. The power consumption is measured with internal voltage sensors and the power draw is broken down onto the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O links and the FPGA card. We present an abstract model that allows for estimating the power consumption of FPGA accelerators on the system level and validate the model using the measured kernels. The performance and energy consumption is compared against optimized multi-threaded software running on the POWER7 host CPUs. Our experimental results show that the accelerator can improve the energy efficiency by an order of magnitude when the computations can be undertaken in a fixed point format. Using floating point data, the gain in energy-efficiency was measured as up to 30 % for the double precision GEMM accelerator and up to 5 × for a 1k complex FFT.  相似文献   

16.
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.  相似文献   

17.
Limits to chip power dissipation and power density and limits on the benefits of hyperpipelining in microprocessors threaten to stop the exponential performance growth in microprocessor performance that we have grown accustomed to. Multicore processors can continue to provide historical performance growth on most modern consumer and business applications. However, power efficiency of these cores must also be improved to stay within reasonable power budgets. This can be achieved by simplifying the processor core architecture, and reversing the trend toward ever more-complex and less power-efficient cores. To maintain overall performance growth with stunted per-core and per-thread performance, growth rates will require an even more rapid increase in the number of cores per die. Growing performance by increasing the number of cores on a die at this rate, however, puts unprecedented requirements on the corresponding growth of off-chip bandwidth. We argue that contrary to the international roadmap for semiconductors (ITRS) predictions, off-chip signaling frequencies are likely to exceed the frequencies of processor cores in the not too distant future, consistent with the system-on-package (SOP) concept in the first paper of this issue. If this approach is followed, a 1-TFlops multiprocessor die with 1 TB/s of off-chip bandwidth is feasible at reasonable cost before the end of the decade.  相似文献   

18.
余乐  李任伟  王瑶  李洋洋  吴超  贾瑞 《电子学报》2018,46(4):992-1004
近年来,随着各种IP核的广泛应用,SoC-FPGA的应用领域也随之日益扩展.处理器作为SoC-FPGA的核心IP,其对系统性能的影响至关重要.使用开源处理器IP能大幅度提高SoC-FPGA系统级设计的效率,已成为现在项目开发中常用的手段.本文研究了现有的绝大多数开源处理器的关键技术指标,从可用性和稳定性上提出了一种选择开源处理器的方法.根据该方法,选择出一些具有高可用性和稳定性的开源处理器.最后,利用不同厂商提供的FPGA EDA工具将所述的开源处理器进行了综合与实现,并与现有FPGA厂商提供的商用软核Nios Ⅱ和Microblaze进行了比较和讨论.  相似文献   

19.
This paper shows how a bus topology performs as a System-on-Chip (SoC) interconnection. We measure and analyze Heterogeneous IP Block Interconnection (HIBI) bus for a multiple clock domain, Multiprocessor System-on-Chip (MPSoC) with an MPEG-4 video encoding application on FPGA. The studied MPSoC contains up to 22 IP blocks: 11 soft processors, 8 hardware accelerators and three other components. A novel approach of frequency scaling is used to isolate the impact of various architecture components. The system is benchmarked in various configurations. For example, HIBI is run at 100× speed with respect to processors to resemble ideal interconnection. Based on the measurements with up to 16.9frames/s CIF (352 × 288) encoding speed, estimation for HDTV resolution video encoder is presented. The required optimizations are discussed. Finally, it is shown that 25frames/s 1280 × 720 video encoder needs 55 MHz HIBI but 670 MHz general-purpose soft RISC processors. In practice, the processing performance has to be boosted by implementing hardware acceleration and improving memory hierarchy. Clearly, HIBI is not the limiting factor.  相似文献   

20.
The designs of application specific integrated circuits and/or multiprocessor systems are usually required in order to improve the performance of multidimensional applications such as digital-image processing and computer vision. Wavelet-based algorithms have been found promising among these applications due to the features of hierarchical signal analysis and multiresolution analysis. Because of the large size of multidimensional input data, off-chip random access memory (RAM) based systems have ever been necessary for calculating algorithms in these applications, where either memory address pointers or data preprocessing and rearrangements in off-chip memories are employed. This paper establishes and follows novel concepts in data dependence analysis for generalized and arbitrarily multidimensional wavelet-based algorithms, i.e., the wavelet-adjacent field and the super wavelet-dependence vector. Based on them, a series of novel nonlinear I/O data space transformations for variable localization and dependence graph regularization for wavelet algorithms is proposed. It leads to general designs of non-RAM-based architectures for wavelet-based algorithms where off-chip communications for intermediate calculation results are eliminated without preprocessing or rearranging input data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号