期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A 155-mW 50-m vertices/s graphics processor with fixed-point programmable vertex shader for mobile applications

Ju-Ho Sohn Jeong-Ho Woo Min-Wuk Lee Hye-Jung Kim Woo R. Hoi-Jun Yoo 《Solid-State Circuits, IEEE Journal of》2006,41(5):1081-1091

A 36 mm/sup 2/ graphics processor with fixed-point programmable vertex shader is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics applications. The graphics processor contains an ARM-10 compatible 32-bit RISC processor,a 128-bit programmable fixed-point single-instruction-multiple-data (SIMD)vertex shader, a low-power rendering engine, and a programmable frequency synthesizer (PFS). Different from conventional graphics hardware, the proposed graphics processor implements ARM-10 co-processor architecture with dual operations so that user-programmable vertex shading is possible for advanced graphics algorithms and various streaming multimedia processing in mobile applications. The circuits and architecture of the graphics processor are optimized for fixed-point operations and achieve the low power consumption with help of instruction-level power management of the vertex shader and pixel-level clock gating of the rendering engine. The PFS with a fully balanced voltage-controlled oscillator (VCO) controls the clock frequency from 8 MHz to 271 MHz continuously and adaptively for low-power modes by software. The chip shows 50 Mvertices/s and 200 Mtexels/s peak graphics performance, dissipating 155 mW in 0.18-/spl mu/m 6-metal standard CMOS logic process. 相似文献

2.

Fast maximum intensity projections of large medical data sets by exploiting hierarchical memory architectures.

Gundolf Kiefer Helko Lehmann Jürgen Weese 《IEEE transactions on information technology in biomedicine》2006,10(2):385-394

Maximum intensity projections (MIPs) are an important visualization technique for angiographic data sets. Efficient data inspection requires frame rates of at least five frames per second at preserved image quality. Despite the advances in computer technology, this task remains a challenge. On the one hand, the sizes of computed tomography and magnetic resonance images are increasing rapidly. On the other hand, rendering algorithms do not automatically benefit from the advances in processor technology, especially for large data sets. This is due to the faster evolving processing power and the slower evolving memory access speed, which is bridged by hierarchical cache memory architectures. In this paper, we investigate memory access optimization methods and use them for generating MIPs on general-purpose central processing units (CPUs) and graphics processing units (GPUs), respectively. These methods can work on any level of the memory hierarchy, and we show that properly combined methods can optimize memory access on multiple levels of the hierarchy at the same time. We present performance measurements to compare different algorithm variants and illustrate the influence of the respective techniques. On current hardware, the efficient handling of the memory hierarchy for CPUs improves the rendering performance by a factor of 3 to 4. On GPUs, we observed that the effect is even larger, especially for large data sets. The methods can easily be adjusted to different hardware specifics, although their impact can vary considerably. They can also be used for other rendering techniques than MIPs, and their use for more general image processing task could be investigated in the future. 相似文献

3.

Praveen Krishnamurthy Jeremy Buhler Roger Chamberlain Mark Franklin Kwame Gyang Arpith Jacob Joseph Lancaster 《Journal of Signal Processing Systems》2007,49(1):101-121

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described. 相似文献

4.

Task Scheduling for Context Minimization in Dynamically Reconfigurable Platforms

Nei-Chiung Perng Shih-Hao Hung Chia-Heng Tu 《Journal of Signal Processing Systems》2010,59(1):3-12

Dynamically reconfigurable hardware provides useful means to reduce the time-to-prototype and even the time-to-market in product designs. It also offers a good alternative in reconfiguring hardware logics to optimize the system performance. This paper targets an essential issue in reconfigurable computing, i.e., the minimization of configuration contexts. We explore different constraints on the CONTEXT MINIMIZATION problem. When the resulting subproblems are polynomial-time solvable, optimal algorithms are presented. We also present a greedy algorithm for the CONTEXT MINIMIZATION problem, that is proved NP\mathcal{NP}-complete. The capability of the proposed algorithm is evaluated by a series of experiments. 相似文献

5.

Advanced graphics behind medical virtual reality: evolution ofalgorithms, hardware, and software interfaces

Soferman Z. Blythe D. John N.W. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1998,86(3):531-554

Applications of virtual reality (VR) and augmented reality (AR) in medicine require real-time visualization and modeling of large three-dimensional data sets. Consequently, these applications require powerful computation, extensive high-bandwidth memory, and fast communication links. In the past, the manufacturers of medical imaging equipment produced their own special-purpose proprietary hardware for image processing and solid graphics. Due to the developments in computer hardware in general and in graphics accelerators in particular, there is a trend toward replacing the proprietary hardware off-the-shelf (OTS) equipment. Computer graphics itself has advanced in its quest for realism. Generic algorithms such as shading, texture mapping, and volume rendering have been developed to meet the resultant ever increasing requirements. Advances in both the OTS CPU and graphics hardware have enabled real-time implementations of these algorithms, thereby facilitating many of the medical VR/AR applications used today. The development of graphics libraries such as OpenGL has also been an important factor. These libraries provide an underlying portable software platform that optimizes the utilization of the available graphics hardware. OpenGL has become a standard graphics application programming interface, particularly for graphics-intensive applications, and more and more OTS systems provide hardware implementations of OpenGL commands. The review paper follows the evolution of these technologies and examines their crucial role in enabling the appearance of the current VR/AR applications in medicine and provides a look at current trends and future possibilities 相似文献

6.

Efficient modeling and analysis of energy consumption for 3D graphics rendering

《Integration, the VLSI Journal》2016

This paper proposes new models of GPU energy consumption from the perspectives of hardware architects and graphics programmers by performing an architecture-independent analysis of the classical graphics rendering pipeline which is still in widespread use today. The detailed analysis includes graphics rendering workload, memory bandwidth and energy consumption . Although the models are derived from classical 3D pipeline, they are extensible to programmable pipelines. There are many factors that affect the performance and energy consumption of 3D graphics rendering, such as the number of textures, vertex sharing, level of details, and rendering algorithms. The proposed models are validated by our simulation study and used to guide our 3D graphics hardware design and 3D graphics programming in order to optimize performance and energy consumption of our GPU prototypes which have been successfully fabricated in SMIC 0.13 μm CMOS technology. 相似文献

7.

窄带物联网信道接收端检测算法的并行化实现

张新王瑜山蕊王昱吴皓月《电讯技术》2020,(1):87-91

针对窄带物联网物理随机接入信道检测和到达时间估计算法处理数据量大、计算耗时的问题,通过分析接收端检测算法的可并行性和数据相关性,基于可重构阵列处理器提出了一种并行化硬件实现方案。该算法在高层配置参数产生的前导符号和通过前期信道处理后的接收符号具有最大相关性时,将此时的到达时间和残留子载波偏移作为估计指标,通过流水线的方式使用多个轻核处理元(Processor Element,PE)实现并行计算以提高运算效率。实验结果表明,使用6个PE同时调度实现算法的映射,运行了35985个周期,其性能比单个PE提升36.18%。用可重构多核阵列处理器实现该算法的运行时间相较于用Matlab实现降低了173.09倍,有效提高了接收端检测算法的运算效率。相似文献

8.

Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Karthik Nagarajan Brian Holland Alan D. George K. Clint Slatton Herman Lam 《Journal of Signal Processing Systems》2011,62(1):43-63

Machine-learning algorithms are employed in a wide variety of applications to extract useful information from data sets, and many are known to suffer from super-linear increases in computational time with increasing data size and number of signals being processed (data dimension). Certain principal machine-learning algorithms are commonly found embedded in larger detection, estimation, or classification operations. Three such principal algorithms are the Parzen window-based, non-parametric estimation of Probability Density Functions (PDFs), K-means clustering and correlation. Because they form an integral part of numerous machine-learning applications, fast and efficient execution of these algorithms is extremely desirable. FPGA-based reconfigurable computing (RC) has been successfully used to accelerate computationally intensive problems in a wide variety of scientific domains to achieve speedup over traditional software implementations. However, this potential benefit is quite often not fully realized because creating efficient FPGA designs is generally carried out in a laborious, case-specific manner requiring a great amount of redundant time and effort. In this paper, an approach using pattern-based decomposition for algorithm acceleration on FPGAs is proposed that offers significant increases in productivity via design reusability. Using this approach, we design, analyze, and implement a multi-dimensional PDF estimation algorithm using Gaussian kernels on FPGAs. First, the algorithm’s amenability to a hardware paradigm and expected speedups are predicted. After implementation, actual speedup and performance metrics are compared to the predictions, showing speedup on the order of 20× over a 3.2 GHz processor. Multi-core architectures are developed to further improve performance by scaling the design. Portability of the hardware design across multiple FPGA platforms is also analyzed. After implementing the PDF algorithm, the value of pattern-based decomposition to support reuse is demonstrated by rapid development of the K-means and correlation algorithms. 相似文献

9.

Multi-functional systolic array with reconfigurable micro-power processing elements

E.I. Milovanovi? I.?. Milovanovi? 《Microelectronics Reliability》2009,49(7):813-820

This paper presents the design and implementation of high performance bi-directional linear systolic array (BLSA) with low-power, reconfigurable processing elements (PE). The BLSA acts as a hardware accelerator for implementing a broad class of problems which are met in a variety of applications such as digital signal processing, computer graphics, graph algorithms, etc. We define a unique algorithm representation for solving problems such as matrix multiplication, transitive closure, finding critical path in a graph, finding all-pairs shortest paths in a graph, etc. The algorithm is mapped into a BLSA with reconfigurable PEs. A clock gating technique is used to minimize power-consumption of a multi-functional PE. Performance of the BLSA are considered from the aspects of power-consumption and communication bandwidth. Using clock gating technique we achieve PE power reduction of 85% in average. Communication bandwidth is considered for different number of PEs in the BLSA and different operand size. The obtained results are in the range of 442 up to 9460 MB/s, i.e. bandwidth of our design is better for larger array and operand size. A lower-power, reconfigurable PE is realized using Xilinx FPGA chips. 相似文献

10.

System-level power-performance tradeoffs for reconfigurable computing

Noguera J. Badia R.M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(7):730-739

In this paper, we propose a configuration-aware data-partitioning approach for reconfigurable computing. We show how the reconfiguration overhead impacts the data-partitioning process. Moreover, we explore the system-level power-performance tradeoffs available when implementing streaming embedded applications on fine-grained reconfigurable architectures. For a certain group of streaming applications, we show that an efficient hardware/software partitioning algorithm is required when targeting low power. However, if the application objective is performance, then we propose the use of dynamically reconfigurable architectures. We propose a design methodology that adapts the architecture and algorithms to the application requirements. The methodology has been proven to work on a real research platform based on Xilinx devices. Finally, we have applied our methodology and algorithms to the case study of image sharpening, which is required nowadays in digital cameras and mobile phones. 相似文献

11.

基于CN56xx网络处理器的网络控制器的研究

沈晶聂叶猛《电视技术》2012,36(9):103-107

在网络处理器的平台上开发了用户管理控制系统,用于对用户上网内容和行为进行监控。网络处理器是可编程的高效网络数据处理芯片,网络控制器是用户管控系统中用于过滤数据的器件。通过实验,在硬件方面使用优化流水线这一高效的芯片处理数据的方法来提升数据处理效率,在软件方面通过使用不同的算法来优化性能,这些算法包括流过滤算法、潜在语义索引算法和IP碎片处理技术。实验结果表明,基于网络处理器的网络控制器在根据过滤和转发规则对数据过滤和转发时准确率高,速度快,非常好地达到了对用户上网内容和行为监控的效果。相似文献

12.

3-D Floorplanning: Simulated Annealing and Greedy Placement Methods for Reconfigurable Computing Systems

Kiarash Bazargan Ryan Kastner Majid Sarrafzadeh 《Design Automation for Embedded Systems》2000,5(3-4):329-338

The advances in the programmable hardware has lead to new architectures where the hardware can be dynamically adapted to the application to gain better performance. There are still many challenging problems to be solved before any practical general-purpose reconfigurable system is built. One fundamental problem is the placement of the modules on the reconfigurable functional unit (RFU). In reconfigurable systems, we are interested both in online placement, where arrival time of tasks is determined at runtime and is not known a priori, and offline in which the schedule is known at compile time. In the case of offline placement, we are willing to spend more time during compile time to find a compact floorplan for the RFU modules and utilize the RFU area more efficiently. In this paper we present offline placement algorithms based on simulated annealing and greedy methods and show the superiority of their placements over the ones generated by an online algorithm. 相似文献

13.

资源限制型可重构并行信息处理方法

下载免费PDF全文

陆智俊贲德毛博年《红外与激光工程》2016,45(11):1126003-1126003(6)

针对立方体钠卫星GNC信息处理系统高计算性能与低功率消耗相矛盾的问题,提出了一种资源限制型可重构并行信息处理方法。该方法采用紧耦合可重构并行信息处理架构,将GNC信息处理中需要多次迭代计算且不适合CPU处理的复杂软件算法,以动态部分重构硬件电路单元（DPR）的方式实现,采用基于互斥量的多核并行可重构资源调度算法,通过多核CPU并行管理与调度共享的DPR单元,完成软件算法的硬件加速与优化。实验结果表明,该方法实现了立方星GNC信息处理系统的高效实时快速处理,与传统信息处理方法相比,可节约50%左右的功耗,可应用于计算资源极为有限的星上信息处理领域,具有很好的工程应用前景。相似文献

14.

An 80/20-MHz 160-mW multimedia processor integrated with embeddedDRAM, MPEG-4 accelerator and 3-D rendering engine for mobileapplications

Chi-Weon Yoon Ramchan Woo Jeengheon Kook Se-Joong Lee Kangmin Lee Hoi-Jun Yeo 《Solid-State Circuits, IEEE Journal of》2001,36(11):1758-1767

A low-power multimedia processor for mobile applications is presented. An 80-MHz 32-b RISC with enhanced multiplier, two 20-MHz hardware accelerators with 7.125-Mb embedded DRAM for MPEG-4 visual SP@L1 decoding and 3-D graphics processing, 2-kB dual-port SRAM, and peripheral blocks are integrated together on a single chip, MPEG-4 SP@L1 video decoding and 3-D graphics rendering with a 16-b depth-buffer alpha-blending double-buffering and gouraud-shading features at 2, 2-Mpolygons/s speed are realized with the help of the dedicated hardware accelerators/ The architecture of the processor is optimized in terms of power consumption and performance, and various low-power circuit techniques are adopted in each hardware block. The chip is implemented using 0.18-μm embedded memory logic (EML) technology. Its area is 84 mm², and power consumption is 160 mW when all of the functions are activated 相似文献

15.

A hierarchical design methodology for full-search block matching motion estimation

Mohamed Rehan M. Watheq El-Kharashi Fayez Gebali 《Multidimensional Systems and Signal Processing》2006,17(4):327-341

Many useful DSP algorithms have high dimensions and complex logic. Consequently, an efficient implementation of these algorithms on parallel processor arrays must involve a structured design methodology. Full-search block-matching motion estimation is one of those algorithms that can be developed using parallel processor arrays. In this paper, we present a hierarchical design methodology for the full-search block matching motion estimation. Our proposed methodology reduces the complexity of the algorithm into simpler steps and then explores the different possible design options at each step. Input data timing restrictions are taken into consideration as well as buffering requirements. A designer is able to modify system performance by selecting some of the algorithm variables for pipelining or broadcasting. Our proposed design strategy also allows the designer to study time and hardware complexities of computations at each level of the hierarchy. The resultant architecture allows easy modifications to the organization of data buffers and processing elements-their number, datapath pipelining, and complexity-to produce a system whose performance matches the video data sample rate requirements. 相似文献

16.

A 36 fps SXGA 3-D Display Processor Embedding a Programmable 3-D Graphics Rendering Engine

Seok-Hoon Kim Jae-Sung Yoon Chang-Hyo Yu Donghyun Kim Kyusik Chung Han Shin Lim Yun-Gu Lee HyunWook Park Jong Beom Ra Lee-Sup Kim 《Solid-State Circuits, IEEE Journal of》2008,43(5):1247-1259

In this paper, a 3D display processor embedding a programmable 3D graphics rendering engine is proposed. The proposed processor combines a 3D graphics rendering engine and a 3D image synthesis engine to support both true realism and interactivity for the future multimedia applications. Using high coherence between 3D graphics data and 3D display inputs, both pipelines are merged by sharing buffers such that a 3D display engine directly uses the output of a 3D graphics rendering engine. The merged architecture has synergetic coupling effects such as freely providing various rendering effects to 3D images and easily computing disparities without complex extraction processes. In the 3D image synthesis engine, we adopt view interpolation algorithm and propose real-time synthesis method, pixel-by-pixel process. The view interpolation algorithm reduces the number of images to be rendered, resulting in the reduction of external memory size to 64.8% compared to conventional synthesis process. The proposed pixel-by-pixel process synthesizes 3D images at 36 fps through bandwidth reduction of 26.7% and decreases internal memory size to 64.2% compared to typical image-by-image process. The 3D graphics rendering engine is programmable and supports the instruction sets of the latest 3D graphics standard APIs, Pixel Shader 3.0 and OpenGL|ES 2.0. The die contains about 1.7 M transistors, occupies 5 mm times 5 mm in 0.18 mum CMOS and dissipates 379 mW at 1.85 V. 相似文献

17.

A Case Study of Hardware/Software Partitioning of Traffic Simulation on the Cray XD1

Tripp J.L. Gokhale M.B. Hansson A.. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(1):66-74

Scientific application kernels mapped to reconfigurable hardware have been reported to have 10times to 100times speedup over equivalent software. These promising results suggest that reconfigurable logic might offer significant speedup on applications in science and engineering. To accurately assess the benefit of hardware acceleration on scientific applications, however, it is necessary to consider the entire application including software components as well as the accelerated kernels. Aspects to be considered include alternative methods of hardware/software partitioning, communications costs, and opportunities for concurrent computation between software and hardware. Analysis of these factors is beyond the scope of current automatic parallelizing compilers. In this paper, a case study is presented in which a simulation of metropolitan road traffic networks is mapped onto a reconfigurable supercomputer, the Cray XD1. Five different methods are presented for mapping the application onto the combined hardware/software system. An approach for approximating the performance of each method is derived through analytic equations. Our results, both analytically and empirically, show that key predictors of performance (which are often not considered in reported speedup of kernel operations) are not necessarily maximum parallelism, but must account for the fraction of the problem that runs on the reconfigurable logic and the amount data flow between software and hardware. 相似文献

18.

Design and optimization for multiprocessor interactive GPU

DENG Jun-yong LI Tao JIANG Lin HAN Jun-gang SHEN Xu-bang 《中国邮电高校学报(英文版)》2014,21(3):85-97

In order to achieve maximization of parallelism, effective distribution of rendering tasks, balance between performance and flexibility in graphics processing pipeline, this article presents design, performance analysis and optimization for multi-core interactive graphics processing unit （MIGPU）. This processor integrates twelve processing cores with specific instruction set architecture and many sophisticated application-specific accelerators into a 3D graphics engine. It is implemented on XC6VLX550T field programmable gate array （FPGA）. MIGPU supports OpenGL2.0 with programmable front-end processor, vertex shader, plane clipper, geometry transformer, three-D clippers and pixel shaders. For boosting the performance of MIGPU, the relationship model is established between primitive types, vertices, pixels, and the effect of culling, clipping, and memory access, and shows a way to improve the speed up of the graphics pipeline. It is capable of assigning graphics rendering tasks to different processors for efficiency and flexibility. The pixel filling rate can reach to 40 Mpixel/s at its peak performance. 相似文献

19.

基于流体系结构的高效能分组密码处理器研究 总被引：1，自引：0，他引：1

下载免费PDF全文

王寿成严迎建徐进辉《电子学报》2017,45(4):937-943

针对现有密码处理器存在的问题,借鉴流处理器架构,提出了高效能的可重构分组密码流处理器架构.该架构采用层次化设计思想,通过分块式本地寄存器组的数据组织方式和共享拼接使用运算单元机制,实现了软件流水和硬件流水的协同工作,能够挖掘分组内和分组间的指令级并行性并提高功能单元的利用率.在65nm CMOS工艺下对架构进行了综合仿真,并经过了大量算法映射.实验结果证明,该架构在CBC和ECB加密模式下均具有良好的加密性能.与其他密码处理器相比,该架构具有小面积、高效能的特点. 相似文献

20.

Optimizing the H.264/AVC Video Encoder Application Structure for Reconfigurable and Application-Specific Platforms

Muhammad Shafique Lars Bauer Jörg Henkel 《Journal of Signal Processing Systems》2010,60(2):183-210

The H.264/AVC video coding standard features diverse computational hot spots that need to be accelerated to cope with the significantly increased complexity compared to previous standards. In this paper, we propose an optimized application structure (i.e. the arrangement of functional components of an application determining the data flow properties) for the H.264 encoder which is suitable for application-specific and reconfigurable hardware platforms. Our proposed application structural optimization for the computational reduction of the Motion Compensated Interpolation is independent of the actual hardware platform that is used for execution. For a MIPS processor we achieve an average speedup of approximately 60× for Motion Compensated Interpolation. Our proposed application structure reduces the overhead for Reconfigurable Platforms by distributing the actual hardware requirements amongst the functional blocks. This increases the amount of available reconfigurable hardware per Special Instruction (within a functional block) which leads to a 2.84× performance improvement of the complete encoder when compared to a Benchmark Application with standard optimizations. We evaluate our application structure by means of four different hardware platforms. 相似文献