期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel Memory Architecture for Application-Specific Instruction-Set Processors

Teemu Pitkänen Jarno K. Tanskanen Risto Mäkinen Jarmo Takala 《Journal of Signal Processing Systems》2009,57(1):21-32

Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and multimedia domains. These applications typically set high requirements for computational performance and often parallelism is the key solution to meet the performance requirements. In order to exploit the parallel processing units, memory should be able to feed the data path with data. This calls for a memory organization supporting parallel memory accesses. In this paper, a conflict resolving parallel data memory system for application-specific instruction-set processors is described. The memory structure is generic and reusable to support various application-specific designs. The proposed memory system does not employ any predefined access format signals for memory addressing. The proposed parallel memory system is attached to an application-specific instruction-set processor core and comparison on area, power, and critical path are shown. The experiments show that significant power savings can be obtained by exploiting the parallel memory system instead of multi-port memory.

Jarmo TakalaEmail:

相似文献

2.

Architecture and Compiler Optimizations for Data Bandwidth Improvement in Configurable Processors 总被引：1，自引：0，他引：1

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):986-997

Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup 相似文献

3.

Accelerating Mobile Video: A 64-Bit SIMD Architecture for Handheld Applications

N. C. Paver M. H. Khan B. C. Aldrich C. D. Emmons 《The Journal of VLSI Signal Processing》2005,41(1):21-34

Providing quality mobile video applications in hand-held mobile devices requires increased computational capability. Using Single Instruction Multiple Data (SIMD) techniques to expose and accelerate the data parallelism inherent in video processing increases performance in handheld and wireless systems. The paper introduces a new 64-bit SIMD coprocessor of the Intel® XScale® microarchitecture which is optimized for low-power handheld applications. The architecture blends the SIMD media processing style with the capabilities of the XScale microarchitecture. This paper provides an overview of the architecture, its instruction set, programming model, the pipeline organization and functional units. The paper also describes how key features of architecture improve the performance of video applications as compared to a scalar implementation. The performance and power improvements based upon measured results are analyzed to show how the opportunities of power savings by reducing the frequency and voltage can be realized.Nigel C. Paver has 13 years experience with the ARM architecture, and in the Intel PCA Components group in Austin, Texas, he is responsible for the architecture and implementation of multimedia coprocessors for the Intel XScale micro-architecture. He is also involved in product architecture and definition of Intel PCA processors. Before Intel, Nigel was one of the lead designers of the early AMULET asynchronous ARM microprocessors at the University of Manchester. He was also vice president in a startup company which used asynchronous design techniques to produce a low-power asynchronous DSP core. Nigel holds a Master of Science degree and Ph.D. in computer science from the University of Manchester and a Bachelor of Science degree in electronics from UMIST.Moinul Khan is a multimedia product architect at Intel Corporation PCA Components group. He is responsible PCA graphics and security architecture. His research interests are virtual prototyping, signal processing algorithms and architecture and communications networking. Before joining Intel he was a technology specialist and founding member of a startup at ATDC, Georgia. He worked on his doctoral research at Georgia Center for Advanced Telecommunications Technology at Georgia Institute of Technology. He received his B.Tech form Indian Insti-ture of Technology and MSEE from Georgia Tech. He also worked as a research member for Canadian Institute for Telecommunications Research and Bell Communications Laboratories.Bradley C. Aldrich joined Intel in 1997 where he is currently an architect within the PCA Components Group. His current work includes the development of coprocessor instruction support in addition to image capture and display technologies for XScale based application processors. He was previously a member of the Intel/Analog Devices joint development architecture team responsible for video enhancements for the Micro Signal Architecture. Prior to that he was a video system architect in Intel’s Digital Imaging and Video Division working on CMOS sensors, still cameras, and tethered PC based video peripherals. He has also worked as a device engineer for Motorola and as a test engineer for Tektronix. He received a BSEE in 1988 and MSEE in 1994 from the University of Texas at San Antonio.Christopher D. Emmons received a Bachelor of Science degree in Computer Science from the University of Texas at Austin in 2003. He joined Intel in 2001 and is currently a multimedia architect responsible for algorithm development and performance optimization for handheld products within the PCA Components Group. Prior to this he worked as an applications engineer providing performance and power analysis in support of product marketing groups. His research interests include video compression, operating system design, and dynamic resource management. 相似文献

4.

Architecture Considerations for Multi-Format Programmable Video Processors

Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39

Many different video processor architectures exist. Its architecture gives a processor strength for a particular application. Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and peripheral support.

Jonah ProbellEmail:

相似文献

5.

A Computational Memory Architecture for MPEG-4 Applications with Mobile Devices

Mohammed Sayed Wael Badawy 《The Journal of VLSI Signal Processing》2006,42(1):35-42

This paper presents a Computational Memory architecture for MPEG-4 applications with mobile devices. The proposed architecture is used for real-time block-based motion estimation, which is the most computational intensive task in the video encoder. It uses the exhaustive block-matching algorithm (EBMA) for motion estimation. The proposed architecture consists of embedded SRAMs and a number of block-matching units working in parallel to process video data while stored in the memory. The block-matching units access the embedded SRAMs simultaneously, which increases the speed of the architecture. The architecture processes CIF format video sequences (i.e., the frame size is 352 × 288 pixels) with block size of 16 × 16 pixels and ±15 pixels search range. The proposed architecture has been designed, prototyped, and simulated for 0.18 μm TSMC CMOS technology. The simulation shows that the proposed architectures processes up to 126 CIF frames per second with clock frequency 100 MHz. The synthesized prototype of the proposed architecture includes 200 KB memory and it has an area of 33.75 mm² and consumes 986.96 mW @100 MHz. Mohammed Sayed received his B.Sc. degree from Zagazig University, Zagazig, Egypt, in 1997 and a postgraduate diploma in VLSI design from the Information Technology Institute (ITI), Cairo, Egypt, in 1998. In 2003 he received his M.Sc. degree from University of Calgary, Calgary, Canada. From 1998 to 2001 he was a research and teaching assistant at the Electronics & Communications Engineering Department, Zagazig University, Egypt. In 2001 he became a research assistant at the Department of Electrical and Computer Engineering, University of Calgary, Canada. His current research interests are System-on-Chip, Embedded Memories, and Digital Video Processing. Mr. Sayed received a number of scholarships and awards such as iCORE Scholarship from 2003 to 2005, SMC Industrial Collaboration Award in June 2003, and the Micronet Annual Workshop Best Paper Award in April 2002. He has a number of journal and conference publications and a number of contributions to the MPEG-4 standard (ISO/IEC JTC1/SC29/WG11 MPEG2002/ M8562 and M8563). Wael Badawy is an associate professor in the Department of Electrical and Computer Engineering. He holds an adjunct professor in the Department of Mechanical Engineering, University of Alberta. Dr. Badawy's research interests are in the areas of: Microelectronics, VLSI architectures for video applications with low-bit rate applications, digital video processing, low power design methodologies, and VLSI prototyping. His research involves designing new models, techniques, algorithms, architectures and low power prototype for novel system and consumer products. Dr. Badawy authored and co-authored more than 100 peer reviewed Journal and Conference papers and about 30 technical reports. He is the Guest Editor for the special issue on System on Chip for Real-Time Applications in the Canadian Journal on Electrical and Computer Engineering, the Technical Chair for the 2002 International Workshop on SoC for real-time applications, and a technical reviewer in several IEEE journals and conferences. He is currently a member of the IEEE-CAS Technical Committee on Communication. Dr. Badawy was honored with the “2002 Petro Canada Young Innovator Award”, “2001 Micralyne Microsystems Design Award” and the 1998 Upsilon Pi Epsilon Honor Society and IEEE Computer Society Award for Academic Excellence in Computer Disciplines. He is currently the Chairman of the Canadian Advisor Committee (CAC) and Head of the Canadian Delegation on ISO/IEC/JTC1/SC6 “Telecommunications and Information Exchange Between Systems”. Member, The Canadian Advisory Committee for the Standards Council of Canada—Subcommittee 29: Coding of Audio, Picture Multimedia and Hypermedia Information, and Canadian Delegate, The ISO/IEC MPEG standard committee. He is a voting Member on the VSI Alliance. He is also the Chair of the IEEE-Southern Alberta Society-Computer Chapter. 相似文献

6.

Data Bus Inversion in High-Speed Memory Applications

Hollis T.M. 《Circuits and Systems II: Express Briefs, IEEE Transactions on》2009,56(4):300-304

Efforts to reduce high-speed memory interface power have led to the adoption of data bus inversion or bus-invert coding. This study compares two popular algorithms, which seek to limit the number of simultaneously transitioning signals and bias the state of transmitted data toward a preferred binary level, respectively. A new algorithm, which provides a compromise between transition frequency and preferred signal level, is proposed, and the three algorithms are compared in terms of their impact on power consumption, power supply noise reduction, and general signal integrity enhancement when used in conjunction with a variety of link topologies. 相似文献

7.

A Chip-Stacked Memory for On-Chip SRAM-Rich SoCs and Processors

Saito H. Nakajima M. Okamoto T. Yamada Y. Ohuchi A. Iguchi N. Sakamoto T. Yamaguchi K. Mizuno M. 《Solid-State Circuits, IEEE Journal of》2010,45(1):15-22

A dynamic-reconfigurable memory chip is fabricated, by which on-chip memories of an SoC chip can be moved to the memory chip to increase the efficiency of memory usage, and stacked on a logic chip by using three dimensional packaging technology. In the memory chip, many RAM-macros are arrayed and they are connected through two dimensional mesh network interconnects. By using memory-specified network interconnects, area overhead of network interconnects for the memory chip is reduced by 63% and the latency overhead by 43%. Signal lines between the two chips are directly connected by 10-?m-pitch inter-chip electrodes, resulting in fast and low-energy inter-chip transmission. 相似文献

8.

Reduced Memory and Low Power Architectures for CORDIC-based FFT Processors

Erdal Oruklu Xin Xiao Jafar Saniie 《Journal of Signal Processing Systems》2012,66(2):129-134

This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation. The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM usage for storing twiddle factors. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware. The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in total memory logic. In addition, the dynamic power consumption can be reduced by as much as 15% by reducing memory accesses. 相似文献

9.

Grid-Based Storage Architecture for Accelerating Bioinformatics Computing

Frank Zhigang Wang Na Helian Sining Wu Yuhui Deng Vineet Khare Chris Thompson 《The Journal of VLSI Signal Processing》2007,48(3):311-324

This paper examines and investigates the relationship between bioinformatics data processing and its underlying computing architecture within the context of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC exchanges sequence data on a daily basis across its three member organizations in USA, UK and Japan. We studied how this sequence database in MySQL can best take advantage of the increased transfer bandwidth of a grid-based storage architecture. Within the context of the UK Government Project “Grid-oriented Storage (GOS)” and the EC Project “EuroAsiaGrid,” GOS has been developed in our lab, which melds parallel streaming technique to meet the needs of WAN/Grid-based virtual organizations. A real-world test shows that the INSDC sequence database backuping operation, mysqldump, over the pipelined GOS architecture beats those over the classic infrastructures by six times over the link between Cambridge and Tokyo. When performing genomic sequence search against one million records via the underlying GOS architecture, the performance improvement of 67.3% has been achieved.

Frank Zhigang WangEmail:

相似文献

10.

广电行业之大数据应用和企业数据应用中心系统架构

唐月刘朝苹张志斌《广播与电视技术》2016,(9):103-111

介绍了广电行业大数据的典型应用,包括收视行为分析、客户特征洞察和市场营销分析,阐述了大数据下企业数据应用中心的系统架构,包括技术架构、数据架构、功能架构和部署架构. 相似文献

11.

An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Bjorn De Sutter Osman Allam Praveen Raghavan Roeland Vandebriel Hans Cappelle Tom Vander Aa Bingfeng Mei 《Journal of Signal Processing Systems》2010,61(2):157-179

This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead, and without any bank-aware compiler support. 相似文献

12.

Easily Adaptable On‐Chip Debug Architecture for Multicore Processors

Jing‐Zhe Xu Hyeongbae Park Seungpyo Jung Ju Sung Park 《ETRI Journal》2013,35(2):301-310

Nowadays, the multicore processor is watched with interest by people all over the world. As the design technology of system on chip has developed, observing and controlling the processor core's internal state has not been easy. Therefore, multicore processor debugging is very difficult and time‐consuming. Thus, we need a reliable and efficient debugger to find the bugs. In this paper, we propose an on‐chip debug architecture for multicore processors that is easily adaptable and flexible. It is based on the JTAG standard and supports monitoring mode debugging, which is different from run‐stop mode debugging. Compared with the debug architecture that supports the run‐stop mode debugging, the proposed architecture is easily applied to a debugger and has the advantage of having a desirable gate count and execution cycle. To verify the on‐chip debug architecture, it is applied to the debugger of the prototype multicore processor and is tested by interconnecting it with a software debugger based on GDB and configured for the target processor. 相似文献

13.

Hardware Acceleration for Media/Transaction Applications in Network Processors

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(12):1691-1697

As the network environment is rapidly changing, network interfaces demand highly intelligent traffic management (on control plane) in addition to the basic requirement of wire speed packet forwarding (on data plane). Several vendors are releasing various network processors (NPS) in order to handle these demands, but they are optimized for throughputs mostly in data plane. As demands for control plane applications (e.g., quality of service) grow, efficient control plane processing will become increasingly important to good performance of network interface. In this paper, we explore acceleration techniques to improve the performance of control plane network applications. Three applications including media transcoding and transaction applications are analyzed in detail. The result of workload characterization shows that wide-issue configuration shows early saturation in performance, and there is no common bottleneck among applications based on sensitivity analysis. Therefore, we study to get each application have its own hardware acceleration module in order to accomplish the required throughput on OC-768 or higher. Our approach includes array style accelerator for media transcoding applications and partitioned lookup mechanism for lookup-table-related applications. Performance analysis of the proposed techniques shows significant improvement over the baseline configuration. Such hardware accelerators provide large packet-level parallelism proportional to the number of processing elements added. Our analyses of the proposed techniques suggest future directions for the design of high-performance NPs. 相似文献

14.

A Design Flow for Architecture Exploration and Implementation of Partially Reconfigurable Processors 总被引：1，自引：0，他引：1

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(10):1281-1294

相似文献

15.

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Yi Wang Linfeng Pan Zili Shao Yong Guan Minyi Guo 《Journal of Signal Processing Systems》2014,74(2):137-150

Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core. 相似文献

16.

非编文稿网数据交换存储架构及辅助软件的应用

朱东《中国有线电视》2014,(3):294-299

根据玉环县广播电视台非编文稿网的运行实践,分别从非编文稿网的安全架构、非编文稿两网间的数据安全交换、文稿网数据安全存储备份、非编网数据库安全定时备份、非编网数据恢复软件应用、磁盘碎片整理方法等方面,介绍确保电视台非编制作系统安全高效运行的经验和方法。相似文献

17.

论数据仓库的数据架构设计

张曙明《信息通信技术》2009,3(6):11-15

数据架构设计是数据仓库技术的核心工作之一,本文结合通信企业数据仓库系统的建设案例,简要论述数据架构的基本原理和主要设计特点。相似文献

18.

Reconfigurable Filter Coprocessor Architecture for DSP Applications 总被引：1，自引：0，他引：1

S. Ramanathan S.K. Nandy V. Visvanathan 《The Journal of VLSI Signal Processing》2000,26(3):333-359

Digital Signal Processing (DSP) is widely used in high-performance media processing and communication systems. In majority of these applications, critical DSP functions are realized as embedded cores to meet the low-power budget and high computational complexity. Usually these cores are ASICs that cannot be easily retargeted for other similar applications that share certain commonalities. This stretches the design cycle that affects time-to-market constraints. In this paper, we present a reconfigurable high-performance low-power filter coprocessor architecture for DSP applications. The coprocessor architecture, apart from having the performance and power advantage of its ASIC counterpart, can be reconfigured to support a wide variety of filtering computations. Since filtering computations abound in DSP applications, the implementation of this coprocessor architecture can serve as an important embedded hardware IP. 相似文献

19.

Fully Systolic FFT Architecture for Giga-sample Applications

K. Babionitakis V. A. Chouliaras K. Manolopoulos K. Nakos D. Reisis N. Vlassopoulos 《Journal of Signal Processing Systems》2010,58(3):281-299

We present a novel 4096 complex-point, fully systolic VLSI FFT architecture based on the combination of three consecutive radix-4 stages resulting in a 64-point FFT engine. The outcome of cascading these 64-point FFT engines is an improved architecture that efficiently processes large input data sets in real time. Using 64-point FFT engines reduces the buffering and the latency to one third of a fully unfolded radix-4 architecture, while the radix-4 schema simplifies the calculations within each engine. The proposed 4096 complex point architecture has been implemented on a FPGA achieving a post-route clock frequency of 200 MHz resulting in a sustained throughput of 4096 point/20.48 μs. It has also been implemented on a high performance 0.13 μm, 1P8M CMOS process achieving a worst-case (0.9 V, 125 C) post-route clock frequency of 604.5 MHz and a sustained throughput of 4096 point/3.89 μs while consuming 4.4 W. The architecture is extended to accomplish FFT computations of 16K, 64K and 256K complex points with 352, 256 and 188 MHz operating frequencies respectively. 相似文献

20.

Configurable Data Memory for Multimedia Processing

Eero Aho Jarno Vanne Timo D. HÄmÄlÄinen 《Journal of Signal Processing Systems》2008,50(2):231-249

In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system. 相似文献