共查询到20条相似文献,搜索用时 0 毫秒
1.
Teemu Pitkänen Jarno K. Tanskanen Risto Mäkinen Jarmo Takala 《Journal of Signal Processing Systems》2009,57(1):21-32
Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and
multimedia domains. These applications typically set high requirements for computational performance and often parallelism
is the key solution to meet the performance requirements. In order to exploit the parallel processing units, memory should
be able to feed the data path with data. This calls for a memory organization supporting parallel memory accesses. In this
paper, a conflict resolving parallel data memory system for application-specific instruction-set processors is described.
The memory structure is generic and reusable to support various application-specific designs. The proposed memory system does
not employ any predefined access format signals for memory addressing. The proposed parallel memory system is attached to
an application-specific instruction-set processor core and comparison on area, power, and critical path are shown. The experiments
show that significant power savings can be obtained by exploiting the parallel memory system instead of multi-port memory.
相似文献
Jarmo TakalaEmail: |
2.
Architecture and Compiler Optimizations for Data Bandwidth Improvement in Configurable Processors 总被引:1,自引:0,他引:1
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):986-997
Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup 相似文献
3.
N. C. Paver M. H. Khan B. C. Aldrich C. D. Emmons 《The Journal of VLSI Signal Processing》2005,41(1):21-34
Providing quality mobile video applications in hand-held mobile devices requires increased computational capability. Using Single Instruction Multiple Data (SIMD) techniques to expose and accelerate the data parallelism inherent in video processing increases performance in handheld and wireless systems. The paper introduces a new 64-bit SIMD coprocessor of the Intel® XScale® microarchitecture which is optimized for low-power handheld applications. The architecture blends the SIMD media processing style with the capabilities of the XScale microarchitecture. This paper provides an overview of the architecture, its instruction set, programming model, the pipeline organization and functional units. The paper also describes how key features of architecture improve the performance of video applications as compared to a scalar implementation. The performance and power improvements based upon measured results are analyzed to show how the opportunities of power savings by reducing the frequency and voltage can be realized.Nigel C. Paver has 13 years experience with the ARM architecture, and in the Intel PCA Components group in Austin, Texas, he is responsible for the architecture and implementation of multimedia coprocessors for the Intel XScale micro-architecture. He is also involved in product architecture and definition of Intel PCA processors. Before Intel, Nigel was one of the lead designers of the early AMULET asynchronous ARM microprocessors at the University of Manchester. He was also vice president in a startup company which used asynchronous design techniques to produce a low-power asynchronous DSP core. Nigel holds a Master of Science degree and Ph.D. in computer science from the University of Manchester and a Bachelor of Science degree in electronics from UMIST.Moinul Khan is a multimedia product architect at Intel Corporation PCA Components group. He is responsible PCA graphics and security architecture. His research interests are virtual prototyping, signal processing algorithms and architecture and communications networking. Before joining Intel he was a technology specialist and founding member of a startup at ATDC, Georgia. He worked on his doctoral research at Georgia Center for Advanced Telecommunications Technology at Georgia Institute of Technology. He received his B.Tech form Indian Insti-ture of Technology and MSEE from Georgia Tech. He also worked as a research member for Canadian Institute for Telecommunications Research and Bell Communications Laboratories.Bradley C. Aldrich joined Intel in 1997 where he is currently an architect within the PCA Components Group. His current work includes the development of coprocessor instruction support in addition to image capture and display technologies for XScale based application processors. He was previously a member of the Intel/Analog Devices joint development architecture team responsible for video enhancements for the Micro Signal Architecture. Prior to that he was a video system architect in Intel’s Digital Imaging and Video Division working on CMOS sensors, still cameras, and tethered PC based video peripherals. He has also worked as a device engineer for Motorola and as a test engineer for Tektronix. He received a BSEE in 1988 and MSEE in 1994 from the University of Texas at San Antonio.Christopher D. Emmons received a Bachelor of Science degree in Computer Science from the University of Texas at Austin in 2003. He joined Intel in 2001 and is currently a multimedia architect responsible for algorithm development and performance optimization for handheld products within the PCA Components Group. Prior to this he worked as an applications engineer providing performance and power analysis in support of product marketing groups. His research interests include video compression, operating system design, and dynamic resource management. 相似文献
4.
Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application.
Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support
multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor
architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor
level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized
for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth
rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video
processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and
peripheral support.
相似文献
Jonah ProbellEmail: |
5.
This paper presents a Computational Memory architecture for MPEG-4 applications with mobile devices. The proposed architecture
is used for real-time block-based motion estimation, which is the most computational intensive task in the video encoder.
It uses the exhaustive block-matching algorithm (EBMA) for motion estimation. The proposed architecture consists of embedded
SRAMs and a number of block-matching units working in parallel to process video data while stored in the memory. The block-matching
units access the embedded SRAMs simultaneously, which increases the speed of the architecture.
The architecture processes CIF format video sequences (i.e., the frame size is 352 × 288 pixels) with block size of 16 × 16
pixels and ±15 pixels search range. The proposed architecture has been designed, prototyped, and simulated for 0.18 μm TSMC
CMOS technology. The simulation shows that the proposed architectures processes up to 126 CIF frames per second with clock
frequency 100 MHz. The synthesized prototype of the proposed architecture includes 200 KB memory and it has an area of 33.75
mm2 and consumes 986.96 mW @100 MHz.
Mohammed Sayed received his B.Sc. degree from Zagazig University, Zagazig, Egypt, in 1997 and a postgraduate diploma in VLSI design from
the Information Technology Institute (ITI), Cairo, Egypt, in 1998. In 2003 he received his M.Sc. degree from University of
Calgary, Calgary, Canada. From 1998 to 2001 he was a research and teaching assistant at the Electronics & Communications Engineering
Department, Zagazig University, Egypt. In 2001 he became a research assistant at the Department of Electrical and Computer
Engineering, University of Calgary, Canada. His current research interests are System-on-Chip, Embedded Memories, and Digital
Video Processing.
Mr. Sayed received a number of scholarships and awards such as iCORE Scholarship from 2003 to 2005, SMC Industrial Collaboration
Award in June 2003, and the Micronet Annual Workshop Best Paper Award in April 2002. He has a number of journal and conference
publications and a number of contributions to the MPEG-4 standard (ISO/IEC JTC1/SC29/WG11 MPEG2002/ M8562 and M8563).
Wael Badawy is an associate professor in the Department of Electrical and Computer Engineering. He holds an adjunct professor in the
Department of Mechanical Engineering, University of Alberta.
Dr. Badawy's research interests are in the areas of: Microelectronics, VLSI architectures for video applications with low-bit
rate applications, digital video processing, low power design methodologies, and VLSI prototyping. His research involves designing
new models, techniques, algorithms, architectures and low power prototype for novel system and consumer products. Dr. Badawy
authored and co-authored more than 100 peer reviewed Journal and Conference papers and about 30 technical reports. He is the
Guest Editor for the special issue on System on Chip for Real-Time Applications in the Canadian Journal on Electrical and
Computer Engineering, the Technical Chair for the 2002 International Workshop on SoC for real-time applications, and a technical
reviewer in several IEEE journals and conferences. He is currently a member of the IEEE-CAS Technical Committee on Communication.
Dr. Badawy was honored with the “2002 Petro Canada Young Innovator Award”, “2001 Micralyne Microsystems Design Award” and
the 1998 Upsilon Pi Epsilon Honor Society and IEEE Computer Society Award for Academic Excellence in Computer Disciplines.
He is currently the Chairman of the Canadian Advisor Committee (CAC) and Head of the Canadian Delegation on ISO/IEC/JTC1/SC6
“Telecommunications and Information Exchange Between Systems”. Member, The Canadian Advisory Committee for the Standards Council
of Canada—Subcommittee 29: Coding of Audio, Picture Multimedia and Hypermedia Information, and Canadian Delegate, The ISO/IEC
MPEG standard committee. He is a voting Member on the VSI Alliance. He is also the Chair of the IEEE-Southern Alberta Society-Computer
Chapter. 相似文献
6.
Efforts to reduce high-speed memory interface power have led to the adoption of data bus inversion or bus-invert coding. This study compares two popular algorithms, which seek to limit the number of simultaneously transitioning signals and bias the state of transmitted data toward a preferred binary level, respectively. A new algorithm, which provides a compromise between transition frequency and preferred signal level, is proposed, and the three algorithms are compared in terms of their impact on power consumption, power supply noise reduction, and general signal integrity enhancement when used in conjunction with a variety of link topologies. 相似文献
7.
Saito H. Nakajima M. Okamoto T. Yamada Y. Ohuchi A. Iguchi N. Sakamoto T. Yamaguchi K. Mizuno M. 《Solid-State Circuits, IEEE Journal of》2010,45(1):15-22
A dynamic-reconfigurable memory chip is fabricated, by which on-chip memories of an SoC chip can be moved to the memory chip to increase the efficiency of memory usage, and stacked on a logic chip by using three dimensional packaging technology. In the memory chip, many RAM-macros are arrayed and they are connected through two dimensional mesh network interconnects. By using memory-specified network interconnects, area overhead of network interconnects for the memory chip is reduced by 63% and the latency overhead by 43%. Signal lines between the two chips are directly connected by 10-?m-pitch inter-chip electrodes, resulting in fast and low-energy inter-chip transmission. 相似文献
8.
This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation.
The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM
usage for storing twiddle factors. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware.
The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in
total memory logic. In addition, the dynamic power consumption can be reduced by as much as 15% by reducing memory accesses. 相似文献
9.
Frank Zhigang Wang Na Helian Sining Wu Yuhui Deng Vineet Khare Chris Thompson 《The Journal of VLSI Signal Processing》2007,48(3):311-324
This paper examines and investigates the relationship between bioinformatics data processing and its underlying computing
architecture within the context of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC exchanges sequence
data on a daily basis across its three member organizations in USA, UK and Japan. We studied how this sequence database in
MySQL can best take advantage of the increased transfer bandwidth of a grid-based storage architecture. Within the context
of the UK Government Project “Grid-oriented Storage (GOS)” and the EC Project “EuroAsiaGrid,” GOS has been developed in our
lab, which melds parallel streaming technique to meet the needs of WAN/Grid-based virtual organizations. A real-world test
shows that the INSDC sequence database backuping operation, mysqldump, over the pipelined GOS architecture beats those over
the classic infrastructures by six times over the link between Cambridge and Tokyo. When performing genomic sequence search
against one million records via the underlying GOS architecture, the performance improvement of 67.3% has been achieved.
相似文献
Frank Zhigang WangEmail: |
10.
11.
Bjorn De Sutter Osman Allam Praveen Raghavan Roeland Vandebriel Hans Cappelle Tom Vander Aa Bingfeng Mei 《Journal of Signal Processing Systems》2010,61(2):157-179
This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory
organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP
processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and
to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts
of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated
for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting
from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding
wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead,
and without any bank-aware compiler support. 相似文献
12.
Nowadays, the multicore processor is watched with interest by people all over the world. As the design technology of system on chip has developed, observing and controlling the processor core's internal state has not been easy. Therefore, multicore processor debugging is very difficult and time‐consuming. Thus, we need a reliable and efficient debugger to find the bugs. In this paper, we propose an on‐chip debug architecture for multicore processors that is easily adaptable and flexible. It is based on the JTAG standard and supports monitoring mode debugging, which is different from run‐stop mode debugging. Compared with the debug architecture that supports the run‐stop mode debugging, the proposed architecture is easily applied to a debugger and has the advantage of having a desirable gate count and execution cycle. To verify the on‐chip debug architecture, it is applied to the debugger of the prototype multicore processor and is tested by interconnecting it with a software debugger based on GDB and configured for the target processor. 相似文献
13.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(12):1691-1697
14.
15.
Yi Wang Linfeng Pan Zili Shao Yong Guan Minyi Guo 《Journal of Signal Processing Systems》2014,74(2):137-150
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core. 相似文献
16.
根据玉环县广播电视台非编文稿网的运行实践,分别从非编文稿网的安全架构、非编文稿两网间的数据安全交换、文稿网数据安全存储备份、非编网数据库安全定时备份、非编网数据恢复软件应用、磁盘碎片整理方法等方面,介绍确保电视台非编制作系统安全高效运行的经验和方法。 相似文献
17.
数据架构设计是数据仓库技术的核心工作之一,本文结合通信企业数据仓库系统的建设案例,简要论述数据架构的基本原理和主要设计特点。 相似文献
18.
Digital Signal Processing (DSP) is widely used in high-performance media processing and communication systems. In majority of these applications, critical DSP functions are realized as embedded cores to meet the low-power budget and high computational complexity. Usually these cores are ASICs that cannot be easily retargeted for other similar applications that share certain commonalities. This stretches the design cycle that affects time-to-market constraints. In this paper, we present a reconfigurable high-performance low-power filter coprocessor architecture for DSP applications. The coprocessor architecture, apart from having the performance and power advantage of its ASIC counterpart, can be reconfigured to support a wide variety of filtering computations. Since filtering computations abound in DSP applications, the implementation of this coprocessor architecture can serve as an important embedded hardware IP. 相似文献
19.
K. Babionitakis V. A. Chouliaras K. Manolopoulos K. Nakos D. Reisis N. Vlassopoulos 《Journal of Signal Processing Systems》2010,58(3):281-299
We present a novel 4096 complex-point, fully systolic VLSI FFT architecture based on the combination of three consecutive
radix-4 stages resulting in a 64-point FFT engine. The outcome of cascading these 64-point FFT engines is an improved architecture
that efficiently processes large input data sets in real time. Using 64-point FFT engines reduces the buffering and the latency
to one third of a fully unfolded radix-4 architecture, while the radix-4 schema simplifies the calculations within each engine.
The proposed 4096 complex point architecture has been implemented on a FPGA achieving a post-route clock frequency of 200 MHz
resulting in a sustained throughput of 4096 point/20.48 μs. It has also been implemented on a high performance 0.13 μm, 1P8M
CMOS process achieving a worst-case (0.9 V, 125 C) post-route clock frequency of 604.5 MHz and a sustained throughput of 4096
point/3.89 μs while consuming 4.4 W. The architecture is extended to accomplish FFT computations of 16K, 64K and 256K complex
points with 352, 256 and 188 MHz operating frequencies respectively. 相似文献
20.
In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system. 相似文献