共查询到20条相似文献,搜索用时 0 毫秒
1.
Robert Schreiber Shail Aditya Scott Mahlke Vinod Kathail B. Ramakrishna Rau Darren Cronquist Mukund Sivaraman 《The Journal of VLSI Signal Processing》2002,31(2):127-142
The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the user's application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates. 相似文献
2.
Journal of Signal Processing Systems - The open-source hardware/software framework TaPaSCo aims to make reconfigurable computing on FPGAs more accessible to non-experts. To this end, it provides an... 相似文献
3.
The explosive growth of the mobile multimedia industry has accentuated the need for efficient VLSI implementations of the
associated computationally demanding signal processing algorithms. In particular, the short battery life caused by excessive
power consumption of mobile devices has become the biggest obstacle facing truly mobile multimedia. We propose novel hardware
accelerator architectures for two of the most computationally demanding algorithms of the MPEG-4 video compression standard––the
forward and inverse shape adaptive discrete cosine transforms (SA-DCT/IDCT). These accelerators have been designed using general
low-energy design philosophies at the algorithmic/architectural abstraction levels. The themes of these philosophies are avoiding
waste and trading area/performance for power and energy gains. Each core has been synthesised targeting TSMC 0.09 μm TCBN90LP
technology, and the experimental results presented in this paper show that the proposed cores improve upon the prior art.
相似文献
Noel O’ConnorEmail: |
4.
Modular arithmetic is a building block for a variety of applications potentially supported on embedded systems. An approach to turn modular arithmetic more efficient is to identify algorithmic modifications that would enhance the parallelization of the target arithmetic in order to exploit the properties of parallel devices and platforms. The Residue Number System (RNS) introduces data-level parallelism, enabling the parallelization even for algorithms based on modular arithmetic with several data dependencies. However, the mapping of generic algorithms to full RNS-based implementations can be complex and the utilization of suitable hardware architectures that are scalable and adaptable to different demands is required. This paper proposes and discusses an architecture with scalability features for the parallel implementation of algorithms relying on modular arithmetic fully supported by the Residue Number System (RNS). The systematic mapping of a generic modular arithmetic algorithm to the architecture is presented. It can be applied as a high level synthesis step for an Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) design flow targeting modular arithmetic algorithms. An implementation with the Xilinx Virtex 4 and Altera Stratix II Field Programmable Gate Array (FPGA) technologies of the modular exponentiation and Elliptic Curve (EC) point multiplication, used in the Rivest-Shamir-Adleman (RSA) and (EC) cryptographic algorithms, suggests latency results in the same order of magnitude of the fastest hardware implementations of these operations known to date. 相似文献
5.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(2):221-233
ARISE introduces a systematic approach for extending once an embedded processor to support thereafter the coupling of an arbitrary number of custom computing units (CCUs). A CCU can be a hardwired or a reconfigurable unit, which can be utilized following a tight and/or loose model of computation. By selecting the appropriate model of computation for each part of the application, the complete application space is considered for acceleration, resulting in significant performance improvements. Also, ARISE offers modularity and scalability and is not restricted by the opcode space and operands limitation problems that exist in such type of machines. To support these features we introduce a machine organization that allows the cooperation of a processor and a set of CCUs. To control the CCUs we extend once the instruction set of the processor with eight instructions. To efficiently incorporate these features to an embedded processor, we propose a micro-architecture implementation that minimizes the control and communication overhead between the processor and the CCUs. To evaluate our proposal, we extended a MIPS processor with the ARISE infrastructure and implemented it on a Xilinx field-programmable gate array (FPGA). Implementation results, demonstrate that the timing model of the processor is not affected. Also, we implemented a set of benchmarks on the ARISE evaluation machine. Performance results prove significant improvements and reduced communication overhead compared to a typical coprocessor approach. 相似文献
6.
A Cost-Efficient Scheduling Algorithm of On-Demand Broadcasts 总被引:3,自引:0,他引:3
In mobile wireless systems data on air can be accessed by a large number of mobile users. Many of these applications including wireless internets and traffic information systems are pull-based, that is, they respond to on-demand user requests. In this paper, we study the scheduling problems of on-demand broadcast environments. Traditionally, the response time of the requests has been used as a performance measure. In this paper we consider the performance as the average cost of request composed of three kinds of costs – access time cost, tuning time cost, and cost of handling failure request. Our main contribution is a self-adaptive scheduling algorithm named LDFC, which computes the delay cost of data item as the priority of broadcast. It costs less compared with some previous algorithms in this context, and shows good adaptability as well even in pure push-based broadcasts. 相似文献
7.
8.
9.
Louta M.D. Demestichas P.P. Loutas E.D. Kraounakis S.K. Theologou M.E. Anagnostou M.E. 《Wireless Personal Communications》2003,27(1):57-87
In future broadband fixed wireless access systems the overall design procedure is critical for their successful commercial deployment as well as their efficient operation and management. The problem addressed in this paper is twofold. Specifically, at a first phase the radio access network planning problem is addressed, which aims at finding the minimum-cost configuration of Access Point Transceivers (APTs) given thegeographical layout of the area to be covered. At the second phase, the interconnecting planning problem is addressed and aims at finding the minimum-cost configuration of the AccessPoint Controllers (APCs) and Inter-Working Units (IWUs) given the Access PointTransceivers layout. Both problems are formally defined, optimally formulated, and solved by computationally efficient heuristics. Finally, results are provided and subsequent conclusions are drawn. 相似文献
10.
SHA1 IP的设计及速度优化 总被引:1,自引:0,他引:1
论文简要介绍了SHA1算法的基本流程,并给出了一种硬件实现方案,文中着重介绍了提高IP的工作速度所采用的三种速度优化方案,并在文章的最后对速度优化的结果进行了比较,可以看出通过优化IP的工作速度得到了显著的提高。 相似文献
11.
偏转腔工作于超高真空状态,腔中的时变场可以使粒子的运动方向发生偏转,在加速器领域有着广泛的应用.偏转腔根据工作状态,有常温结构和超导结构.本文主要介绍了现有常温和超导偏转腔的主要类型,偏转腔的历史发展及在各个领域的应用.最后,高能物理研究所实验室研制了用于进行束团长度测量的工作于TM210模式的偏转腔,此偏转腔工作频率... 相似文献
12.
Ali Nermine Philippe Jean-Marc Tain Benoit Coussy Philippe 《Journal of Signal Processing Systems》2022,94(10):945-960
Journal of Signal Processing Systems - The wide landscape of memory-hungry and compute-intensive Convolutional Neural Networks (CNNs) is quickly changing. CNNs are continuously evolving by... 相似文献
13.
14.
15.
《Spectrum, IEEE》2003,40(1):40-43
For corporations the world over, the tech bubble of the late 1990s was an orgy of excess, which, like all parties that go on too long and involve far too much consumption, ended in a brutal hangover. Information technology (IT) departments simply bought too many servers, storage devices, and PCs in preparation for Y2K, the introduction of the euro, and an e-commerce bonanza that, like an absinthe-induced hallucination, seemed very real at the time, but vanished following the dot-com crash. Overall, the IT market is maturing its way to sustainable, albeit unspectacular, growth. The paper considers how system complexity is driving customers and vendors to seek solace and solutions in software. 相似文献
16.
The COMMIT Protocol for Truthful and Cost-Efficient Routing in Ad Hoc Networks with Selfish Nodes 总被引:1,自引:0,他引:1
We consider the problem of establishing a route and sending packets between a source/destination pair in ad hoc networks composed of rational selfish nodes whose purpose is to maximize their own utility. In order to motivate nodes to follow the protocol specification, we use side payments that are made to the forwarding nodes. Our goal is to design a fully distributed algorithm such that (1) a node is always better off participating in the protocol execution (individual rationality), (2) a node is always better off behaving according to the protocol specification (truthfulness), (3) messages are routed along the most energy-efficient (least cost) path, and (4) the message complexity is reasonably low. We introduce the COMMIT protocol for individually rational, truthful, and energy-efficient routing in ad hoc networks. To the best of our knowledge, this is the first ad hoc routing protocol with these features. COMMIT is based on the VCG payment scheme in conjunction with a novel game-theoretic technique to achieve truthfulness for the sender node. By means of simulation, we show that the inevitable economic inefficiency is small. As an aside, our work demonstrates the advantage of using a cross-layer approach to solving problems: Leveraging the existence of an underlying topology control protocol, we are able to simplify the design and analysis of our routing protocol and reduce its message complexity. On the other hand, our investigation of the routing problem in the presence of selfish nodes disclosed a new metric under which topology control protocols can be evaluated: the cost of cooperation. 相似文献
17.
18.
由于网络处理器(NP)论坛率先于2001年5月发布了技术目标和时间表,众多委员会的工程师竞相制订了NP硬件、软件和测试标准的最佳实用规范。最终,论坛的工作将形成一组NP基本部件(NPE)的接口规范,以减轻下一代网络和通信设施的设计负担,缩短上市时间。NPE是一个可编程半导体器件,用来设计完成数据通信或电信系统中协议数据单元(PDU)的线速度处理。PDU可以是一个数据包、信元、或待处理协议中的任何一种基本单元。NP论坛硬件工作组承担NPE之间硬件接口的定义和递交的任务。正在考虑的有两类接口:·流式接口(SI),用来互连处理… 相似文献
19.
Designing Fast Fourier Transform Accelerators for Orthogonal Frequency-Division Multiplexing Systems
Waqar Hussain Fabio Garzia Tapani Ahonen Jari Nurmi 《Journal of Signal Processing Systems》2012,69(2):161-171
Designing accelerators for the real-time computation of Fast Fourier Transform (FFT) algorithms for state-of-the-art Orthogonal Frequency-Division Multiplexing (OFDM) demodulators has always been challenging. We have scaled-up a template-based Coarse-Grain Reconfigurable Array device for faster FFT processing that generates special purpose accelerators based on the user input. Using a basic and a scaled-up version, we have generated a radix-4 and mixed-radix (2, 4) FFT accelerator to process different length and types of algorithms. Our implementation results show that these accelerators satisfy not only the execution time requirements of FFT processing for Single Input Single Output (SISO) wireless standards that are IEEE-802.11 a/g and 3GPP-LTE but also for Multiple Input Multiple Output (MIMO) IEEE-802.11n standard. 相似文献
20.
Guohui Wang Yingen Xiong Jay Yun Joseph R. Cavallaro 《Journal of Signal Processing Systems》2014,76(3):283-299
In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm – image inpainting-based object removal algorithm – on mobile devices. To take advantage of the computation power of the mobile processor, the algorithm workflow is partitioned between the CPU and the GPU based on the profiling results on mobile devices, so that the computationally-intensive kernels are accelerated by the mobile GPGPU (general-purpose computing using graphics processing units). By exploring the implementation trade-offs and utilizing the proposed optimization strategies at different levels including algorithm optimization, parallelism optimization, and memory access optimization, we significantly speed up the algorithm with the CPU-GPU heterogeneous implementation, while preserving the quality of the output images. Experimental results show that heterogeneous computing based on GPGPU co-processing can significantly speed up the computer vision algorithms and makes them practical on real-world mobile devices. 相似文献