共查询到20条相似文献,搜索用时 15 毫秒
1.
Asano T. Silberman J. Dhong S.H. Takahashi O. White M. Cottier S. Nakazato T. Kawasumi A. Yoshihara H. 《Micro, IEEE》2005,25(5):30-38
The synergistic processor element is a new architecture oriented for multimedia and streaming processing. In this architecture, the memory is not a cache but a private or scratch pad memory. Such a memory is simple and needs to be high-frequency and large space in low-power. This design uses an 11 fan-out of four (11FO4), six-cycle, fully pipelined, embedded 256-Kbyte SRAM for this purpose. The design's memory is not one hard macro, but a group of custom macros physically distributed to optimize the pipeline. 相似文献
2.
Stasiak D. Chaudhry R. Cox D. Posluszny S. Warnock J. Weitzel S. Wendel D. Wang M. 《Micro, IEEE》2005,25(6):71-78
Power consumption is a major challenge in VLSI design. Power-constrained designs must attack power reduction with many techniques and require tools to accurately predict the power consumption. These tools give designers feedback on the efficiency of the power management logic. We present the basic methodology behind cycle-accurate power estimation. This forms a basis for explaining the techniques used to reduce power in the first-generation Cell processor, along with data that correlates our hardware measurements against power estimates. 相似文献
3.
《Parallel Computing》2007,33(10-11):720-740
The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library.List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations. 相似文献
4.
《Journal of Microcomputer Applications》1993,16(1):1-17
This paper deals with the RISC design and the first stage evaluation of the UAL processor. UAL stands as Universal Assembly Language. The design and the implementation of the UAL processor will support and speed up the operation of the UAL-E automated environment. UAL-E is used as an environment for the translation and evaluation of assembly programs. The design of the UAL processor is done on a superscalar and pipeline configuration, and the UAL evaluation is based on a set of primitive functions such as procedure call, interrupt service, sorting, search and max, matrixes multiplication, etc. Results from the first stage evaluation of UAL in comparison with other RISC processors are also provided. 相似文献
5.
基于FPGA的IFFT处理器设计 总被引:1,自引:0,他引:1
通过分析FFT的Cooley-Tukey算法,导出相应的IFFT算法。采用Altera公司的Cyclone Ⅱ系列FPGA芯片中的FFT megacore IP核定制FFT功能,分别使用Quartus Ⅱ和VHDL开发工具验证实现。 相似文献
6.
A real-time software platform for the Cell processor 总被引:1,自引:0,他引:1
Scalability, efficiency, and programmability are essential for using the cell processor in consumer electronics. A real-time resource scheduler virtualizes the processor cores and ensures the application of real-time constraints at the system level. These features let the platform control resource usage and help exploit the power management features implemented in the cell processor. 相似文献
7.
一种堆栈型Java处理器的流水线设计 总被引:1,自引:1,他引:1
针对目前嵌入式系统的特点,设计了一种四段流水线的堆栈型Java微处理器核。使用双口RAM作为Java栈,减小了存储资源的消耗。通过硬件在一个时钟周期内直接执行Java虚拟机(JVM)中大多数简单的算术/逻辑指令;通过微代码模拟在若干时钟周期内完成中等复杂指令处理;提供硬件陷阱机制,以支持JVM中非常复杂和面向对象指令的软件仿真。综合硬件资源和运行效率两方面的需求可灵活选择不同的指令实现方式,为Java处理器在FPGA中的移植实现提供方便。 相似文献
8.
To provide a variety of new and advanced communications services, computer networks are required to perform increasingly complex packet processing. This processing typically takes place on network routers and their associated components. An increasingly central component in router design is a chip-multiprocessor (CMP) referred to as "network processor" or NP. In addition to multiple processors, NPs have multiple forms of on-chip memory, various network and off-chip memory interfaces, and other specialized logic components such as CAMs (content addressable memories). The design space for NPs (e.g., number of processors, caches, cache sizes, etc.) is large due to the diverse workload, application requirements, and system characteristics. System design constraints relate to the maximum chip area and the power consumption that are permissible while achieving defined line rates and executing required packet functions. In this paper, an analytic performance model that captures the processing performance, chip area, and power consumption for a prototypical NP is developed and used to provide quantitative insights into system design trade offs. The model, parameterized with a networking application benchmark, provides the basis for the design of a scalable, high-performance network processor and presents insights into how best to configure the numerous design elements associated with NPs. 相似文献
9.
The Monarch architecture team took advantage of custom VLSI in the design of a shared-memory parallel processor. The simple structure eases the task of programming a massively parallel machine 相似文献
10.
This paper quantifies and compares the performance impacts of memory latencies and finite bandwidth. We show that the implementation of aggressive latency tolerance techniques aggravates stalls due to finite memory bandwidth, which actually become more significant than stalls resulting from uncongested memory latency alone. We expect that memory bandwidth limitations across the processor pins will drive significant architectural change. An execution-driven simulation measures the time that several SPEC95 benchmarks spend stalled for memory latency, limited-memory bandwidth and computing 相似文献
11.
《Computer》1998,31(1):39-48
Chip architects from Sun, Cyrix, Motorola, Mips, Intel and Digital see challenges rather than walls in microprocessor design. They share their insights in this virtual roundtable. Tremblay discusses the conflicting goals of improving how much work a processor does per cycle and at the same time shortening the cycle time. Grohoski says we need to reduce the processor complexity to spend less time debugging that complexity. Burgess thinks tightly interwoven designs will better support focused applications. Killian is confident the industry will solve foreseeable problems. He sees “big data” problems as key design drivers. Colwell sees a convergence of factors that make validation a big concern. He foresees future computers as communication enhancement devices. Rubinfeld names five issues as important to processor design and discusses some challenges specific to high-speed processor design. Despite the competitiveness of their field, these six architects shared several insights of interest to those not intimately connected with processor design 相似文献
12.
We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to state-of-the-art processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations. 相似文献
13.
This paper examines the design of a special-purpose digital processor targeted for linear-regulator control system implementation, and monolithic microelectronic fabrication. In order to improve the computational dynamic range, accuracy, and speed of this digital control processor (DCP), a logarithmic arithmetic is selected. This selection is based upon an examination of DCP data wordlength requirements which are in turn estimated from computational dynamic range and accuracy considerations. DCP arithmetic implementation and architecture are also examined, and the selection of a logarithmic arithmetic is seen to be advantageous from these viewpoints. Finally, contemporary microcomputer technology is used to estimate DCP computational speed. When compared to microcomputers, the proposed DCP appears to allow a computational speed increase in excess of 300-fold while maintaining an equivalent computational accuracy and dynamic range. 相似文献
14.
Language processor generators are systems that produce various language processors (including compilers) on the basis of a high-level specification. The design of language processor generators is discussed on the basis of experiments with a traditional compiler writing system (HLP78) employing pore LALR parsing and general attribute grammars. It is argued that these methods are too primitive from the practical point of view. The design of a new language processor generator, HLP84, is based on this view. This system is an attempt to provide high-level tools for a restricted class of applications (one-pass analysis). The syntactic facilities include regular expressions on the right-hand sides of productions, a disambiguating mechanism that is integrated with regular expressions, and a mechanism for using semantic information to aid parsing. The semantic facilities include automatic support for semantic error handling and for symbol tables. Early experiences with the new system show that in spite of the general overhead caused by the higher automation level, the system allows the generation of reasonably efficient processors. 相似文献
15.
How multimedia workloads will change processor design 总被引:1,自引:0,他引:1
Workloads drive architecture design and will change in the next two decades. For high-performance, general-purpose processors, there is a consensus that multimedia will continue to grow in importance. The authors predict these processors will incorporate more media processing capabilities, eventually bringing about the demise of specialized media processors, except perhaps, in embedded applications. These enhanced general-purpose processor capabilities will arise from multimedia applications that require real-time response, continuous-media data types and significant fine-grained data parallelism 相似文献
16.
17.
《Computers & Structures》1986,24(4):625-635
Linear and nonlinear finite element software development considerations for vector processors are presented. Areas of discussion include performance measurement, data management, element level calculations and nonlinear problem solution. An example problem which demonstrates software performance is also presented.Incorporation of the methods presented in this paper can lead to finite element software which requires approximately one tenth the CPU time and as little as one-hundredth the I/O effort of conventional software. 相似文献
18.
针对多核处理器下的共享二级缓存(L2 Cache)提出了一种面向低功耗的Cache设计方案(LPD)。在LPD方案中,分别通过低功耗的共享Cache混合划分算法(LPHP)、可重构Cache算法(CRA)和基于Cache划分的路预测算法(WPP-L2)来达到降低Cache功耗的目的,同时保证系统的性能良好。在LPHP和CRA中,程序运行时动态地关闭Cache中空闲的Cache列,节省了对空闲列的访问功耗。在WPP-L2中,利用路预测技术在Cache访问前给出预测路信息,预测命中时则可用最短的访问延时和最少的访问功耗完成Cache访问;预测失效时,则结合Cache划分策略,降低由路预测失效导致的额外功耗开销。通过SPEC2000测试程序验证,与传统使用最近最少使用(LRU)替换策略的共享L2 Cache相比,本方案提出的三种算法虽然对程序执行时间稍有影响,但分别节省了20.5%、17%和64.6%的平均L2 Cache访问功耗,甚至还提高了系统吞吐率。实验表明,所提方法在保持系统性能的同时可以显著降低多核处理器的功耗。 相似文献
19.
王帅涛 《计算机测量与控制》2019,27(2):229-232
为了实现多目标回波信息处理,设计了一种四通道高速采样的信号处理机。实现以K7芯片为核心的数字处理平台及多种通信方式,通过高速ADC将回波信号数字化,利用巴克码脉内调制,在保持目标速度分辨率的同时提高了距离分辨率。通过简化数字下变频提高资源利用率,并仿真了多目标识别处理过程。对处理机的测试表明,实现了1.22mV静态噪声和64dB动态范围。 相似文献
20.
The implementation of a proof-of-concept Lattice Quantum Chromodynamics kernel on the Cell processor is described in detail, illustrating issues encountered in the porting process. The resulting code performs up to 45 GFlop/s per socket (without inter-node parallel communications), indicating that the Cell processor is likely to be a good platform for future Lattice QCD calculations. 相似文献