共查询到20条相似文献,搜索用时 15 毫秒
1.
《Design & Test of Computers, IEEE》1999,16(1):32-41
Computational RAM is a processor-in-memory architecture that makes highly effective use of internal memory bandwidth by pitch-matching simple processing elements to memory columns. Computational RAM can function either as a conventional memory chip or as a SIMD (single-instruction stream, multiple-data stream) computer. When used as a memory, computational RAM is competitive with conventional DRAM in terms of access time, packaging and cost. Adding logic to memory is not a simple question of bolting together two existing designs. The paper considers how computational RAM integrates processing power with memory by using an architecture that preserves and exploits the features of memory 相似文献
2.
As a result of the exploding bandwidth demand from the Internet, network router and switch designers are designing and fabricating a growing number of microchips specifically for networking devices rather than traditional computing applications. In particular, a new breed of microprocessors, called Internet processors, has emerged that is designed to efficiently execute network protocols on various types of internetworking devices including switches, routers, and application-level gateways. We evaluate a series of three progressively more aggressive routing-table cache designs and demonstrate that the incorporation of hardware caches into Internet processors, combined with efficient caching algorithms can significantly improve overall packet forwarding performance 相似文献
3.
David Hemmendinger 《International journal of parallel programming》1989,18(4):241-253
We examine the use of atomic instructions in implementing barriers that do not require previously initialized memory. We show hown identical processes can use uninitialized shared memory to elect a leader that then initialized the shared memory. The processes first use the uninitialized memory to obtain unique identifiers in the range 0 ton–1 and then meet at a barrier. After passing the barrier, the leader initializes the shared memory. Whenn is not a power of 2 this barrier implementation, a tournament algorithm, avoids extra work by taking advantage of information implicit in the algorithm for obtaining the unique identifiers. The only atomic instruction that we require is one that complements a bit. 相似文献
4.
Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. 相似文献
5.
《Micro, IEEE》2004,24(6):118-127
Power is a major problem for scaling the hardware needed to support memory disambiguation in future out-of-order architectures. In current machines, the traditional detection of memory ordering violations requires frequent associative searches of state proportional to the instruction window size. A new class of solutions yields an order-of-magnitude reduction in the energy required to properly order loads and stores for windows of hundreds to thousands of in-flight instructions 相似文献
6.
Nagm Mohamed Nazeih Botros Mohamad Alweh 《通讯和计算机》2009,6(12):70-73,84
This is a comparative study of cache energy dissipations in Very Long Instruction Word (VLIW) and the classical superscalar microprocessors. While architecturally different, the two types are analyzed in this work under the assumption of having similar underlying silicon fabrication platforms. The outcomes of the study reveal how energy is exploited in the cache system of the former which makes it more appealing to low, power applications compared to the latter. 相似文献
7.
An encoding technique exploits application information to reduce power consumption along the instruction memory communication path in embedded processors. Microarchitectural support enables reprogrammability of the encoding transformations to track specific code effectively, and the restriction to functional transformations delivers major power savings. Having reprogrammable hardware also allows flexible, inexpensive switches between transformations. 相似文献
8.
《Journal of Systems Architecture》2007,53(5-6):272-284
Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer. 相似文献
9.
Jim Crammond 《International journal of parallel programming》1988,17(6):497-522
This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm. 相似文献
10.
《Computer Physics Communications》2007,176(9-10):589-600
As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the “gold standard”) used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347–1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs. 相似文献
11.
12.
Kozyrakis C.E. Perissakis S. Patterson D. Anderson T. Asanovic K. Cardwell N. Fromm R. Golbus J. Gribstad B. Keeton K. Thomas R. Treuhaft N. Yelick K. 《Computer》1997,30(9):75-78
Members of the University of California, Berkeley, argue that the memory system will be the greatest inhibitor of performance gains in future architectures. Thus, they propose the intelligent RAM or IRAM. This approach greatly increases the on-chip memory capacity by using DRAM technology instead of much less dense SRAM memory cells. The resultant on-chip memory capacity coupled with the high bandwidths available on chip should allow cost-effective vector processors to reach performance levels much higher than those of traditional architectures. Although vector processors require explicit compilation, the authors claim that vector compilation technology is mature (having been used for decades in supercomputers), and furthermore, that future workloads will contain more heavily vectorizable components 相似文献
13.
14.
These days, the once obscure engineering term “DSP” (digital signal processing) is working its way into common use. It has begun to crop up on the labels of an ever wider range of products, from home audio components to answering machines. This is not merely a reflection of a new marketing strategy, however; there truly is more digital signal processing inside today's products than ever before. But why is the market for DSP processors booming? The answer is somewhat circular: as microprocessor fabrication processes have become more sophisticated, the cost of a microprocessor capable of performing DSP tasks has dropped significantly to the point where such a processor can be used in consumer products and other cost sensitive systems. As a result, more and more products have begun using DSP processors, fueling demand for faster, smaller, cheaper, more energy-efficient chips. Although fundamentally related, DSP processors are significantly different from general purpose processors (GPPs) like the Intel Pentium or PowerPC. The authors explain what DSP processors are and what they do. They also offer a guide to evaluating DSP processors for use in a product or application 相似文献
15.
Trace processors: moving to fourth-generation microarchitectures 总被引:1,自引:0,他引:1
This article proposes a new architecture called “trace processors”, which consist of multiple, distributed on-chip processor cores, each of which simultaneously executes a different trace. All but one core executes the traces speculatively, having used branch prediction to select traces that follow the one executing. (Although this architectural concept is similar to multiscalar processors, it does not require explicit compiler support). The authors argue that future processors will rely heavily on replication and hierarchy, and they show how their architecture exploits these concepts 相似文献
16.
Speculative multithreaded processors 总被引:1,自引:0,他引:1
Speculation will overcome the limitations in dividing a single program into multiple threads that can execute on the multiple logical processing elements needed to enhance performance through parallelization 相似文献
17.
Realistic simulations of fluid flow in oil reservoirs have been proven to be computationally intensive. In this work, techniques for solving large sparse systems of linear equations that arise in simulating these processes are developed for parallel computers such as INTEL hypercubes iPSC/2 and iPSC/860. This solver is based on a combined multigrid and domain decomposition approach. The Algorithm uses line corrections solved using a multigrid method, line Jacobi and block incomplete domain decomposition as an overall preconditioner for a conjugate gradient-like acceleration method, ORTHOMIN (k). This is shown to be a factor of ten times faster on a 32-processor hypercube compared to widely used sequential solvers. Three test problems are used to validate the results which include implicit wells and faults: The first is based on highly heterogeneous two-phase flow, the second on the SPE Third Comparative Solution and the third on real production compositional data. 相似文献
18.
The functional structure of a classical content-addressable memory (CAM) and its realization at the transistor level are described. Some unorthodox CAM approaches are briefly examined. Associative processor systems are discussed, and application-specific CAM architectures to support artificial intelligence features are surveyed. Limitations of associative processing and ways to circumvent them are addressed. The use of parallel cellular logic is considered 相似文献
19.
A method of formalized design of systolic processors is considered. The proposed method can be applied to synthesize an efficient processor structure for LU-decomposition of symmetrical matrices.Translated from Kibernetika, No. 3, pp. 41–48, May–June 1990. 相似文献
20.
Digit serial data transmission can be used to an advantage in the design of special purpose processors where communication issues dominate and where digit pipelining can be used to maintain high data rates. VLSI signal processing applications are one such problem domain. We have developed a family of VLSI components that have digit serial transmission and that can be pipelined at the digit level. These components can be used to construct VLSI processors that are especially suited to signal processing applications. One such particularly attractive processor is a structure we call the arithmetic cube. The arithmetic cube can be programmed to solve linear transformations such as convolutions and DFTs, and has nearest neighbor interconnects, regular layout, simple control, and a limited number of interconnections. Regular layout and simple control derive naturally from the algorithms on which the processor is based. Long wires are eliminated by the nearest neighbor interconnect. High throughput can be achieved by pipelining the processor at the digit level. The arithmetic cube is programmable in the problem size n; once implemented for a certain size N, smaller problems can be solved on the same implementation without a loss in performance. In addition, the architecture extends to larger N in a regular and automatic fashion.This work has been supported in part by the Army Research Office under Contract DAAG29-83-K-0126. 相似文献