期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Guest editors' introduction: application-specific microprocessors

Orailoglu A. Veidenbaum A. 《Design & Test of Computers, IEEE》2003,20(1):6-7

相似文献

2.

Compiler-directed cache management in multiprocessors

Cheong H. Veidenbaum A.V. 《Computer》1990,23(6):39-47

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented 相似文献

3.

Guest Editor's Introduction

Alex Veidenbaum 《International journal of parallel programming》2001,29(5):461-462

相似文献

4.

On Interaction between Interconnection Network Design and Latency Hiding Techniques in Multiprocessors

Kim Sunil Veidenbaum Alexander V. 《The Journal of supercomputing》2000,16(3):197-216

Latency hiding techniques are increasingly used to minimize the effect of a long memory latency in multiprocessors. Their use requires additional network bandwidth. The network organization and its design parameters alone can significantly affect performance. With latency hiding, system performance depends on how well the interconnection network can support the use of such techniques and their interaction with network organization. This paper investigates these issues for prefetching and weak consistency in a 128-processor shared-memory system with either a 2-D torus, a multistage, or a single-stage network. The performance impact of network organization and the link bandwidth, with and without the use of latency hiding techniques is shown. The effect of caching and of limiting the number of outstanding memory requests is shown. Multistage is the most robust network and has the best performance under all conditions. Single-stage network is very close in performance when sufficient channel bandwidth is available. Torus network comes in last when channel bandwidth is high, but can exceed single stage performance when it is low. The relative performance of the three networks with prefetching remains similar, with torus gaining the most. Benchmark execution time can decrease by as much as 25% with prefetching. Further gains depend on reducing the effect of write traffic. Finally, the existence of an optimal number of outstanding requests is shown but the value is program-dependent. 相似文献

5.

An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1

Edward H. Gornish Alexander Veidenbaum 《International journal of parallel programming》1999,27(1):35-70

Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective. 相似文献

6.

Guest Editor's Introduction

Veidenbaum Alex 《International journal of parallel programming》2002,30(4):223-224

International Journal of Parallel Programming - 相似文献

7.

Guest Editor's Introduction: Application-Specific Processors

Veidenbaum A. 《Micro, IEEE》2004,24(3):8-9

相似文献

8.

Power-Aware Compilation for Register File Energy Reduction

Ayala José L. Veidenbaum Alexander López-Vallejo Marisa 《International journal of parallel programming》2003,31(6):451-467

Most power reduction techniques have focused on gating the clock to unused functional units to minimize static power consumption, while system level optimizations have been used to deal with dynamic power consumption. Once these techniques are applied, register file power consumption becomes a dominant factor in the processor. This paper proposes a power-aware reconfiguration mechanism in the register file driven by a compiler. Optimal usage of the register file in terms of size is achieved and unused registers are put into a low-power state. Total energy consumption in the register file is reduced by 65% with no appreciable performance penalty for MiBench benchmarks on an embedded processor. The effect of reconfiguration granularity on energy savings is also analyzed, and the compiler approach to optimize energy results is presented. 相似文献

9.

Brain Derived Vision Algorithm on High Performance Architectures

Jayram Moorkanikara Nageswaran Andrew Felch Ashok Chandrasekhar Nikil Dutt Richard Granger Alex Nicolau Alex Veidenbaum 《International journal of parallel programming》2009,37(4):345-369

Even though computing systems have increased the number of transistors, the switching speed, and the number of processors, most programs exhibit limited speedup due to the serial dependencies of existing algorithms. Analysis of intrinsically parallel systems such as brain circuitry have led to the identification of novel architecture designs, and also new algorithms than can exploit the features of modern multiprocessor systems. In this article we describe the details of a brain derived vision (BDV) algorithm that is derived from the anatomical structure, and physiological operating principles of thalamo-cortical brain circuits. We show that many characteristics of the BDV algorithm lend themselves to implementation on IBM CELL architecture, and yield impressive speedups that equal or exceed the performance of specialized solutions such as FPGAs. Mapping this algorithm to the IBM CELL is non-trivial, and we suggest various approaches to deal with parallelism, task granularity, communication, and memory locality. We also show that a cluster of three PS3s (or more) containing IBM CELL processors provides a promising platform for brain derived algorithms, exhibiting speedup of more than 140 × over a desktop PC implementation, and thus enabling real-time object recognition for robotic systems. 相似文献

10.

Combining flow and dependence analyses to expose redundant array accesses

Elana D. Granston Alexander V. Veidenbaum 《International journal of parallel programming》1995,23(5):423-470

The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management. 相似文献