Design and evaluation of a hierarchical decoupled architecture |
| |
Authors: | Won W Ro Stephen P Crago Alvin M Despain Jean-Luc Gaudiot |
| |
Affiliation: | (1) Department of Electrical and Computer Engineering, California State University, Northridge;(2) Information Sciences Institute-East, University of Southern California, California;(3) Department of Electrical Engineering, University of Southern California, California;(4) Department of Electrical Engineering and Computer Science, University of California, Irvine |
| |
Abstract: | The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result,
today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional
data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability
for future accesses and often fail when memory accesses do not contain much locality.
To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance
decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated
from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching
processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating
and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors.
The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time.
This is achieved by separating the access-related instructions from the main computation and running them early enough on
the two dedicated processors.
Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling
tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall
IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC
improves the performance by 17.2%. |
| |
Keywords: | Decoupled architectures Memory latency hiding Multithreading Parallel architecture Instruction level parallelism Data prefetching Speculative execution |
本文献已被 SpringerLink 等数据库收录! |
|