期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Alloyed Branch History: Combining Global and Local Branch History for Robust Performance

Zhijian Lu John Lach Mircea R. Stan Kevin Skadron 《International journal of parallel programming》2003,31(2):137-177

This paper introduces alloyed prediction, a new hardware-based two-level branch predictor organization that combines global and local history in the same structure, combining the advantages of current two-level predictors with those of hybrid predictors. The alloyed organization is motivated by measurements showing that wrong-history mispredictions are even more important than conflict-induced mispredictions. Wrong-history mispredictions arise because current two-level, history-based predictors provide only global or only local history. The contribution of wrong history to the overall misprediction rate is substantial because most programs have some branches that require global history and others that require local history. This paper explores several ways to implement alloyed prediction, including the previously proposed bi-mode organization. Simulations show that mshare is the best alloyed organization among those we examine, and that mshare gives reliably good prediction compared to bimodal (two-bit), two-level, and hybrid predictors. The robust performance of alloying across a range of predictor sizes stems from its ability to attack wrong-history mispredictions at even very small sizes without subdividing the branch prediction hardware into smaller and less effective components. 相似文献

2.

Trellis: Portability across architectures with a high-level framework

Lukasz G. Szafaryn Todd Gamblin Bronis R. de Supinski Kevin Skadron 《Journal of Parallel and Distributed Computing》2013

The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires a common set of high-level directives and efficient mapping onto each architecture. 相似文献

3.

HotSpot: a compact thermal modeling methodology for early-stage VLSI design 总被引：3，自引：0，他引：3

Wei Huang Ghosh S. Velusamy S. Sankaranarayanan K. Skadron K. Stan M.R. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(5):501-513

This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient. 相似文献

4.

Challenges in computer architecture evaluation

Skadron K. Martonosi M. August D.I. Hill M.D. Lilja D.J. Pai V.S. 《Computer》2003,36(8):30-36

We focus on problems suited to the current evaluation infrastructure. The current limitation and trends in evaluation techniques are troublesome and could noticeably slow the rate of computer system innovation. New research has been recommended to help and make quantitative evaluations of computer systems manageable. We support research in the areas of simulation frameworks, benchmarking methodologies, analytic methods, and validation techniques. 相似文献

5.

Interconnect Lifetime Prediction for Reliability-Aware Systems

Zhijian Lu Wei Huang Stan M.R. Skadron K. Lach J. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(2):159-172

Thermal effects are becoming a limiting factor in high-performance circuit design due to the strong temperature dependence of leakage power, circuit performance, IC package cost, and reliability. While many interconnect reliability models assume a constant temperature, this paper analyzes the effects of temporal and spatial thermal gradients on interconnect lifetime in terms of electromigration, and presents a physics-based dynamic reliability model which returns reliability equivalent temperature and current density that can be used in traditional reliability analysis tools. The model is verified with numerical simulations and reveals that blindly using the maximum temperature leads to too pessimistic lifetime estimation. Therefore, the proposed model not only increases the accuracy of reliability estimates, but also enables designers to reclaim design margin in reliability-aware design. In addition, the model is useful for improving the performance of temperature-aware runtime management by modeling system lifetime as a resource to be consumed at a stress-dependent rate 相似文献

6.

Temperature-aware computer systems: Opportunities and challenges 总被引：1，自引：0，他引：1

Skadron K. Stan M.R. Wei Huang Velusamy S. Sankaranarayanan K. Tarjan D. 《Micro, IEEE》2003,23(6):52-61

Temperature-aware design techniques have an important role to play in addition to traditional techniques like power-aware design and package- and board-level thermal engineering. The authors define the role of architecture techniques and describe hotspot, an accurate yet fast thermal model suitable for computer architecture research. 相似文献

7.

Parameterized physical compact thermal modeling

Wei Huang Stan M.R. Skadron K. 《Components and Packaging Technologies, IEEE Transactions on》2005,28(4):615-622

This paper presents a compact thermal modeling (CTM) approach, which is fully parameterized according to design geometries and material physical properties. While most compact modeling approaches facilitate thermal characterization of existing package designs, our method is better suited for preliminary exploration of the design space at both the silicon level and the package level. We show that our modeling method achieves reasonable boundary condition independence (BCI) by comparing a CTM example with a BCI model for a benchmark ball grid array single-chip package under the same standard set of boundary conditions. In essence, the presented CTM method can act as a convenient medium for enhanced interactions and collaborations among designers at the package, circuit and computer architecture levels, leading to efficient early evaluations of different thermally-related design trade-offs at all the above levels of abstraction before the actual detailed design is available. The presented modeling method can be easily extended to model emerging packaging schemes such as stacked chip-scale packaging and three-dimensional integration. 相似文献

8.

Jiayuan Meng Kevin Skadron 《International journal of parallel programming》2011,39(1):115-142

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup. 相似文献

9.

Dual-Data Rate Transpose-Memory Architecture Improves the Performance,Power and Area of Signal-Processing Systems

Mohamed El-Hadedy Xinfei Guo Martin Margala Mircea R. Stan Kevin Skadron 《Journal of Signal Processing Systems》2017,88(2):167-184

This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm ² silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area. 相似文献

10.

Improved thermal management with reliability banking

Lu Z. Lach J. Stan M.R. Skadron K. 《Micro, IEEE》2005,25(6):40-49

Using a fixed temperature for thermal throttling is pessimistic. Reduced aging during periods of low temperature can compensate for accelerated aging during periods of high temperature. Runtime tracking of the temperature-dependent aging rate means that throttling is engaged only when necessary to maintain reliability. In this article, we show that the effect of cool (low-temperature) phases can compensate for that of hot (high-temperature) phases on reliability. Existing dynamic thermal management (DTM) techniques ignore the effects of temperature fluctuations on chip lifetime and can unnecessarily impose performance penalties for hot phases. Using electromigration as the targeted failure mechanism, we apply a dynamic reliability model and propose a dynamic reliability management (DRM) technique to dynamically track the consumption of chip lifetime during operation. 相似文献