期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Adaptive instruction dispatching techniques for Simultaneous Multi-Threading (SMT) processors

Monobrata DebnathAuthor Vitae Wei-Ming LinEugene JohnAuthor Vitae 《Computers & Electrical Engineering》2012

Simultaneous Multi-Threading (SMT) provides a technique to improve resource utilization ability by sharing key data-path components among multiple independent threads. When critical resources are shared by multiple threads, effective use of these resources proves to be the most important factor in fully exploiting the system potential. Transient behaviors of various threads in terms of their execution parallelism can easily affect utilization efficiency of these shared resources. To commit more resources to threads that are more active allows for better resource utilization and thus higher throughput. In this paper, we propose a real-time dynamic scheduler for the SMT which dispatches instructions from threads based on thread-activeness information gathered in real time and dynamically adjusts dispatching priorities among threads accordingly. An extensive simulation shows a significant gain in system throughput by this technique. The performance of the proposed dispatching technique is evaluated on different workload mixtures created based on instruction-level parallelism available in each thread. An average of 6.5% and maximum of 15% performance improvement is observed with the proposed dispatching technique. 相似文献

2.

DLL-conscious instruction fetch optimization for SMT processors

Fayez Mrinmoy Hsien-Hsin S. 《Journal of Systems Architecture》2008,54(12):1089-1100

Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of dynamically linked libraries (DLL) are used by applications and operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed cache implementation, existing instruction fetching mechanisms can induce unnecessary false I-TLB and I-Cache misses caused by the DLL-based instructions that are intended to be shared. This problem is more prominent when multiple independent threads are executing concurrently on an SMT processor.In this work, we investigate a neglected form of contention between running threads in the I-TLB and I-Cache (including both VIVT and VIPT) due to DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our optimized system, we can reinstate the intended sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and improve the instruction cache hit rates by up to 62%, resulting in up to 30% DLL IPC improvements and up to 15% overall IPC improvements. 相似文献

3.

Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors

Wang Hui Sangireddy Rama Baldawa Sandeep 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(3):389-403

The resource sharing nature of Simultaneous Multithreading (SMT) processors and the presence of long latency instructions from concurrent threads make the instruction scheduling window (IW), which is a primary shared component among key pipeline structures in SMT, a performance bottleneck. Due to the tight constraints on its physical size, the IW faces more severe pressure to handle the instructions from various threads while attempting to avoid resource monopolization by some low-ILP threads. It is particularly challenging to optimize the efficiency and fairness in IW utilization to fulfill the affordable performance by SMT under the shadow of long latency instructions. Most of the existing optimization schemes in SMT processors rely on the fetch policy to control the instructions that are allowed to enter the pipeline, while little effort is put to control the long latency instructions that are already located in the IW. In this paper, we propose streamline buffers to handle the long latency instructions that have already entered the pipeline and clog the IW, while the controlling fetch policies take time to react. Each streamline buffer extracts from IW and holds a chain of instructions from a thread that are stalled by dependency on a long latency load. 相似文献

4.

同时多线程处理器共享资源的特性分析

下载免费PDF全文

黄彩霞《计算机工程与科学》2009,31(8)

同时多线程处理器中同时执行的线程共享处理器中的资源,而这些有限的共享资源在线程之间的分配状况将决定每个线程执行的性能和处理器的总体性能。如何根据不同类别共享资源的特性对它们进行合理有效分配成为同时多线程处理器研究的重要课题之一。本文对同时多线程处理器中各类共享资源的特性进行深入研究与分析,分析结果表明,队列类共享资源的分配方式对每个线程执行的性能和SMT处理器的总体性能具有至关重要的影响。因此,同时多线程处理器中共享资源分配的关键在于控制队列类共享资源的分配。相似文献

5.

SMT layout overhead and scalability

Burns J. Gaudiot J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(2):142-155

Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, this additional chip area could be used for other resources such as more functional units, larger caches, or better branch predictors. How large is the SMT overhead and at what point does SMT no longer pay off for maximum throughput compared to adding other architecture features? This paper evaluates the silicon overhead of SMT by performing a transistor/interconnect-level analysis of the layout. We discuss microarchitecture issues that impact SMT implementations and show how the Instruction Set Architecture (ISA) and microarchitecture can have a large effect on the SMT overhead and performance. Results show that SMT yields large performance gains with small to moderate area overhead 相似文献

6.

Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321

Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献

7.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

8.

Exploring the performance limits of simultaneous multithreading for memory intensive applications 总被引：2，自引：1，他引：1

Evangelia Athanasaki Nikos Anastopoulos Kornilios Kourtis Nectarios Koziris 《The Journal of supercomputing》2008,44(1):64-97

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed. In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application’s threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.

Nectarios KozirisEmail:

相似文献

9.

Tuning Compiler Optimizations for Simultaneous Multithreading

Jack L. Lo Susan J. Eggers Henry M. Levy Sujay S. Parekh Dean M. Tullsen 《International journal of parallel programming》1999,27(6):477-503

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines. 相似文献

10.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

11.

A phase adaptive cache hierarchy for SMT processors

Sonia López Óscar Garnica David H. Albonesi Steven Dropsho Juan Lanchares José I. HidalgoAuthor vitae 《Microprocessors and Microsystems》2011,35(8):683-694

Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design. 相似文献

12.

多核多线程结构线程调度策略研究 总被引：1，自引：0，他引：1

王晶樊晓桠张盛兵王海《计算机科学》2007,34(9):256-258

片上多核多线程（CMT）结构兼具了片上多处理（CMP）和同时多线程（sMT）结构的优势，支持片上所有处于执行状态的线程每周期并行执行，导致核内与核间硬件资源共享和争用问题。该文在阐述CMT结构的资源共享特征并简要介绍SMT线程调度发展状况的基础上，主要围绕以减少资源争用为目标的线程调度策略和资源划分机制等热点，分析其研究现状，论述已有策略在处理这些问题上的优缺点，并探讨了可能的研究发展方向。相似文献

13.

Amdahl’s law for multithreaded multicore processors

Hao Che Minh Nguyen 《Journal of Parallel and Distributed Computing》2014

In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core. 相似文献

14.

使用取指策略控制同时多线程处理器中个体线程的性能

孙彩霞张民选《计算机学报》2008,31(2):309-317

当前,对同时多线程(Si multaneous Multithreading,SMT)处理器取指策略的研究大都集中在总体性能的优化上.文中提出一种新颖的SMT处理器取指策略(Controlling Performance of Individual Thread,CPIT),用于控制个体线程的执行.结果表明,对于模拟的所有负载,CPIT在94%以上的情况下都能保证受控线程获得期望性能.而对于失败的情况,受控线程的平均性能偏差不超过1.25%.此外,CPIT策略对处理器总体性能的影响并不大.与ICOUNT这种以优化性能为目标的取指策略相比,总体性能的平均降低不超过3%,而除受控线程外的其他线程的性能平均只降低了1.75%. 相似文献

15.

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献

16.

Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures

J. C. Pichel D. B. Heras J. C. Cabaleiro F. F. Rivera 《Concurrency and Computation》2009,21(15):1838-1856

In this paper the problem of the locality of sparse algebra codes on simultaneous multithreading (SMT) architectures is studied. In these kind of architectures many hardware structures are dynamically shared among the running threads. This puts a lot of stress on the memory hierarchy, and a poor locality, both inter‐thread and intra‐thread, may become a major bottleneck in the performance of a code. This behavior is even more pronounced when the code is irregular, which is the case of sparse matrix ones. Therefore, techniques that increase the locality of irregular codes on SMT architectures are important to achieve high performance. This paper proposes a data reordering technique specially tuned for these kind of architectures and codes. It is based on a locality model developed by the authors in previous works. The technique has been tested, first, using a simulator of a SMT architecture, and subsequently, on a real architecture as Intel's Hyper‐Threading. Important reductions in the number of cache misses have been achieved, even when the number of running threads grows. When applying the locality improvement technique, we also decrease the total execution time and improve the scalability of the code. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

17.

面向SMT体系结构的片上资源分配策略研究

张骏樊晓桠刘松鹤《计算机科学》2008,35(6):135-138

SMT处理器通过同时执行来自多个线程中的指令来提高性能,所有线程通过竞争共享的方式来最大化片上资源的利用率.然而,SMT处理器的集中控制结构所固有的线延迟约束和多个线程对片上资源持有的不均衡性使得设计者不得不考虑在线程间进行资源分配,来减少通信延迟和可能出现的线程饥饿.本文介绍了针对SMT体系结构片上资源分配的基本原理、研究内容;分析了片上资源分配对SMT体系结构造成的影响;从显式和隐式两个角度讨论了SMT体系结构片上资源分配策略的运行机制和设计方法;举例分析了POWER5处理器的动态资源平衡策略;最后,展望了SMT处理器片上资源分配的未来发展趋势. 相似文献

18.

一种有效的同时多线程处理器取指控制机制 总被引：1，自引：0，他引：1

何立强刘志勇《计算机学报》2006,29(4):535-543

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制. 相似文献

19.

Helper threads via virtual multithreading

《Micro, IEEE》2004,24(6):74-82

Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processor's multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent 相似文献

20.

龙芯2号同时多线程处理器的软硬件接口设计 总被引：1，自引：0，他引：1

李祖松许先超胡伟武唐志敏《软件学报》2007,18(7):1806-1817

随着生产工艺的提高,芯片上能集成越来越多的晶体管,多线程技术也逐步成为一种主流的处理器体系结构技术,而多线程处理器的软硬件接口也就成为急需解决的问题.在分析同时多线程的软件需求的基础上,提出龙芯2号同时多线程处理器的软硬件接口协同设计解决方案,给出相应的操作系统实现方案.同时,在Linux 2.4.20的基础上实现了龙芯2号同时多线程处理器相应的操作系统.通过运行SPEC CPU2000等测试程序进行性能评测,充分说明实现软硬件接口的龙芯2号同时多线程处理器极大地提高了多进程负载的性能.分析和设计方案不仅适用于同时多线程处理器,而且对于片内多核处理器的设计也有借鉴作用. 相似文献