期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

2.

一种有效的同时多线程处理器取指控制机制 总被引：1，自引：0，他引：1

何立强刘志勇《计算机学报》2006,29(4):535-543

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制. 相似文献

3.

一种基于综合历史信息的SMT结构分支预测算法

王晶樊晓桠叶曾《计算机科学》2008,35(2):259-262

在SMT结构中,可以同时从多个线程中取指.当可取指线程个数较少时,分支预测的重要性与在超标量处理器中的相比有增无减,因为SMT结构中转移误预测的代价更大了.影响分支预测准确率的关键因素是历史信息的组织方式和更新方式.本文仿真分析了这些因素对分支预测准确率的影响,提出了一种基于综合历史信息的分支预测算法--IHBP,把全局信息和局部信息结合在一起预测转移,解决了SMT结构中分支预测信息过时、混乱等问题,使得预测的准确率更具备鲁棒性.仿真结果表明:在8线程结构中,该算法与目前国际普遍采用的Gshare算法和Pag算法相比,分支预测准确率分别提高了8.5%和2.3%. 相似文献

4.

Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors

Wang Hui Sangireddy Rama Baldawa Sandeep 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(3):389-403

The resource sharing nature of Simultaneous Multithreading (SMT) processors and the presence of long latency instructions from concurrent threads make the instruction scheduling window (IW), which is a primary shared component among key pipeline structures in SMT, a performance bottleneck. Due to the tight constraints on its physical size, the IW faces more severe pressure to handle the instructions from various threads while attempting to avoid resource monopolization by some low-ILP threads. It is particularly challenging to optimize the efficiency and fairness in IW utilization to fulfill the affordable performance by SMT under the shadow of long latency instructions. Most of the existing optimization schemes in SMT processors rely on the fetch policy to control the instructions that are allowed to enter the pipeline, while little effort is put to control the long latency instructions that are already located in the IW. In this paper, we propose streamline buffers to handle the long latency instructions that have already entered the pipeline and clog the IW, while the controlling fetch policies take time to react. Each streamline buffer extracts from IW and holds a chain of instructions from a thread that are stalled by dependency on a long latency load. 相似文献

5.

一种基于IA-64的并行架构的研究

下载免费PDF全文

邓晴莺张民选蒋江《计算机工程与科学》2008,30(7):82-85

同时多线程（SMT）能在同一时钟周期执行不同线程的指令,同时开发了指令级并行（ILP）和线程级并行（TLP）。显式并行指令计算（EPIC）关注于编译器和硬件的相互协作。在本文中,我们设计和实现了一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,并通过NAS并行测试程序作出了性能评测。相似文献

6.

DLL-conscious instruction fetch optimization for SMT processors

Fayez Mrinmoy Hsien-Hsin S. 《Journal of Systems Architecture》2008,54(12):1089-1100

Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of dynamically linked libraries (DLL) are used by applications and operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed cache implementation, existing instruction fetching mechanisms can induce unnecessary false I-TLB and I-Cache misses caused by the DLL-based instructions that are intended to be shared. This problem is more prominent when multiple independent threads are executing concurrently on an SMT processor.In this work, we investigate a neglected form of contention between running threads in the I-TLB and I-Cache (including both VIVT and VIPT) due to DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our optimized system, we can reinstate the intended sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and improve the instruction cache hit rates by up to 62%, resulting in up to 30% DLL IPC improvements and up to 15% overall IPC improvements. 相似文献

7.

一种具有QoS特性的同时多线程处理器取指策略 总被引：4，自引：0，他引：4

何立强刘志勇《计算机研究与发展》2006,43(11):1980-1984

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,从而极大地提高了处理器的性能.建议了一种具有QoS特性的同时多线程处理器取指策略,并讨论了其在QoS管理方面的问题.该策略的核心思想是利用线程的优先级和流速来同时控制线程的取指过程,从而满足线程在执行速度上的QoS需求.与传统的基于纯优先级的取指策略相比,该策略不但具有QoS特性,同时还可以更加有效地分配取指带宽,从而能获得更高的处理器性能.该策略的物理实现非常简单.模拟实验的结果表明,该策略在提供QoS支持的基础上,可以在传统的基于优先级的取指策略ICOUNT的基础上提高15%的系统性能. 相似文献

8.

同时多线程处理器共享资源的特性分析

下载免费PDF全文

黄彩霞《计算机工程与科学》2009,31(8)

同时多线程处理器中同时执行的线程共享处理器中的资源,而这些有限的共享资源在线程之间的分配状况将决定每个线程执行的性能和处理器的总体性能。如何根据不同类别共享资源的特性对它们进行合理有效分配成为同时多线程处理器研究的重要课题之一。本文对同时多线程处理器中各类共享资源的特性进行深入研究与分析,分析结果表明,队列类共享资源的分配方式对每个线程执行的性能和SMT处理器的总体性能具有至关重要的影响。因此,同时多线程处理器中共享资源分配的关键在于控制队列类共享资源的分配。相似文献

9.

SMT layout overhead and scalability

Burns J. Gaudiot J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(2):142-155

Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, this additional chip area could be used for other resources such as more functional units, larger caches, or better branch predictors. How large is the SMT overhead and at what point does SMT no longer pay off for maximum throughput compared to adding other architecture features? This paper evaluates the silicon overhead of SMT by performing a transistor/interconnect-level analysis of the layout. We discuss microarchitecture issues that impact SMT implementations and show how the Instruction Set Architecture (ISA) and microarchitecture can have a large effect on the SMT overhead and performance. Results show that SMT yields large performance gains with small to moderate area overhead 相似文献

10.

IA-64的并行架构及其寄存器文件

下载免费PDF全文

邓晴莺张民选蒋江《计算机工程》2008,34(12):13-15

同时多线程能在同一时钟周期执行不同线程的指令,并且指令级并行和线程级并行。显式并行指令计算关注于编译器和硬件的相互协作。寄存器文件的设计在高性能处理器设计中十分重要,寄存器栈和寄存器栈引擎是提高其性能的重要手段。该文设计和实现一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,实验表明,该并行架构适用于大多数并行应用,针对NAS的并行测试程序,该架构相对于SMTSIM平均有12.48%的性能提升。相似文献

11.

Tuning Compiler Optimizations for Simultaneous Multithreading

Jack L. Lo Susan J. Eggers Henry M. Levy Sujay S. Parekh Dean M. Tullsen 《International journal of parallel programming》1999,27(6):477-503

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines. 相似文献

12.

基于EPIC的同时多线程处理器取指策略

下载免费PDF全文

贾小敏孙彩霞张民选《计算机工程》2007,33(4):256-258

EPIC硬件简单，同时多线程易于开发线程级并行，在EPIC上实现同时多线程可以结合二者的优点。取指策略对同时多线程处理器的性能有重要影响。该文介绍了几种有代表性的超标量同时多线程处理器取指策略，分析了这些策略在EPIC同时多线程处理器上的适用性，提出了一种新的适用于EPIC的取指策略SICOUNT。分析表明SICOUNT策略可以充分利用EPIC软硬件协同的优势，在选择取指线程时使用编译器所提供的停顿信息，能更精确地估计各个线程的流动速度，使取出指令的质量更高。相似文献

13.

Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321

Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献

14.

Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

《Future Generation Computer Systems》2014

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload. 相似文献

15.

The Misprediction Recovery Cache

Ashwini K. Nanda James O. Bondi Simonjit Dutta 《International journal of parallel programming》1998,26(4):383-415

In modern processors, deep pipelines couple with superscalar techniques to allow each pipe stage to process multiple instructions. When such a pipe must be flushed and refilled, as when predicted program flow beyond a branch is subsequently recognized as wrong, the temporary performance loss is significant. While modern branch target buffer (BTB) technology makes this flush/refill penalty fairly rare, the penalty that accrues from the remaining branch mispredictions is a serious impediment to even higher processor performance. Advanced mechanisms that can reduce this residual misprediction penalty can be of enormous value in future microprocessor designs. In this paper we describe the design and performance of a promising new mechanism called the Misprediction Recovery Cache (MRC). The key results of our study are. (1) Small, finite sized MRCs (16 to 256 entry) can effectively reduce branch penalty in deeply pipelined processors. (2) Commercial Benchmarks such as the Winstone benchmarks make better use of larger M RCs due to large number of unique branch instructions unlike the predominantly technical SPECint benchmarks. (3) The MRC hit rates increase with increasing BTB prediction accuracy (5-200% depending on MRC size) due to fewer residual mispredictions associated with better prediction. (4) For the processor architecture we studied, the M RC resulted in up to 20% improvement in cpi(cycles per instruction). (5) The incremental performance gain achievable by adding an MRC to a modern CISC processor (which uses a BTB with a two-level predictor) is two to three times of what was achievable by going from a one-level predictor to a two-level predictor. 相似文献

16.

使用取指策略控制同时多线程处理器中个体线程的性能

孙彩霞张民选《计算机学报》2008,31(2):309-317

当前,对同时多线程(Si multaneous Multithreading,SMT)处理器取指策略的研究大都集中在总体性能的优化上.文中提出一种新颖的SMT处理器取指策略(Controlling Performance of Individual Thread,CPIT),用于控制个体线程的执行.结果表明,对于模拟的所有负载,CPIT在94%以上的情况下都能保证受控线程获得期望性能.而对于失败的情况,受控线程的平均性能偏差不超过1.25%.此外,CPIT策略对处理器总体性能的影响并不大.与ICOUNT这种以优化性能为目标的取指策略相比,总体性能的平均降低不超过3%,而除受控线程外的其他线程的性能平均只降低了1.75%. 相似文献

17.

Recalling instructions from idling threads to maximize resource utilization for simultaneous multi-threading processors

Yilin Zhang Caleb Douglas Wei-Ming Lin 《Computers & Electrical Engineering》2013

Simultaneous Multi-Threading (SMT) has been a very popular design in improving resource utilization by sharing key datapath components among multiple independent threads. However, allowing any of the threads to overwhelm these shared resources not only leads to unfair thread processing but may also result in severely degraded overall performance. How to prevent idling threads from clogging the critical resources in the pipeline becomes a must in sustaining desired system performance. In this paper, we show that, if one can manage to recall instructions of idling threads from the shared Issue Queue (IQ), the system performance is easily enhanced by a significant margin, with up to 20% for some benchmark mixes. An even more noteworthy feature about this technique is that the ensuing hardware overhead is very insignificant and it can also be coupled with other advanced techniques employed in other stages of the SMT pipeline for potentially additive benefits. 相似文献

18.

一种精确的分支预测微处理器模型 总被引：3，自引：0，他引：3

陈跃跃周兴铭《计算机研究与发展》2003,40(5):741-745

在当今深流水宽发射的微处理器中，为实现高性能，精确的分支预测是不可缺少的关键技术．分支预测失效将浪费大量的时钟周期，无法发挥乱序执行的效能．宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度．提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型．根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律，给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法，并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响．由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取．相似文献

19.

Thread-Sensitive Instruction Issue for SMT Processors

《Computer Architecture Letters》2004,3(1):5-5

Simultaneous Multi Threading (SMT) is a processor design method in which concurrent hardware threads share processor resources like functional units and memory. The scheduling complexity and performance of an SMT processor depend on the topology used in the fetch and issue stages. In this paper, we propose a thread sensitive issue policy for a partitioned SMT processor which is based on a thread metric. We propose the number of ready-to-issue instructions of each thread as priority metric. To evaluate our method, we have developed a reconfigurable SMT-simulator on top of the SimpleScalar Toolset. We simulated our modeled processor under several workloads composed of SPEC benchmarks. Experimental results show around 30% improvement compared to the conventional OLDEST_FIRST mixed topology issue policy. Additionally, the hardware implementation of our architecture with this metric in issue stage is quite simple. 相似文献

20.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制 总被引：1，自引：1，他引：0

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献