期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李钟赵银亮杜延宁《计算机科学》2011,38(2):296-301

推测多线程技术通过推测执行的方式开发应用程序的线程级并行性,以提高程序执行性能。该技术一般通过执行模型来检测运行时可能的线程推测错误情况,并采取合适的机制恢复程序正确运行。描述的Prophet是一种基于硬件实现的推测多线程执行模型。重点描述了Prophet执行模型针对执行模型设计的关键问题的解决方案,包括Prophet的线程状态控制和多版本的Cach。系统,Prophet的多版本Cache系统提供了推测数据缓存功能,并使用基于总线监听的Cache协议实现了数据依赖违规检测。还给出了使用Olden基准程序对Prophet执行模型进行功能和性能测试的结果,并分析说明了Prophet系统可以有效地开发应用程序的线程级并行性。相似文献

2.

支持推测并行化的事务存储硬件模拟系统

李颀安虹李功明邓博斌《小型微型计算机系统》2013,34(5)

多核处理器通过增加处理器核数提高计算能力,虽然可以通过同时运行多道程序的方式利用处理器资源,但是多核处理器真正的成功取决于解决并行应用开发中的难题.为此,处理器体系结构和编程模型的协同开发是必须的.而随着核数的增多,传统上使用的软件模拟器因为软件的串行性而性能越来越差,无法支持这种软硬件协同开发.FPGA天生的并行性使它在模拟多核处理器时具有较高的模拟性能和高度的可扩放性,成为处理器体系结构研究的理想工具.本文介绍了基于FPGA的多核模拟系统,RAMP-Pink.该系统基于HASim实现,同时支持事务存储和线程级推测,用于对事务存储和线程级推测的软硬件协同开发.该模拟系统可配置不同的FPGA开发平台,也可以以软件模拟方式运行. 相似文献

3.

同时多线程处理器上的Cache性能分析与优化

隋秀峰吴俊敏陈国良《小型微型计算机系统》2009,30(1)

同时多线程(SMT)是一种延迟容忍的体系结构,它在每个周期内可以执行多个线程的多条指令.在SMT处理器上,对于片上共享存储这个复杂的结构资源,至今还没有很好的共享和冲突解决方案.本文着重研究了在多个并发执行的线程间划分共享Cache所存在的问题,指出基于LRU策略的传统Cache会根据需要隐式地划分共享Cache,这在某些情况下会导致全局性能的下降.针对这一问题并且考虑到SMT处理器上对Cache访问带宽的需求,本文提出采用一种多模块多体的Cache结构设计方案.并且在一个修改过的SMT模拟器上对该设计方案进行了性能评价.实验结果显示,相比于基于LRU策略的传统Cache,这一结构可以将一个4路SMT处理器的IPC提高9%. 相似文献

4.

OpenCMP：一个支持事务存储模型的多核处理器模拟器

何裕南安虹郭锐梁博《计算机科学》2007,34(1):248-254

CPU设计正在由仅开发指令级并行性的单线程单核结构转向利用线程级并行性的多线程多核结构，但至今还没有一个可移植性好并被广泛使用的开源多核处理器模拟器，限制了在这样的结构上开展高质量的研究工作。我们开发了一个多核处理器体系结构模拟器OpenCMP，用于支持当前和未来对多线程多核处理器体系结构关键技术的研究。该模拟器适当地抽象了多核处理器结构，为主流的多核处理器结构研究提供一个可扩展、灵活的模拟工具框架，包括支持对乱序、顺序的处理器核和同时多线程处理器核的模拟，以便对更大的多核设计空间进行比较性研究。本文以支持事务存储模型的多核处理器结构模拟器为例，详细描述了如何通过抽象多核结构和事务存储模型的最基本特性和组成部分，扩展单核处理器模拟器SimpleScalar，设计与实现一个多核处理器模拟器。初步研究表明，与现有的多核处理器模拟器相比，该模拟器能够较好地支持对事务存储模型和基于事务存储模型的多核处理器体系结构的研究．相似文献

5.

面向多核NUCA共享数据竞争问题的Bank一致性技术

下载免费PDF全文

吴俊杰潘晓辉《计算机工程与科学》2009,31(11)

非一致Cache体系结构(NUCA)几乎已经成为未来片上大容量cache的发展方向。多核处理器的NUCA结构中,多个处理器核对共享数据的竞争访问,可能导致数据经常处于中部的cache Bank,增加NUCA的访问延迟。本文提出支持数据副本的Bank一致性技术,通过有选择地在NUCA中为访问的处理器核创建不同的数据副本,Bank一致性技术能够缓解多核处理器对共享数据的竞争问题。本文详细地介绍了Bank一致性协议的设计方法。最后,使用全系统模拟器对8个NPB基准测试程序进行了详细评测。实验结果表明,Bank一致性技术能够有效缓解多核处理器中共享数据的竞争访问问题。相比不支持Bank一致性技术的CMP-DNUCA结构,本文的方法能将系统IPC性能平均提升5.95%。相似文献

6.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制 总被引：1，自引：1，他引：0

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

7.

片上非一致Cache体系结构研究

下载免费PDF全文

贾小敏黄彩霞张民选孙彩霞齐树波《计算机工程与科学》2009,31(8)

随着集成电路制造工艺的发展,片上集成大容量Cache成为微处理器的发展趋势。然而,互连线延迟所占比例越来越大,成为大容量Cache的性能瓶颈,因此需要新的Cache体系结构来克服这些问题。非一致Cache体系结构通过在Cache内部支持多级延迟和数据块迁移来减少Cache的命中时间,提高性能,从而克服互连线延迟对大容量Cache的限制,已经成为微处理器片上存储结构的研究热点。本文回顾了非一致Cache体系结构模型的研究进展,特别是对片上多核处理器中的非一致Cache体系结构模型进行了详细介绍,比较了不同模型的贡献和不足。最后,对非一致Cache体系结构的发展进行了展望。相似文献

8.

多核处理器可重构Cache功耗计算方法的研究

《计算机科学》2014,(Z1)

多核动态可重构Cache是解决Cache功耗困扰的一个重要方法。现有Cache功耗模拟器并不能很好地支持多核动态可重构Cache功耗研究,通过对多核动态可重构Cache的功耗模型进行研究,找到了计算可重构Cache的方法和思路,应用CACTI来分别构建各个组成结构的Cache功耗模型,以较为准确地测算可重构Cache的功耗。在Simics模拟器下构建动态可重构Cache,运行测试程序,对比传统的体系结构,可重构Cache的功耗能够得到10.4%的降低。同时,实验中发现功耗的降低不仅仅是动态可重构Cache贡献的,而是由系统综合产生的,因此在低功耗设计中,要综合考虑整体系统的功耗和性能,避免片面地考虑Cache结构而导致整体功耗的提高。相似文献

9.

DSCF:一种面向共享存储多核DSP的数据流分簇前向技术

汪东陈书明《计算机研究与发展》2008,45(8)

多核数字信号处理器(DSP)的性能常常受限于共享存储的长延迟Cache一致性访问.数据前向(forwarding)技术是隐藏长延迟访问的一种有效手段.根据多核DSP应用的两类重要特征,提出了一种面向共享存储多核DSP结构的数据流分簇前向技术DSCF(data stream clustered forwarding).DSCF方法的主要特点是:兼容基本的共享存储Cache一致性协议;不污染目标Cache;数据的传输速度能够与消费速度相匹配;系统结构的可扩展性好.典型测试程序的模拟评测表明,采用DSCF方法能够将Cache一致性失效率平均降低44%,将系统总体性能提升30%～70%. 相似文献

10.

OpenSMT：一个同时多线程处理器模拟器的设计和实现

路放安虹梁博任建《计算机科学》2006,33(1):158-163

同时多线程（SMT）技术是目前微处理器体系结构的研究热点之一。为了支持对SMT技术和基于SMT核的单芯片多处理器（CMP）体系结构技术的深入研究，我们在广泛使用的超标体系结构模拟器Simple Sealar的基础上，通过对SMT结构的关键特性进行适当的抽象，开发了一个SMT体系结构模拟器OpenSMT。本文介绍了谊模拟器主要的设计思想和实现方法，包括多个线程上下文结构的表示、超标量流水线各个阶段的模拟，以及模拟器设计和实现时需要解决的几个关键问题等。初步的应用研究表明，与现有可免费获得的研究用SMT模拟器相比，该模拟器能够较好地平衡模拟性能、灵活性和精度三个基本设计目标，实现了执行驱动、易于扩展指令集结构、良好的用户接口、灵活的软件结构、适宜评估更广泛的SMT体系结构设计空间等设计要求。相似文献

11.

CMP Support for Large and Dependent Speculative Threads 总被引：1，自引：0，他引：1

Colohan C.B. Ailamaki A.C. Steffan J.G. Mowry T.C. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1041-1054

Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9. 相似文献

12.

Shared write buffer to boost applications on SpMT architecture

Ming Chen John Ye Tianzhou Chen Hongjun Dai 《The Journal of supercomputing》2017,73(8):3508-3525

With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads. 相似文献

13.

Memory-aware kernel mechanism and policies for improving internode load balancing on NUMA systems

Mei-Ling Chiang Wei-Lun Su Shu-Wei Tu Zhen-Wei Lin 《Software》2019,49(10):1485-1508

Although nonuniform memory access architecture provides better scalability for multicore systems, cores accessing memory on remote nodes take longer than those accessing on local nodes. Remote memory access accompanied by contention for internode interconnection degrades performance. Properly mapping threads to cores and data accessed to their nodes can substantially improve performance and energy efficiency. However, an operating system kernel's load-balancing activity may migrate threads across nodes, which thus messes up the thread mapping. Besides, subsequent data mapping behavior pays for the cost of page migration to reduce remote memory access. Once unsuitable threads are migrated, it is detrimental to system performance. This paper focuses on improving the kernel's internode load balancing on nonuniform memory access systems. We develop a memory-aware kernel mechanism and policies to reduce remote memory access incurred by internode thread migration. The Linux kernel's load balancing mechanism is modified to incorporate selection policies in the internode thread migration, and the kernel is modified to track the amount of memory used by each thread on each node. With this information, well-designed policies can then choose suitable threads for internode migration. The purpose is to avoid migrating a thread that might incur relatively more remote memory access and page migration. The experimental results show that with our mechanism and the proposed selection policies, the system performance is substantially increased when compared with the unmodified Linux kernel that does not consider memory usage and always migrates the first-fit thread in the runqueue that can be migrated to the target central processing unit. 相似文献

14.

Speculative Versioning Cache

Vijaykumar T.N. Gopal S. Smith J.E. Sohi G. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(12):1305-1317

Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency 相似文献

15.

A moving threads processor architecture MTPA

M. Forsell V. Leppänen 《The Journal of supercomputing》2011,57(1):5-19

Moving threads is a new kind of approach for multicore processor architectures. Traditionally, each thread stays in the core where it is created, and data is moved from the main memory via caches to each core and thread. In the moving threads approach, each core can access only a certain portion of the main memory via its local memory block, and thus extremely lightweight threads are moved between the cores. As a consequence, all kinds of cache coherence problems and need for read reply messages are eliminated. Also Lamport’s sequential consistency of shared memory multiprocessor systems is achieved for free. In this paper, we propose a processor architecture (MTPA) for the moving threads paradigm. We describe the overall structure, operation, instruction set, and thread management mechanism as well as evaluate the proposed architecture with different functional unit settings with simulations and give early silicon area and power consumption estimates. 相似文献

16.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

17.

Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architecture

《Parallel Computing》2017

Chip Multiprocessors (CMP) have emerged during last decades as a very attractive solution in using the ever-increasing on-chip transistor count. However, classical parallelization techniques failed to fully exploit parallelization from existing sequential applications due to false data dependencies. This paper focuses on the Thread-level Speculation (TLS) technique, an alternative way to exploit the transistor budget in a CMP. With TLS, even possibly data dependent threads can run in parallel as long as the semantics of the sequential execution is preserved. A special hardware support monitors the actual data dependencies between threads at run time and, if they are violated, misspeculation effects are undone usually through replay. This kind of system is known as speculative CMP. However, the TLS mechanism requires complex protocols that integrate cache coherence and speculation to maintain program order among multiple versions of data. Current TLS protocol evaluations are usually inadequate because they are not done low-level enough. A realistic evaluation of speculative CMPs requires either to be performed on a real hardware or very detailed cycle-accurate simulator models.In this paper we are particularly focused on a low-level evaluation of the write-invalidate TLS protocol Speculation Integrated with Snoopy Coherence (SISC) protocol proposed in [1]. This evaluation relies on cycle-level simulation environment with detailed cycle-level cache memories, cache controller and system bus. On top of this, a speculative four core architecture is simulated and three new modules (Scheduler, Squash Arbiter and Supplier Arbiter) are provided to support low-level implementation of the SISC protocol. The overall cost of the SISC protocol is evaluated by means of CACTI tool for the three different domains: the access latency cost, the area cost, and the power cost. The evaluation goal was to keep the cache access time to remain below cycle latency as well as the area and power overheads below an acceptable budget overhead. The SISC protocol has been compared against regular MESI-based architecture in both 32-bit and 64-bit versions. We kept the cache access time below the cycle latency, and we managed to keep both data cache area and static power overheads respectively below 32% and 35%. 相似文献

18.

TACLeBench中内核程序循环级推测并行性分析

孟慧玲王耀彬李凌杨洋王欣夷刘志勤《计算机应用》2021,41(9):2652-2657

线程级推测（TLS）技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征（线程粒度、可并行化覆盖率、依赖特征）以及源码对加速比的影响。实验结果表明：1）该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2）利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。相似文献

19.

A general compiler framework for speculative multithreaded processors

Bhowmik A. Franklin M. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(8):713-724

Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing nonnumeric programs, which tend to have irregular and pointer-intensive data structures and complex flows of control. Proper thread formation is crucial for obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general and supports speculative threads, nonspeculative threads, loop-centric threads, and out-of-order thread spawning. It is therefore useful for compiling for a wide variety of SpMT architectures. For effective partitioning of programs, the compiler uses profiling, interprocedural pointer analysis, data dependence information, and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that the use of nonspeculative threads and nonloop speculative threads provides a significant increase in speedup for nonnumeric programs. 相似文献

20.

Disintermediated Active Communication

《Computer Architecture Letters》2006,5(2):15-15

Disintermediated active communication (DAC) is a new paradigm of communication in which a sending thread actively engages a receiving thread when sending it a message via shared memory. DAC is different than existing approaches that use passive communication through shared-memory - based on intermittently checking for messages - or that use preemptive communication but must rely on intermediaries such as the operating system or dedicated interrupt channels. An implementation of DAC builds on existing cache coherency support and exploits light-weight user-level interrupts. Inter-thread communication occurs via monitored memory locations where the receiver thread responds to invalidations of monitored addresses with a light-weight user-level software-defined handler. Address monitoring is supported by cache line user-bits, or CLUbits. CLUbits reside in the cache next to the coherence state, are private per thread, and maintain user-defined per-cache-line state. A light weight software library can demultiplex asynchronous notifications and handle exceptional cases. In DAC-based programs threads coordinate with one another by explicit signaling and implicit resource monitoring. With the simple and direct communication primitives of DAC, multi-threaded workloads synchronize at a finer granularity and more efficiently utilize the hardware of upcoming multi-core designs. This paper introduces DAC, presents several signaling models for DAC-based programs, and describes a simple memory-based framework that supports DAC by leveraging existing cache-coherency models. Our framework is general enough to support uses beyond DAC 相似文献