期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于取指执行时序范畴的多核共享Cache干扰分析 总被引：2，自引：0，他引：2

陈芳园张冬松刘聪王志英《计算机研究与发展》2013,50(1):206-217

在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰,因此多核结构下线程WCET需要考虑并行线程间在共享Cache上的干扰.在现有的简单地址映射干扰分析基础上,考虑了指令取指执行时序因素对干扰的影响,提出了非干扰状态的充分不必要条件,根据指令的取指执行时序范畴判断线程在共享Cache上的干扰状态.通过排除非干扰状态,可以进一步精确多核结构中线程的WCET估值.理论分析证明了该方法的有效性.实验结果表明,与当前现有的考虑执行周期和基于逻辑访问先后顺序的方法相比,基于时序方法下的WCET估值分别可以提高12％和7％的精确度. 相似文献

2.

时钟共享多线程处理器通信机制的设计与实现

《电子技术应用》2016,(3):42-46

多核多线程处理器~([1])是并行技术的一个发展方向,基于多核多线程处理器,提出了一种时钟共享多线程处理器。该处理器有近邻通信和线程间通信两种通信机制,近邻通信采用近邻共享FIFO来传递信息,线程间通信通过线程间共享存储来传递信息,这样可以提高处理器的资源利用率和并行执行能力。相似文献

3.

基于多核集群系统的并行编程模型的研究

胡晨骏王晓蔚《计算机技术与发展》2008,18(4):70-73

并行计算技术是计算机技术发展的重要方向之一.当前并行程序模型主要有消息传递模型和共享存储模型两种.随着处理器多核技术的发展,在一枚多核处理器中集成两个或多个完整的计算引擎(内核),并充分利用多核计算机的特性,发挥多核计算机的性能成为一个很重要的研究方向.介绍一种新的MPI实现机制,这种机制集成了共享存储模型和消息通信模型的优点,在节点内使用共享存储模型,在节点间使用消息传递模型,并且通过自动生成线程级的任务来获得更好的性能. 相似文献

4.

基于多核集群系统的并行编程模型的研究

胡晨骏王晓蔚《微机发展》2008,18(4):70-73

并行计算技术是计算机技术发展的重要方向之一。当前并行程序模型主要有消息传递模型和共享存储模型两种。随着处理器多核技术的发展,在一枚多核处理器中集成两个或多个完整的计算引擎（内核）,并充分利用多核计算机的特性,发挥多核计算机的性能成为一个很重要的研究方向。介绍一种新的MPI实现机制,这种机制集成了共享存储模型和消息通信模型的优点,在节点内使用共享存储模型,在节点间使用消息传递模型,并且通过自动生成线程级的任务来获得更好的性能。．相似文献

5.

多核计算机上非递归并行计算矩阵乘积

鹿中龙钟诚黄华林《小型微型计算机系统》2011,32(5)

提出延迟隐藏的数据预取模型,实现计算与访存的重叠操作,以达到共享二级缓存零缺失;给出基本块的概念,以简化算法的数据结构和减少存储开销;按基本块连续存储方式存储矩阵元素,从存储层次上优化算法,显著地减少页表缓冲缺失;采取非递归调度基本块的策略,充分利用多核计算机的共享二级缓存来减少访问主存的次数,并且不局限于某种特定的存储结构,实现算法缓存无关.多核计算机上的实验结果表明,给出的非递归计算矩阵乘积的线程级并行算法高效、可扩展. 相似文献

6.

EDFUSE:一个基于异步事件驱动的FUSE用户级文件系统框架

段翰聪王勇涛李林《计算机科学》2012,39(106):389-391

开源FUSE文件系统用户模块实现方式采用多线程并发模型,在高并发条件下,线程间的同步将降低系统的吞吐率,增加响应时间。基于流水线分段数据通信思想和异步事件网络驱动模型,消除线程间的同步,通过优化文件和元数据缓存来提高缓存命中率等方式,实现了异步事件驱动的FUSE用户级文件系统的用户态框架。实验结果表明,在大量请求环境下系统的吞吐率得到提高。相似文献

7.

多核集群系统下的混合并行遗传算法研究

王竹荣巨涛马凡《计算机科学》2011,38(7):194-199

为应对传统遗传算法在处理大规模组合优化问题面临的进化速度缓慢,难以达到实时要求的严峻挑战,提出了一种在多核PC集群系统上实现“粗粒度一主从式”混合并行遗传算法的模型:通过把“粗粒度一主从式”并行遗传算法映射到多核PC集群上,结合消息传递和共享存储两种并行编程模型,在节点间使用消息传递模型(MPI),对应的遗传算法为粗粒度并行遗传算法,在节点内使用共享存储模型(OpcnMP),对应的遗传算法为主从式并行遗传算法,用MPI和OpenMP混合编程的方式以进程和线程两级并行在多核集群上实现具体的混合并行遗传算法。理论分析和实验结果表明,提出的实现模型有较好的性能,可大大改进传统遗传算法的缺陷。为利用并行遗传算法在普通多核PC集群上处理大规模组合优化问题提出了一种有效、可行的解决方案。相似文献

8.

基于混合粒子群优化的CMP线程调度方法

下载免费PDF全文

李静梅张博《计算机工程》2012,38(20):113-115

为提高片上多核处理器(CMP)架构中线程调度的执行效率,发挥CMP的并行性能,提出一种基于混合粒子群优化算法的线程调度方法.根据设计的线程调度模型,利用有向无环图表述线程及线程间的相互依赖关系,并采用改进的混合粒子群算法对其进行合理调度.实验结果表明,该方法的执行效率优于现有的遗传算法,能有效地降低任务的执行时间,充分发挥多核架构的优势. 相似文献

9.

帮助线程预取技术研究综述 总被引：1，自引：0，他引：1

张建勋古志民《计算机科学》2013,40(7):19-23,39

帮助线程预取是当前多核平台提高非规则数据密集应用预取效果性能的关键技术之一,近年来已成为国内外的研究热点。针对非规则数据密集应用访存规律的非连续局部性特征,帮助线程预取技术利用CMP平台的最后一级共享缓存(LLC)将应用的非连续局部性转换为瞬时的连续时空局部性(即时局部性),从而达到通过线程级数据预取提高程序性能的目的。归纳了帮助线程预取技术的分类,概括和比较了不同帮助线程实现技术的优势和局限性,深入分析和探讨了现有的几种典型帮助线程技术的预取控制策略。最后从帮助线程实时控制、参数动态选取和优化方面指出了帮助线程预取技术的研究方向。相似文献

10.

Win32平台基于通信的多核并行编程方法

李青徐璐娜《计算机应用与软件》2014,(4):11-14

随着计算机硬件的发展,多核并行计算在计算机软件及应用领域的出现率也越来越频繁。目前的多核编程模型采用线程级并行模型,现有的多线程并行编程模型主要有线程库、指令模型和任务式模型三种。提出一种与MPI并行编程模型相似的基于通信的方法在Win32平台上来实现并行编程,在此基础上实现MTI并行编程模型。通过若干典型的测试给出使用MTI进行并行编程的执行结果,结果表明MTI是有效、易用的。相似文献

11.

一种面向多核系统的并行计算任务分配方法 总被引：2，自引：0，他引：2

卢宇彤杨学军所光《计算机研究与发展》2009,46(Z1)

随着多核处理器的普及,目前的大规模并行处理系统普遍采用多核处理器,这对于资源管理和调度提出了更高的要求.提出了基于共享Cache资源划分的方法,建立了面向多核处理器支持Cache资源分配的进程调度模型,设计并实现了并行任务到多核处理器的映射算法,更好地解决了大规模资源管理系统中面向多核处理器的任务分配问题,降低了使用共享Cache的多个进程运行时的相互干扰,提升了应用程序性能. 相似文献

12.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

13.

Cache-Based Synchronization in Shared Memory Multiprocessors

《Journal of Parallel and Distributed Computing》1996,32(1):11-27

In shared memory multiprocessors, efficient synchronization is imperative to assure good performance. There are two aspects to the “cost” of a synchronization operation: the first is the waiting time at synchronization points, and the second is the intrinsic overhead in performing the operation. The overhead has two components. The first component deals with contention resolution for synchronization operation among competing processors. The second component deals with the shared data accesses that the processor has to perform once it enters a synchronization region. We present a mechanism to reduce the overhead of performing synchronization operations in a cache-based shared memory multiprocessor. The mechanism is based on the intuitive notion that parallel programs invariably use synchronization operations to govern the access to shared data. Traditional multiprocessor cache protocols treat synchronization accesses the same way as normal read/write memory accesses, leading to inefficiencies in performing synchronization operations which ultimately limit the scalability of such systems. The key idea in our approach is to combine synchronization with the coherence maintenance for the cached data. Each cache line maintains states for synchronization as well as for cache coherence, and the cache protocol ensures the correctness of the synchronization operations and the coherence of the data at these synchronization points. To assess the performance gain due to the proposed mechanism, simulation studies are performed using a workload model that represents a dynamic scheduling paradigm which forms the core of several parallel programs. Results from simulation studies show that the proposed cache-based synchronization performs significantly better than traditional cache coherence approaches. 相似文献

14.

共享多端口数据Cache结构：SMPDCA

黄光奇李子木周兴铭窦勇《计算机学报》2001,24(12):1318-1323

随着半导体工艺技术的飞速发展,单芯片多处理器（Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现：共享多端口数据Cache结构（Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点：最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。相似文献

15.

并行计算时间模型和并行机系统性能 总被引：4，自引：0，他引：4

乔香珍《计算机学报》1998,21(5):413-418

本文重点从共享存储器式并行机系统体系结构中的新技术和并行软件系统的新特点分析了影响并行算法和应用程序性能的各种因素，并提出改进的并行计划时间的模型，给给出了提高并行算法和软件性能的原则和实例。相似文献

16.

Design of an adaptive cache coherence protocol for large scalemultiprocessors

Yang Q. Thangadurai G. Bhuyan L.N. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):281-293

A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters 相似文献

17.

An adaptive cache coherence protocol specification for parallel input/output systems

Garcia-Carballeira F. Carretero J. Calderon A. Perez J.M. Garcia J.D. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(6):533-545

Caching has been intensively used in memory and traditional file systems to improve system performance. However, the use of caching in parallel file systems and I/O libraries has been limited to I/O nodes to avoid cache coherence problems. We specify an adaptive cache coherence protocol that is very suitable for parallel file systems and parallel I/O libraries. This model exploits the use of caching, both at processing and I/O nodes, providing performance improvement mechanisms such as aggressive prefetching and delayed-write techniques. The cache coherence problem is solved by using a dynamic scheme of cache coherence protocols with different sizes and shapes of granularity. The proposed model is very appropriate for parallel I/O interfaces, such as MPI-IO. Performance results, obtained on an IBM SP2, are presented to demonstrate the advantages offered by the cache management methods proposed. 相似文献

18.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

19.

多核处理器面向低功耗的共享Cache划分方案 总被引：1，自引：0，他引：1

下载免费PDF全文

熊伟殷建平所光赵志恒《计算机工程与科学》2010,32(10):26-29

随着多核处理器的发展,片上Cache的容量随之增大,其功耗占整个芯片功耗的比率也越来越大。如何减少Cache的功耗,已成为当今Cache设计的一个热点。本文研究了面向低功耗的多核处理器共享Cache的划分技术(LP-CP)。文中提出了Cache划分框架,通过在处理器中加入失效率监控器来动态地收集程序的失效率,然后使用面向低功耗的共享Cache划分算法,计算性能损耗阈值范围内的共享Cache划分策略。我们在一个共享L2 Cache的双核处理器系统中,使用多道程序测试集测试了面向低功耗的Cache划分:在性能损耗阈值为1%和3%的情况中,系统的Cache关闭率分别达到了20.8%和36.9%。相似文献

20.

Locally parallel cache design based on KL1 memory access characteristics

Akira Matsumoto Takayuki Nakagawa Masatoshi Sato Yasunori Kimura Kenji Nishida Atsuhiro Goto 《New Generation Computing》1991,9(2):149-169

The parallel inference machine (PIM) is now being developed at ICOT. It consists of a dozen or more clusters, each of which is a tightly coupled multiprocessor (comprising about eight processing elements) with shared global memory and a common bus. Kernel language 1 (KL1), a parallel logic programming language based on Guarded Horn Clauses (GHC), is executed on each PIM cluster. This paper describes the memory access characteristics in KL1 parallel execution and a locally parallel cache mechanism with hardware lock. The most important issue of locally parallel cache design is how to reduce common bus traffic. A write-back cache protocol having five cache states specially optimized for KL1 execution on each PIM cluster is described. We introduced new software controlled memory access commands, named DW, ER, and RP. A hardware lock mechanism is attached to the cache on each processor. This lock mechanism enables efficient word-by-word locking, reducing common bus traffic by using the cache states. 相似文献