首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
赖鑫  刘聪  王志英 《计算机工程》2012,38(24):228-234
在线程级猜测中进行数据依赖相关检测时,存在Cache一致性协议无法容忍线程切换引起的Cache块替换等问题。为此,通过分析推测线程数据管理模型,结合推测线程切概率低的特点,提出一种分布-共享式恢复缓冲区结构。该结构在进行Cache一致性检验时结合作废向量和版本优先级寄存器进行数据依赖检测,利用L2 Cache进行推测数据缓冲和恢复以支持推测线程切换。修改SESC模拟器以验证和评估该存储体系结构。实验结果表明,在保持模拟器理想加速比的情况下,该存储体系结构可以较好地支持推测线程切换。  相似文献   

2.
A new intermediate representation for software pipelined loops with conditions is proposed in the paper. The representation allows separation of operations from different paths and their conditional, as well as speculative scheduling, including speculative computation of conditions. An algorithm that transforms the representation into the executable code is presented. The algorithm uses the notion of finite automata to represent the execution of separate paths as threads of control that are canceled or approved by operations that actually compute the conditions. The approach may be used in conjunction with different scheduling techniques to reconstruct the control flow graph from the final schedule directly. It inherently solves the problems of overlapped predicate lifetimes and speculation. The approach provides also a novel formal model for loop execution.  相似文献   

3.
As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and on different cores. While threads contend for the same resources using the former approach, the latter approach is plagued by the overhead for inter-core communication. Despite the impact of resource contention, our detailed simulations show that the first approach provides the best performance due to lower inter-thread communication cost. The key contribution of the paper is the proposed design and evaluation of the dual-thread speculation system. This design point has very low complexity and reaps most of the gains of a system. The work was carried out while Fredrik Warg was a graduate student at Chalmers University of Technology.  相似文献   

4.
With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads.  相似文献   

5.
This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor  相似文献   

6.
推测多线程(speculative multithreading,简称SpMT)技术是一种实现非规则程序自动并行化的有效途径.然而,基于控制流图和分支预测技术的线程划分方法,不可避免地会受到划分路径上所存在的控制依赖和数据依赖的制约.目前,在传统的线程划分算法中存在的一个重要问题是,在对划分路径进行选取时只考虑了控制依赖影响却不能有效地综合考虑数据依赖的影响,进而导致不能选取最佳的划分路径.因此,针对传统方法中这种依赖评估方法效率低下的问题,设计并实现了一种基于路径优化的线程划分算法.该算法通过引入基于程序切片技术的预计算方法,建立一种路径评估方法来评估程序间的控制和数据依赖.同时,引入控制线程体大小的启发式规则,以便有效地解决负载不平衡的问题.基于Olden测试集的测试结果表明,所提出的算法可以有效地对非规则程序进行划分,其平均加速比可以达到1.83.  相似文献   

7.
Inter‐iteration dependences in loops can hinder loop‐level parallelism. For some loops, existing thread‐level speculation techniques fail to expose their inherent loop‐level parallelism, because some inter‐iteration dependences are too costly to synchronize, predict, pre‐compute and isolate. This paper presents a compiler technique called loop recreation to change the nature of some dependences (by turning some inter‐iteration dependences into intra‐iteration ones and vice versa) in a loop so that the inter‐iteration dependences in the transformed loop are less costly to enforce at runtime than those in the original loop. We present an algorithm for finding an optimal loop recreation transformation with respect to a simple misspeculation cost model and demonstrate the performance advantages of loop recreation over two recent techniques for multicore systems running nine representative irregular applications. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

8.
CMP Support for Large and Dependent Speculative Threads   总被引:1,自引:0,他引:1  
Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9.  相似文献   

9.
多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销.  相似文献   

10.
The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. The execution model we employ is based upon state separation, an approach for separately maintaining the speculative state of parallel threads and non-speculative state of the computation. If speculation is successful, the results produced by parallel threads in speculative state are committed by copying them into the computation’s non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Techniques are proposed to reduce the cost of data copying between non-speculative and speculative state and efficiently carrying out misspeculation detection. We apply the above approach to speculative parallelization of loops in several sequential programs which results in significant speedups on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors.  相似文献   

11.
12.
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average.  相似文献   

13.
Data decompression is one of the most important techniques in data processing and has been widely used in multimedia information transmission and processing. However, the existing decompression algorithms on multicore platforms are time-consuming and do not support large data well. In order to expand parallelism and enhance decompression efficiency on large-scale datasets, based on the software thread-level speculation technique, this paper raises a speculative parallel decompression algorithm on Apache Spark. By analyzing the data structure of the compressed data, the algorithm firstly hires a function to divide compressed data into blocks which can be decompressed independently and then spawns a number of threads to speculatively decompress data blocks in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, the proposed algorithm is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve 2.6\(\times \) speedup when comparing with the traditional approach in average. In addition, with the growing number of working nodes, the execution time cost decreases gradually, and the speedup scales linearly. The results indicate that the decompression efficiency can be significantly enhanced by adopting this speculative parallel algorithm.  相似文献   

14.
控制与数据投机优化技术的研究   总被引:1,自引:0,他引:1  
控制投机和数据投机是提高程序指令级并行度的有效方法.为了保证投机指令的正确执行,须解决两个问题,即延迟触发控制投机指令导致的异常和数据投机中的别名歧义.这需要硬件的支持才能做到,所以以前在这方面的研究大多是在模拟器上进行的,侧重于描述对模拟器结构的扩展.而IA-64是第一个同时支持这两种优化的体系结构.基于此,作者用一个统一的框架在IA-64开放源码研究编译器(ORC)中首次实现了控制与投机优化.该文以编译器为侧重点,介绍了投机优化中的几个核心问题及其解决方法,其中包括一种新的用来维护投机代码正确性的算法.实验结果表明这种方法是有效的.  相似文献   

15.
结点间流水是解决数据分布和计算分割不一致时的一种重要的并行发掘技术.结点间流水通过计算与通信的重叠获得并行度.精确的流水粒度是获得良好的流水性能的关键.流水分块取决于很多因素,如程序规模、程序的访问模式、结点规模、结点的计算能力和存储体系、通信系统的性能、通信库开销等等.提出了动态profiling方式并实现在流水粒度的推导中,运行时信息收集部分典型分块,结合代价模型推导流水粒度,该模型考虑局部性优化;探索如何减少插桩执行的开销的同时保证代价模型的精度.实验证明,这种方式有更好的适应性,能获得较好的流水并行.  相似文献   

16.
王伟  李仁发  吴强 《计算机应用》2006,26(5):1237-1240
动态可重构技术允许根据计算的运行时情况对硬件处理单元进行重构,使其位宽适合计算的需要。而且,对代表计算密集型任务的循环计算进行位宽的动态优化可达到提高处理性能,减少所消耗的芯片资源和功耗的目的。本文构造了一个处理框架对循环计算的位宽进行动态的优化,包括对循环计算的位宽变化情况进行理论和运行时的分析,以及构造1个位宽管理算法选择重构的时机和对配置文件进行调度。通过对实验结果的分析,证明了我们的方案具有较好的性能。  相似文献   

17.
Chip-multiprocessor (CMP) architectures are a promising design alternative to exploit the ever-increasing number of transistors that can be put on a die. To deliver high performance on applications that cannot be easily parallelized, CMPs can use additional support for speculatively executing the possibly data-dependent threads of an application. For cross-thread dependences that must be handled dynamically, the threads can be made to synchronize and communicate either at the register level or at the memory level. In the past, it has been unclear whether the higher hardware cost of register-level communication is cost-effective. In this paper, we show that the wide-issue dynamic processors that will soon populate CMPs, make fast communication a requirement for high performance. Consequently, we propose an effective hardware mechanism to support communication and synchronization of registers between on-chip processors. Our scheme adds enough support to enable register-level communication without specializing the architecture toward speculation much. Finally, our scheme allows the system to achieve near ideal performance.  相似文献   

18.
GPUs and CPUs have fundamentally different architectures. It is conventional wisdom that GPUs can accelerate only those applications that exhibit very high parallelism, especially vector parallelism such as image processing. In this paper, we explore the possibility of using GPUs for value prediction and speculative execution: we implement software value prediction techniques to accelerate programs with limited parallelism, and software speculation techniques to accelerate programs that contain runtime parallelism, which are hard to parallelize statically. Our experiment results show that due to the relatively high overhead, mapping software value prediction techniques on existing GPUs may not bring any immediate performance gain. On the other hand, although software speculation techniques introduce some overhead as well, mapping these techniques to existing GPUs can already bring some performance gain over CPU. Based on these observations, we explore the hardware implementation of speculative execution operations on GPU architectures to reduce the software performance overheads. The results indicate that the hardware extensions result in almost tenfold reduction of the control divergent sequential operations with only moderate hardware (5–8%) and power consumption (1–5%) overheads.  相似文献   

19.
如何有效利用多核提供的丰富晶体管资源对串行程序的执行进行加速是当前研究中的热点问题。线程级推测(thread-level speculation,TLS)技术旨在充分利用多核资源,最大化地开发出串行代码中存在的潜在并行性。目前TLS技术已经在多种串行应用的并行化工作中得到有效利用,但嵌入式应用程序仍未在推测并行化方面进行有效的分析。因此,选取了八个具有代表性的嵌入式应用,对其在循环级推测并行化中的性能提升潜力和运行时特征(数据依赖、线程粒度和并行覆盖率)进行探讨。实验结果表明,利用线程级推测并行化嵌入式应用的加速效果优于指令级并行技术,实验中的最大加速比达到了13.29;在嵌入式应用领域,该技术可以有效地利用4到8核的计算资源。  相似文献   

20.
While Graphics Processing Units (GPUs) show high performance for problems with regular structures, they do not perform well for irregular tasks due to the mismatches between irregular problem structures and SIMD-like GPU architectures. In this paper, we introduce a new library, CUIRRE, for improving performance of irregular applications on GPUs. CUIRRE reduces the load imbalance of GPU threads resulting from irregular loop structures. In addition, CUIRRE can characterize irregular applications for their irregularity, thread granularity and GPU utilization. We employ this library to characterize and optimize both synthetic and real-world applications. The experimental results show that a 1.63× on average and up to 2.76× performance improvement can be achieved with the centralized task pool approach in the library at a 4.57% average overhead with static loading ratios. To avoid the cost of exhaustive searches of loading ratios, an adaptive loading ratio method is proposed to derive appropriate loading ratios for different inputs automatically at runtime. Our task pool approach outperforms other load balancing schemes such as the task stealing method and the persistent threads method. The CUIRRE library can easily be applied on many other irregular problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号