首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
基于共享Cache多核处理器的Hash连接优化   总被引:1,自引:0,他引:1  
邓亚丹  景宁  熊伟 《软件学报》2010,21(6):1220-1232
针对目前主流的多核处理器,研究了基于共享缓存多核处理器环境下的数据库Hash连接优化.首先提出基于Radix-Join算法的Hash连接多线程执行框架,通过实例分析了影响多线程Radix-Join算法性能的因素.在此基础上,优化了Hash连接多线程执行框架中的各种线程及其访问共享Cache的性能,优化了聚集连接时Hash连接算法的内存访问,并分析了多线程聚集划分的加速比.基于开源数据库INGRES和EaseDB,实现了所提出的连接多线程执行框架,在实验中测试了多线程Hash连接框架的性能.实验结果表明,该算法可以有效解决Hash连接执行时共享Cache在多线程条件下的访问冲突和处理器负载均衡问题,极大地提高了Hash连接性能.  相似文献   

2.
In modern energy-saving replication storage systems, a primary group of disks is always powered up to serve incoming requests while other disks are often spun down to save energy during slack periods. However, since new writes cannot be immediately synchronized into all disks, system reliability is degraded. In this paper, we develop a high-reliability and energy-efficient replication storage system, named RERAID, based on RAID10. RERAID employs part of the free space in the primary disk group and uses erasure coding to construct a code cache at the front end to absorb new writes. Since code cache supports failure recovery of two or more disks by using erasure coding, RERAID guarantees a reliability comparable with that of the RAID10 storage system. In addition, we develop an algorithm, called erasure coding write (ECW), to buffer many small random writes into a few large writes, which are then written to the code cache in a parallel fashion sequentially to improve the write performance. Experimental results show that RERAID significantly improves write performance and saves more energy than existing solutions.  相似文献   

3.
向量计算Array OLAP查询处理技术   总被引:1,自引:0,他引:1       下载免费PDF全文
多核和众核处理器成为新的具有强大并行处理能力的大内存计算平台的主流配置。多核处理器遵循以LLC(Last Level Cache,最后一级cache)大小为中心的优化技术,而众核处理器,如Phi、GPU协处理器,则采用较小的cache并以更多的硬件级线程来掩盖内存访问延迟的设计。随着处理核心数量的增长,计算框架更倾向于面向大规模处理核心的、代码执行效率高并且扩展性强的设计思想。提出了一种基于数组存储和向量处理的内存分析处理框架Array OLAP,简化OLAP的存储模型和查询处理模型。在Array OLAP计算框架中,维表规范化为基于向量的维过滤器,事实表规范化为带有多维索引的度量属性。通过多维索引计算,一个多维查询被简化为事实表上的向量索引扫描并根据度量表达式进行聚集计算。规范化的向量查找和向量索引扫描具有较好的代码执行效率,并且阶段化的处理模型更好地适应不同的计算平台,将计算阶段分配给最适合的计算平台。同时,Array OLAP是一种面向数据仓库模式特点的设计,向量处理模型设计简单,对于数据仓库维表较小且增长缓慢的特点具有较好的效率。描述了在不同平台上的Array OLAP计算框架并且通过基准测试评估Array OLAP的性能,通过与当前的内存分析型数据库的性能对比,Array OLAP性能超过主流的内存分析型数据库并且可以平滑地迁移到新的硬件平台。  相似文献   

4.
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.  相似文献   

5.
With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads.  相似文献   

6.
In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.  相似文献   

7.
In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core.  相似文献   

8.
面向多线程多道程序的加权共享Cache划分   总被引:5,自引:1,他引:4  
并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%.  相似文献   

9.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

10.
多核构架下OpenMP多线程应用运行性能的研究   总被引:4,自引:0,他引:4       下载免费PDF全文
多核平台下,OpenMP线程在核间的动态迁移在一定程度上会导致应用程序性能的下降,如果将线程绑定在固定的核上运行,使其不再迁移,这种方法将有可能提升应用程序性能,达到充分利用多核平台的计算能力的目的。本文将介绍如何使用主流的编译器绑定接口以及Linux内核API的方式实现OpenMP线程与核之间的绑定,使用STREAM Benchmark和NPB在上海超级计算中心的"魔方"超级计算机刀片上测试、比较绑定前后的应用程序的性能。结果证明,使用绑定方案将有可能提升OpenMP应用程序的性能。  相似文献   

11.
An overview is given of the architecture of an overall design considerations for the 11-unit, 32-b V80 microprocessor, which includes two 1-kB cache memories and a branch prediction mechanism that is a new feature for microprocessors. The V80's pipeline processing and system support functions for multiprocessor and high-reliability systems are discussed. Using V80 support functions, multiprocessor and high-reliability systems were realized without any performance drop. Cache memories and a branch prediction mechanism were used to improve pipeline processing. Various hardware facilities replaced the usual microprogram to ensure high performance  相似文献   

12.
Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node.  相似文献   

13.
Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation.  相似文献   

14.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

15.
The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative wrong execution cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses.  相似文献   

16.
This paper presents the Mitosis framework, which is a combined hardware-software approach to speculative multithreading, even in the presence of frequent dependences among threads. Speculative multithreading increases single-threaded application performance by exploiting thread-level parallelism speculatively - that is, executing code in parallel even when the compiler or runtime system cannot guarantee the parallelism exists. The proposed approach is based on predicting/computing thread input values via software, through a piece of code that is added at the beginning of each thread (the pre-computation slice). A pre-computation slice is expected to compute the correct thread input values most of the time, but not necessarily always. This allows aggressive optimization techniques to be applied to the slice to make it very short. This paper focuses on the microarchitecture that supports this execution model. The primary novelty of the microarchitecture is the hardware support for the execution and validation of pre-computation slices. Additionally, this paper presents new architectures for the register file and the cache memory in order to support multiple versions of each variable and allow for efficient roll-back in case of misspeculation. We show that the proposed microarchitecture, together with the compiler support, achieves an average speedup of 2.2 for applications that conventional non-speculative approaches are not able to parallelize at all.  相似文献   

17.
Multi-core x86_64 processors introduced an important change in architecture, a shared last level cache. Historically, each processor has had access to a large private cache that seamlessly and transparently (to end users) interfaced with main memory. Previously, processes or threads only had to compete for memory bandwidth, but now they are competing for actual space. Competition for space and environmental resources is a problem studied in other scientific domains. This paper introduces methods from ecology to model multi-core cache usage with the competitive Lotka–Volterra equations. A model is presented and validated for characterizing the interaction of cores through shared caching, and for predicting the degree to which different workloads will interfere with each others’ execution from cache contention.  相似文献   

18.
《Micro, IEEE》2008,28(3):26-41
CPUs consume too much power. Modern complex cores sometimes waste power on functions that are not useful for the code they run. In particular, operating system kernels do not benefit from many power-consuming features intended to improve application performance. We advocate asymmetric single-ISA multicore systems, in which some cores are optimized to run OS code at greatly improved energy efficiency.  相似文献   

19.
Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread.  相似文献   

20.
An important aspect of understanding the behavior of applications with respect to their performance, overhead, and scalability characteristics is knowledge of their execution control flow. High level knowledge of which functions or constructs were executed after which other constructs allows reasoning about temporal application characteristics such as cache reuse. This paper describes an approach to capture and visualize the execution control flow of OpenMP applications in a compact way. Our approach does not require a full trace of program execution events but is instead based on a straightforward extension to the summary data already collected by an existing profiling tool. In multithreaded applications each thread may define its own independent flow of control, complicating both the recording as well as the visualization of the execution dynamics. Our approach allows for the full flexibility with respect to independent threads. However, the most common usage models of OpenMP have threads operate in a largely uniform way, synchronizing frequently at sequence points and diverging only to operate on different data items in worksharing constructs. Our approach accounts for this by offering a simplified representation of the execution control flow for threads with similar behavior.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号