期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Memory-aware kernel mechanism and policies for improving internode load balancing on NUMA systems

Mei-Ling Chiang Wei-Lun Su Shu-Wei Tu Zhen-Wei Lin 《Software》2019,49(10):1485-1508

Although nonuniform memory access architecture provides better scalability for multicore systems, cores accessing memory on remote nodes take longer than those accessing on local nodes. Remote memory access accompanied by contention for internode interconnection degrades performance. Properly mapping threads to cores and data accessed to their nodes can substantially improve performance and energy efficiency. However, an operating system kernel's load-balancing activity may migrate threads across nodes, which thus messes up the thread mapping. Besides, subsequent data mapping behavior pays for the cost of page migration to reduce remote memory access. Once unsuitable threads are migrated, it is detrimental to system performance. This paper focuses on improving the kernel's internode load balancing on nonuniform memory access systems. We develop a memory-aware kernel mechanism and policies to reduce remote memory access incurred by internode thread migration. The Linux kernel's load balancing mechanism is modified to incorporate selection policies in the internode thread migration, and the kernel is modified to track the amount of memory used by each thread on each node. With this information, well-designed policies can then choose suitable threads for internode migration. The purpose is to avoid migrating a thread that might incur relatively more remote memory access and page migration. The experimental results show that with our mechanism and the proposed selection policies, the system performance is substantially increased when compared with the unmodified Linux kernel that does not consider memory usage and always migrates the first-fit thread in the runqueue that can be migrated to the target central processing unit. 相似文献

2.

Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applications

《Future Generation Computer Systems》2017

To capitalize on multicore power, modern high-speed data transfer applications usually adopt multi-threaded design and aggregate multiple network interfaces. However, NUMA introduces another dimension of complexity to these applications. In this paper, we undertook comprehensive experiment on real systems to illustrate the importance of NUMA-awareness to applications with intensive memory accesses and network I/Os. Instead of simply attributing the NUMA effect to the physical layout, we provide an in-depth analysis of underlying interactions inside hardware devices. We profile the system performance by monitoring relevant hardware counters, and reveal how the NUMA penalty occurs during prefetch and cache synchronization processes. Consequently, we implement a thread mapping module in a bulk data transfer software, BBCP, as a practical example of enabling NUMA-awareness. The enhanced application is then evaluated on our high-performance testbed with storage area networks (SAN). Our experimental results show that the proposed NUMA optimizations can significantly improve BBCP’s performance in memory-based tests with various contention levels and realistic data transfers involving SAN-based storage. 相似文献

3.

NUMA感知的持久内存存储引擎优化设计

屠要峰陈河堆王涵毅闫宗帅孔鲁陈兵《软件学报》2022,33(3):891-908

持久性内存(persistmemory,PM)具有非易失、字节寻址、低时延和大容量等特性,打破了传统内外存之间的界限,对现有软件体系结构带来颠覆性影响.但是,当前PM硬件还存在着磨损不均衡、读写不对称等问题,特别是当跨NUMA(nonuniformmemoryaccess)节点访问PM时,存在着严重的I/O性能衰减问题.提出了一种NUMA感知的PM存储引擎优化设计,并应用到中兴新一代数据库系统GoldenX中,显著降低了数据库系统跨NUMA节点访问持久内存的开销.主要创新点包括:提出了一种DRAM+PM混合内存架构下跨NUMA节点的数据空间分布策略和分布式存取模型,实现了PM数据空间的高效使用;针对跨NUMA访问PM的高开销问题,提出了I/O代理例程访问方法,将跨NUMA访问PM开销转化为一次远程DRAM内存拷贝和本地访问PM的开销,设计了Cache Line Area (CLA)缓存页机制,缓解了I/O写放大问题,提升了本地访问PM的效率;扩展了传统表空间概念,让每个表空间既拥有独立的表数据存储,也拥有专门的WAL (write-ahead logging)日志存储,针对该分布式WA... 相似文献

4.

基于NUMA延迟发送的时变图弱连通分量求解

梁锐杰程永利《计算机系统应用》2023,32(3):322-329

时变图连通分量已经被广泛应用到不同场景, 如交通路网建设、推荐系统的信息推送等. 然而当前多数连通分量求解方法忽视了NUMA体系结构对计算效率产生的影响, 即过高的远程内存访问延迟导致低下的算法执行效率. 本文针对时变图的弱连通分量求解问题, 提出一种基于NUMA延迟发送的时变图弱连通分量求解方法, 它通过合理的数据内存布局, 合理控制NUMA节点间的信息交换次数, 最大限度减少远程内存访问数量, 显著提高了算法执行效率. 实验结果表明, 该方法的性能明显优于当前流行的图处理系统Ligra和Polymer提供的方法. 相似文献

5.

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

Luís Fabrício Wanderley Góes Christiane Pousa Ribeiro Márcio Castro Jean-François Méhaut Murray Cole Marcelo Cintra 《International journal of parallel programming》2014,42(2):365-382

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines. 相似文献

6.

Optimizing operating system performance for CC‐NUMA architectures

Moon Seok Chang 《Concurrency and Computation》2003,15(14):1257-1274

The CC‐NUMA (cache‐coherent non‐uniform memory access) architecture is an attractive solution to scalable servers. The performance of a CC‐NUMA system heavily depends on the number of accesses to remote memory through an interconnection network. To reduce the number of remote accesses, an operating system needs to exploit the potential locality of the architecture. This paper describes the design and implementation of a UNIX‐based operating system supporting the CC‐NUMA architecture. The operating system implements various enhancements by revising kernel algorithms and data structures. This paper also analyzes the performance of the enhanced operating system by running commercial benchmarks on a real CC‐NUMA system. The performance analysis shows that the operating system can achieve improved performance and scalability for CC‐NUMA by implementing kernel data striping, localization and load balancing. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

7.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献

8.

Cache-only memory architectures

Dahlgren F. Torrellas J. 《Computer》1999,32(6):72-79

The shared memory concept makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long latency memory accesses still requires substantial programmer effort. Researchers have proposed using compilers, operating systems, or architectures to improve performance by allocating data close to the processors that use it. The Cache-Only Memory Architecture (COMA) increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state. The authors explain the functionality, architecture, performance, and complexity of COMA systems. They also outline different COMA designs, compare COMA to traditional nonuniform memory access (NUMA) systems, and describe proposed improvements in NUMA systems that target the same performance obstacles as COMA 相似文献

9.

Simulation as a tool for optimizing memory accesses on NUMA machines

Jie Tao Martin Schulz Wolfgang Karl 《Performance Evaluation》2005,60(1-4):31-50

Due to the inherent non-uniformity in the memory system, programmers and users of non-uniform memory access (NUMA) machines have to take special care of the memory performance of their applications. This paper discusses a variety of potential improvements with respect to cache misses, cache invalidations, and inter-node communication. This study is based on the simulation tool SIMT, which models the memory hierarchy in detail and is capable of providing complete, accurate information about all dynamic memory references. This information can be used to analyze the memory access behavior of applications and thereby forms the basis for any optimization with respect to memory accesses. 相似文献

10.

Compile-time optimization of near-neighbor communication for scalable shared-memory multiprocessors

David E. Hudak Santosh G. Abraham 《Journal of Parallel and Distributed Computing》1992,15(4)

Scalable shared-memory multiprocessor systems are typically NUMA (nonuniform memory access) machines, where the exploitation of the memory hierarchy is critical to achieving high performance. Iterative data parallel loops with near-neighbor communication account for many important numerical applications. In such loops, the communication of partial results stresses the memory system performance. In this paper, we develop data placement schemes that minimize communication time where the near-neighbor interaction is determined by a stencil. Under a given loop partition, our compile-time algorithm partitions global data into four classes for each processor, with each class requiring specific consistency maintenance requirements. The ADAPT (Automatic Data Allocation and Partitioning Tool) system was implemented to automatically partition parallel code segments for the BBN TC2000, a scalable shared-memory multiprocessor. ADAPT caches global arrays and maintains data consistency in software through instructions that flush data from private caches. Restructuring of a fluid flow code segment by ADAPT improved performance by a factor of more than 3 on the BBN TC2000. Features in current generation pipelined processors with multiple functional units permit the overlap of memory accesses with computation. Our experiments on the BBN TC2000 show that the degree of overlap is limited by architectural parameters, such as the number of CPU registers. 相似文献

11.

Performance prediction and evaluation of parallel processing on aNUMA multiprocessor

Zhang X. Qin X. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(10):1059-1068

The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system 相似文献

12.

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Nikolopoulos Dimitrios S. Ayguadé Eduard Polychronopoulos Constantine D. 《International journal of parallel programming》2002,30(4):225-255

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer. 相似文献

13.

Controlling NUMA effects in embedded manycore applications with lightweight nested parallelism support

《Parallel Computing》2016

Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 × speedup versus flat parallelism. 相似文献

14.

Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

Hikmet Dursun Manaschai Kunaseth Ken-ichi Nomura Jacqueline Chame Robert F. Lucas Chun Chen Mary Hall Rajiv K. Kalia Aiichiro Nakano Priya Vashishta 《The Journal of supercomputing》2012,62(2):946-966

We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i)?inter-node parallelization via spatial decomposition; (ii)?inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii)?data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv)?register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98?% on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking?+ multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem. 相似文献

15.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

16.

Adaptive thread mapping strategies for transactional memory applications

Márcio Castro Luís Fabrício W. Góes Jean-François Méhaut 《Journal of Parallel and Distributed Computing》2014

Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. 相似文献

17.

Learning to classify parallel input/output access patterns 总被引：1，自引：0，他引：1

Madhyastha T.M. Reed D.A. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(8):802-813

Input/output performance on current parallel file systems is sensitive to a good match of application access patterns to file system capabilities. Automatic input/output access pattern classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper, we examine and compare two novel input/output access pattern classification methods based on learning algorithms. The first approach uses a feedforward neural network previously trained on access pattern benchmarks to generate qualitative classifications. The second approach uses hidden Markov models trained on access patterns from previous executions to create a probabilistic model of input/output accesses. In a parallel application, access patterns can be recognized at the level of each local thread or as the global interleaving of all application threads. Classification of patterns at both levels is important for parallel file system performance; we propose a method for forming global classifications from local classifications. We present results from parallel and sequential benchmarks and applications that demonstrate the viability of this approach. 相似文献

18.

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Jorge González-Domínguez^{Author Vitae} Guillermo L. Taboada Author VitaeBasilio B. Fraguela Author Vitae María J. Martín Author VitaeJuan Touriño Author Vitae 《Computers & Electrical Engineering》2012,38(2):258-269

Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification. 相似文献

19.

Configurable parallel memory architecture for multimedia computers

Kimmo Jarno Timo Jarkko 《Journal of Systems Architecture》2002,47(14-15)

This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses. 相似文献

20.

Analysis of large deviations behavior of multi-GPU memory access in deep learning

P. S. Tamizharasan N. Ramasubramanian 《The Journal of supercomputing》2018,74(5):2199-2212

The unpredictable nature of irregular memory accesses in a mixed memory applications such as deep learning application poses many challenges due to the communication issues. Typically, a multi-GPU node that has a large number of simultaneous memory requests consumes almost 80% of the processing time for memory mapping. This calls for characterization of mixed regular and irregular memory accesses so that memory divergence can be simplified to improve performance. In this paper, using large deviations principle, it is shown that the mixed regular and irregular memory accesses can be viewed as a combination of continuous and discrete functions. This view point is proved to give better performance through characterization of memory divergence in multi-GPU node using the sub-additivity property. Further, a detection test procedure based on quenched large deviations model is proposed which generates threshold values for optimizing the memory mapping in data intensive applications and hence it will improve the performance. 相似文献