首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
本文根据曙光一号并处理机上移植UNIS操作系统(SNIX)的经验,对U-NIX操作系统从单机到多机的移植与改造技术进行了描述。包括多机硬体系结构的支持,UNIX源码选择,移植环境的建立,源码的改造等多方面所进行的工作。  相似文献   

2.
The allocation and disposal of memory is a ubiquitous operation in most programs. Rarely do programmers concern themselves with details of memory allocators; most assume that memory allocators provided by the system perform well. Yet, in some applications, programmers use domain-specific knowledge in an attempt to improve the speed or memory utilization of memory allocators. In this paper, we describe a program (CustoMalloc) that synthesizes a memory allocator customized for a specific application. Our experiments show that the synthesized allocators are uniformly faster and more space efficient than the Berkeley UNIX allocator. Constructing a custom allocator requires little programmer effort, usually taking only a few minutes. Experience has shown that the synthesized allocators are not overly sensitive to properties of input sets and the resulting allocators are superior even to domain-specific allocators designed by programmers. Measurements show that synthesized allocators are from two to ten times faster than widely-used allocators.  相似文献   

3.
We present a practical lock-free shared data structure that efficiently implements the operations of a concurrent deque as well as a general doubly linked list. The implementation supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of doubly linked lists are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm only requires single-word compare-and-swap atomic primitives, supports fully dynamic list sizes, and allows traversal also through deleted nodes and thus avoids unnecessary operation retries. We have performed an empirical study of our new algorithm on two different multiprocessor platforms. Results of the experiments performed under high contention show that the performance of our implementation scales linearly with increasing number of processors. Considering deque implementations and systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture.  相似文献   

4.
肖竟华  陈岚 《微机发展》2007,17(2):187-189
存储管理子系统作为操作系统中最重要的组成部分之一,对整个系统的运行起着举足轻重的作用。Linux继承了UNIX系统的优秀设计思想,并采用了许多先进算法来保持系统的高效性和稳定性。文中先概述了Linux2.4物理内存的管理,然后介绍了解决内存中碎片问题的伙伴系统算法和Slab分配器,并讨论了它们实现的要点,着重对Slab分配器中的几个数据结构进行了分析。  相似文献   

5.
Efficient, scalable memory allocation for multithreaded applications on multiprocessors is a significant goal of recent research. In the distributed computing literature it has been emphasized that lock-based synchronization and concurrency-control may limit the parallelism in multiprocessor systems. Thus, system services that employ such methods can hinder reaching the full potential of these systems. A natural research question is the pertinence and the impact of lock-free concurrency control in key services for multiprocessors, such as in the memory allocation service, which is the theme of this work. We show the design and implementation of NBmalloc, a lock-free memory allocator designed to enhance the parallelism in the system. The architecture of NBmalloc is inspired by Hoard, a well-known concurrent memory allocator, with modular design that preserves scalability and helps avoiding false-sharing and heap-blowup. Within our effort to design appropriate lock-free algorithms for NBmalloc, we propose and show a lock-free implementation of a new data structure, flat-set, supporting conventional “internal” set operations as well as “inter-object” operations, for moving items between flat-sets. The design of NBmalloc also involved a series of other algorithmic problems, which are discussed in the paper. Further, we present the implementation of NBmalloc and a study of its behaviour in a set of multiprocessor systems. The results show that the good properties of Hoard w.r.t. false-sharing and heap-blowup are preserved.  相似文献   

6.
7.
We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed.  相似文献   

8.
Dynamic storage allocation is a vital component of programming systems intended for multiprocessor architectures that support globally shared memory. Highly parallel algorithms for access to system data structures lie at the core of effective memory allocation strategies as well as solutions to other parallel systems problems. In this paper, we investigate four algorithms, all based on the first fit approach, that provide different granularities of parallel access to the allocator's data structures. These solutions employ a variety of design techniques including specialized locking protocols, the use of atomic fetch-and- operations, and structural modifications. We describe experiments designed to compare the performance of these schemes. The results show that simple algorithms are appropriate when the expected number of concurrent requests per memory is low and the request pattern is not bursty. Algorithms that support finer granularity access while avoiding locking protocols are successful in a range of larger processor/memory ratios.This research was supported in part by the National Science Foundation under Grant Number DCR 8320136, DARPA/U.S. Army Engineer Topographic Laboratories under contract number DACA76-85-C-0001, and Unisys Corporation.A preliminary version appeared in International Conference on Parallel Processing, August 1987.  相似文献   

9.
In distributed shared memory multiprocessors, remote memory references generate processor-to-memory traffic, which may result in a bottleneck. It is therefore important to design algorithms that minimize the number of remote memory references. We establish a lower bound of three on remote reference time complexity for mutual exclusion algorithms in a model where processes communicate by means of a general read-modify-write primitive that accesses at most one shared variable in one instruction. Since the general read-modify-write primitive is a generalization of a variety of atomic primitives that have been implemented in multiprocessor systems, our lower bound holds for all mutual exclusion algorithms that use such primitives. Furthermore, this lower bound is shown to be tight by presenting an algorithm with the matching upper bound.  相似文献   

10.
动态内存分配器是现代应用程序重要组成部分, 它负责管理空闲内存并处理用户内存请求. 现代通用动态内存分配器能够提供较为平衡的性能与内存利用率, 但考虑到不同应用场景的内存使用情况和优化目标不同, 使用通用内存分配器并非最优解. 针对应用场景定制的专用内存分配器通常能够更好地满足系统需要, 然而编写专用内存分配器较为费时, 也容易出错. 开发者通常使用内存分配框架搭建专用动态内存分配器. 然而, 现有的内存分配框架存在抽象能力较差, 组合性与定制性不足的问题. 为此, 从函数式编程视角审视动态内存分配过程, 基于函数可组合性提出了一种可组合的定制化动态内存分配器框架榫卯. 榫卯框架将系统内存分配抽象为多个互不耦合的内存分配层级函数的组合, 这些层级函数能够扩展出策略槽, 以提供更高的定制性和组合性. 榫卯框架基于标准C实现, 依赖C预处理器的元编程特性实现层级函数组合的零性能开销. 开发者能够通过组合与定制分配器的层级函数, 快速构建出适合应用场景的内存分配器. 为了证明榫卯框架的有效性, 使用榫卯框架构建了3种不同的内存分配器实例: tlsfcc, hslab与wfslab, 其中tlsfcc针对多核嵌入式应用场景, 通过替换同步策略优化并发吞吐率; hslab是核心感知的slab式分配器, 通过定制线程缓存优化在异构硬件的性能; wfslab是低延迟的无等待/无锁分配器. 为了评估这3种内存分配器实例, 通过运行基准测试对比现有内存分配器. 实验分别在8核x86/64平台和8核异构aarch64嵌入式平台进行. 实验表明tlsfcc与原始tlsf分配器相比, 在上述两个平台上分别取得了平均1.76和1.59的加速比; 对比hslab与类似架构的tcmalloc, 它在两个平台的平均执行时间仅为tcmalloc的69.6%和85.0%; wfslab则取得了参与实验对比的内存分配器中最小的最差情况内存请求延迟, 其中包括目前最先进的无锁内存分配器mimalloc和snmalloc.  相似文献   

11.
Mehdi Badii 《Software》1998,28(5):463-480
This paper presents the implementation of multitasking functions of DYNIX Sequent computers on the UNIX operating system. The Sequent computers are shared memory multiprocessor computers running the DYNIX operating system. These functions support data and function partitioning. They let the user implement subprograms by the processors of a Sequent computer in parallel. The functions can synchronize, lock, and unlock data and program segments. As a result, the simulator allows the users to develop their multitasking programs on a uniprocessor computer such as a SUN workstation, and later port them to a Sequent computer. Further, the simulator adds a level of abstraction on top of UNIX for concurrent programming. The functions of the simulator allow the user to handle the communication and synchronization of the processes in a program at a higher level of abstraction, while concentrating on the design of multitasking algorithms. The simulator is applied to a parallel selection algorithm. © 1998 John Wiley & Sons, Ltd.  相似文献   

12.
The authors describe their multiprocessing extensions to Common Lisp. They have added a few simple, expressive features on which one can build high-level constructs. These consist of a multithreading mechanism, primitives for communication and synchronization (mailboxes and signals), and a feature called futures. A few examples clarify how the primitives work and demonstrate their expressiveness. When Spur Lisp is ported to and optimized on the Spur workstation (a shared memory multiprocessor), programmers can use it to make symbolic programs parallel  相似文献   

13.
We compare five existing dynamic memory allocators optimized for GPUs and show their strengths and weaknesses. In the measurements, we use three generic evaluation tests proposed in the past and we add one with a real workload, where dynamic memory allocation is used in building the k‐d tree data structure. Following the performance analysis we propose a new dynamic memory allocator and its variants that address the limitations of the existing dynamic memory allocators. The new dynamic memory allocator uses few resources and is targeted towards large and variably sized memory allocations on massively parallel hardware architectures.  相似文献   

14.
15.
In the standard kernel organization on a bus-based multiprocessor, all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of the kernel performs local operations directly and uses remote invocation to perform remote operations. Either approach to interkernel communication can be used in a large-scale shared-memory multiprocessor. In the paper we discuss the issues and architectural features that must be considered when choosing between remote memory access and remote invocation. We focus in particular on experience with the Psyche multiprocessor operating system on the BBN Butterfly Plus. We find that the Butterfly architecture is biased towards the use of remote invocation for kernel operations that perform a significant number of memory references, and that current architectural trends are likely to increase this bias in future machines. This conclusion suggests that straightforward parallelization of existing kernels (e.g. by using semaphores to protect shared data) is unlikely in the future to yield acceptable performance. We note, however, that remote memory access is useful for small, frequently-executed operations, and is likely to remain so.  相似文献   

16.
The comparative performance is studied of different message passing system designs experimentally on a shared memory Encore Multimax multiprocessor. The systems are measured both by benchmarks and by running example parallel applications. To act as a control, the shared memory machine results are compared with the performance of the benchmarks and applications on the Intel iPSC/2 running the NX/2 operating system. The design alternatives considered are buffering, buffer organization, reference and value semantics, synchronization, coordination strategy and the location of the system in user or kernel space. The results include measurements of the effects of the design alternatives, memory caching, message sizes and copying  相似文献   

17.
In this paper, we propose a method about task scheduling and data assignment on heterogeneous hybrid memory multiprocessor systems for real‐time applications. In a heterogeneous hybrid memory multiprocessor system, an important problem is how to schedule real‐time application tasks to processors and assign data to hybrid memories. The hybrid memory consists of dynamic random access memory and solid state drives when considering the performance of solid state drives into the scheduling policy. To solve this problem, we propose two heuristic algorithms called improvement greedy algorithm and the data assignment according to the task scheduling algorithm, which generate a near‐optimal solution for real‐time applications in polynomial time. We evaluate the performance of our algorithms by comparing them with a greedy algorithm, which is commonly used to solve heterogeneous task scheduling problem. Based on our extensive simulation study, we observe that our algorithms exhibit excellent performance and demonstrate that considering data allocation in task scheduling is significant for saving energy. We conduct experiments on two heterogeneous multiprocessor systems. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

18.
The authors present a fault injection and monitoring environment (FINE) as a tool to study fault propagation in the UNIX kernel. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. FINE consists of a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. Experiments on SunOS 4.1.2 are conducted by applying FINE to investigate fault propagation and to evaluate the impact of various types of faults. Fault propagation models are built for both hardware and software faults. Transient Markov reward analysis is performed to evaluate the loss of performance due to an injected fault. Experimental results show that memory and software faults usually have a very long latency, while bus and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, which are detected when the system is tries to access an unauthorized memory location. Only about 8% of faults propagate to other UNIX subsystems. Markov reward analysis shows that the performance loss incurred by bus faults and CPU faults is much higher than that incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of nonpointer faults  相似文献   

19.
The butterfly barrier   总被引:3,自引:0,他引:3  
We describe and algorithm for barrier synchronization that requires only read and write to shared store. The algorithm is faster than the traditionallocked counter approach for two processors and has an attractive log2 N time scaling for largerN. The algorithm is free of hot spots and critical regions and requires a shared memory bandwidth which grows linearly withN, the number of participating processors. We verify the technique using both a real shared memory multiprocessor, for numbers of processors up to 30, and a shared memory multiprocessor simulator, for number of processors up to 256.Work performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.  相似文献   

20.
Conventional multiprocessors mostly use centralized, memory-based barriers to synchronize concurrent processes created in multiple processors. These centralized barriers often become the bottleneck or hot spots in the shared memory. In this paper, we overcome the difficulty by presenting a distributed and hardwired barrier architecture, that is hierarchically constructed for fast synchronization in cluster-structured multiprocessors. The hierarchical architecture enables the scalability of cluster-structured multiprocessors. A special set of synchronization primitives is developed for explicit use of distributed barriers dynamically. To show the application of the hardwired barriers, we demonstrate how to synchronize Doall and Doacross loops using a limited number of hardwired barriers. Timing analysis shows an O(102) to O(105) reduction in synchronization overhead, compared with the use of software-controlled barriers implemented in a shared memory. The hardwired architecture is effective in implementing any partially ordered set of barriers or fuzzy barriers with extended synchronization regions. The versatility, scalability, programmability, and low overhead make the distributed barrier architecture attractive in constructing fine-grain, massively parallel MIMD systems using multiprocessor clusters with distributed shared memory  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号