首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Modern processors such as Tilera’s Tile64, Intel’s Nehalem, and AMD’s Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, flat memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD’s HyperTransportTM, or Intel’s Quick-Path InterconnectTM. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular region of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. Increased competition for memory resources will also increase the memory access latency variation in future systems. Proper allocation of workload data to the appropriate MC will be important in decreasing the variation and average latency when servicing memory requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. We also introduce policies that can handle data placement in memory systems that have regions with heterogeneous properties. The proposed policies yield average performance improvements of 6.5% for adaptive first-touch page-placement, and 8.9% for a dynamic page-migration policy for a system with homogeneous DRAM DIMMs. We also show improvements in systems that contain DIMMs with different performance characteristics.  相似文献   

2.
Pete Boysen  Pinaki Shah 《Software》1993,23(3):235-241
Many Smalltalk implementations store objects in a large file called a virtual image. Each user must have a copy of the virtual image to execute. Since the image can exceed a megabyte in size, considerable disk space is required to support such a system in a multi-user environment. In this paper, a method is described which can reduce storage requirements for systems which use generation scavenging as a memory reclamation technique. This method also improves the performance of the checkpoint operation and offline garbage-collection.  相似文献   

3.
Non-volatile memory(NVM)provides a scalable and power-efficient solution to replace dynamic random access memory(DRAM)as main memory.However,because of the relatively high latency and low bandwidth of NVM,NVM is often paired with DRAM to build a heterogeneous memory system(HMS).As a result,data objects of the application must be carefully placed to NVM and DRAM for the best performance.In this paper,we introduce a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications.Leveraging online profiling and performance models,the runtime solution characterizes memory access patterns associated with data objects,and minimizes unnecessary data movement.Our runtime solution effectively bridges the performance gap between NVM and DRAM.We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.  相似文献   

4.
Efficient, scalable memory allocation for multithreaded applications on multiprocessors is a significant goal of recent research. In the distributed computing literature it has been emphasized that lock-based synchronization and concurrency-control may limit the parallelism in multiprocessor systems. Thus, system services that employ such methods can hinder reaching the full potential of these systems. A natural research question is the pertinence and the impact of lock-free concurrency control in key services for multiprocessors, such as in the memory allocation service, which is the theme of this work. We show the design and implementation of NBmalloc, a lock-free memory allocator designed to enhance the parallelism in the system. The architecture of NBmalloc is inspired by Hoard, a well-known concurrent memory allocator, with modular design that preserves scalability and helps avoiding false-sharing and heap-blowup. Within our effort to design appropriate lock-free algorithms for NBmalloc, we propose and show a lock-free implementation of a new data structure, flat-set, supporting conventional “internal” set operations as well as “inter-object” operations, for moving items between flat-sets. The design of NBmalloc also involved a series of other algorithmic problems, which are discussed in the paper. Further, we present the implementation of NBmalloc and a study of its behaviour in a set of multiprocessor systems. The results show that the good properties of Hoard w.r.t. false-sharing and heap-blowup are preserved.  相似文献   

5.
随着计算机应用领域不断拓展,流媒体应用及科学计算正成为微处理器的一种重要负载.流媒体应用的特征是大量的数据并行、少量的数据重用以及每次访存带来的大量计算.因为带宽的限制,传统的微处理器结构很难满足这些特点.X处理器是一款流处理器,针对流应用特点,X处理器采用了新型的三级流式存储层次:局部寄存器文件、流寄存器文件和片外存储器,有效解决了带宽问题.本文在模拟平台采用了两种方法(RS码和测试程序)测试,验证了流存储层次解决带宽瓶颈的有效性,也证明了设计的正确性.  相似文献   

6.
The paper concerns the strong uniform consistency and the asymptotic distribution of the kernel density estimator of random objects on a Riemannian manifolds, proposed by Pelletier (Stat. Probab. Lett., 73(3):297–304, 2005). The estimator is illustrated via one example based on a real data. This research was partially supported by Grants X-094 from the Universidad de Buenos Aires, pid 5505 from conicet and pav 120 and pict 21407 from anpcyt, Argentina.  相似文献   

7.
Demand for memory capacity and bandwidth keeps increasing rapidly in modern computer systems, and memory power consumption is becoming a considerable portion of the system power budget. However, the current DDR DIMM standard is not well suited to effectively serve CMP memory requests from both a power and performance perspective. We propose a new memory module called a multicore DIMM, where DRAM chips are grouped into multiple virtual memory devices, each of which has its own data path and receives separate commands. The Multicore DIMM is designed to improve the energy efficiency of memory systems with small impact on system performance. Dividing each memory modules into 4 virtual memory devices brings a simultaneous 22%, 7.6%, and 18% improvement in memory power, IPC, and system energy-delay product respectively on a set of multithreaded applications and consolidated workloads.  相似文献   

8.
相变存储器(PCM)由于其非易失性、高读取速度以及低静态功耗等优点,已成为主存研究领域的热点.然而,目前缺乏可用的PCM设备,这使得基于PCM的算法研究得不到有效验证.因此,本文提出了利用主存模拟器仿真并验证PCM算法的思路.本文首先介绍了现有主存模拟器的特点,并指出其并不能完全满足当前主存研究的实际需求,在此基础上提出并构建了一个基于DRAM和PCM的混合主存模拟器.与现有模拟器的实验比较结果表明,本文设计的混合主存模拟器能够有效地模拟DRAM和PCM混合存储架构,并能够支持不同形式的混合主存系统模拟,具有高可配置性.最后,论文通过一个使用示例说明了混合主存模拟器编程接口的易用性.  相似文献   

9.
大数据应用对内存容量的需求越来越大,而在大数据应用中,以动态随机存储器为内存介质的传统存储器所凸显出来的问题也越来越严重。计算机设计者们开始考虑用非易失性内存去替代传统的动态随机存储器内存。非易失性内存作为非易失的存储介质,不需要动态刷新,因此不会引起大量的能量消耗;此外,非易失性内存的读性能与动态随机存储器相近,且非易失性内存单个存储单元的容量具有较强的可扩展性。但将非易失性内存作为内存集成到现有的计算机系统中,需要解决其安全性问题。传统的动态随机存储器作为内存介质掉电后数据会自动丢失,即数据不会在存储介质中驻留较长时间,而当非易失性内存作为非易失性存储介质时,数据可以保留相对较久的时间。若攻击者获得了非易失性内存存储器的访问权,扫描存储内容,便可以获取内存中的数据,这一安全性问题被定义为数据的“恢复漏洞”。因此,在基于非易失性内存模组的数据中心环境中,如何充分有效地利用非易失性内存,并保证其安全性,成为迫切需要解决的问题。该文从非易失性内存的安全层面出发,对近年来的研究热点及进展进行介绍。首先,该文总结了非易失性内存所面临的主要安全问题,如数据窃取、完整性破坏、数据一致性与崩溃恢复...  相似文献   

10.
在现代数据中心,虚拟化技术在资源管理、服务器整合、提高资源利用率等方面发挥了巨大的作用,已成为云计算架构中关键的抽象层次和重要的支撑性技术。在虚拟化环境中,如果要保证高资源利用率和系统性能,必须有一个高效的内存管理方法,使得虚拟机的物理内存大小能够满足应用程序不断变化的内存需求。因此,如何在单机以及数据中心内进行内存资源的动态调控,就成为了一个关键性问题。实现了一个低开销、高精确度的内存工作集跟踪机制,进而进行相应的本地或者全局的内存调控。采用了多种动态内存调控技术:气球技术能够在单机内有效地为各个虚拟机动态调节内存;远程缓存技术可在物理机之间进行内存调度;虚拟机迁移可将虚拟机负载在多个物理主机间进行均衡。深入分析了以上各种方案的优缺点,并根据内存超载的情况有针对性地设计了相应的调控策略,实验数据表明:所提出的预测式的内存资源管理方法能够对内存资源进行在线监控和动态调配,并有效地提高了数据中心的内存资源利用率,降低了数据中心能耗。  相似文献   

11.
Matrix-Matrix Multiplication (MMM) is a highly important kernel in linear algebra algorithms and the performance of its implementations depends on the memory utilization and data locality. There are MMM algorithms, such as standard, Strassen–Winograd variant, and many recursive array layouts, such as Z-Morton or U-Morton. However, their data locality is lower than that of the proposed methodology. Moreover, several SOA (state of the art) self-tuning libraries exist, such as ATLAS for MMM algorithm, which tests many MMM implementations. During the installation of ATLAS, on the one hand an extremely complex empirical tuning step is required, and on the other hand a large number of compiler options are used, both of which are not included in the scope of this paper. In this paper, a new methodology using the standard MMM algorithm is presented, achieving improved performance by focusing on data locality (both temporal and spatial). This methodology finds the scheduling which conforms with the optimum memory management. Compared with (Chatterjee et al. in IEEE Trans. Parallel Distrib. Syst. 13:1105, 2002; Li and Garzaran in Proc. of Lang. Compil. Parallel Comput., 2005; Bilmes et al. in Proc. of the 11th ACM Int. Conf. Super-comput., 1997; Aberdeen and Baxter in Concurr. Comput. Pract. Exp. 13:103, 2001), the proposed methodology has two major advantages. Firstly, the scheduling used for the tile level is different from the element level’s one, having better data locality, suited to the sizes of memory hierarchy. Secondly, its exploration time is short, because it searches only for the number of the level of tiling used, and between (1, 2) (Sect. 4) for finding the best tile size for each cache level. A software tool (C-code) implementing the above methodology was developed, having the hardware model and the matrix sizes as input. This methodology has better performance against others at a wide range of architectures. Compared with the best existing related work, which we implemented, better performance up to 55% than the Standard MMM algorithm and up to 35% than Strassen’s is observed, both under recursive data array layouts.  相似文献   

12.
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.  相似文献   

13.
BSPlib: The BSP programming library   总被引:1,自引:0,他引:1  
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access (DRMA) using put or get operations, and bulk synchronous message passing (BSMP). Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms (FFTs), sorting, and molecular dynamics.  相似文献   

14.
数值计算程序的存储复杂性分析   总被引:12,自引:1,他引:11  
由于越来越多的技术用于缩小处理器与存储器之间的日益加大的速度差距,计算机的存储系统变得日趋复杂.现在,任何一个程序设计者,尤其是数值计算程序的设计者,若不考虑其所用计算平台存储系统的特点是很难获取高性能的.因此公用传统的算法评价方法,从时间复杂性和空间复杂性着手来解释一个算法的不同实现在同一计算平台上很大的性能差异,显然是不够的.计算平台存储系统的特点必须在分析算法的复杂性时加以考虑.孙家昶199  相似文献   

15.
A Unified Primal-Dual Algorithm Framework Based on Bregman Iteration   总被引:2,自引:0,他引:2  
In this paper, we propose a unified primal-dual algorithm framework for two classes of problems that arise from various signal and image processing applications. We also show the connections to existing methods, in particular Bregman iteration (Osher et al., Multiscale Model. Simul. 4(2):460–489, 2005) based methods, such as linearized Bregman (Osher et al., Commun. Math. Sci. 8(1):93–111, 2010; Cai et al., SIAM J. Imag. Sci. 2(1):226–252, 2009, CAM Report 09-28, UCLA, March 2009; Yin, CAAM Report, Rice University, 2009) and split Bregman (Goldstein and Osher, SIAM J. Imag. Sci., 2, 2009). The convergence of the general algorithm framework is proved under mild assumptions. The applications to 1 basis pursuit, TV−L 2 minimization and matrix completion are demonstrated. Finally, the numerical examples show the algorithms proposed are easy to implement, efficient, stable and flexible enough to cover a wide variety of applications.  相似文献   

16.
The emergence of new network interface technology is enabling new approaches to the development of communications software. This paper evaluates the SHRIMP virtual memory mapped network interface by using it to build two fast implementations of remote procedure call (RPC). Our first implementations, called vRPC, is fully compatible with the SunRPC standard. We change the RPC runtime library, the operating system kernel is unchanged, and only a minimal change was needed in the stub generator to create a new protocol identifier. Despite these restrictions, our vRPC implementation is several times faster than existing SunRPC implementations. A round-trip null RPC with no arguments and results under vRPC takes about 33 μs. Our second implementation, called ShrimpRPC, is not compatible with SunRPC but offers much better performance. ShrimpRPC specializes the stub generator and runtime library to take full advantage of SHRIMP's features. The result is a round-trip null RPC latency of 9.5 μs, which is about 1 μs above the hardware minimum.  相似文献   

17.
The study of group dynamics highlights the activity in the group in terms of its performance and communication. The experience of facilitating virtual communities and teams (Eunice and Kimball in , 1997) suggests that groups go through the same stages either in face-to-face or in online mode. The paper brings together a theoretical framework based on the literature on virtual communities, Gestalt systems and online facilitation in order to address the issue of electronic togetherness, in particular from a group dynamics perspective. The empirical work on which the paper is based is an observation of a group of students in a training set playing a decision-making game. The model of Tuckman (Tuckman in Psychol Bull 63:384–399, 1965; Tuckman and Jensen in Group Organ Stud 2:419–427, 1977) is used as a framework within which to discuss the findings of the case. The paper finishes with concrete recommendations for facilitators of online communities and designers of the electronic spaces where these communities operate.  相似文献   

18.
本文提出了一种基于IA-64体系结构的内存页面大页面化的模型,可执行文件ELF的Data Segment使用大页面。由于转换解析缓冲区(TLB)能映射更大的虚拟内存范围,从而可减小未命中率,因此可以提高使用大页面的高性能计算(HPC)应用程序或使用大量虚拟内存的任何内存访问密集型应用程序系统性能。  相似文献   

19.
Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid’s design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1–4 % for workloads with low to high memory demand. For these workloads, Asteroid’s adaptive policy has good error coverage and can thoroughly test memory.  相似文献   

20.
The International Society of Presence Research, defines “presence” (a shortened version of the term “telepresence”) as a “psychological state in which even though part or all of an individual’s current experience is generated by and/or filtered through human-made technology, part or all of the individual’s perception fails to accurately acknowledge the role of the technology in the experience” (ISPR 2000, The concept of presence: explication statement. Accessed 15 Jan 2009). In this article, we will draw on the recent outcomes of cognitive sciences to offer a broader definition of presence, not related to technology only. Specifically, presence is described here as a core neuropsychological phenomenon whose goal is to produce a sense of agency and control: subjects are “present” if they are able to enact in an external world their intentions. This framework suggests that any environment, virtual or real, does not provide undifferentiated information, ready-made objects equal for everyone. It offers different opportunities and produces presence according to its ability in supporting the users and their intentions. The possible consequences of this approach for the development of presence-inducing virtual environments are also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号