首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Container-based virtualization is becoming increasingly popular in cloud computing due to its efficiency and flexibility. Resource isolation is a fundamental property of containers. Existing works have indicated weak resource isolation could cause significant performance degradation for containerized applications and enhanced resource isolation. However, current studies have almost not discussed the isolation problems of page cache which is a key resource for containers. Containers leverage memory cgroup to control page cache usage. Unfortunately, existing policy introduces two major problems in a container-based environment. First, containers can utilize more memory than limited by their cgroup, effectively breaking memory isolation. Second, the OS kernel has to evict page cache to make space for newly-arrived memory requests, slowing down containerized applications. This paper performs an empirical study of these problems and demonstrates the performance impacts on containerized applications. Then we propose pCache (precise control of page cache) to address the problems by dividing page cache into private and shared and controlling both kinds of page cache separately and precisely. To do so, pCache leverages two new technologies: fair account (f-account) and evict on demand (EoD). F-account splits the shared page cache charging based on per-container share to prevent containers from using memory for free, enhancing memory isolation. And EoD reduces unnecessary page cache evictions to avoid the performance impacts. The evaluation results demonstrate that our system can effectively enhance memory isolation for containers and achieve substantial performance improvement over the original page cache management policy.  相似文献   

2.
Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous research addressed this issue by replacing linked lists with dynamic arrays forming a cache-conscious array burst trie. Though faster, this variant can incur high instruction costs which can hinder its efficiency. Thus, engineering a fast, compact, and scalable trie for strings remains an open problem. In this paper, we introduce a novel and practical solution that carefully combines a trie with a hash table, creating a variant of burst trie called HAT-trie. We provide a thorough experimental analysis which demonstrates that for large set of strings and on alternative computing architectures, the HAT-trie—and two novel variants engineered to achieve further space-efficiency—is currently the leading in-memory trie-based data structure offering rapid, compact, and scalable storage and retrieval of variable-length strings.  相似文献   

3.
Recently, to relieve the performance degradation caused by the bottleneck between CPU and main memory, cache conscious multi-dimensional index structures have been proposed. The ultimate goal of them is to reduce the space for entries so as to widen index trees and minimize the number of cache misses. The existing index structures can be classified into two approaches according to their entry reduction methods. One approach is to compress MBR keys by quantizing coordinate values to the fixed number of bits. The other approach is to store only the sides of minimum bounding regions (MBRs) that are different from their parents partially. The second approach works well when the size of a node is small and the number of entries is small. In this paper, we investigate the existing multi-dimensional index structures for main memory database systems through experiments under the various work loads. Then, we propose a new index structure that exploits the properties of the both techniques. We implement existing multi-dimensional index structures and the proposed index structure. We perform various experiments to show that our approach outperforms others.  相似文献   

4.
We address the problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays. We determine the optimal proxy prefix cache allocation to the videos that minimizes the aggregate network bandwidth cost. We integrate proxy caching with traditional server-based reactive transmission schemes such as hatching, patching and stream merging to develop a set of proxy-assisted delivery schemes. We quantitatively explore the impact of the choice of transmission scheme, cache allocation policy, proxy cache size, and availability of unicast versus multicast capability, on the resulting transmission cost. Our evaluations show that even a relatively small prefix cache (10%-20% of the video repository) is sufficient to realize substantial savings in transmission cost. We find that carefully designed proxy-assisted reactive transmission schemes can produce significant cost savings even in a predominantly unicast environment such as the Internet.  相似文献   

5.
The design of a high performance fetch architecture can be challenging due to poor interconnect scaling and energy concerns. Way prediction has been presented as one means of scaling the fetch engine to shorter cycle times, while providing energy efficient instruction cache accesses. However, way prediction requires additional complexity to handle mispredictions.In this paper, we examine a high-bandwidth fetch architecture augmented with an instruction cache way predictor. We compare the performance and energy efficiency of this architecture to both a serial access cache and a parallel access cache. Our results show that a serial fetch architecture achieves approximately the same energy reduction and performance as way prediction architectures, without the added structures and recovery complexity needed for way prediction.  相似文献   

6.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

7.

Context

Pointer analysis is an important building block of optimizing compilers and program analyzers for C language. Various methods with precision and performance trade-offs have been proposed. Among them, cycle elimination has been successfully used to improve the scalability of context-insensitive pointer analyses without losing any precision.

Objective

In this article, we present a new method on context-sensitive pointer analysis with an effective application of cycle elimination.

Method

To obtain similar benefits of cycle elimination for context-sensitive analysis, we propose a novel constraint-based formulation that uses sets of contexts as annotations. Our method is not based on binary decision diagram (BDD). Instead, we directly use invocation graphs to represent context sets and apply a hash-consing technique to deal with the exponential blow-up of contexts.

Result

Experimental results on C programs ranging from 20,000 to 290,000 lines show that applying cycle elimination to our new formulation results in 4.5 ×speedup over the previous BDD-based approach.

Conclusion

We showed that cycle elimination is an effective method for improving the scalability of context-sensitive pointer analysis.  相似文献   

8.
In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D.  相似文献   

9.
GPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high bandwidth service for memory requests. However, the average L1 cache capacity per thread is very limited, resulting in cache thrashing which in turn impairs the performance. Meanwhile, many registers and shared memories are unassigned to any warps or thread blocks. Moreover, registers and shared memories that are assigned can be idle when warps or thread blocks are finished. Exploiting the above insights, we propose Virtual-Cache to cost-effectively increase the effective size of L1 cache by utilizing the unassigned and released registers and shared memories as cache-lines in this paper. Specifically, we leverage the unassigned registers and shared memories to serve cache requests directly. Regarding the registers assigned to a warp, they can work as cache-lines after the warp completes the execution and before they are accessed again by a new launched warp. Regarding the shared memories of a thread block, they are enabled to serve cache requests when the thread block is finished till they are referenced by shared memory instructions of the relaunched thread block. The register file, shared memory and L1 cache are physically independent but logically unified as a large virtual cache with redesigned cache-line management. We develop the control and data path for the register file, making the register file accessible for cache requests by borrowing an operand collector to serve the cache requests. We also expand the control and data path for the shared memory to serve the cache requests. Our evaluation results show that Virtual-Cache makes the performance improved by 28% over the previously proposed cache management technique for cache-sensitive applications.  相似文献   

10.
Recent technology advances in mobile networking have ushered in a new era of personal communication. Users can ubiquitously access the Internet via many emerging mobile appliances, such as portable notebooks, personal digital assistants (PDAs), and WAP-enabled cellular phones. While the transcoding proxy is attracting an increasing amount of attention in this environment, it is noted that new caching strategies are required for these transcoding proxies. We propose an efficient cache replacement algorithm for transcoding proxies. Specifically, we formulate a generalized profit function to evaluate the profit from caching each version of an object. This generalized profit function explicitly considers several new emerging factors in the transcoding proxy and the aggregate effect of caching multiple versions of the same object. It is noted that the aggregate effect is not simply the sum of the costs of caching individual versions of an object, but rather, depends on the transcoding relationship among these versions. The notion of a weighted transcoding graph is devised to evaluate the corresponding aggregate effect efficiently. Utilizing the generalized profit function and the weighted transcoding graph, we propose, in this paper, an innovative cache replacement algorithm for transcoding proxies. In addition, an effective data structure is designed to facilitate the management of the multiple versions of different objects cached in the transcoding proxy. Using an event-driven simulation, it is shown that the algorithm proposed consistently outperforms companion schemes in terms of the delay saving ratios and cache hit ratios.  相似文献   

11.
TTL caching models have recently regained significant research interest due to their connection to popular caching policies such as LRU. This paper advances the state-of-the-art analysis of TTL-based cache networks by developing two exact methods with orthogonal generality and computational complexity. The first method generalizes existing results for line networks under renewal requests to the broad class of caching policies whereby evictions are driven by stopping times; in addition to classical policies used in DNS and web caching, our stopping time model captures an emerging new policy implemented in SDN switches and Amazon web services. The second method further generalizes these results to feedforward networks with Markov arrival process (MAP) requests. MAPs are particularly suitable for non-line networks because they are closed not only under superposition and splitting, as known, but also under caching operations with phase-type (PH) TTL distributions. The crucial benefit of the two closure properties is that they jointly enable the first exact analysis of TTL feedforward cache networks in great generality. Moreover, numerical results highlight that existing Poisson approximations in binary-tree topologies are subject to relative errors as large as 30%, depending on the tree depth.  相似文献   

12.
In Java, C or C++, attempts to dereference the null value result in an exception or a segmentation fault. Hence, it is important to identify those program points where this undesired behaviour might occur or prove the other program points (and possibly the entire program) safe. To that purpose, null-pointer analysis of computer programs checks or infers non-null annotations for variables and object fields. With few notable exceptions, null-pointer analyses currently use run-time checks or are incorrect or only verify manually provided annotations. In this paper, we use abstract interpretation to build and prove correct a first, flow and context-sensitive static null-pointer analysis for Java bytecode (and hence Java) which infers non-null annotations. It is based on Boolean formulas, implemented with binary decision diagrams. For better precision, it identifies instance or static fields that remain always non-null after being initialised. Our experiments show this analysis faster and more precise than the correct null-pointer analysis by Hubert, Jensen and Pichardie. Moreover, our analysis deals with exceptions, which is not the case of most others; its formulation is theoretically clean and its implementation strong and scalable. We subsequently improve that analysis by using local reasoning about fields that are not always non-null, but happen to hold a non-null value when they are accessed. This is a frequent situation, since programmers typically check a field for non-nullness before its access. We conclude with an example of use of our analyses to infer null-pointer annotations which are more precise than those that other inference tools can achieve.  相似文献   

13.
Caches are essential to bridge the gap between the high latency main memory and the fast processor pipeline. Standard processor architectures implement two first-level caches to avoid a structural hazard in the pipeline: an instruction cache and a data cache. For tight worst-case execution times it is important to classify memory accesses as either cache hit or cache miss. The addresses of instruction fetches are known statically and static cache hit/miss classification is possible for the instruction cache. The access to data that is cached in the data cache is harder to predict statically. Several different data areas, such as stack, global data, and heap allocated data, share the same cache. Some addresses are known statically, other addresses are only known at runtime. With a standard cache organization all those different data areas must be considered by worst-case execution time analysis. In this paper we propose to split the data cache for the different data areas. Data cache analysis can be performed individually for the different areas. Access to an unknown address in the heap does not destroy the abstract cache state for other data areas. Furthermore, we propose to use a small, highly associative cache for the heap area. We designed and implemented a static analysis for this cache, and integrated it into a worst-case execution time analysis tool.  相似文献   

14.
半导体技术的发展使得在芯片上集成数十亿个晶体管成为可能.目前工业界和学术界倾向于采用片上多处理器体系结构(CMP),对于此类结构,芯片性能受片外访存影响较大,因此如何组织片上高速缓存层次结构是一个关键.针对此问题,提出采用非包含高速缓存组织片上最后一级高速缓存,以降低片外访存次数.并通过对Splash2部分测试程序的详细模拟,对CMP上高速缓存层次结构的不同组织方式做了比较.数据显示非包含高速缓存最多可使平均访存时间降低8.3%.同时,指出非包含高速缓存有助于节省片上资源的特性,并给出片上集成三级高速缓存后CMP上高速缓存层次结构的设计建议.  相似文献   

15.
The long short-term memory (LSTM) is not the only neural network which learns a context sensitive language. Second-order sequential cascaded networks (SCNs) are able to induce means from a finite fragment of a context-sensitive language for processing strings outside the training set. The dynamical behavior of the SCN is qualitatively distinct from that observed in LSTM networks. Differences in performance and dynamics are discussed.  相似文献   

16.
命中率、字节命中率和延迟时间是Web缓存系统中最重要的性能指标,但是却难以准确、合理地度量不同大小的Web对象的访问延迟.引入字节延迟的概念,为不同的对象延迟建立了一个比较合理的评价标准.提出最小延迟代价的Web缓存替换算法LLC,使用户访问的延迟时间尽可能缩短.实验结果表明,与常用的缓存替换算法相比,LLC算法在有效减少用户感知的访问延迟方面具有较好的性能表现.  相似文献   

17.
The current paper presents an automatic and context sensitive system for the dynamic recognition of pain expression among the six basic facial expressions and neutral on acted and spontaneous sequences. A machine learning approach based on the Transferable Belief Model, successfully used previously to categorize the six basic facial expressions in static images [2], [61], is extended in the current paper for the automatic and dynamic recognition of pain expression from video sequences in a hospital context application. The originality of the proposed method is the use of the dynamic information for the recognition of pain expression and the combination of different sensors, permanent facial features behavior, transient features behavior, and the context of the study, using the same fusion model. Experimental results, on 2-alternative forced choices and, for the first time, on 8-alternative forced choices (i.e. pain expression is classified among seven other facial expressions), show good classification rates even in the case of spontaneous pain sequences. The mean classification rates on acted and spontaneous data reach 81.2% and 84.5% for the 2-alternative and 8-alternative forced choices, respectively. Moreover, the system performances compare favorably to the human observer rates (76%), and lead to the same doubt states in the case of blend expressions.  相似文献   

18.
Information flow control (IFC) checks whether a program can leak secret data to public ports, or whether critical computations can be influenced from outside. But many IFC analyses are imprecise, as they are flow-insensitive, context-insensitive, or object-insensitive; resulting in false alarms. We argue that IFC must better exploit modern program analysis technology, and present an approach based on program dependence graphs (PDG). PDGs have been developed over the last 20 years as a standard device to represent information flow in a program, and today can handle realistic programs. In particular, our dependence graph generator for full Java bytecode is used as the basis for an IFC implementation which is more precise and needs less annotations than traditional approaches. We explain PDGs for sequential and multi-threaded programs, and explain precision gains due to flow-, context-, and object-sensitivity. We then augment PDGs with a lattice of security levels and introduce the flow equations for IFC. We describe algorithms for flow computation in detail and prove their correctness. We then extend flow equations to handle declassification, and prove that our algorithm respects monotonicity of release. Finally, examples demonstrate that our implementation can check realistic sequential programs in full Java bytecode.  相似文献   

19.
This paper introduces analyses of write-back caches integrated into response-time analysis for fixed-priority preemptive and non-preemptive scheduling. For each scheduling paradigm, we derive four different approaches to computing the additional costs incurred due to write backs. We show the dominance relationships between these different approaches and note how they can be combined to form a single state-of-the-art approach in each case. The evaluation explores the relative performance of the different methods using a set of benchmarks, as well as making comparisons with no cache and a write-through cache. We also explore the effect of write buffers used to hide the latency of write-through caches. We show that depending upon the depth of the buffer used and the policies employed, such buffers can result in domino effects. Our evaluation shows that even ignoring domino effects, a substantial write buffer is needed to match the guaranteed performance of write-back caches.  相似文献   

20.
Proxy cache algorithms: design, implementation, and performance   总被引:4,自引:0,他引:4  
Caching at proxy servers is one of the ways to reduce the response time perceived by World Wide Web users. Cache replacement algorithms play a central role in the response time reduction by selecting a subset of documents for caching, so that a given performance metric is maximized. At the same time, the cache must take extra steps to guarantee some form of consistency of the cached documents. Cache consistency algorithms enforce appropriate guarantees about the staleness of the cached documents. We describe a unified cache maintenance algorithm, LNC-R-WS-U, which integrates both cache replacement and consistency algorithms. The LNC-R-WS-U algorithm evicts documents from the cache based on the delay to fetch each document into the cache. Consequently, the documents that took a long time to fetch are preferentially kept in the cache. The LNC-R-W3-U algorithm also considers in the eviction consideration the validation rate of each document, as provided by the cache consistency component of LNC-R-WS-U. Consequently, documents that are infrequently updated and thus seldom require validations are preferentially retained in the cache. We describe the implementation of LNC-R-W3-U and its integration with the Apache 1.2.6 code base. Finally, we present a trace-driven experimental study of LNC-R-W3-U performance and its comparison with other previously published algorithms for cache maintenance  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号