首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we introduce contention locality in Transactional Memory (TM) which describes the likelihood that a previously aborted transaction conflicts again in the future. We find that conflicts are highly predictable in TMs and we propose two optimization techniques based on contention locality: The first optimization technique is Speculative Contention Avoidance (SCA). SCA dynamically controls the number of concurrently executing transactions and serializes those transactions that are likely to conflict. As such, SCA reduces contention in TMs and improves performance. The second optimization technique is Adaptive Validation (AV). We show that there is no single validation policy that works well across all applications. AV adjusts validation based on applications’ behavior and improves performance of TMs. In this paper, SCA and AV are evaluated using Transactional Locking II (TL2) and Stamp v0.9.10 benchmark suite. The evaluation reveals that SCA and AV are effective and improve performance significantly.  相似文献   

2.
This paper presents a comprehensive system modeling and analysis approach for both predicting queuing delay and controlling average queuing delay of a single buffer to a required value in a multiple traffic source network environment. This approach could effectively enhance the QoS performance of delay sensitive applications. A discrete-time analytical model that approximates the multi-source arrival process with a binomial distribution has been developed to analyze the relationship between the queuing threshold and average queuing delay. A control strategy with dynamic queue thresholds based on the analytical result is then used to control the average queuing delay to a required value within the buffer. Packet dropping is treated as implicit congestion feedback to the arrival process for rate adjustment. The feasibility of the system has been validated by comparing theoretical analysis with a diverse set of simulation results. Following from the simulation results, a set of statistical analyses has been performed to evaluate the efficiency and accuracy of the proposed scheme. In addition, a user-friendly graphical user interface has been developed to allow user-configuration of the simulation process and display simulation results.  相似文献   

3.
A parallel associative memory first proposed by Kanerva (1988) is discussed. The major appeal of this memory is its ability to be trained very rapidly. A discrepancy between Kanerva's theoretical calculation of capacity and the actual capacity is demonstrated experimentally and a corrected theory is offered. A modified method of reading from memory is suggested which results in a capacity nearly the same as that originally predicted by Kanerva. The capacity of the memory is then analyzed for a different method of writing to memory. This method increases the capacity of the memory by an order of magnitude. A further modification is suggested which increases the learning rate of this method.  相似文献   

4.
We study the performance benefits of speculation in a release consistent software distributed shared memory system. We propose a new protocol, speculative home-based release consistency (SHRC) that speculatively updates data at remote nodes to reduce the latency of remote memory accesses. Our protocol employs a predictor that uses patterns in past accesses to shared memory to predict future accesses. We have implemented our protocol in a release consistent software distributed shared memory system that runs on commodity hardware. We evaluate our protocol implementation using eight software distributed shared memory benchmarks and show that it can result in significant performance improvements.  相似文献   

5.
The standard memory allocators of shared memory systems (SMPs) often provide poor performance, because they do not sufficiently reflect the access latencies of deep NUMA architectures with their on-chip, off-chip, and off-blade communication. We analyze memory allocation strategies for data-intensive MapReduce applications on SMPs with up to 512 cores and 2 TB memory. We compare the efficiency of the MapReduce frameworks MR-Search and Phoenix++ and provide performance results on two benchmark applications, k-means and shortest-path search.  相似文献   

6.
In this paper we propose two new enhancements to the SOCKS protocol in the areas of IP multicasting and UDP tunneling. Most network firewalls deployed at the entrance to a private network block multicast traffic. This is because of potential security threats inherent with IP multicast. Multicasting is the backbone of many Internet technologies like voice and video conferencing, real time gaming, multimedia streaming, and online stock quotes, among others. There is a need to be able to safely and securely allow multicast streams to enter into and leave a protected enterprise network. Securing multicast streams is challenging. It poses many architectural issues. The SOCKS protocol is typically implemented in a network firewall as an application-layer gateway. Our first enhancement in the area of IP multicast to the SOCKS protocol is to enable the application of security and access control policies and safely allow multicast traffic to enter into the boundaries of a protected enterprise network. The second enhancement we propose is to allow the establishment of a tunnel between two protected networks that have SOCKS based firewalls to transport UDP datagrams.  相似文献   

7.
This paper investigates the effects of imposing bounds on the measurements used in weighted least-squares (WLS) state estimation. The active limits for such bounds are derived and algorithms based on linear and quadratic programming kernels are presented. Using the lower limit for the bounds, the constrained WLS scheme becomes an adaptive maximally constrained scheme: M-WLS. For some networks, the poor prior knowledge of the global error characteristic results in some measurements having less influence than would be expected from the local error characteristics of their transducers. By using M-WLS estimation, the influence of such measurements on state estimation may be improved. Analysis of the adaptive bounding of the scheme can also lead to identification of critical measurement discrepancies. For the purpose of illustration, results are presented using simulated measurements; the head measurements (pressures) are consistent with nominal demands (nodal flows) and the demand measurements are generated by superimposing random errors of 2.5ls-1 rms on the nominal demands.  相似文献   

8.
网络服务性能监测方案的探讨   总被引:2,自引:0,他引:2  
提出了对网络服务性能进行监测的三种可行方案,比较了它们各自的优缺点,并结合RFC2758着重分析了基于监测器的网络服务性能的监测方案。最后在融合这三种方案优点的基础上给出一种能在更大程度上满足监测要求、扩展性好的方案。  相似文献   

9.
网络编码可大大提高网络吞吐量、减少延迟。然而,由于编码意义上网络信息流的复杂性,实际应用中仍存在困难。另一方面,Ad Hoc网作为自组织形式的特殊网络,其安全路由协议是研究重点。针对Ad Hoc网安全路由协议中较优秀的SAODV,提出一种基于网络编码的优化方案,较传统编码易于实现、具有较强实用性。  相似文献   

10.
为解决网络性能监控系统的扩展性差、监测参数单一及结构复杂等问题,设计实现了一种集性能监测及异常行为响应于一体的层次化、模块化的监控系统.运行结果表明,系统支持非SNMP的数据采集方式,兼容非SNMP设备,能够及时响应网络异常,实现网络控制功能,监测参数相对更加丰富,能够更好的反映网络运行的实际情况.  相似文献   

11.
The JFFS2 file system for flash memory compresses files before actually writing them into flash memory. Because of this, multimedia files, for instance, which are already compressed in the application level go through an unnecessary and time-consuming compression stage and cause energy waste. Also, when reading such multimedia files, the default use of disk cache results in unnecessary main memory access, hence an energy waste, due to the low cache hit ratio. This paper presents two techniques to reduce the energy consumption of the JFFS2 flash file system for power-aware applications. One is to avoid data compression selectively when writing files, and the other is to bypass the page caching when reading sequential files. The modified file system is implemented on a PDA running Linux and the experiment results show that the proposed mechanism effectively reduces the overall energy consumption when accessing continuous and large files.  相似文献   

12.
A.  A.H.  A.H. 《Performance Evaluation》2005,60(1-4):51-72
In this paper, trace locality of reference (LoR) is identified as a mechanism to predict the behavior of a variety of systems. If two objects were accessed nearby in the past and the first one is accessed again, trace LoR predicts that the second one will be accessed in near future. To capture trace LoR, trace graph is introduced. Although trace LoR can be observed in a variety of systems, but the focus of this paper is to characterize it for data accesses in memory management systems. In this field, it is compared with recency-based prediction (LRU stack) and it is shown that not only the model is much simpler, but also it outperforms recency-based prediction in all cases. The paper examines various parameters affecting trace LoR such as object size, caching effects (address reference stream versus miss address stream), and access type (read, write, or both). It shows that object size does not have meaningful effects on trace LoR; in average the predictability of miss address stream is 30% better than address reference stream; and identifying access type can increase predictability. Finally, two enhancements are introduced to the model: history and multiple LRU prediction. A main contribution of this paper is the introduction of the n-stride prediction1. For a prediction to be useful, we should have sufficient time to load the object, and n-stride prediction shows that trace LoR can predict an access far ahead from its occurrence.  相似文献   

13.
Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed. In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application’s threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.
Nectarios KozirisEmail:
  相似文献   

14.
Genetic Programming (GP) (Koza, Genetic programming, MIT Press, Cambridge, 1992) is well-known as a computationally intensive technique. Subsequently, faster parallel versions have been implemented that harness the highly parallel hardware provided by graphics cards enabling significant gains in the performance of GP to be achieved. However, extracting the maximum performance from a graphics card for the purposes of GP is difficult. A key reason for this is that in addition to the processor resources, the fast on-chip memory of graphics cards needs to be fully exploited. Techniques will be presented that will improve the performance of a graphics card implementation of tree-based GP by better exploiting this faster memory. It will be demonstrated that both L1 cache and shared memory need to be considered for extracting the maximum performance. Better GP program representation and use of the register file is also explored to further boost performance. Using an NVidia Kepler 670GTX GPU, a maximum performance of 36 billion Genetic Programming Operations per Second is demonstrated.  相似文献   

15.
Codes that have large-stride/irregular-stride (L/I) memory access patterns, e.g., sparse matrix and linked list codes, often perform poorly on mainstream clusters because of the general purpose processor (GPP) memory hierarchy. High performance reconfigurable computers (HPRC) contain both GPPs and field programmable gate arrays (FPGAs) connected via a high-speed network. In this research, simple 64-bit floating-point codes are used to illustrate the runtime performance impact of L/I memory accesses in both software-only and FPGA-augmented codes and to assess the benefits of mapping L/I-type codes onto HPRCs. The experiments documented herein reveal that large-stride software-only codes experience severe performance degradation. In contrast, large-stride FPGA-augmented codes experience minimal performance degradation. For experiments with large data sizes, the unit-stride FPGA-augmented code ran about two times slower than software. On the other hand, the large-stride FPGA-augmented code ran faster than software for all the larger data sizes. The largest showed a 17-fold runtime speedup.  相似文献   

16.
The performance of memory and I/O systems is insufficient to catch up with that of COTS (Commercial Off-The-Shelf) CPU. PC clusters using COTS CPU have been employed for HPC. A cache-based processor is far less effective than a vector processor in applications with low spatial locality. Moreover, for HPC, Google-like server farms and database processing, insufficient capacity of main memory poses a serious problem. Power consumption of a Google-like server farm or a high-end HPC PC cluster is huge. In order to overcome these problems, we propose a concept of a memory and network enhancer equipped with scatter and gather vector access functions, high-performance network connectivity, and capacity extensibility. Communication mechanisms named LHS and LHC are also proposed. LHS and LHC are architectures for reducing latency for mixed messages with small controlling data and large data body. Examples of the killer applications of this new type of hardware are presented. This paper presents not only concepts and simulations but also real hardware prototypes named DIMMnet-2 and DIMMnet-3. This paper presents the evaluations concerning memory issues and network issues. We evaluate the module with NAS CG benchmark class C and Wisconsin benchmarks as applications with memory issues. Although evaluation for CG class C is difficult with conventional cycle-accurate simulation methods, we obtained the result for class C with our original method. As a result, we find that the module can improve its maximum performance about 25 times more with Wisconsin benchmarks. However, the results on a cache-based PC show the cache-line flushing degrades acceleration ratio. This shows the high potential of the proposed extended memory module and processors in combination with DMA-based main memory access such as SPU on Cell/B.E. that does not need cache-line flushing. The LHS and LHC communication mechanisms are evaluated in this paper. The evaluations of their effects on latency are shown.  相似文献   

17.
针对事件驱动型传感器网络应用系统,基于简化的AODV(adhoc ondemanddistancevectorrouting)(S-AODV)算法,提出一种结合预先路由和按需路由的混合拓扑控制策略,通过随机选择一部分节点预先运行S-AODV算法来减小事件发生时任务节点的初始拓扑建立时延.仿真实验表明,该策略能以较小的能耗代价换取较快的系统响应速度,满足了事件监测类应用的实时性要求.  相似文献   

18.
This paper describes an approach to carry out performance analysis of parallel embedded applications. The approach is based on measurement, but in addition, the idea of driving the measurement process (application instrumentation and monitoring) by a behavioral model is introduced. Using this model, highly comprehensible performance information can be collected. The whole approach is based on this behavioral model, one instrumentation method and two tools, one for monitoring and the other for visualization and analysis. Each of these is briefly described, and the steps to carry out performance analysis using them are clearly defined. They are explained by means of a case study. Finally, one method to evaluate the intrusiveness of the monitoring approach is proposed, and the intrusiveness results for the case study are presented.  相似文献   

19.
《Computer Networks》2008,52(17):3218-3228
As computer networks increase in size, it is critical to provide efficient and scalable network management. Integration of mobile agents (MAs) with simple network management protocol (SNMP) provides a decentralized network management architecture that overcomes the limitations of the legacy SNMP client/server structure. However, as an MA travels through its itinerary, acquiring the network state at each managed node, its size linearly increases node-by-node and it may be unexpectedly bloated. As a result, a bloated MA will have difficulty in migrating from one node to another. We show that the network response time grows exponentially as the MA size increases linearly. In this paper, we propose a new strategy called itinerary partitioning approach (IPA) that exploits cloning capability of MAs to effectively address this bloated state phenomenon. The analytical model shows the effectiveness of our proposed IPA in terms of network response time. We have implemented the IPA in a practical test-bed network and the results seem to be very encouraging.  相似文献   

20.
Clustering network sites is a vital issue in parallel and distributed database systems DDBS. Grouping distributed database network sites into clusters is considered an efficient way to minimize the communication time required for query processing. However, clustering network sites is still an open research problem since its optimal solution is NP-complete. The main contribution in this field is to find a near optimal solution that groups distributed database network sites into disjoint clusters in order to minimize the communication time required for data allocation. Grouping a large number of network sites into a small number of clusters effectively increases the transaction response time, results in better data distribution, and improves the distributed database system performance. We present a novel algorithm for clustering distributed database network sites based on the communication time as database query processing is time dependent. Extensive experimental tests and simulations are conducted on this clustering algorithm. The experimental and simulation results show that a better network distribution is achieved with significant network servers load balance and network delay, a minor communication time between network sites is realized, and a higher distributed database system performance is recognized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号