首页 | 本学科首页   官方微博 | 高级检索  
文章检索
  按 检索   检索词:      
出版年份:   被引次数:   他引次数: 提示:输入*表示无穷大
  收费全文   32篇
  免费   1篇
  国内免费   1篇
电工技术   1篇
无线电   2篇
一般工业技术   1篇
自动化技术   30篇
  2021年   1篇
  2017年   1篇
  2016年   1篇
  2015年   2篇
  2014年   2篇
  2011年   1篇
  2010年   1篇
  2009年   1篇
  2008年   2篇
  2007年   3篇
  2004年   3篇
  2002年   2篇
  2001年   1篇
  2000年   3篇
  1999年   2篇
  1998年   2篇
  1997年   1篇
  1996年   1篇
  1994年   2篇
  1991年   1篇
  1975年   1篇
排序方式: 共有34条查询结果,搜索用时 15 毫秒
1.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.  相似文献   
2.
Efficient algorithms for optimistic crash recovery   总被引:1,自引:0,他引:1  
Summary Recovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them. S. Venkatesan received the B. Tech. and M. Tech degrees from the Indian Institute of Technology, Madras in 1981 and 1983, respectively and the M.S. and Ph.D. degrees in Computer Science from the University of Pittsburgh in 1985 and 1988. He joined the University of Texas at Dallas in January 1989, where he is currently an Assistant Professor of Computer Science. His research interests are in fault-tolerant distributed systems, distributed algorithms, testing and debugging distributed programs, fault-tolerant telecommunication networks, and mobile computing. Tony Tony-Ying Juang is an Associate Professor of Computer Science at the Chung-Hwa Polytechnic Institute. He received the B.S. degree in Naval Architecture from the National Taiwan University in 1983 and his M.S. and Ph.D. degrees in Computer Science from the University of Texas at Dallas in 1989 and 1992, respectively. His research interests include distributed algorithms, fault-tolerant distributed computing, distributed operating systems and computer communications.This research was supported in part by NSF under Grant No. CCR-9110177 and by the Texas Advanced Technology Program under Grant No. 9741-036  相似文献   
3.
Mobile computing systems provide users with access to information regardless of their geographical location. In these systems, Mobile Support Stations (MSSs) play the role of providing reliable and uninterrupted communication and computing facilities to mobile hosts. The failure of a MSS can cause interruption of services provided by the mobile system. Two basic schemes for tolerating the failure of MSSs exist in the literature. The first scheme is based on the principle of checkpointing used in distributed systems. The second scheme is based on state information replication of mobile hosts in a number of secondary support stations. Depending on the replication scheme used, the second approach is further classified as a pessimistic or an optimistic technique. In this paper, we propose a hybrid scheme which combines the pessimistic and the optimistic replication schemes. In the proposed scheme, an attempt is made to strike a balance between the long delay caused by the pessimistic and the high memory requirements of the optimistic schemes. In order to find the best ratio between the number of pessimistic to the number of optimistic secondary stations in the proposed scheme, we used fuzzy logic. We also used simulation to compare the performance of the proposed scheme with those of the optimistic and the pessimistic schemes. Simulation results showed that the proposed scheme performs better than either schemes in terms of delay and memory requirements.  相似文献   
4.
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。  相似文献   
5.
In this study, we describe the ​further development of Elastic Cloud Computing Cluster (EC3), a tool ​for creating self-managed cost-efficient virtual hybrid elastic clusters on top of Infrastructure as a Service (IaaS) clouds. By using spot ​instances and checkpointing techniques, EC3 can significantly reduce the total ​execution cost as well as facilitating automatic fault tolerance. Moreover, EC3 can deploy and manage hybrid clusters across on-premises and public ​cloud resources, thereby introducing ​cloud bursting capabilities. ​We present the results of a case study that we conducted to assess the effectiveness of the tool ​based on the structural dynamic analysis of buildings. In addition, we evaluated the checkpointing algorithms in a real ​cloud environment with existing workloads to study their effectiveness. The results ​demonstrate the feasibility and benefits of this type of ​cluster for computationally intensive applications.  相似文献   
6.
A survey of checkpointing algorithms for parallel and distributed computers   总被引:1,自引:0,他引:1  
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.  相似文献   
7.
Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in real-time systems (RTS). Therefore, for RTS it is important to optimize the fault tolerance techniques such that the probability to meet the deadlines, i.e. the Level of Confidence (LoC), is maximized. Previous studies have focused on evaluating the LoC for equidistant checkpointing. However, no studies have addressed the problem of evaluating the LoC for non-equidistant checkpointing. In this work, we provide an expression to evaluate the LoC for non-equidistant checkpointing. Further, we detail an exhaustive search approach to find the distribution of a given number of checkpoints that results in the maximal LoC. Since the exhaustive search approach is very time-consuming, we propose the Clustered Checkpointing method, a heuristic that distributes checkpoints in a number of clusters with the goal to maximize the LoC. The results show that the LoC can be improved when non-equidistant checkpointing is used. Further, the results indicate that the proposed Clustered Checkpointing method is capable to find the distribution that results in the maximal LoC in much shorter time than the exhaustive search approach, while considering only few clusters.  相似文献   
8.
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve’s structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve.  相似文献   
9.
R. S. Side  G. C. Shoja 《Software》1994,24(5):507-525
Developing a distributed debugger is much more complex than developing a sequential debugger. This added complexity is mainly due to the non-determinism of events that communication delays introduce into distributed systems. We explore the problems that one must address when designing a distributed program debugger and then describe our design and implementation of DPD (distributed program debugger). Problems addressed include non-determinism of events, finding consistent system states, setting breakpoints, recording events, and checkpointing. Important features of DPD include dynamic roll back and replay, as well as a graphical user interface. DPD has been tested successfully in debugging distributed programs within a distributed facility called REM (remote execution manager).  相似文献   
10.
Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements.  相似文献   
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号