首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 371 毫秒
1.
通过作业日志分析和考核实验方式,对超级计算机并行作业运行稳定性进行了分析。日志分析结果表明,并行作业运行的稳定性会随作业执行时间的增长、作业使用CPU数的增多而下降;当并行作业的计算量达到105CPU小时量级,超过20%的作业会因系统故障而中止。考核实验结果表明,使用数千CPU的并行作业很容易受到多种因素的干扰而中止,很难持续运行超过24小时。最后给出了有关超级计算机稳定性改进、系统管理使用和并行程序研制的几点建议。  相似文献   

2.
刘扬  何华灿  蒋芸 《计算机应用》2004,24(8):104-105,109
集群服务器模式的作业管理系统通过在集群系统中提供多虚拟服务模式的作业管理服务器,能够解决用户响应效率低、资源利用率低等问题,提高了作业的执行效率。提出了集群作业管理系统的分层实现模型,并具体分析了各层的详细实现,该模型显著提高了作业管理系统的可用性和可扩展性。  相似文献   

3.
结合回填的FCFS策略是超级计算机上使用最为普遍的调度策略,针对该策略在响应时间和系统利用率等方面的不足,提出了改进其性能的DGA方法。该方法利用并行作业的可塑性,通过调度时对作业平均响应时间的预测来选择适合的作业请求规模,并利用遗传算法来解决最优作业资源请求的搜索问题。模拟器上实际作业流的模拟结果表明:该方法可以显著地改进结合回填的FCFS策略的调度效果,也优于已有的可塑性作业调度策略。  相似文献   

4.
基于TL_Sheduling的异构集群负载均衡算法的研究与应用   总被引:1,自引:0,他引:1  
异构环境下"集群系统"负载均衡技术的关键是实现进程的跨平台迁移,对于进程的跨平台迁移所带来的困难和开销是巨大的.在传统进程迁移算法基础上,充分考虑节点对提交作业的适应性,提出一种新的TL-Scheduling负载均衡算法,能够控制作业提交到适合它执行的节点上,该算法可以有效提高系统负载均衡和作业执行效率.在其基础上,研究设计了基于XML业务流程模板的卫星作业调度系统,使得作业调度系统更加实用化.  相似文献   

5.
传统的作业管理系统一般局限在局域网范围内,用户量有限,作业的执行状态和结果不能通过Internet网远程监控和查询。文章在考察Web技术的基础上设计并实现了一种基于Web机制的作业管理系统,文中阐述了系统的应用模式和体系结构,以及系统中各个子系统的功能及相互关系,引入了作业网络的概念,描述了作业间复杂的逻辑依赖关系,也可表达作业网络间的层次和嵌套关系,这进一步扩大了系统的应用领域。  相似文献   

6.
基于蚁群优化算法的服务网格的作业调度   总被引:9,自引:0,他引:9  
提出了利用蚁群算法来优化服务网格的作业调度系统的方法和一个两层的作业调度模型,该模型可以在网格的动态和异构环境下实现对作业执行时间的预测,然后根据作业的预测执行时间并利用蚁群优化算法使适应函数取得最小值,从而得到最优化的作业调度。基于开发的校园网格实验床,通过实验显示该方法可以优化服务网格的性能,减少作业的平均执行时问,提高系统的吞吐率。  相似文献   

7.
高性能集群的作业调度通常使用作业调度系统来实现,准确填写作业运行时间能在很大程度上提升作业调度效率。现有的研究通常使用机器学习的预测方式,在预测精度和实用性上还存在一定的提升空间。为了进一步提高集群作业运行时间预测的准确率,考虑先对集群作业日志进行聚类,将作业类别信息添加到作业特征中,再使用基于注意力机制的NR-Transformer网络对作业日志数据建模和预测。在数据处理上,根据与预测目标的相关性、特征的完整性和数据的有效性,从历史日志数据集中筛选出7维特征,并按作业运行时间的长度将其划分为多个作业集,再对各作业集分别进行训练和预测。实验结果表明,相比于传统机器学习和BP神经网络,时序神经网络结构有更好的预测性能,其中NR-Transformer在各作业集上都有较好的性能。  相似文献   

8.
与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。  相似文献   

9.
为提升服务质量,数据中心需要确保在规定的截止时间前完成用户作业,因此必须根据实时的系统资源对作业进行有效的调度。提出了一种作业调度算法,根据预测的作业执行时间进行批作业调度,以最小化批作业的完成时间。作业执行时间预测模型基于长短期记忆LSTM网络,根据用户作业类型、作业量、作业需要的CPU核数和内存数量,以及作业需要的资源在系统总资源中的占比,对用户作业的执行时间进行预测。预测结果用于判断集群是否有能力按时完成用户作业,同时为合理安排各作业的执行顺序提供依据。通过实验确定了影响LSTM时间预测模型性能的各超参数取值,如迭代次数、学习率和网络层数等。实验表明,与SVR模型、ARIMA模型和BP模型相比,基于LSTM的作业执行时间预测模型的决定系数R2分别有2.97%,2.34%和5.66%的提升效果,且预测的平均误差仅为0.78%。  相似文献   

10.
本文提出了一种解决超级计算机中系统级节能问题的方案,采用共享计算机资源实时任务执行概率和节点机安全切入和切出机制,实现了超级计算机系统中负载检测统计和预测以及节能安全决策。初步实验表明,本文提出的节能安全决策方法可以实现较多节能,其节能效果受限于具体的系统模型、开销模型和负载预测结果。  相似文献   

11.
The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth understanding of the characteristics of HPC job failures is essential. In this paper, we present an empirical study on job failures of 10 public workload data sets collected from 8 large-scale HPCs all over the world. Multiple analysis methods are applied to provide a comprehensive and in-depth understanding of job failures. In order to facilitate design, testing and management of HPCs, we study properties of job failures from the following four aspects: proportion in workload and resource consumption, submission inter-arrival time, locality, and runtime.Our analysis results show that job failure rates are significant in most HPCs, and on average, a failed job often consumes more computational resources than a successful job. We also observe that the submission inter-arrival time of failed jobs is better fit by Generalized Pareto and Lognormal distributions, and the probability of failed job submission follows a “V” shape: decreasing during the first 100 seconds right after the submission of the last failed job and increasing afterward. The majority of job failures come from a small number of users and applications, and furthermore these users are the primary factor related to job failures compared with these applications. We find evidence that failed jobs’ lifetime accuracy (runtime / request time) always follows the “bathtub curve”. Moreover, job failures exhibit strong locality properties that can support the prediction of failed jobs’ occurrence and runtime. Most of these findings are new contributions from the research community, and some findings also reveal important properties of job failures that were misunderstood or poorly understood before. The wide range of studies in this paper can directly and thoroughly facilitate fault tolerant, scheduling, workload modeling, etc. in HPCs, and lead to better system utility while reducing costs.  相似文献   

12.
提出与描述了一种面向任务运行时间预测和容错感知(Fault-Aware)的网格资源分配策略,采用主动容错的方式,在资源出错之前尽量提前避免它出错或异常的情况发生。该策略把网格中任务的运行时间(runtime)预测和资源的在线时间(uptime)预测结合起来,相对于普通的调度策略具有比较高的资源利用率。在具体的CoBRA网格中间件中实现了该容错感知调度,描述了实现该容错感知调度策略模块的功能。测试过程中选择了睡眠任务技术,划分四种不同的场景进行实验,把该容错感知资源分配与普通的FCFS调度策略进行比较,结果证明在可变化的资源可用性的情况下系统可以加快应用的整体执行时间,具有很小的偏差。  相似文献   

13.
In most parallel supercomputers, submitting a job for execution involves specifying how many processors are to be allocated to the job. When the job is moldable (i.e., there is a choice on how many processors the job uses), an application scheduler called SA can significantly improve job performance by automatically selecting how many processors to use. Since most jobs are moldable, this result has great impact to the current state of practice in supercomputer scheduling. However, the widespread use of SA can change the nature of workload processed by supercomputers. When many SAs are scheduling jobs on one supercomputer, the decision made by one SA affects the state of the system, therefore impacting other instances of SA. In this case, the global behavior of the system comes from the aggregate behavior caused by all SAs. In particular, it is reasonable to expect the competition for resources to become tougher with multiple SAs, and this tough competition to decrease the performance improvement attained by each SA individually. This paper investigates this very issue. We found that the increased competition indeed makes it harder for each individual instance of SA to improve job performance. Nevertheless, there are two other aggregate behaviors that override increased competition when the system load is moderate to heavy. First, as load goes up, SA chooses smaller requests, which increases efficiency, which effectively decreases the offered load, which mitigates long wait times. Second, better job packing and fewer jobs in the system make it easier for incoming jobs to fit in the supercomputer schedule, thus reducing wait times further. As a result, in moderate to heavy load conditions, a single instance of SA benefits from the fact that other jobs are also using SA.  相似文献   

14.
Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.  相似文献   

15.
传统基于用户预估的执行时间通常准确性较差。结合分类和基于实例的学习方法,综合使用模板相似和数值相似方法,在历史调度数据中获取当前作业的相似作业,并使用其历史信息预测当前作业执行时间。使用调度历史中的用户名、分组名、队列名、应用名、用户请求处理器数、用户请求(预估)执行时间和用户请求内存量等属性进行训练和预测,算法中涉及的参数使用遗传算法确定。数值实验表明,相较于已有文献,本方法在使用更少参数的前提下得到了与文献结果中相近的低估率,并获得了更低的平均绝对误差。在HPC2N04和HPC2N05日志数据集上,平均绝对误差分别降低了43%和77%。研究了使用在线预测替换用户估计对作业调度的影响,对结果进行了初步分析并指出了今后的改进方向。  相似文献   

16.
针对计算密集型作业与数据密集型作业混合情况,在一个作业有时间限制的动态环境中,对传统的网格作业调度方法进行扩展,提出了三种网格作业调度启发式算法:Emin min、Ebest、Esufferage。并在一个由多个Cluster组成的、通过高速网络连接的网格模型上,对三种算法进行验证。与Min min算法的比较结果显示:三种算法均优于Min min算法。与ASJS算法比较结果显示:Emin min减少了等待时间与作业的makespan; Esufferage算法以减少作业完成量为代价,减少了作业的等待时间及makespan; Ebest在完成作业数量上与ASJS基本保持一致,但却增加了作业的等待时间与makespan。总体上,Emin min具有比较大的优势。  相似文献   

17.
Jiang  Miao  Fang  Yi  Xie  Huangming  Chong  Jike  Meng  Meng 《World Wide Web》2019,22(1):325-345

Major job search engines aggregate tens of millions of job postings online to enable job seekers to find valuable employment opportunities. Predicting the probability that a given user clicks on jobs is crucial to job search engines as the prediction can be used to provide personalized job recommendations for job seekers. This paper presents a real-world job recommender system in which job seekers subscribe to email alert to receive new job postings that match their specific interests. The architecture of the system is introduced with the focus on the recommendation and ranking component. Based on observations of click behaviors of a large number of users in a major job search engine, we develop a set of features that reflect the click behavior of individual job seekers. Furthermore, we observe that patterns of missing features may indicate various types of job seekers. We propose a probabilistic model to cluster users based on missing features and learn the corresponding prediction models for individual clusters. The parameters in this clustering-prediction process are jointly estimated by EM algorithm. We conduct experiments on a real-world testbed by comparing various models and features. The results demonstrate the effectiveness of our proposed personalized approach to user click prediction.

  相似文献   

18.
张伟哲  张宏莉  张元竞 《软件学报》2010,21(Z1):238-250
针对基于MPI 的并行作业性能预测问题,鉴于历史预测与建模分析方法在异构网络计算环境中性能预测的局限,提出了基于判例构造的并行作业性能预测方法.在MPI 库PMPI 接口中插入封套函数,获取通信日志,并设计了日志规整和合并算法.将最核心的日志循环收缩问题,转化为字符串循环子串收缩问题,提出了一种基于后缀数组算法,在理论和实际的性能方面均优于已有算法;判例程序自动构建阶段,解决了计算时间与通信时间等比例缩放问题,设计了自动构建可执行判例程序的方法.同构与异构机群环境实验结果表明,判例预测方法能够比较准确地预估计算作业的运行时间,对于同构机群误差不超过3%,异构机群误差不超过10%,与同类算法相比,具有较好的综合性能.  相似文献   

19.
The aim of this study is to predict automatic trading decisions in stock markets. Comprehensive features (CF) for predicting future trend are very difficult to generate in a complex environment, especially in stock markets. According to related work, the relevant stock information can help investors formulate objects that may result in better profits. With this in mind, we present a framework of an intelligent stock trading system using comprehensive features (ISTSCF) to predict future stock trading decisions. The ISTSCF consists of stock information extraction, prediction model learning and stock trading decision. We apply three different methods to generate comprehensive features, including sentiment analysis (SA) that provides sensitive market events from stock news articles for sentiment indices (SI), technical analysis (TA) that yields effective trading rules based on trading information on the stock exchange for technical indices (TI), as well as the trend-based segmentation method (TBSM) that raises trading decisions from stock price for trading signals (TS). Experiments on the Taiwan stock market show that the results of employing comprehensive features are significantly better than traditional methods using numeric features alone (without textual sentiment features).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号