期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

岳晓飞史岚赵宇海季航旭王国仁《软件学报》2022,33(3):985-1004

新兴分布式计算框架Apache Flink支持在集群上执行大规模的迭代程序,但其默认的静态资源分配机制导致无法进行合理的资源配置来使迭代作业按时完成.针对这一问题,应该依靠用户来主动表达性能约束而不是被动地进行资源保留,故提出了一种基于运行时间预测的动态资源分配策略RABORP (resource allocation... 相似文献

2.

基于用户行为的超级计算机作业失败预测方法

唐阳坤鲜港杨文祥喻杰张晓蓉王耀彬《计算机工程与科学》2022,44(10):1753-1761

超级计算机的规模不断扩大,与此同时,科学应用的复杂性也在不断增加,这导致了超级计算机上许多作业失败。作业失败会造成资源浪费,排队作业等待时间延长,严重影响系统的执行效率。提前预测作业失败,就可以采取必要的措施提升系统资源利用率和系统执行效率,这对未来的E级超级计算机至关重要。为此,尝试研究从已知的传统特征和构建特征中预测作业失败,发现能够反映用户工作行为模式和提交行为模式的特征及处理方式。通过结合行为特征和传统特征,提出基于树结构模型的综合框架来预测作业失败。实验结果表明,预测效果优于其他相关方法。相似文献

3.

Utilization, predictability, workloads, and user runtime estimatesin scheduling the IBM SP2 with backfilling 总被引：1，自引：0，他引：1

Mu'alem A.W. Feitelson D.G. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(6):529-543

Scheduling jobs on the IBM SP2 system and many other distributed-memory MPPs is usually done by giving each job a partition of the machine for its exclusive use. Allocating such partitions in the order in which the jobs arrive (FCFS scheduling) is fair and predictable, but suffers from severe fragmentation, leading to low utilization. This situation led to the development of the EASY scheduler which uses aggressive backfilling: Small jobs are moved ahead to fill in holes in the schedule, provided they do not delay the first job in the queue. We compare this approach with a more conservative approach in which small jobs move ahead only if they do not delay any job in the queue and show that the relative performance of the two schemes depends on the workload. For workloads typical on SP2 systems, the aggressive approach is indeed better, but, for other workloads, both algorithms are similar. In addition, we study the sensitivity of backfilling to the accuracy of the runtime estimates provided by the users and find a very surprising result. Backfilling actually works better when users overestimate the runtime by a substantial factor 相似文献

4.

面向运行时间预测和容错感知的网格资源分配

下载免费PDF全文

赵胜王媛媛《计算机工程与应用》2011,47(16):65-68

提出与描述了一种面向任务运行时间预测和容错感知（Fault-Aware）的网格资源分配策略,采用主动容错的方式,在资源出错之前尽量提前避免它出错或异常的情况发生。该策略把网格中任务的运行时间（runtime）预测和资源的在线时间（uptime）预测结合起来,相对于普通的调度策略具有比较高的资源利用率。在具体的CoBRA网格中间件中实现了该容错感知调度,描述了实现该容错感知调度策略模块的功能。测试过程中选择了睡眠任务技术,划分四种不同的场景进行实验,把该容错感知资源分配与普通的FCFS调度策略进行比较,结果证明在可变化的资源可用性的情况下系统可以加快应用的整体执行时间,具有很小的偏差。相似文献

5.

用爬山法实现无中心式网格调度 总被引：1，自引：0，他引：1

张琳黄仙姣《计算机工程与设计》2006,27(11):2073-2076

为方便网格资源的扩展,网格调度应当是无中心的.为在尽可能多的计算资源中为单地点作业优化资源选择,这里采用了爬山算法.当一个网格调度器收到一个单地点作业,爬山法被激活,根据网格调度器之间的相邻关系为作业找出最适合的计算系统,这里每个计算系统的适合度用预测的作业响应时间表示.实验模拟了无中心式网格调度与计算系统之间的性能差别,每个计算系统的本地调度采用保守式装填法,网格工作负荷由模型得到,并用一段工作负荷的平均响应时间衡量调度性能.实验结果表明,即使在作业提交点分布不均匀且运行时间估计不准确情况下,爬山法仍可有效改善单地点作业的调度. 相似文献

6.

面向高性能计算环境的作业优化调度模型的设计与实现 总被引：1，自引：0，他引：1

王小宁肖海力曹荣强《计算机工程与科学》2017,39(4):619-626

高性能计算环境聚合了多个分布在不同地域、不同组织机构的高性能计算资源,面向用户提供统一的访问入口和使用方式,由系统中间件根据用户作业请求匹配合适的高性能计算资源。随着环境应用编程接口的开放以及作业请求数量的大幅增加,面对高并发作业提交请求时,目前采用的即时调度模型会由于网络等原因导致一定数量的请求处理失败,同时缺乏灵活性。针对此问题,优化了环境作业调度模型,引入作业环境队列,细化了作业系统层状态,增加了作业调度策略可配置性,并基于环境中间件SCE实现了系统原型。经测试,在单核心服务每分钟处理近200个作业提交请求的工作负载下,无因系统和网络原因引起的作业提交出错现象;在共计1 000个作业中,近500个作业提交命令请求在0.3s以内完成,800余个作业提交命令请求在0.5s以内完成。相似文献

7.

网格环境下的作业运行支持系统分析*

王彬许卓群《计算机应用研究》2007,24(2):106-109

网格环境下的作业运行支持系统支持用户在网格资源上远程提交作业任务,执行科学计算应用程序,并管理运行着的作业任务.作业运行支持系统解决了计算执行环境的准备、状态监视汇报、运行时操纵和I/O支持等方面的关键问题.现有的几种主要的网格中间件系统均提供了作业执行和管理工具,很好地解决了几个主要问题,但并不能完全满足用户的需要,还需进一步改进与完善. 相似文献

8.

ATLAS grid workload on NDGF resources: Analysis,modeling, and workload generation

《Future Generation Computer Systems》2015

Evaluating new ideas for job scheduling or data transfer algorithms in large-scale grid systems is known to be notoriously challenging. Existing grid simulators expect to receive a realistic workload as an input. Such input is difficult to obtain in the absence of an in-depth study of representative grid workloads.In this work, we analyze the workload of the ATLAS experiment at CERN at the LHC, processed on the resources of Nordic Data Grid Facility. ATLAS is one of the biggest grid technology users, with extreme demands for CPU power, data volume and bandwidth. The analysis is based on the data sample with ∼1.6 million jobs, 3029 TB of data transfer, and 873 years of processor time. Our additional contributions are (a) scalable workload models that can be used to generate a synthetic workload for a given number of jobs, (b) an open-source workload generator software integrated with existing grid simulators, and (c) suggestions for grid system designers based on the insights of our analysis. 相似文献

9.

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Gandomi Abolfazl Movaghar Ali Reshadi Midia Khademzadeh Ahmad 《The Journal of supercomputing》2020,76(9):7177-7203

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

相似文献

10.

系统负载与并行程序运行时间的关系 总被引：6，自引：0，他引：6

雷州徐志伟祝明发《计算机研究与发展》2000,37(7):813-818

负载共享技术在并行处理中是至关重要的,通过对大量负载共享献的考究发现,以前的研究都是基于一定系统负载而进行平衡自救的设计,它们很少考虑到所选定的系统负载与程序运行时间之间的准确关系,为了确定系统负载对并行程序运行的影响,确定了影响并行程序运行的两个重要系统因素：ＣＰＵ负载和网络负载,为了不失一般性,也为了简化网络负载的测量,选用２ＮＡＳＰＶＭ并行Ｂｅｎｃｈｍａｒｋ作为实验测试对象;为了得到程序运相似文献

11.

Backfilling Using System-Generated Predictions Rather than User Runtime Estimates 总被引：4，自引：0，他引：4

Tsafrir D. Etsion Y. Feitelson D.G. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(6):789-803

The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counter intuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY 相似文献

12.

User-level failure detection and auto-recovery of parallel programs in HPC systems

Guozhen ZHANG Yi LIU Hailong YANG Jun XU Depei QIAN 《Frontiers of Computer Science》2021,15(6):156107

As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems. 相似文献

13.

Kronos: towards bus contention-aware job scheduling in warehouse scale computers

Shuai XUE Shang ZHAO Quan CHEN Zhuo SONG Shanpei CHEN Tao MA Yong YANG Wenli ZHENG Minyi GUO 《Frontiers of Computer Science》2023,17(1):171101

While researchers have proposed many techniques to mitigate the contention on the shared cache and memory bandwidth, none of them has considered the memory bus contention due to split lock. Our study shows that the split lock may cause 9X longer data access latency without saturating the memory bandwidth. To minimize the impact of split lock, we propose Kronos, a runtime system composed of an online bus contention tolerance meter and a bus contention-aware job scheduler. The meter characterizes the tolerance of jobs to the “pressure” of bus contention and builds a tolerance model with the polynomial regression technique. The job scheduler allocates user jobs to the physical nodes in a contention aware manner. We design three scheduling policies that minimize the number of required nodes while ensuring the Service Level Agreement (SLA) of all the user jobs, minimize the number of jobs that suffer from SLA violation without enough nodes, and maximize the overall performance without considering the SLA violation, respectively. Adopting the three policies, Kronos reduces the number of the required nodes by 42.1% while ensuring the SLA of all the jobs, reduces the number of the jobs that suffer from SLA violation without enough nodes by 72.8%, and improves the overall performance by 35.2% without considering SLA. 相似文献

14.

计算网格工作负荷的建模 总被引：1，自引：0，他引：1

下载免费PDF全文

王庆江张琳《计算机工程》2007,33(3):76-78

为评估计算网格中的作业调度,建立了网格工作负荷模型。在不同的节点,作业的运行时间不同;在不同的节点之间,作业的迁移开销不同。定义了不依赖网格资源性能的纯运行时间和纯迁移开销。借鉴并行计算机的工作负荷模型,可得到并行度、纯运行时间和到达间隔的分布。构建了作业提交位置、纯迁移开销、纯运行时间估计因子、完成期限的分布。应用实例表明,由网格工作负荷模型可获得各种工作负荷,支持对作业调度的全面评估。相似文献

15.

An empirical examination of the impact of computer information systems on users

《Information & Management》1995,29(4):207-214

This research investigates the effects of computer-based information systems on users and their jobs. Overall, based upon data collected from 101 users, the results show that information systems have a positive effect on four of the five core job dimensions (identity, significance, autonomy, and feedback). No significant impact is detected for the “skill variety” dimension. Results are also reported when users are broken down according to their position in the management hierarchy, and their user type (primary, secondary, end-users). 相似文献

16.

Connecting Community-Grids by supporting job negotiation with coevolutionary Fuzzy-Systems

Alexander F?lling Christian Grimme Joachim Lepping Alexander Papaspyrou 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(12):2375-2387

We utilize a competitive coevolutionary algorithm (CA) in order to optimize the parameter set of a Fuzzy-System for job negotiation between Community-Grids. In a Community-Grid, users are submitting jobs to their local High Performance Computing (HPC) sites over time. Now, we assume that Community-Grids are interconnected such that the exchange of jobs becomes possible: Each Community strives for minimizing the response time for their own members by trying to distribute workload to other communities in the Grid environment. For negotiation purpose, a Fuzzy-System is used to steer each site’s decisions whether to distribute or accept workload in a beneficial, yet egoistic direction. In such a system, it is essential that communities can only benefit if the workload is equitably (not necessarily equally) portioned among all participants. That is, if one community egoistically refuses to execute foreign jobs regularly, other HPC sites suffer from overloading. This, on the long run, deteriorates the opportunity to utilize them for job delegation. Thus, the egoistic community will degrade its own average performance. This scenario is particularly suited for the application of a competitive CA: the Fuzzy-Systems of the participating communities are modeled as species, which evolve in different populations while having to compete within the commonly shared ecosystem. Using real workload traces and Grid setups, we show that the opportunistic cooperation leads to significant improvements for both each community and the overall system. 相似文献

17.

Coalition theory based task scheduling algorithm using DLFC-NN model

Ashis Kumar Mishra Subasish Mohapatra Pradip Kumar Sahu 《Concurrency and Computation》2024,36(10):e8005

Resource management and job scheduling are essential in today's cloud computing world. Due to task scheduling and users' diverse submission of large-scale requests, co-located VM instances negatively impacted the performance of leased VM instances. This workload further led to resource rivalry across co-located VMs. In order to address the aforementioned problems, numerous strategies have been presented, however, they fail to take the asynchronous nature of the cloud environment into account. To address this issue, a novel “CTA using DLFC-NN model” is proposed. This proposed approach combines the coalition theory and DLFC-NN techniques by including IRT-OPTICS for task size clustering, digital metrology based on ionized information (DMBII) for defect detection in virtue machines (VM), and the dynamic levy flight hamster optimization algorithm for processing time optimization of the clusters. However, the implementation of task scheduling in an online environment is limited by a number of presumptions or oversimplifications made by current scheduling systems. As a result, a unique coalition theory is applied to efficiently schedule activities. In addition, the DLFC-NN model is used to reduce resource consumption, span time, and be highly accurate and energy-efficient when working on both online and offline jobs. Nevertheless, while optimizing the clusters' overall execution time, earlier approaches only decreased the make-span time for task scheduling. However, the DLFC-NN model solves the computation problem by using a fully weighted bipartite graph and the pseudo method to determine the fitness of the least makespan time. The enhanced methodology used in this study reduces the scheduling cost and minimizes job completion times according to different task counts when compared to the existing techniques. 相似文献

18.

基于学习的容器环境Spark性能监控与分析

皮艾迪喻剑周笑波《计算机应用》2017,37(12):3586-3591

Spark计算框架被越来越多的企业用作大数据分析的框架,由于通常部署在分布式和云环境中因此增加了该系统的复杂性,对Spark框架的性能进行监控并查找导致性能下降的作业向来是非常困难的问题。针对此问题,提出并编写了一种针对分布式容器环境中Spark性能的实时监控与分析方法。首先,通过在Spark中植入代码和监控Docker容器中的API文件获取并整合了作业运行时资源消耗信息;然后,基于Spark作业历史信息,训练了高斯混合模型（GMM）;最后,使用训练后的模型对Spark作业的运行时资源消耗信息进行分类并找出导致性能下降的作业。实验结果表明,所提方法能检测出90.2%的异常作业,且其对Spark作业性能的影响仅有4.7%。该方法能减轻查错的工作量,帮助用户更快地发现Spark的异常作业。相似文献

19.

Scheduling and planning job execution of loosely coupled applications 总被引：1，自引：1，他引：0

Enis Afgan Purushotham Bangalore Tibor Skala 《The Journal of supercomputing》2012,59(3):1431-1454

Growth in availability of data collection devices has allowed individual researchers to gain access to large quantities of data that needs to be analyzed. As a result, many labs and departments have acquired considerable compute resources. However, effective and efficient utilization of those resources remains a barrier for the individual researchers because the distributed computing environments are difficult to understand and control. We introduce a methodology and a tool that automatically manipulates and understands job submission parameters to realize a range of job execution alternatives across a distributed compute infrastructure. Generated alternatives are presented to a user at the time of job submission in the form of tradeoffs mapped onto two conflicting objectives, namely job cost and runtime. Such presentation of job execution alternatives allows a user to immediately and quantitatively observe viable options regarding their job execution, and thus allows the user to interact with the environment at a true service level. Generated job execution alternatives have been tested through simulation and on real-world resources and, in both cases, the average accuracy of the runtime of the generated and perceived job alternatives is within 5%. 相似文献

20.

Predicting Job Failures in AuverGrid Based on Workload Log Analysis

Hamid Saadatfar Hamid Fadishei Hossein Deldari 《New Generation Computing》2012,30(1):73-94

Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring. 相似文献