期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Composing high-performance schedulers: a case study from real-time simulation

Kaushik Ghosh Richard M. Fujimoto Karsten Schwan 《Concurrency and Computation》1999,11(5):221-245

Dynamic, high-performance or real-time applications require scheduling latencies and throughput not typically offered by current kernel or user-level threads schedulers. Moreover, it is widely accepted that it is important to be able to specialize scheduling policies for specific target applications and their execution environments. This paper presents one solution to the construction of such high-performance, application-specific thread schedulers. Specifically, scheduler implementations are composed from modular components, where individual scheduler modules may be specialized to underlying hardware characteristics or implement precisely the mechanisms and policies desired by application programs. The resulting user-level schedulers' implementations can provide resource guarantees by interaction with kernel-level facilities which provide means of resource reservation. This paper demonstrates the concept of composable schedulers by construction of several compositions for highly dynamic target applications, where low scheduling latencies are critical to application performance. Claims about the importance and effectiveness of scheduler composition are validated experimentally on a shared-memory multiprocessor. Scheduler compositions are optimized to take advantage of different low-level hardware attributes and of knowledge about application requirements specific to certain applications, including a Time Warp-based real-time discrete event simulator. Experimental evaluations are based on synthetic workloads, on a real-time simulation blending simulated with implemented control system components, and on a dynamic robot control program. Measurements indicate that schedulers can be composed and specialized to offer performance similar to that of dedicated scheduling co-processors. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

2.

<Emphasis FontCategory="NonProportional">rt-muse</Emphasis>: measuring real-time characteristics of execution platforms

Martina Maggio Juri Lelli Enrico Bini 《Real-Time Systems》2017,53(6):857-885

Operating systems code is often developed according to principles like simplicity, low overhead, and low memory footprint. Schedulers are no exceptions. A scheduler is usually developed with flexibility in mind, and this restricts the ability to provide real-time guarantees. Moreover, even when schedulers can provide real-time guarantees, it is unlikely that these guarantees are properly quantified using theoretical analysis that carries on to the implementation. To be able to analyze the guarantees offered by operating systems’ schedulers, we developed a publicly available tool that analyzes timing properties extracted from the execution of a set of threads and computes the lower and upper bounds to the supply function offered by the execution platform, together with information about migrations and statistics on execution times. rt-muse evaluates the impact of many application and platform characteristics including the scheduling algorithm, the amount of available resources, the usage of shared resources, and the memory access overhead. Using rt-muse, we show the impact of Linux scheduling classes, shared data and application parallelism, on the delivered computing capacity. The tool provides useful insights on the runtime behavior of the applications and scheduler. In the reported experiments, rt-muse detected some issues arising with the real-time Linux scheduler: despite having available cores, Linux does not migrate SCHED_RR threads which are enqueued behind SCHED_FIFO threads with the same priority. 相似文献

3.

Thread- and Process-based Implementations of the pSystem Parallel Programming Environment

LUÍS M. B. LOPES FERNANDO M. A. SILVA 《Software》1997,27(3):329-351

相似文献

4.

Application‐specific thread schedulers for distributed applications

Matthew D. Roper Ronald A. Olsson 《Concurrency and Computation》2012,24(3):260-280

This paper describes our work to improve the performance of distributed applications. We aim at certain application characteristics such as balancing load, allowing separately written applications to work better together, allowing a distributed application to adapt its behavior in more flexible ways, and so on. Our approach is to write application‐specific schedulers, which can access the global state of the application in making scheduling decisions. To achieve this goal, we extended our earlier work on CATAPULTS ( C reating A nd T esting AP plication‐specific U ser L evel T hread S chedulers), a domain‐specific language for creating and testing application‐specific user‐level thread schedulers, to distributed applications by adding ‘master schedulers’ for dealing with the distributed parts of applications. This paper presents our design of, experimentation with, and implementation of distributed CATAPULTS. This paper presents several realistic examples to measure the feasibility of this approach, specifically: a website application, an embedded application, and load balancing. Each example has a scheduling goal for which we developed a customized scheduler. We measured the performance with and without the customized scheduler. The customized scheduler for each example was fairly straightforward to develop and each achieved its scheduling goal. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

5.

Thread prioritization: A thread scheduling mechanism for multiple-context parallel processors

Stuart Fiske William J. Dally 《Future Generation Computer Systems》1995,11(6):503-518

Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical. 相似文献

6.

Fair lateness scheduling: reducing maximum lateness in G-EDF-like scheduling

Jeremy P. Erickson James H. Anderson Bryan C. Ward 《Real-Time Systems》2014,50(1):5-47

In prior work on soft real-time (SRT) multiprocessor scheduling, tardiness bounds have been derived for a variety of scheduling algorithms, most notably, the global earliest-deadline-first (G-EDF) algorithm. In this paper, we devise G-EDF-like (GEL) schedulers, which have identical implementations to G-EDF and therefore the same overheads, but that provide better tardiness bounds. We discuss how to analyze these schedulers and propose methods to determine scheduler parameters to meet several different tardiness bound criteria. We employ linear programs to adjust such parameters to optimize arbitrary tardiness criteria, and to analyze lateness bounds (lateness is related to tardiness). We also propose a particular scheduling algorithm, namely the global fair lateness (G-FL) algorithm, to minimize maximum absolute lateness bounds. Unlike the other schedulers described in this paper, G-FL only requires linear programming for analysis. We argue that our proposed schedulers, such as G-FL, should replace G-EDF for SRT applications. 相似文献

7.

Lesser Bear—a light‐weight process library for SMP computers

Hisashi Oguma Yasuichi Nakayama 《Concurrency and Computation》2001,13(12):1107-1120

We have designed and implemented a light‐weight process (thread) library called ‘Lesser Bear’ for SMP computers. Lesser Bear has high portability and thread‐level parallelism. Creating UNIX processes as virtual processors and a memory‐mapped file as a huge shared‐memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors, and treats a shared‐memory space as a critical section for synchronization of threads. Therefore, thread functions of the previous Lesser Bear are serialized. In this paper, we present a scheduling mechanism to execute thread functions in parallel. In the design of the proposed mechanism, we divide the entire shared‐memory space into partial spaces for virtual processors, and prepare two queues (Protect Queue and Waiver Queue) for each partial space. We adopt an algorithm in which lock operations are not necessary for enqueueing. This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead. The mechanism is applied to Lesser Bear and evaluated by experimental results. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

8.

Scheduling of hard real-time multi-phase multi-thread (MPMT) periodic tasks

Pierre Courbin Irina Lupu Joël Goossens 《Real-Time Systems》2013,49(2):239-266

In this paper we study the scheduling of parallel and real-time recurrent tasks on multiprocessor platforms. Firstly, we propose a new parallel task model which allows recurrent tasks to be composed of several phases, each one composed of several threads. Each thread requires a single processor for execution and can be scheduled simultaneously. We then propose an algorithm to transpose popular Fork-Join task model to our MPMT task model. Secondly, we define several kinds of real-time schedulers that can be applied to our parallel task model. We distinguish between two scheduling classes: Hierarchical schedulers and Global Thread schedulers. We present and prove correct an exact schedulability test for each class. Lastly, we also evaluate the performance of our scheduling paradigm in comparison with Gang scheduling by means of simulations. In this work we extend the work of Lupu and Goossens in Scheduling of hard real-time multi-thread periodic tasks (Real-Time and Network Systems, 2011) which considers mono-phase multi-thread task model. We extend their previous results to a Multi-Phase Multi-Thread task model. 相似文献

9.

Task mapper and application-aware virtual machine scheduler oriented for parallel computing

Jing ZHANG Xiao-jun CHEN Jun-huai LI Xiang LI 《浙江大学学报:C卷英文版》2012,(3):155-177

We design a task mapper TPCM for assigning tasks to virtual machines, and an application-aware virtual machine scheduler TPCS oriented for parallel computing to achieve a high performance in virtual computing systems. To solve the problem of mapping tasks to virtual machines, a virtual machine mapping algorithm (VMMA) in TPCM is presented to achieve load balance in a cluster. Based on such mapping results, TPCS is constructed including three components: a middleware supporting an application-driven scheduling, a device driver in the guest OS kernel, and a virtual machine scheduling algorithm. These components are implemented in the user space, guest OS, and the CPU virtualization subsystem of the Xen hypervisor, respectively. In TPCS, the progress statuses of tasks are transmitted to the underlying kernel from the user space, thus enabling virtual machine scheduling policy to schedule based on the progress of tasks. This policy aims to exchange completion time of tasks for resource utilization. Experimental results show that TPCM can mine the parallelism among tasks to implement the mapping from tasks to virtual machines based on the relations among subtasks. The TPCS scheduler can complete the tasks in a shorter time than can Credit and other schedulers, because it uses task progress to ensure that the tasks in virtual machines complete simultaneously, thereby reducing the time spent in pending, synchronization, communication, and switching. Therefore, parallel tasks can collaborate with each other to achieve higher resource utilization and lower overheads. We conclude that the TPCS scheduler can overcome the shortcomings of present algorithms in perceiving the progress of tasks, making it better than schedulers currently used in parallel computing. 相似文献

10.

Distributed fair DRAM scheduling in network-on-chips architecture

《Journal of Systems Architecture》2013,59(7):543-550

Memory access scheduling is an effective manner to improve performance of Chip Multi-Processors (CMPs) by taking advantage of the timing characteristics of a DRAM. A memory access scheduler can subdivide resources utilization (banks and rows) to increase throughput by accessing different DRAM banks in parallel. However, different threads running on different cores may exhibit different performance. One thread may experience starvation while the others are serviced normally. Therefore, designing a scheduler which reduces the unfairness in the DRAM system, while also improving system throughput on a variety of workloads and systems, is necessary. In this paper, a distributed fair DRAM scheduling for two-dimensional mesh network-on-chips (NoCs), called DFDS, is presented. The key design points in DFDS are: (i) assessing the total waiting cycles of a memory request in NoC and considering it as a metric in arbitration. For this purpose waiting cycles of a memory request are put in an additional flit in a packet and are updated while traversing the NoC, and (ii) proposing a semi-dynamic virtual channel allocation to provide in-order memory requests to memory controllers (MCs). Consequently, we use a simple scheduling algorithm in MCs, instead of complex algorithms. To validate our approach, we apply synthetic and real workload from Parsec benchmark suite. The results show effectiveness of our approach, as we reduce the waiting time of memory requests by up to 15%. 相似文献

11.

Building a Basic Block Instruction Scheduler with Reinforcement Learning and Rollouts

McGovern Amy Moss Eliot Barto Andrew G. 《Machine Learning》2002,49(2-3):141-160

The execution order of a block of computer instructions on a pipelined machine can make a difference in running time by a factor of two or more. Compilers use heuristic schedulers appropriate to each specific architecture implementation to achieve the best possible program speed. However, these heuristic schedulers are time-consuming and expensive to build. We present empirical results using both rollouts and reinforcement learning to construct heuristics for scheduling basic blocks. In simulation, the rollout scheduler outperformed a commercial scheduler on all benchmarks tested, and the reinforcement learning scheduler outperformed the commercial scheduler on several benchmarks and performed well on the others. The combined reinforcement learning and rollout approach was also very successful. We present results of running the schedules on Compaq Alpha machines and show that the results from the simulator correspond well to the actual run-time results. 相似文献

12.

Multicast support in multi-chip centralized schedulers in Input Queued switches

Andrea Bianco Alessandra Scicchitano 《Computer Networks》2009,53(7):1040-1049

IQ switches store packets at input ports to avoid the memory speedup required by OQ switches. However, packet schedulers are needed to determine an I/O (input/output) interconnection pattern that avoids conflicts among packets at output ports. Today, centralized, single-chip, scheduler implementation are largely dominant. In the near future, the multi-chip scheduler implementation will be needed to reduce the hardware scheduler complexity in very large, high-speed, switches. However, the multi-chip implementation implies introducing a non-negligible delay among input and output selectors used to determine the I/O interconnection pattern at each time slot. This delay, mainly due to inter-chip latency, requires modifications to traditional scheduling algorithms, which normally rely on the hypothesis that information exchange among selectors can be performed with negligible delay. We propose a novel multicast scheduler, named IMRR, an extension of a previously proposed multicast scheduling algorithm named mRRM, making it suitable to a multi-chip implementation, and examine its performance by simulation. 相似文献

13.

Scheduling communication in multithreaded programs: experimental results

Juan Carlos Gomez Vernon Rego V. S. Sunderam 《Concurrency and Computation》2006,18(1):1-28

When the critical path of a communication session between end points includes the actions of operating system kernels, there are attendant overheads. Along with other factors, such as functionality and flexibility, such overheads motivate and favor the implementation of communication protocols in user space. When implemented with threads, such protocols may hold the key to optimal communication performance and functionality. Based on implementations of reliable user‐space protocols supported by a threads framework, we focus on our experiences with internal threads' scheduling techniques and their potential impact on performance. We present scheduling strategies that enable threads to do both application‐level and communication‐related processing. With experiments performed on a Sun SPARC‐5 LAN environment, we show how different scheduling strategies yield different levels of application‐processing efficiency, communication latency and packet‐loss. This work forms part of a larger study on the implementation of multiple thread‐based protocols in a single address space, and the benefits of coupling protocols with applications. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

14.

一种新型实时调度算法研究 总被引：2，自引：0，他引：2

何东之李伟张向文《小型微型计算机系统》2005,26(11):1965-1970

在许多片上特定应用系统中，任务多且切换频繁，任务切换开销大，有时甚至严重影响系统的可调度性．研究了动态可抢占门限调度算法，它通过初始门限值、动态门限值的计算和优化线程分配，实现了在处理器高利用率下，有效降低任务切换开销的目的，并相应地减少了对内存的需求．动态可抢占门限调度算法是将静态抢占门限算法与动态调度算法有机地结合在一起。完成了由静态到动态无缝转换．相似文献

15.

Thread Scheduling for Multiprogrammed Multiprocessors

N. S. Arora R. D. Blumofe C. G. Plaxton 《Theory of Computing Systems》2001,34(2):115-144

We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with work T ₁ and critical-path length T _∈ fty , and for any number P of processes, our scheduler executes the computation in expected time O(T ₁ /P _A + T _∈ fty P/P _A ) , where P _A is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever P is small relative to the parallelism T ₁ /T _∈ fty . Online publication February 26, 2001. 相似文献

16.

ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

François Broquedis Nathalie Furmento Brice Goglin Pierre-André Wacrenier Raymond Namyst 《International journal of parallel programming》2010,38(5-6):418-439

Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations. 相似文献

17.

Achieving service rate objectives with decay usage scheduling

Hellerstein J.L. 《IEEE transactions on pattern analysis and machine intelligence》1993,19(8):813-825

Decay usage scheduling is a priority- and usage-based approach to CPU allocation in which preference is given to processes that have consumed little CPU in the recent past. The author develops an analytic model for decay usage schedulers running compute-bound workloads, such as those found in many engineering and scientific environments; the model is validated from measurements of a Unix system. This model is used in two ways. First, ways to parameterize decay usage schedulers are studied to achieve a wide range of service rates. Doing so requires a fine granularity of control and a large range of control. The results show that, for a fixed representation of process priorities a larger range of control makes the granularity of control coarser, and a finer granularity of control decreases the range of control. A second use of the analytic model is to construct a low overhead algorithms for achieving service rate objectives. Existing approaches require adding a feedback loop to the scheduler. This overhead is avoided by exploiting the feedback already present in decay usage schedulers. Using both empirical and analytical techniques, it is shown that the algorithm is effective and that it provides fairness when the system is over- or under-loaded 相似文献

18.

Provably Efficient Online Nonclairvoyant Adaptive Scheduling

Yuxiong He Wen-Jing Hsu Leiserson C.E. 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(9):1263-1279

Multiprocessor scheduling in a shared multiprogramming environment can be structured in two levels, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler maps the ready threads of a job onto the allotted processors. We present two provably-efficient two-level scheduling schemes called G-RAD and S-RAD respectively. Both schemes use the same job scheduler RAD for the processor allotments that ensures fair allocation under all levels of workload. In G-RAD, RAD is combined with a greedy thread scheduler suitable for centralized scheduling; in S-RAD, RAD is combined with a work-stealing thread scheduler more suitable for distributed settings. Both G-RAD and S-RAD are non-clairvoyant. Moreover, they provide effective control over the scheduling overhead and ensure efficient utilization of processors. We also analyze the competitiveness of both G-RAD and S-RAD with respect to an optimal clairvoyant scheduler. In terms of makespan, both schemes can achieve O(1)-competitiveness for any set of jobs with arbitrary release time. In terms of mean response time, both schemes are O(1)-competitive for arbitrary batched jobs. To the best of our knowledge, G-RAD and S-RAD are the first non-clairvoyant scheduling algorithms that guarantee provable efficiency, fairness and minimal overhead. 相似文献

19.

Garbage collection-aware utility accrual scheduling

Shahrooz Feizabadi Godmar Back 《Real-Time Systems》2007,36(1-2):3-22

The convenience and robustness of automatic memory management have long been exploited by modern systems that use type-safe programming languages such as Java. The timeliness requirements of real-time systems, however, impose specific demands on the operational parameters of the garbage collector. The memory requirements of real-time tasks must be accommodated with a predictable impact on the time-line and under the purview of the scheduler. Utility Accrual is a method of dynamic overload scheduling that is designed to respond to overload conditions by producing a schedule that heuristically maximizes a pre-defined metric of utility. Traditionally, UA schedulers have focused primarily on CPU overload. We explore memory overload conditions in which the memory demands exceed the system’s available memory bandwidth. This paper presents a utility accrual algorithm for uniprocessor CPU and garbage collection scheduling that addresses such memory overload conditions. By tightly linking CPU and memory allocation, the scheduler can appropriately respond to overload along both dimensions. This scheduler is the first of its kind to enable the use of automatic memory management in a utility accrual system. Experimental results based on actual Java application profiles indicate the benefits of our model when compared to memory-unaware scheduling. 相似文献

20.

Signals,timers, and continuations for multithreaded user‐level protocols

Juan Carlos Gomez Jorge R. Ramos Vernon Rego 《Software》2006,36(5):449-471

Precise timing and asynchronous I/O are appealing features for many applications. Unix kernels provide such features on a per‐process basis, using signals to communicate asynchronous events to applications. Per‐process signals and timers are grossly inadequate for complex multithreaded applications that require per‐thread signals and timers that operate at finer granularity. To respond to this need, we present a scheme that integrates asynchronous (Unix) signals with user‐level threads, using the ARIADNE system as a platform. This is done with a view towards support for portable, multithreaded, and multiprotocol distributed applications, namely the CLAM (connectionless, lightweight, and multiway) communications library. In the same context, we propose the use of continuations as an efficient mechanism for reducing thread context‐switching and busy‐wait overheads in multithreaded protocols. Our proposal for integrating timers and signal‐handling mechanisms not only solves problems related to race conditions, but also offers an efficient and flexible interface for timing and signalling threads. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献