Scheduling for Routers and OSs:
How to Schedule when you're Blind
People
Adam Wierman
Misja Nuyens
Mor Harchol-Balter
Nikhil Bansal
Motivation
While many computer applications, e.g. web servers, have knowledge of the service demands of requests, routers and operating systems have no knowledge about the remaining lengths of the network flows they schedule. As a result, policies that use job size information, such as Shortest-Remaining-Processing-Time (SRPT), cannot be applied in these applications. However, even when the scheduler cannot use job size information to prioritize, it can use other statistics in order to prioritize jobs that are likely to have small remaining service demands. One such statistic is the age or attained service of a job, which is defined as the amount of service already given to a job. By giving priority to the jobs with the Least Attained Service (LAS), it is possible to achieve good performance under heavy-tailed service distributions (which are common across computer systems). This is because, under the heavy tailed distributions that are characteristic of IP flow durations, jobs that have received a large amount of service are likely to be very large, and thus have remaining sizes that are still large. LAS has been studied since the early 1960s, however there remain many unanswered questions about the policy. Our work is motivated by the following question: How close can the performance of LAS be to the performance of SRPT? In other words, how much does LAS suffer because it is blind?
Results and Impact
We have shown that, under some distributions, LAS loses very little. In particular, when the service distribution is Pareto (or more generally regularly varying), LAS is constant competitive, i.e. has mean response time within a constant of that of SRPT regardless of the system load. Further, in this setting, the tail of the distribution of response times under LAS is asymptotically equivalent to that under SRPT. Due to the prevalence of heavy-tailed distributions in computer systems, results such as these have led to LAS being applied in both routers, e.g. the work of Rai, Urvoy-Keller, and Biersack, and operating systems, e.g. the work of Feng, Misra, and Rubenstein. However, our results also illustrate that one must be careful when applying LAS: Under light-tailed distributions LAS is no longer constant competitive; and further, the tail of the response time distribution is asymptotically heavier than that of SRPT. In addition, one must be careful about fairness when using LAS. We have shown that LAS is unfair to medium-large jobs under all service distributions with finite variance, though the degree of unfairness is small as long as the load is not too high.
Publications
-
On the Gittins Index in an M/GI/1 queueUnder preparation.
-
Moment conditions for the foreground-background queueUnder preparation.
-
Many flows asymptotics for SMART scheduling policiesUnder submission.Scheduling policies that are biased toward small jobs have received growing attention due to their superior mean delay performance. Such policies include Shortest-Remaining-Processing-Time (SRPT), Preemptive-Shortest-Job-First (PSJF), and Least-Attained-Service (LAS). In this paper, we study the delay distribution of LAS and the class of scheduling policies called SMART-LD (SMAll-Response- Time for Large-Deviations) that included SRPT, PSJF, and their variants to understand policies that prioritized short jobs. We study the delay distribution (rate function) of the SMART-LD class and LAS in a discrete-time queueing system under the many sources large deviations regime. We prove that all SMART-LD policies have the same asymptotic delay distribution as SRPT and illustrate the improvements SMART-LD policies and LAS make over First-Come-First-Served (FCFS). Furthermore, we show that the delay distribution of SMART-LD policies stochastically improves upon the delay distribution of LAS across all job sizes.
-
Operations Research, 2008. 56(1):88-101.Recently, the class of SMART scheduling policies (disciplines) has been introduced in order to formalize the common heuristic of ``biasing toward small jobs.'' We study the tail of the sojourn-time (response-time) distribution under both SMART policies and the Foreground-Background policy (FB) in the GI/GI/1 queue. We prove that these policies behave very well under heavy-tailed service times, but behave poorly under light-tailed service times. Specifically, for heavy-tailed service times, we show that the sojourn-time tail under FB and all SMART policies are equal to that of the service time tail, up to a constant. In contrast, for light-tailed service times with no mass in the endpoint of the distribution, we show that, on a logarithmic scale, the sojourn-time tail of FB and all SMART policies is as large as possible.
-
Performance Evaluation, 2008. 65(3-4):286-307.Recently, computer systems researchers have begun to apply the Foreground-Background (FB) scheduling discipline to a variety of applications, and as a result, there has been a resurgence in theoretical research studying the discipline as well. In this paper, we will bring together results from both of these research streams and provide a survey of state-of-the-art theoretical results characterizing the performance of FB with emphasis on the impact of these results to computer systems.
-
Co-recipient of the CMU SCS Distinguished Dissertation Award
Finalist receiving Honorable Mention for the INFORMS OR in Telecommunications Dissertation Award Ph.D. Thesis, 2007. -
Proceedings of the Allerton conference, 2007.In recent years, the response times experienced by large job sizes have been the focus of a growing number of papers. Though results about many scheduling disciplines have appeared, to this point, results characterizing the response time of large job sizes have been limited to either mean value analysis or law of large numbers scalings. We will present a novel framework that unifies these results and provides new results characterizing the distributional behavior of large job sizes. Also, we will illustrate the impact these new results have (i) for the analysis of the response time tail, (ii) for the analysis of busy periods, and (iii) for predicting response times of customers upon arrival.
-
Proceedings of ACM Sigmetrics, 2006.Scheduling policies that prioritize short jobs have received growing attention in recent years. The class of SMART policies includes many such disciplines, e.g. Shortest Remaining Processing Time (SRPT) and Preemptive Shortest Job First (PSJF). In this work, we study the delay distribution of SMART policies and contrast this distribution with that of the Least-Attained-Service (LAS) policy, which indirectly favors short jobs by prioritizing jobs with the least attained service (age). We study the delay distribution (rate function) of LAS and the SMART class in a discrete-time queueing system under the many flows regime. Our analysis in this regime (large capacity and large number of flows) is enabled by the introduction of a two dimensional queue representation, which creates tie-break rules. These additional rules do not alter the policies, but greatly simplify their analysis. We demonstrate that the queue evolution of all the above policies can be described under this single two dimensional framework. We prove that all SMART policies have the same delay distribution as SRPT and illustrate the improvements SMART policies make over First-Come-First-Served (FCFS). Furthermore, we show that the delay distribution of SMART policies stochastically improves upon the delay distribution of LAS. However, the delay distribution under LAS is not too bad -- the distribution of delay under LAS for most jobs sizes still provides improvement over FCFS. Our results are complementary to prior work that studies delay-tail behavior in the large buffer regime under a single flow.
-
Proceedings of ACM Sigmetrics, 2005.In addition to providing small mean response times, modern applications seek to provide users predictable service and, in some cases, Quality of Service (QoS) guarantees. In order to understand the predictability of response times under a range of scheduling policies, we study the conditional variance in response times seen by jobs of different sizes. We define a metric and a criterion that distinguish between contrasting functional behaviors of conditional variance, and we then classify large groups of scheduling policies. In addition to studying the conditional variance of response times, we also derive metrics appropriate for comparing higher conditional moments of response time across job sizes. We illustrate that common statistics such as raw and central moments are not appropriate when comparing higher conditional moments of response time. Instead, we find that cumulant moments should be used.
-
Operations Research Letters, 2004. 32:73-76.We compare the overall mean response time (a.k.a. sojourn time) of the Processor Sharing (PS) and Feedback (FB) queues under an M/GI/1 system. We show that FB outperforms PS under service distributions having decreasing failure rates; whereas PS outperforms FB under service distributions having increasing failure rates.
-
Best Student Paper Award recipient
Proceedings of ACM Sigmetrics, 2003.It is common to evaluate scheduling policies based on their mean response times. Another important, but sometimes opposing, performance metric is a scheduling policy's fairness. For example, a policy that biases towards small job sizes so as to minimize mean response time may end up being unfair to large job sizes. In this paper we define three types of unfairness and demonstrate large classes of scheduling policies that fall into each type. We end with a discussion on which jobs are the ones being treated unfairly. -
Carnegie Mellon School of Computer Science Technical Report CMU-CS-02-201, 2002.We propose a framework for comparing the performance of two queueing policies. Our framework is motivated by the notion of competitive analysis, widely used by the computer science community to analyze the performance of online algorithms. We apply our framework to compare M/GI/1/FB and M/GI/1/SJF with M/GI/1/SRPT, and obtain new results about the performance of M/GI/1/FB and M/GI/1/SJF.
-
Performance Evaluation Review, 2002. 30(3):9-11.
-
Performance Evaluation, 2002. 49(1-4):241-256.We explore the performance of an M/GI/1 queue under various scheduling policies from the perspective of a new metric: the slowdown experienced by largest jobs. We consider scheduling policies that bias against large jobs, towards large jobs, and those that are fair, e.g., Processor-Sharing. We prove that as job size increases to infinity, all work conserving policies converge almost surely with respect to this metric to no more than $1/(1 - \rho)$, where $\rho$ denotes load. We also find that the expected slowdown under any work conserving policy can be made arbitrarily close to that under Processor-Sharing, for all job sizes that are sufficiently large.