Measuring the Fairness of Scheduling Policies for Web Servers and Beyond
People
Adam Wierman
Mor Harchol-Balter
Karl Sigman
Bianca Schroeder
Motivation
Within modern computer systems, the question of how to allocate resources to process incoming requests is paramount. Applications across domains are all faced with this task. Within traditional networking, web servers are constantly being barraged with incoming requests, and must decide how best to allocate resources to respond. Even within the network, routers are placed under the same demands. Outside of networks, the foundations of operating systems depend on the ability to allocate the processor in response to requests made by active processes. Even database queries can be viewed in the same light - memory resources must be allocated to fulfill user requests.
Given the fundamental nature of this problem, it is not surprising that there is a large base of research modeling this process. Queueing theory provides an abstract model for this situation and has been studied mathematically since the early 1900s. However, despite the huge field of research within this framework, even the fundamental question of how best to schedule jobs in this system is not fully answered. Traditionally, the measure of performance used to evaluate queueing networks has been mean response time; that is the mean time between the arrival of a request and the completion of the request. Under this metric it is well known that the Shortest-Remaining-Processing-Time (SRPT) policy minimizes mean response time. This has led to designs based on SRPT being suggested for web servers and other applications.
However, SRPT is not used in practice due to fears about unfairness. Unfairness is a difficult idea to understand, but intuitively it is easy to see why SRPT seems unfair. A large job is very strongly biased against in an SRPT system because many small jobs could arrive and be completed while a large job is stuck in the queue not receiving any service. These concerns plague many policies, besides just SRPT, that achieve low mean response times by biasing towards short job sizes, e.g. Preemptive-Shortest-Job-First (PSJF) and Foreground-Background (FB) scheduling. Unfortunately, there has been little theoretical work understanding the validity of these concerns about unfairness despite the importance of such work to real world applications.
| |
Results
Our work in this area began with a simple question: What is the experience of large job sizes under policies that prioritize small jobs, e.g. SRPT? We proved that, surprisingly, the largest jobs do not receive worse performance under SRPT than they do under traditional timesharing policies like Processor Sharing (PS), which is intuitively fair because it shares the server evenly among all jobs in the system at all times. In fact, in many cases using SRPT can provide smaller response times (in expectation) for all job sizes than PS. However, under heavy load SRPT can be unfair to some jobs, it is just not the largest jobs. Instead, it is the medium-large jobs that are treated unfairly.
From this initial focus on only the performance of large jobs, gradually we developed the first theoretical framework for studying the fairness of all scheduling policies, which culminated with a paper in Sigmetrics 2003 that won the Best Student Paper award. This paper defined a simple, intuitive notion of the fairness of scheduling policies (in expectation) and analyzed both individual policies and scheduling heuristics with respect to this new metric. Since 2003, we have also generalized the framework to describe not only fairness in expectation, but also the distribution of fairness (through use of higher moments).
| |
Impact
Our work on fairness has been cited over 50 times, has been used for many applications including web servers, routers, operating systems, and wireless networks, and has served to jumpstart a new focus on fairness in the Sigmetrics community. Many researchers, e.g. Rai, Biersack, et al. and Gong & Williamson, have analyzed other policies with respect to the fairness measure we introduced. Still others, including Friedman & Henderson, have invented new policies that perform well with respect to this fairness measure. In addition, many researchers, such as Levy & Raz and Sandmann, have developed new fairness measures for use in other applications.
Publications
-
Invited paper in Performance Evaluation Review, 2007. 34(4):4 - 12.[show/hide abstract]The growing trend in computer systems towards using scheduling policies that prioritize jobs with small service requirements has resulted in a new focus on the fairness of such policies. In particular, researchers have been interested in whether prioritizing small job sizes results in large jobs being treated
-
Co-recipient of the CMU SCS Distinguished Dissertation Award
Finalist receiving Honorable Mention for the INFORMS OR in Telecommunications Dissertation Award Ph.D. Thesis, 2007. -
Proceedings of Allerton, 2007.[show/hide abstract]In recent years, the response times experienced by large job sizes have been the focus of a growing number of papers. Though results about many scheduling disciplines have appeared, to this point, results characterizing the response time of large job sizes have been limited to either mean value analysis or law of large numbers scalings. We will present a novel framework that unifies these results and provides new results characterizing the distributional behavior of large job sizes. Also, we will illustrate the impact these new results have (i) for the analysis of the response time tail, (ii) for the analysis of busy periods, and (iii) for predicting response times of customers upon arrival.
-
Proceedings of ACM Sigmetrics, 2005.[show/hide abstract]In addition to providing small mean response times, modern applications seek to provide users predictable service and, in some cases, Quality of Service (QoS) guarantees. In order to understand the predictability of response times under a range of scheduling policies, we study the conditional variance in response times seen by jobs of different sizes. We define a metric and a criterion that distinguish between contrasting functional behaviors of conditional variance, and we then classify large groups of scheduling policies. In addition to studying the conditional variance of response times, we also derive metrics appropriate for comparing higher conditional moments of response time across job sizes. We illustrate that common statistics such as raw and central moments are not appropriate when comparing higher conditional moments of response time. Instead, we find that cumulant moments should be used.
-
Best Student Paper Award recipient[show/hide abstract]
Proceedings of ACM Sigmetrics, 2003.It is common to evaluate scheduling policies based on their mean response times. Another important, but sometimes opposing, performance metric is a scheduling policy's fairness. For example, a policy that biases towards small job sizes so as to minimize mean response time may end up being unfair to large job sizes. In this paper we define three types of unfairness and demonstrate large classes of scheduling policies that fall into each type. We end with a discussion on which jobs are the ones being treated unfairly. -
Performance Evaluation Review, 2002. 30(3):9-11.
-
Performance Evaluation, 2002. 49(1-4):241-256.[show/hide abstract]We explore the performance of an M/GI/1 queue under various scheduling policies from the perspective of a new metric: the slowdown experienced by largest jobs. We consider scheduling policies that bias against large jobs, towards large jobs, and those that are fair, e.g., Processor-Sharing. We prove that as job size increases to infinity, all work conserving policies converge almost surely with respect to this metric to no more than $1/(1 - \rho)$, where $\rho$ denotes load. We also find that the expected slowdown under any work conserving policy can be made arbitrarily close to that under Processor-Sharing, for all job sizes that are sufficiently large.
Figure 1. A summary of the classification of common policies and scheduling techniques based on fairness.
Figure 2. A comparison of the response time for a job of size x, T(x), under PS and SRPT. Notice that under low load SRPT outperforms PS for all job sizes, but that under high load SRPT is unfair to medium-large job sizes.