Internet Engineering Task Force I. Stoica, H. Zhang CMU INTERNET DRAFT S. Shenker ICSI Expire August, 1999 R. Yavatkar Intel D. Stephens, A. Malis Ascend Y. Bernet Microsoft Z. Wang Lucent F. Baker Cisco J. Wroclawski MIT C. Song, R. Wilder MCI February, 1999 Per Hop Behaviors Based on Dynamic Packet States Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document proposes a family of Per Hop Behaviors (PHBs) based on Dynamic Packet State (DPS) in the context of the differentiated service architecture. With these PHBs, distributed algorithms can be devised to implement services with flexibility, utilization, and assurance levels similar to those that can be provided with per flow mechanisms. With Dynamic Packet State, each packet carries in its header, in addition to the PHB codepoint, some PHB-specific state. The state is initialized by the ingress node. Interior nodes process each incoming packet based on the state carried in the packet's header, updating Stoica et al Expires August, 1999 [Page 1] INTERNET DRAFT PHBs Based on DPS February, 1999 both its internal state and the state in the packet's header before forwarding it to the next hop. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks using networks in which interior nodes do not maintain per flow state. We give several examples of services that can be implemented by PHBs based on DPS. These include weighted fair share service, penalty box service, guaranteed service (with mathematically proven worst case bounds), and distributed bandwidth broker service. We also discuss several possible solutions for encoding Dynamic Packet State that have the minimum incompatibility with IPv4. For more details see http://www.cs.cmu.edu/~istoica/DPS 1. Introduction While the diffserv architecture [Blake98] is highly scalable, the services it can provide have lower flexibility, utilization, or assurance levels than services provided by architectures that employs per flow management at every node. As an example, it can be shown that in order for the premium service to achieve service assurance comparable to the guaranteed service, even with a relative large queueing delay bound, the fraction of bandwidth that can be allocated to premium service traffic has to be very low. In [StoZha99], an example was given to show that even when the fraction of bandwidth allocated to premium service is limited to be less than 10%, the worst case queueing delay bound can be as high as 240 ms. The example assumes a DS domain with diameter of 15 hops, in which each link has speed of 1 Gbps and it is traversed by 1,500 flows with 1,500 byte packets. It is debatable whether these numbers should be of significant concern. For example, the low utilization by the premium traffic may be acceptable if the majority of traffic will be best effort, either because the best effort service is "good enough" for majority applications or the price difference between premium traffic and best effort traffic is too high to justify the performance difference between them. Alternatively, if the guaranteed nature of service assurance is not needed, i.e., statistical service assurance is sufficient for premium service, we can then achieve higher network utilization. Providing meaningful statistical service is still an open research problem. A discussion of these topics is beyond the scope of this document. Furthermore, as pointed out in [Bernet98], there is usually a tradeoff between the complexity of the QoS mechanism and the efficiency of the resource usage. While intserv-style guaranteed service can achieve high resource utilization, premium service needs much simpler mechanism. Stoica et al Expires August, 1999 [Page 2] INTERNET DRAFT PHBs Based on DPS February, 1999 This document proposes a new set of PHBs based on Dynamic Packet State. These PHBs have relative low complexities (do not require per flow state), but can be used to implement distributed algorithms to provide services with flexibility, utilization, and assurance levels similar to those that can be achieved with per flow mechanisms. DS domains implemented with these type of PHBs should interoperate with DS domains implemented with other PHBs. With Dynamic Packet State, each packet carries in its header, in addition to the PHB codepoint, some PHB-specific state. The state is initialized by an ingress node, when the packet enters the DS domain. Interior nodes process each incoming packet based on the state carried in the packet's header, updating both its internal state and the state in the packet's header before forwarding it. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks. Consequently, introducing PHBs based on DPS will significantly increase the flexibility and capabilities of the services that can be built within the diffserv architecture. In particular, we will show that a variety of QoS services can be implemented by PHBs based on DPS. These include weighted fair share service, penalty box service, guaranteed service, and distributed admission control service. In this document, we use flow to refer to a subset of packets that traverse the same path inside a DS domain between two edge nodes. Thus, with the highest level of traffic aggregation, a flow consists of all packets between the same pair of ingress and egress nodes. This document is organized as follows. Section 2 gives a general description of PHBs based on DPS. Section 3 presents several services that can be implemented with PHBs based on DPS. Section 4 discusses alternative techniques of storing state in the packet's header. Sections 5 briefly discusses some issues related to the specification of DPS PHB's, such as codepoints, tunneling behavior, and interaction with other PHB's. Section 6 discusses security issues. 2. Description of Per-Hop Behaviors Based on Dynamic Packet State Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98], a PHB codepoint based on DPS has extra state associated to it. This state is initialized by ingress nodes and carried by packets inside the DS domain. The state semantic is PHB dependent. The values taken by the state can be either flow, path, or packet dependent. The state carried by packets can be used by interior nodes for a variety of purposes such as, packet scheduling, updating the local node's state, Stoica et al Expires August, 1999 [Page 3] INTERNET DRAFT PHBs Based on DPS February, 1999 or extending the codepoint space. When a PHB based on DPS is defined, in addition to the guidelines given in [Blake98], the following items MUST be specified: o state semantic - the meaning of the information carried by the packets o state placement - where is the information stored in the packet's header o encoding format - how is the information encoded in packets For example, consider a PHB that implements the Core-Stateless Fair Queueing algorithm, which is described in Section 3.1. In this case, the state carried by a packet is an estimate of the current rate of the flow to which the packet belongs. The state can be placed either (a) between layer two and layer three headers, (b) as an IP option, or (c) in the IP header (see Section 4). Finally, the rate can be encoded by using a floating point like format as described in Section 4.1.1. In addition, the following requirement, called the transparency requirement, must be satisfied o All changes performed at ingress nodes or within the DS domain on packets' headers (possible for the purpose of storing the state) must be undone by egress nodes Any document defining a PHB based on DPS must specify a default placement of the state in the packet header and a default bit encoding format. However, to increase the flexibility, it is acceptable for document to define alternate state placements and encoding formats. Any router that claims to be compatible with a particular PHB based on DPS must support at least the default placement and the default bit encoding format. 3. Examples of Services that can be Implemented by PHBs Based on DPS To illustrate the power and the flexibility of the PHBs based on DPS, we give three examples. In the first, we address the congestion control problem by approximating the functionality of a "reference" network in which each node performs fair queueing. In the second example, we describe a scheduling algorithm that provides per flow QoS guarantees. In the third example, we present a distributed per- flow reservation mechanism. It is worth noting that the last two techniques, when used together, can implement the intserv guaranteed Stoica et al Expires August, 1999 [Page 4] INTERNET DRAFT PHBs Based on DPS February, 1999 service semantic in a diffserv environment. 3.1. Support for Congestion Control A central tenet of the Internet architecture is that congestion control is achieved mainly through end-host algorithms. However, starting with Nagle [Nagle87], many researchers observed that such end-to-end congestion control solutions are greatly improved when nodes have mechanisms that allocate bandwidth in a fair manner. Fair bandwidth allocation protects well-behaved flows from ill-behaved ones, and allows a diverse set of end-to-end congestion control policies to co-exist in the network [Demers89]. In the past, fair allocations were typically achieved by using per-flow queueing mechanisms -- such as Fair Queueing [Demers89, Parekh92] and its many variants [BenZha96, Golestani94, ShrVar95] -- or per-flow dropping mechanisms such as Flow Random Early Drop (FRED) [Lin97]. However, it has been recently shown in [Stoica98] that by having packets carrying additional state it is possible to approximate the functionality of a "reference" network in which each node implements fair queueing allocation with a network in which interior nodes do not maintain per flow state. The next section discusses briefly this algorithm, called Core-Stateless Fair Queueing. The algorithm details are given in [Stoica98]. Source files for the ns-2 simulator as well as an interactive simulator are available on line at [CSFQ]. 3.1.1. Core-Stateless Fair Queueing (CSFQ)} The architecture proposed in [Stoica98] can be easily implemented by using a PHB based on DPS. In this case, the state associated with each packet represents the current estimated rate of the microflow to which the packet belongs. Ingress nodes compute this rate and insert it in the packet header. Interior nodes use FIFO queueing and keep no per-flow state. They employ a probabilistic dropping algorithm that uses the packet state along with the node's own measurement of the aggregate traffic. As an interior node forwards a packet, it updates its state if needed (e.g., if the flow rate changes due to dropping). To give the intuition behind the CSFQ algorithm, consider a node with output link speed C, where flows are modelled as continuous streams of bits. We assume that each flow i's arrival rate r_i(t) is known precisely. Max-min fair bandwidth allocations are characterized by the fact that all flows that are bottlenecked (i.e., have bits dropped) by this node have the same output rate. We call this rate the fair share rate of the node; let alpha(t) be the fair share rate at time t. In general, if max-min bandwidth allocations are achieved, each flow i receives service at a rate given by min(r_i(t), alpha(t)). Let A(t) Stoica et al Expires August, 1999 [Page 5] INTERNET DRAFT PHBs Based on DPS February, 1999 denote the total arrival rate: A(t)= r_1(t) + r_2(t) + ... + r_n(t). If A(t) > C then the fair share alpha(t) is the unique solution to C = min(r_1(t), alpha(t)) + min(r_1(t), alpha(t)) + ... (1) min(r_1(t), alpha(t)). If A(t) <= C then no bits are dropped and we, by convention, set alpha(t)=max(r_1(t), r_2(t), ..., r_n(t)). If r_i(t) <= alpha(t), i.e., flow i sends no more than the node's fair share rate, all of its traffic will be forwarded. If r_i(t) > alpha(t), then a fraction (r_i(t) - alpha(t))/r_i(t) of its bits will be dropped, so it will have an output rate of exactly alpha(t). This suggests a very simple probabilistic forwarding algorithm that achieves fair allocation of bandwidth: each incoming bit of flow i is dropped with the probability P = max(0, 1 - alpha(t)/r_i(t)) (2) When these dropping probabilities are used, the arrival rate of flow i at the next hop is given by min(r_i(t), alpha(t)). The above algorithm can be extended to a packet system. The packet system still employs a drop-on-input scheme, except that now we drop packets rather than bits. Because the rate estimation [Stoica98] incorporates the packet size, the dropping probability is independent of the packet size and depends only, as above, on the rate r_i(t) and fair share rate alpha(t). Pseudocode reflecting this algorithm is described in Figure 1. Figure 1 - Algorithms performed by ingress and interior nodes in CSFQ. --------------------------------------------------------------------- Ingress node --------------------------------------------------------------------- on packet p arrival: i = get_flow(p); r = estimate_flow_rate(p,i); state(p) <- r; --------------------------------------------------------------------- Interior node --------------------------------------------------------------------- on packet p arrival: r <- state(p); alpha = estimate_fair_rate(); /* compute dropping probability */ P = max(1 - alpha/r, 0); if (unif_rand(0,1) < P) Stoica et al Expires August, 1999 [Page 6] INTERNET DRAFT PHBs Based on DPS February, 1999 drop(p); else enqueue(p); /* update flow's rate if needed */ state <- min(alpha, r); --------------------------------------------------------------------- Further, [Stoica98] proposes specific algorithms for estimating the flow's rate and the link's fair rate. In addition, [Stoica98] gives also a worst-case bound for the performance of this algorithm in an idealized setting. 3.1.2. Weighted CSFQ The CSFQ algorithm can be extended to support flows with different weights. This allows to provide per flow service differentiation without maintaining per flow or per path state at interior nodes. Let w_i denote the weight of flow i. Returning to our fluid model, the meaning of these weights is that we say a fair allocation is one in which all bottlenecked flows have the same value for r_i/w_i. The only other major change in the algorithm from Figure 1 is that the label is now r_i/w_i, instead of simply r_i. 3.1.3. CSFQ with Penalty Box Fair queuing can also play a beneficial role in addressing the unfriendly flow problem. In the literature, one approach to address this problem is RED with penalty box [FloFal97]. This approach relies on the ability to accurately identify unfriendly flows with relatively lightweight router mechanisms. A flow is classified as unfriendly if its dynamic behavior is "more aggressive" than the expected behavior of a TCP flow. Unfortunately, in practice it is very difficult to implement such a test accurately, since a router does not know all the parameters that influence the TCP behavior (e.g., round-trip time). In addition, it is an open problem whether this approach can work satisfactory in the presence of different end-to-end adaption schemes [Stoica98]. In contrast, with fair queueing it is relatively straightforward to identify the unfriendly flows. One way is to simply compare the arrival rate of the flow to the link's fair rate. In CSFQ, this reduces to comparing the estimated rate carried by the incoming packet to the output link's fair rate estimated by the node. In addition, as noted in [Stoica98] with CSFQ it is possible to devise algorithms to control these unfriendly flows without maintaining any per flow state. Stoica et al Expires August, 1999 [Page 7] INTERNET DRAFT PHBs Based on DPS February, 1999 3.2. Per Flow QoS Scheduling In this section, we show that by using a PHB based on DPS, it is possible to provide per flow guarantees. More precisely, our goal is to replicate the functionality of a network with each node implementing the Delay-Jitter-Controlled Virtual Clock (Jitter-VC) algorithm [ZhaFer94]. In the remainder of this section, we will first describe the implementation of Jitter-VC using per flow state, then present our algorithm, called Core-Jitter-VC (CJVC), using PHB based on DPS. 3.2.1. Jitter Virtual Clock (Jitter-VC)} Jitter-VC is a non-work-conserving version of the well-known Virtual Clock algorithm [Zhang90]. It uses a combination of a delay-jitter rate-controller [Verma91, ZhaFer94] and a Virtual Clock scheduler. The algorithm works as follows: each packet is assigned an eligibility time and a deadline upon its arrival. The packet is held in the rate-controller until its eligible time expires. The scheduler orders the transmission of eligible packets according to their deadlines. Jitter-VC guarantees that all packets will meet their deadlines within the transmission time of a packet of the maximum length. For the k-th packet of a flow, its eligibility time e(j;k) and deadline d(j;k) at the j-th node on its path are computed as follows (for simplicity of notation we eliminate the flow's index): e(j; 1) = a(j; 1), e(j; k) = max(a(j; k) + g(j-1; k), d(j; k-1)), j >= 1; k > 1 (3) d(j; k) = e(j; k) + l(k)/r, j, k >= 1 (4) where l(k) is the length of the packet, r is the reserved rate for the flow, a(j; k) is the packet's arrival time at the j-th node traversed by the packet, and g(j; k), written into the packet header by the previous node, is the amount of time the packet was ahead of schedule at the previous node, i.e. the difference between the packet's deadline and its actual departure time at the (j-1)-th node. The purpose of having g(k; j-1) is to compensate at node j the variation of delay due to load fluctuation at the previous node j-1. It has been shown that if a flow's long term arrival rate is no greater than its reserved rate, a network of Virtual Clock servers can provide the same delay guarantee to the flow as a network of WFQ servers [FigPas95, Goyal95, StiVer96]. In addition, it has been shown Stoica et al Expires August, 1999 [Page 8] INTERNET DRAFT PHBs Based on DPS February, 1999 that a network of Jitter-VC servers can provide the same delay guarantees as a network of Virtual Clock servers [Georgiadis95]. Therefore, a network of Jitter-VC servers can provide the same guaranteed service as a network of WFQ servers. 3.2.2 Core-Jitter-VC (CJVC) In [StoZha99], a variant of Jitter-VC algorithm, called Core-Jitter- VC (CJVC) is proposed. CJVC can be implemented with a PHB based on DPS. In addition, it can be shown that a network of CJVC servers can provide the same guaranteed service as a network of Jitter-VC servers. Figure 2 - CJVC algorithm at ingress and interior nodes --------------------------------------------------------------------- Ingress node --------------------------------------------------------------------- on packet p arrival: i = get_flow(p); if (first_packet_of_flow(i)) e[i] = crt_time; delta[i] = 0; else delta[i] = l(p)/r[i] + delta[i] - l[i]/r[i] + max(crt_time - d[i], 0) / (h - 1); e[i] = max(crt_time, d[i]); l[i] = length(p); d[i] = e[i] + l[i]/r[i]; on packet p transmission: label(p) <- (r[i], d[i] - crt_time, delta[i]); --------------------------------------------------------------------- Egress node --------------------------------------------------------------------- on packet p arrival: (r, g, delta) <- state(p); e = crt_time + g + delta; /* see Eq. (3) */ d = e + l(p)/r; on packet p transmission: label(p) <- (r, d - crt_time, delta); --------------------------------------------------------------------- As suggested by Eq. (3) and (4), the Jitter-VC algorithm needs two state variables for each flow: r, which is the reserved rate for the flow and d(j; k), which is the deadline of the last flow's packet Stoica et al Expires August, 1999 [Page 9] INTERNET DRAFT PHBs Based on DPS February, 1999 that was served by node j. While it is straightforward to eliminate r by putting it in the packet header, it is not trivial to eliminate d(j; k). The difference between r and d(j; k) is that while all nodes along the path keep the same r value for the flow, d(j; k) is a dynamic value that is computed iteratively at each node. In fact, the eligible time and the deadline of the k-th packet depend on the deadline of the previous packet of the same flow, i.e., d(j; k-1). A naive implementation based on DPS would be to pre-compute the eligible times and the deadlines of the packet at all nodes along its path and insert all of them in the header. This would eliminate the need for the interior nodes to maintain d(j; k). The main disadvantage of this approach is that the amount of information carried by the packet increases with the number of hops along the path. The challenge then is to design algorithms that compute d(j; k) for all nodes while requiring a minimum amount of state in the packet header. Notice that in Eq. (3), the reason for node j to maintain d(j; k) is that it will be used to compute the deadline and the eligible time of the next packet. Since it is only used in a "max" operation, we can eliminate the need for d(j; k) if we can ensure that the other term in "max" is never less than d(j; k). The key idea is then to use a slack variable associated with each packet, denoted delta(k), such that for every interior node j along the path, the following holds a(j; k) + g(j-1; k) + delta(k) >= d(j; k-1), j > 1 (5) With Ineq. (5), the computation of eligibility time is reduced from Eq. (3) to e(j; k) = a(j; k) + g(j-1; k) + delta(k), (6) Therefore, by using one additional variable, delta(k), we eliminate the need for maintaining the deadline of the previous packet d(j; k) for each flow at interior nodes. The computation of delta(k) is given in [StoZha99]. Figure 2 shows the computation of the scheduling parameters e(j; k) and d(j; k) by a CJVC server. The number of hops h is computed when the admission control is performed. 3.3. Distributed Admission Control The previous two examples focus on data path mechanisms and services. In this section, we will show that PHBs based on DPS can also implement control plane services such as distributed admission control. Stoica et al Expires August, 1999 [Page 10] INTERNET DRAFT PHBs Based on DPS February, 1999 Admission control is a central component in providing quantitatively defined QoS services. The main job of the admission control test is to ensure that the network resources are not over-committed. In particular it has to ensure that the sum of the reservation rates of all flows that traverse any link in the network is no larger than the link capacity C. A new reservation request is granted if it passes the admission test at each hop along its path. There are two main approaches to implementing admission control. Traditional reservation-based networks adopt a distributed architecture in which each node is responsible for its local resources, and where nodes are assumed to maintain per flow state. To support the dynamic creation and deletion of fine grained flows, a large quantity of dynamic per flow state needs to be maintained in a distributed fashion. Complex signaling protocols (e.g., RSVP and ATM UNI) are used to maintain the consistency of this per flow state. A second approach is to use a centralized bandwidth broker that maintains the topology as well as the state of all nodes in the network. In this case, the admission control can be implemented by the broker, eliminating the need for maintaining distributed state. Such a centralized approach is more appropriate for an environment where most flows are long lived, and set-up and tear-down events are rare. However, to support fine grained and dynamic flows, there may be a need for a distributed broker architecture, in which the broker database is replicated or partitioned. Such an architecture eliminates the need for a signaling protocol, but requires another protocol to maintain the consistency of the different broker databases. Unfortunately, since it is impossible to achieve perfect consistency, this may create race conditions and/or resource fragmentation. A third approach is to use a simplified provisioning model that is not aware of the details of the network topology, but instead admits a new flow if there is sufficient bandwidth available for the flow's packets to travel anywhere in the network with adequate QoS. This simplified model may be based on static provisioning and service level agreements, or on a simple dynamic bandwidth broker. In any case, the tradeoff made in return for the simplicity is that the admission control decision must be more conservative, and a much smaller level of QoS-controlled service can be supported. In the following, we show that by using a PHB based on DPS, it is possible to implement distributed admission control for guaranteed services in a DS domain. In our scheme, for each reservation, the ingress node generates a request message that is forwarded along the path. In turn, each interior node decides whether or not to accept Stoica et al Expires August, 1999 [Page 11] INTERNET DRAFT PHBs Based on DPS February, 1999 the request. When the message reaches the egress node it is returned to the ingress, which makes the final decision. We do not make any reliability assumptions about the request messages. In addition, the algorithms does not require reservation termination messages. In the following we describe the per-hop admission control. In [StoZha99] we describe how this scheme can be integrated with RSVP to achieve end- to-end delay. 3.3.1. Per-Hop Admission Control The solution is based on two main ideas. First, we always maintain a conservative upper bound of the aggregate reservation R, denoted R_bound, which we use for making admission control decisions. R_bound is updated with a simple rule: R_bound = R_bound + r whenever a request for a rate r is received and R_bound + r <= C. It should be noted that in order to maintain the invariant that R_bound is an upper bound of R, this algorithm does not need to detect duplicate request messages, generated either due to retransmission in case of packet loss or retry in case of partial reservation failures. Of course, the obvious problem with the algorithm is that R_bound will diverge from R. At the limit, when R_bound reaches the link capacity C, no new requests can be accepted even though there might be available capacity. To address this problem, we introduce a separate algorithm, based on DPS, that periodically estimates the aggregate reserved rate. Based on this estimate we compute a second upper bound for R, and then use it to re-calibrate the upper bound R_bound. An important aspect of the estimation algorithm is that it can be actually shown that the discrepancy between the upper bound and the actual reserved rate R is bounded. Then the re-calibration process reduces to choosing the minimum of the two upper bounds. 3.3.3. Estimation Algorithm for the Aggregate Reservation To estimate the aggregate reservation, denoted R_est, we again use DPS. In this case, the state of each packet consists of the amount of bits the flow is entitled to send during the interval between the time when the previous packet was transmitted up to the time when the current packet is transmitted. Note that unlike the previous two examples, in this case the state carried by the packet does not affect the packet's processing by interior nodes. This state is solely used to compute each node's aggregate reservation. Stoica et al Expires August, 1999 [Page 12] INTERNET DRAFT PHBs Based on DPS February, 1999 Furthermore, this state is not changed as the packet is forwarded. The estimation performed by interior nodes is based on the following simple observation: the sum of state values of packets of all flows received during an interval (a, a + T_W] is a good approximation for the total number of bits that all flows are entitled to send during this interval. Dividing this sum by T_W, we obtain the aggregate reservation rate. This is basically the rate estimation algorithm, though we need to account for several estimation errors. In particular, we need to account for the fact that not all reservations continue for the entire duration of interval (a, a + T_W]. We divide time into intervals of length T_W. Let (u1, u2] be such an interval, where u2 = u1 + T_W. Let T_I be the maximum inter-departure time between two consecutive packets in the same flow and T_J be the maximum jitter of a flow, both of which are much smaller than T_W. Further, each interior node is associated a global variable Ra which is initialized at the beginning of each interval (u1, u2] to zero, and is updated to Ra + r every time a request for a reservation r is received and the admission test is passed, i.e., R_bound + r <= C. Let Ra(t) denote the value of this variable at time t. Since interior nodes do not differentiate between the original and duplicate requests, Ra(t) is an upper-bound of the sum of all reservations accepted during the interval (u1, t]. (For simplicity, here we assume that a flow which is granted a reservation during the interval (u1, u2] becomes active no later than u2.) Then, it can be shown that the aggregate reservation at time u2, R(u2), is bounded by R(u2) < R_est(u2)/(1-f) + Ra(u2), (7) where f = (T_I + T_J)/T_W. Finally, this bound is used to re- calibrate the upper bound of the aggregate reservation R_bound(u1) as follows R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8) Figure 3 shows the pseudocode of the admission decision and of the aggregate reservation estimation algorithm at ingress and interior nodes. We make several observations. First, the estimation algorithm uses only the information in the current interval. This makes the algorithm robust with respect to loss and duplication of signaling packets since their effects are "forgotten" after one time interval. As an example, if a node processes both the original and a duplicate of the same reservation request during the interval (u1, u2], R_bound will be updated twice for the same flow. However, this erroneous update will not be reflected in the computation of R_est(u3), since its computation is based only on the state values received during (u2, u3]. As a consequence, it can be show that the admission control Stoica et al Expires August, 1999 [Page 13] INTERNET DRAFT PHBs Based on DPS February, 1999 algorithm can asymptotically reach a link utilization of C (1 - f)/(1 + f) [StoZha99]. A possible optimization of the admission control algorithm is to add reservation termination messages. This will reduce the discrepancy between the upper bound R_bound and the aggregate reservation R. However, in order to guarantee that R_bound remains an upper bound for R, we need to ensure that a termination message is sent at most once, i.e., there are no retransmissions if the message is lost. In practice, this property can be enforced by edge nodes, which maintain per flow state. To ensure that the maximum inter-departure time is no larger than T_I, the ingress node may need to send a dummy packet in the case when no data packet arrives for a flow during an interval T_I. This can be achieved by having the ingress node to maintain a timer with each flow. An optimization would be to aggregate all "micro-flows" between each pair of ingress and egress nodes into one flow, and compute the state value based on the aggregated reservation rate, and insert a dummy packet only if there is no data packet for the aggregate flow during an interval. Figure 3 - Admission control and rate estimation algorithm. --------------------------------------------------------------------- Ingress node --------------------------------------------------------------------- on packet p departure: i = get_flow(p); state(p) <- r[i] * (crt_time - prev_time[i]); prev_time[i] = crt_time; --------------------------------------------------------------------- Interior node --------------------------------------------------------------------- Reservation Estimation | Admission Control --------------------------------------------------------------------- on packet p arrival: | on reservation request r: b <- state(p); | /* admission ctrl. test */ L = L + b; | if (R_bound + r <= C) | Ra = Ra + r; on time-out T_W: | R_bound = R_bound + r; /* estimated reservation */ | accept(r); R_est = L / T_W; | else R_bound = min(R_bound, | deny(r); R_est/(1 - f) + Ra);| Ra = 0; | --------------------------------------------------------------------- Stoica et al Expires August, 1999 [Page 14] INTERNET DRAFT PHBs Based on DPS February, 1999 4. Carrying State in Packets There are at least three ways to encode state in the packet header: (1) introduce a new IP option and insert the option at the ingress router, (2) introduce a new header between layer 2 and layer 3, and (3) use the existing IP header. Inserting an IP option, while having the potential to provide a large space for encoding state, will require all routers within a DS domain to process IP options, thus complicating packet processing. In addition, adding an IP option may cause packet fragmentation that further reduces the performance. Introducing a new header between layer 2 and layer 3 would require solutions be devised for different layer 2 technologies. In the context of MPLS [Rosen98, Rosen99] networks, the state can be encoded in a special label. One way to do this is by using a particular encoding of the experimental use field indicating a nested label on the label stack that carried the PHB-specific state information rather than an ordinary label. In this case, the label on the top of the stack would indicate the label-switched path, and the inner label the PHB-specific state. This would require a small addition to the MPLS architecture to allow two labels to be pushed or popped in unison, and treated as a single entity on the label stack. 4.1. Encoding State within an IP header In this section, we discuss the third option: storing the additional states in the IP header. The biggest problem with using the IP header is to find enough space to insert the extra information. The main challenge is to remain fully compatible with current standards and protocols. In particular, we want the network domain to be transparent to end-to-end protocols, i.e., the egress node should restore the fields changed by ingress and interior nodes to their original values. In this respect, we observe that there is an ip_off field in the IPv4 header to support packet fragmentation/reassembly which is rarely used. For example, by analyzing the traces of over 1.7 million packets on an OC-3 link [nlanr], we found that less than 0.22% of all packets were fragments. In addition, ther are a relatively small number of distinct fragment sizes. Therefore, it is possible to use a fraction of ip_off field to encode the fragment sizes, and the remaining bits to encode DPS information. The idea can be implemented as follows. When a packet arrives at an ingress node, the node checks whether a packet is a fragment or needs to be fragmented. If neither of these are true, all 13 bits of the ip_off field in the packet header will be used to encode DPS values. If the packet is a fragment, the fragment size is recoded into a more efficient representation and the rest of the bits is used to encode the DPS information. The fragment size field Stoica et al Expires August, 1999 [Page 15] INTERNET DRAFT PHBs Based on DPS February, 1999 will be restored at the egress node. In the above, we make the implicit assumption that a packet can be fragmented only by ingress nodes, and not by interior nodes. This is consistent with the diffserv view that the forwarding behavior of a network's components is engineered to be compatible throughout a domain. In summary, this gives us up to 13 bits in the current IPv4 header to encode the PHB specific state. 4.2. Example of State Encoding The state encoding is PHB dependent. In this section, we give examples of encoding the state for the services described in Section 3. 4.2.1. Encoding Flow's Rate Recall that in CSFQ, the PHB state represents an estimate of the current rate of the flow to which the packet belongs. One possible way to represent the rate estimate is to restrict it to only a small number of possible values. For example if we limit it to 128 values, only seven bits are needed to represent this rate. While this can be a reasonable solution in practice, in the following we propose a more sophisticated representation that allows us to express a larger range of values. Let r denote the rate estimate. To represent r we use an m bit mantissa and an n bit exponent. Since r <= 0, it is possible to gain an extra bit for mantissa. For this we consider two cases: (a) if r >= 2^m we represent r as the closest value of the form u 2^v, where 2^m <= u <= 2^(m+1). Then, since the (m+1)-th most significant bit in the u's representation is always 1, we can ignore it. As an example, assume m = 3, n = 4, and r = 19 = 10011. Then 19 is represented as 18 = u*2^v, where u = 9 = 1001 and v = 1. By ignoring the first bit in the representation of u the mantissa will store 001, while the exponent will be 1. (b) On the other hand, if r < 2^m, the mantissa will contain r, while the exponent will be 2^n - 1. For example, for m = 3, n = 4, and r = 6 = 110, the mantissa is 110, while the exponent is 1111. Converting from one format to another can be efficiently implemented. Figure 4 shows the conversion code in C. For simplicity, here we assume that integers are truncated rather than rounded when represented in floating point. Figure 4. The C code for converting between integer and floating point formats. m represents the number of bits used by the mantissa; n represents the number of bits in the Stoica et al Expires August, 1999 [Page 16] INTERNET DRAFT PHBs Based on DPS February, 1999 exponent. --------------------------------------------------------------------- intToFP(int val, int *mantissa, int *exponent) { int nbits = get_num_bits(val); if (nbits <= m) { *mantissa = val; *exponent = (1 << n) - 1; } else { *exponent = nbits - m - 1; *mantissa = (val >> *exponent) - (1 << m); } } FPToInt(int mantissa, int exponent) { int tmp; if (exponent == ((1 << n) - 1)) return mantissa; tmp = mantissa | (1 << m); return (tmp << exponent) } --------------------------------------------------------------------- By using m bits for mantissa and n for exponent, we can represent any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example, with 7 bits, by allocating 3 for mantissa and 4 for exponent, we can represent any integer in the range [1..15*2^15] with a relative error of (-6.25%, 6.25%). The worst relative error case occurs when the mantissa is 8. For example the number r = 271 = 100001111 is encoded as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 = -0.0554 = -5.54%. Similarly, r = 273 = 100010001 is encoded as u = 1001, v = 5, with a relative error of 5.55%. 4.2.2. Encoding Scheduling Parameters As shown in Figure 2, implementing CJVC requires encoding three parameters: (1) the rate r or equivalently l/r, (2) the slack delay delta, and (3) g. Since the deadline determines the delay guarantees, we use a representation that trades the eligible time accuracy for the deadline accuracy. To store the scheduling information we use three fields. The first field encodes l/r + delta by using a floating point like representation. Since g is never larger than l/r we encode it as a fraction of l/r + delta. This allows us to save space without Stoica et al Expires August, 1999 [Page 17] INTERNET DRAFT PHBs Based on DPS February, 1999 compromising the accuracy. More precisely, in [StoZha99] we show that by using three bits for mantissa and four bits for exponent to represent l/r + delta, and only three bits to represent g, we can compute the packet's deadline, i.e., d = crt_time + g + l/r + delta, with a relative error which is bounded by (-13%, 13%). The third field stores l/r also as a fraction of l/r + delta. Then, the eligible time is computed as e = d - l/r. By allocating another three bits for this field we can store all scheduling parameters in 13 bits. In [StoZha99] we give a slightly different encoding that uses two extra bits in the DS field to store the reservation state. 4.2.3. Encoding Reservation State As shown in Figure 2, when estimating the aggregate reservation, the PHB state represents the number of bits that a flow is entitled to send during the interval between the time when the previous packet of the flow has been transmitted until the current packet is transmitted. This number can be simply encoded as an integer b. To reduce the range, a possibility is to store b/l instead of b, where l is the length of the packet. 4.3. Encoding Multiple Values Since the space in the packet's header is a scarce resource, encoding multiple values is particularly challenging. In this section we discuss two general methods that helps to alleviate this difficulty. In the first method, exemplified in Section 4.2.2. the idea is to leverage additional knowledge about the state semantic to achieve efficient encoding. In particular one value can be stored as a function of other values. For example, if a value is known to be always greater than the other values, the larger value can be represented in floating point format, while the other values may be represented as fractions of this value (see Section 4.2.2). The idea of the second method is to have different packets within a flow carry different state formats. This method is appropriate for PHBs that do not require all packets of a flow to carry the same state. For example, in estimating the aggregate reservation (see Section 3.3.2) there is no need for every packet to carry the number of bits the flow is entitled to send between the current time and the time when the previous packet has been transmitted. The only requirement is that the distance between any two consecutive packets that carry such values to be no larger than T_I. Other packets in between can carry different information. Similarly, if we encode the IP fragment size in the packet's state, the packet has to carry this value only if the IP fragment is not zero. When the IP fragment is zero the packet can carry other state instead. On the other hand, Stoica et al Expires August, 1999 [Page 18] INTERNET DRAFT PHBs Based on DPS February, 1999 note that in CSFQ, it is mandatory that every node carry the rate estimate as this is used to compute the dropping probability for that packet. 5. Specification Issues This section briefly describes some issues related to drafting specifications for PHB's based on DPS. 5.1. Recommended Codepoints At this time it is appropriate to use values drawn from the 16 codepoints [Nichols98] reserved for local and experimental use (xxxx11) to encode PHBs based on DPS. 5.2. Interaction with other PHBs The interaction of DPS PHB's with other PHB's obviously depends on the PHB semantic. It should be noted that the presence of other PHB's in a node may affect the computation and update of DPS state as well as the actual forwarding behavior experienced by the packet. 5.3. Tunneling When packets with PHBs based on DPS are tunneled, the end-nodes must make sure that (1) the tunnel is marked with a PHB that does not violate the original PHB semantic, and (2) the PHB specific state is correctly updated at the end of the tunnel. This requirement might be met by using a tunnel PHB that records and updates packet state, and then copying the state from the encapsulating packet to the inner packet at the tunnel endpoint. Alternatively, the behavior of the tunnel might be measured or precomputed in a way that allows the encapsulated packet's DPS state to be updated at the decapsulation point without requiring the tunnel to support DPS behavior. 6. Security Considerations The space allocated for the PHB state in the packet header must be compatible with IPsec. In this context we note that using the fragment offset to carry PHB state does not affect IPsec's end-to-end security, since the fragment offset is not used for cryptographic calculations [Kent98]. Thus, as it is the case with the DS filed [Nichols98], IPSec does not provide any defense against malicious modifications of the PHB state. This leaves the door open for denial of service attacks. For example, in CSFQ, flow's rate overestimation may cause disproportionate bandwidth being allocated to the flow inside the DS domain. In the example in Section 3.3, the underestimation of the aggregate reservation can lead to resource overprovision. Stoica et al Expires August, 1999 [Page 19] INTERNET DRAFT PHBs Based on DPS February, 1999 One way to expose denial of service attacks is by auditing. In this context, we note that associating state with PHBs makes it easier to perform efficient auditing at interior nodes. For example, in CSFQ, an eventual attack can be detected by simply measuring a flow rate and then compare it against the estimated rate carried by flow's packets. 7. Conclusions In this document we propose an extension of the diffserv architecture by defining a new set of PHBs that are based on Dynamic Packet State. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks within the diffserv architecture. Such an extension will significantly increase the flexibility and capabilities of the services that can be provided by diffserv. 8. References [CSFQ] Core-Stateless Fair Queueing. URL: http://www.cs.cmu.edu/~istoica/csfq. [nlanr] NLANR: Network Traffic Packet Header Traces. URL: http://moat.nlanr.net/Traces/. [BenZha96] J.C.R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair queueing. In Proceedings of IEEE INFOCOM'96, pages 120-128, San Francisco, CA, March 1996. [Bernet98] Y. Bernet, R. Yavatkar, P. Ford, F. Baker, L. Zhang, K. Nichols and M. Speer. A Framework for Use of RSVP with Diff- serv Networks, Internet Draft, draft-ietf-diffserv-rsvp-01.txt, November, 1998. [Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Architecture for Differentiated Services, Internet Draft, draft-ietf-diffserv-arch-02.txt, October 1998. [Demers89] A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In Journal of Internetworking Research and Experience, pages 3-26, October 1990. Also in Proceedings of ACM SIGCOMM'89, pp 3-12. [FigPas95] N. Figueira and J. Pasquale. An upper bound on delay for the Virtual Clock service discipline. IEEE/ACM Transactions on Networking, 3(4), April 1995. [FloFal97] S. Floyd and K. Fall. Router mechanisms to support Stoica et al Expires August, 1999 [Page 20] INTERNET DRAFT PHBs Based on DPS February, 1999 end-to-end congestion control, LBL Technical Report, February 1997. [Georgiadis95] L. Georgiadis, R. Guerin, and V. Peris. Efficient network QoS provisioning based on per node traffic shaping. In IEEE INFOCOM'96, San Francisco, CA, March 1996. [Golestani94] S. Golestani. A self-clocked fair queueing scheme for broadband applications. In Proceedings of IEEE INFOCOM'94, pages 636-646, Toronto, CA, June 1994. [Goyal95] P. Goyal, S. Lam, and H. Vin. Determining end-to-end delay bounds in heterogeneous networks. In Proceedings of the 5th International Workshop on Network and Operating System Support For Digital Audio and Video, pages 287-298, Durham, New Hampshire, April 1995. [Jacobson98] Van Jacobson, K. Poduri and K. Nichols. An Expedited Forwarding PHB, November 1998. Internet Draft, draft- ietf-diffserv-phb-ef-01.txt, November 1999. [Heinanen99] Juha Heinanen, F. Baker, W. Weiss, and J. Wroclawski. Assured Forwarding PHB Group, Internet Draft, draf- ietf-diffserv-af-04.txt, January 1999. [Kent98] S. Kent and R. Atkinson. IP Authentification Header, Request for Comments: 2402, draft-ietf-ipsec-auth-header-07.txt, November, 1998. [Lin97] D. Lin and R. Morris. Dynamics of random early detection. In Proceedings of ACM SIGCOMM'97, pages 127-137, Cannes, France, October 1997. [Nagle87] J Nagle. On packet switches with infinite storage. IEEE Trans. on Communications, 35(4):435-438, April 1987. [Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black. Definition of the Differentiated Services Field (DS Field) in the ipv4 and ipv6 Headers, Internet Draft, draft-ietf-diffserv- header-04.txt, October 1998. [ns2] Ucb/lbnl/vint network simulator - ns (version 2). [Parekh92] A. Parekh and R. Gallager. A generalized processor sharing approach to flow control - the single node case. In Proceedings of the INFOCOM'92, 1992. [Rosen98] E. C. Rosen, Y. Rekhter, D. Tappan, D. Farrinaci G. Stoica et al Expires August, 1999 [Page 21] INTERNET DRAFT PHBs Based on DPS February, 1999 Fedorkow, T. Li, and A. Conta. MPLS Label Stack Encoding, Internet Draft, draft-ietf-mpls-label-encaps-03.txt, September 1998. [Rosen99] E. C. Rosen, A. Viswanathan, and R. Callon. Multiprotocol Label Switching Architecture, Internet Draft, draft-ietf-mpls-arch-04.txt, February 1999. [ShrVar95] M. Shreedhar and G. Varghese. Efficient fair queueing using deficit round robin. In Proceedings of SIGCOMM'95, pages 231-243, Boston, MA, September 1995. [StiVer96] D. Stilliadis and A. Verma. Latency-rate servers: A general model for analysis of traffic scheduling algorithms. In Proceedings of IEEE INFOCOM'96, pages 111-119, San Francisco, CA, March 1996. [Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. In Proceedings ACM SIGCOMM'98, pages 118-130, Vancouver, September 1998. [StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services Without Per Flow Management. URL: http://www.cs.cmu.edu/~istoica/DPS/, February 1999. [Verma91] D. Verma, H. Zhang, and D. Ferrari. Guaranteeing delay jitter bounds in packet switching networks. In Proceedings of Tricomm'91, pages 35-46, Chapel Hill, North Carolina, April 1991. [ZhaFer94] H. Zhang and D. Ferrari. Rate-controlled service disciplines. Journal of High Speed Networks, 3(4):389-412, 1994. [Zhang90] L. Zhang. Virtual Clock: A new traffic control algorithm for packet switching networks. In Proceedings of ACM SIGCOMM'90, pages 19-29, Philadelphia, PA, September 1990. 9. Author's Addresses Ion Stoica Hui Zhang Carnegie Mellon University Carnegie Mellon University 5000 Forbes Ave 5000 Forbes Ave Pittsburgh, PA 15213 Pittsburgh, PA 15213 Phone: (412) 268-2993 Phone: (412) 268-8945 Email: istoica@cs.cmu.edu Email: hzhang@cs.cmu.edu Scott Shenker Raj Yavatkar ICSI Intel Corporation, JF3-206 Stoica et al Expires August, 1999 [Page 22] INTERNET DRAFT PHBs Based on DPS February, 1999 1947 Center Street, Suite 600 2111 NE 25th. Avenue, Berkeley, CA 94704-1198 Hillsboro, OR 97124 Phone: (510) 642-4274 x300 Phone: (503) 264-9077 Email: shenker@icsi.berkeley.edu Email: raj.yavatkar@intel.com Donpaul Stephens Andrew. G. Malis Ascend Communications, Inc. Ascend Communications, Inc. 10250 Valley View Road, Suite 113 1 Robbins Road Minneapolis, MN 55344 Westford, MA 01886 Phone: (612) 996-6840 Phone: (978) 952-7414 Email: donpaul@ascend.com Email: amalis@ascend.com Yoram Bernett Zheng Wang Microsoft Lucent Technologies One Microsoft Way, 101 Crawfords Corner Road, Redmond, WA 98052 Room 4C-516 Phone: (425) 936-9568 Holmdel, NJ 07733, USA Email: yoramb@microsoft.com Phone: (732) 949-1042 Email: zhwang@dnrc.bell-labs.com Fred Baker John Wroclawski Cisco Systems MIT Lab for Computer Science 519 Lado Drive 545 Technology Sq. Santa Barbara, CA 93111 Cambridge, MA 02139 Phone: (408) 526-4257 Phone: (617) 253-7885 E-mail: fred@cisco.com Email: jtw@lcs.mit.edu Chuck Song Rick Wilder MCI Internet Engineering MCI Internet Engineering 2100 Reston Parkway 2100 Reston Parkway Phone: (703) 715-7082 Phone: (703) 715-7114 E-mail: csong@mci.net E-mail: wilder@mci.net Stoica et al Expires August, 1999 [Page 23]