Internet Engineering Task Force I. Stoica, H. Zhang CMU
INTERNET DRAFT S. Shenker ICSI
Expire August, 1999 R. Yavatkar Intel
D. Stephens, A. Malis Ascend
Y. Bernet Microsoft
Z. Wang Lucent
F. Baker Cisco
J. Wroclawski MIT
C. Song, R. Wilder MCI
February, 1999
Per Hop Behaviors Based on Dynamic Packet States
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document proposes a family of Per Hop Behaviors (PHBs) based on
Dynamic Packet State (DPS) in the context of the differentiated
service architecture. With these PHBs, distributed algorithms can be
devised to implement services with flexibility, utilization, and
assurance levels similar to those that can be provided with per flow
mechanisms.
With Dynamic Packet State, each packet carries in its header, in
addition to the PHB codepoint, some PHB-specific state. The state is
initialized by the ingress node. Interior nodes process each incoming
packet based on the state carried in the packet's header, updating
Stoica et al Expires August, 1999 [Page 1]
INTERNET DRAFT PHBs Based on DPS February, 1999
both its internal state and the state in the packet's header before
forwarding it to the next hop. By using DPS to coordinate actions of
edge and interior nodes along the path traversed by a flow,
distributed algorithms can be designed to approximate the behavior of
a broad class of "stateful" networks using networks in which interior
nodes do not maintain per flow state.
We give several examples of services that can be implemented by PHBs
based on DPS. These include weighted fair share service, penalty box
service, guaranteed service (with mathematically proven worst case
bounds), and distributed bandwidth broker service. We also discuss
several possible solutions for encoding Dynamic Packet State that
have the minimum incompatibility with IPv4.
For more details see http://www.cs.cmu.edu/~istoica/DPS
1. Introduction
While the diffserv architecture [Blake98] is highly scalable, the
services it can provide have lower flexibility, utilization, or
assurance levels than services provided by architectures that employs
per flow management at every node.
As an example, it can be shown that in order for the premium service
to achieve service assurance comparable to the guaranteed service,
even with a relative large queueing delay bound, the fraction of
bandwidth that can be allocated to premium service traffic has to be
very low. In [StoZha99], an example was given to show that even when
the fraction of bandwidth allocated to premium service is limited to
be less than 10%, the worst case queueing delay bound can be as high
as 240 ms. The example assumes a DS domain with diameter of 15 hops,
in which each link has speed of 1 Gbps and it is traversed by 1,500
flows with 1,500 byte packets. It is debatable whether these numbers
should be of significant concern. For example, the low utilization by
the premium traffic may be acceptable if the majority of traffic will
be best effort, either because the best effort service is "good
enough" for majority applications or the price difference between
premium traffic and best effort traffic is too high to justify the
performance difference between them. Alternatively, if the guaranteed
nature of service assurance is not needed, i.e., statistical service
assurance is sufficient for premium service, we can then achieve
higher network utilization. Providing meaningful statistical service
is still an open research problem. A discussion of these topics is
beyond the scope of this document. Furthermore, as pointed out in
[Bernet98], there is usually a tradeoff between the complexity of the
QoS mechanism and the efficiency of the resource usage. While
intserv-style guaranteed service can achieve high resource
utilization, premium service needs much simpler mechanism.
Stoica et al Expires August, 1999 [Page 2]
INTERNET DRAFT PHBs Based on DPS February, 1999
This document proposes a new set of PHBs based on Dynamic Packet
State. These PHBs have relative low complexities (do not require per
flow state), but can be used to implement distributed algorithms to
provide services with flexibility, utilization, and assurance levels
similar to those that can be achieved with per flow mechanisms. DS
domains implemented with these type of PHBs should interoperate with
DS domains implemented with other PHBs.
With Dynamic Packet State, each packet carries in its header, in
addition to the PHB codepoint, some PHB-specific state. The state is
initialized by an ingress node, when the packet enters the DS domain.
Interior nodes process each incoming packet based on the state
carried in the packet's header, updating both its internal state and
the state in the packet's header before forwarding it. By using DPS
to coordinate actions of edge and interior nodes along the path
traversed by a flow, distributed algorithms can be designed to
approximate the behavior of a broad class of "stateful" networks.
Consequently, introducing PHBs based on DPS will significantly
increase the flexibility and capabilities of the services that can be
built within the diffserv architecture. In particular, we will show
that a variety of QoS services can be implemented by PHBs based on
DPS. These include weighted fair share service, penalty box service,
guaranteed service, and distributed admission control service.
In this document, we use flow to refer to a subset of packets that
traverse the same path inside a DS domain between two edge nodes.
Thus, with the highest level of traffic aggregation, a flow consists
of all packets between the same pair of ingress and egress nodes.
This document is organized as follows. Section 2 gives a general
description of PHBs based on DPS. Section 3 presents several services
that can be implemented with PHBs based on DPS. Section 4 discusses
alternative techniques of storing state in the packet's header.
Sections 5 briefly discusses some issues related to the specification
of DPS PHB's, such as codepoints, tunneling behavior, and interaction
with other PHB's. Section 6 discusses security issues.
2. Description of Per-Hop Behaviors Based on Dynamic Packet State
Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98], a PHB
codepoint based on DPS has extra state associated to it. This state
is initialized by ingress nodes and carried by packets inside the DS
domain. The state semantic is PHB dependent. The values taken by the
state can be either flow, path, or packet dependent. The state
carried by packets can be used by interior nodes for a variety of
purposes such as, packet scheduling, updating the local node's state,
Stoica et al Expires August, 1999 [Page 3]
INTERNET DRAFT PHBs Based on DPS February, 1999
or extending the codepoint space.
When a PHB based on DPS is defined, in addition to the guidelines
given in [Blake98], the following items MUST be specified:
o state semantic - the meaning of the information carried by
the packets
o state placement - where is the information stored in the
packet's header
o encoding format - how is the information encoded in packets
For example, consider a PHB that implements the Core-Stateless Fair
Queueing algorithm, which is described in Section 3.1. In this case,
the state carried by a packet is an estimate of the current rate of
the flow to which the packet belongs. The state can be placed either
(a) between layer two and layer three headers, (b) as an IP option,
or (c) in the IP header (see Section 4). Finally, the rate can be
encoded by using a floating point like format as described in Section
4.1.1.
In addition, the following requirement, called the transparency
requirement, must be satisfied
o All changes performed at ingress nodes or within the DS domain on
packets' headers (possible for the purpose of storing the state)
must be undone by egress nodes
Any document defining a PHB based on DPS must specify a default
placement of the state in the packet header and a default bit
encoding format. However, to increase the flexibility, it is
acceptable for document to define alternate state placements and
encoding formats. Any router that claims to be compatible with a
particular PHB based on DPS must support at least the default
placement and the default bit encoding format.
3. Examples of Services that can be Implemented by PHBs Based on DPS
To illustrate the power and the flexibility of the PHBs based on DPS,
we give three examples. In the first, we address the congestion
control problem by approximating the functionality of a "reference"
network in which each node performs fair queueing. In the second
example, we describe a scheduling algorithm that provides per flow
QoS guarantees. In the third example, we present a distributed per-
flow reservation mechanism. It is worth noting that the last two
techniques, when used together, can implement the intserv guaranteed
Stoica et al Expires August, 1999 [Page 4]
INTERNET DRAFT PHBs Based on DPS February, 1999
service semantic in a diffserv environment.
3.1. Support for Congestion Control
A central tenet of the Internet architecture is that congestion
control is achieved mainly through end-host algorithms. However,
starting with Nagle [Nagle87], many researchers observed that such
end-to-end congestion control solutions are greatly improved when
nodes have mechanisms that allocate bandwidth in a fair manner. Fair
bandwidth allocation protects well-behaved flows from ill-behaved
ones, and allows a diverse set of end-to-end congestion control
policies to co-exist in the network [Demers89].
In the past, fair allocations were typically achieved by using
per-flow queueing mechanisms -- such as Fair Queueing [Demers89,
Parekh92] and its many variants [BenZha96, Golestani94, ShrVar95] --
or per-flow dropping mechanisms such as Flow Random Early Drop (FRED)
[Lin97]. However, it has been recently shown in [Stoica98] that by
having packets carrying additional state it is possible to approximate
the functionality of a "reference" network in which each node
implements fair queueing allocation with a network in which interior
nodes do not maintain per flow state. The next section discusses
briefly this algorithm, called Core-Stateless Fair Queueing. The
algorithm details are given in [Stoica98]. Source files for the ns-2
simulator as well as an interactive simulator are available on line at
[CSFQ].
3.1.1. Core-Stateless Fair Queueing (CSFQ)}
The architecture proposed in [Stoica98] can be easily implemented by
using a PHB based on DPS. In this case, the state associated with each
packet represents the current estimated rate of the microflow to which
the packet belongs. Ingress nodes compute this rate and insert it in
the packet header. Interior nodes use FIFO queueing and keep no
per-flow state. They employ a probabilistic dropping algorithm that
uses the packet state along with the node's own measurement of the
aggregate traffic. As an interior node forwards a packet, it updates
its state if needed (e.g., if the flow rate changes due to dropping).
To give the intuition behind the CSFQ algorithm, consider a node with
output link speed C, where flows are modelled as continuous streams of
bits. We assume that each flow i's arrival rate r_i(t) is known
precisely. Max-min fair bandwidth allocations are characterized by the
fact that all flows that are bottlenecked (i.e., have bits dropped) by
this node have the same output rate. We call this rate the fair share
rate of the node; let alpha(t) be the fair share rate at time t. In
general, if max-min bandwidth allocations are achieved, each flow i
receives service at a rate given by min(r_i(t), alpha(t)). Let A(t)
Stoica et al Expires August, 1999 [Page 5]
INTERNET DRAFT PHBs Based on DPS February, 1999
denote the total arrival rate: A(t)= r_1(t) + r_2(t) + ... + r_n(t).
If A(t) > C then the fair share alpha(t) is the unique solution to
C = min(r_1(t), alpha(t)) + min(r_1(t), alpha(t)) + ... (1)
min(r_1(t), alpha(t)).
If A(t) <= C then no bits are dropped and we, by convention, set
alpha(t)=max(r_1(t), r_2(t), ..., r_n(t)).
If r_i(t) <= alpha(t), i.e., flow i sends no more than the node's fair
share rate, all of its traffic will be forwarded. If r_i(t) >
alpha(t), then a fraction (r_i(t) - alpha(t))/r_i(t) of its bits will
be dropped, so it will have an output rate of exactly alpha(t). This
suggests a very simple probabilistic forwarding algorithm that
achieves fair allocation of bandwidth: each incoming bit of flow i is
dropped with the probability
P = max(0, 1 - alpha(t)/r_i(t)) (2)
When these dropping probabilities are used, the arrival rate of flow
i at the next hop is given by min(r_i(t), alpha(t)).
The above algorithm can be extended to a packet system. The packet
system still employs a drop-on-input scheme, except that now we drop
packets rather than bits. Because the rate estimation [Stoica98]
incorporates the packet size, the dropping probability is independent
of the packet size and depends only, as above, on the rate r_i(t)
and fair share rate alpha(t). Pseudocode reflecting this algorithm is
described in Figure 1.
Figure 1 - Algorithms performed by ingress and interior nodes in CSFQ.
---------------------------------------------------------------------
Ingress node
---------------------------------------------------------------------
on packet p arrival:
i = get_flow(p);
r = estimate_flow_rate(p,i);
state(p) <- r;
---------------------------------------------------------------------
Interior node
---------------------------------------------------------------------
on packet p arrival:
r <- state(p);
alpha = estimate_fair_rate();
/* compute dropping probability */
P = max(1 - alpha/r, 0);
if (unif_rand(0,1) < P)
Stoica et al Expires August, 1999 [Page 6]
INTERNET DRAFT PHBs Based on DPS February, 1999
drop(p);
else
enqueue(p);
/* update flow's rate if needed */
state <- min(alpha, r);
---------------------------------------------------------------------
Further, [Stoica98] proposes specific algorithms for estimating the
flow's rate and the link's fair rate. In addition, [Stoica98] gives
also a worst-case bound for the performance of this algorithm in an
idealized setting.
3.1.2. Weighted CSFQ
The CSFQ algorithm can be extended to support flows with different
weights. This allows to provide per flow service differentiation
without maintaining per flow or per path state at interior nodes. Let
w_i denote the weight of flow i. Returning to our fluid model, the
meaning of these weights is that we say a fair allocation is one in
which all bottlenecked flows have the same value for r_i/w_i. The only
other major change in the algorithm from Figure 1 is that the label is
now r_i/w_i, instead of simply r_i.
3.1.3. CSFQ with Penalty Box
Fair queuing can also play a beneficial role in addressing the
unfriendly flow problem. In the literature, one approach to address
this problem is RED with penalty box [FloFal97]. This approach relies
on the ability to accurately identify unfriendly flows with relatively
lightweight router mechanisms. A flow is classified as unfriendly if
its dynamic behavior is "more aggressive" than the expected behavior
of a TCP flow. Unfortunately, in practice it is very difficult to
implement such a test accurately, since a router does not know all the
parameters that influence the TCP behavior (e.g., round-trip time). In
addition, it is an open problem whether this approach can work
satisfactory in the presence of different end-to-end adaption schemes
[Stoica98].
In contrast, with fair queueing it is relatively straightforward to
identify the unfriendly flows. One way is to simply compare the
arrival rate of the flow to the link's fair rate. In CSFQ, this
reduces to comparing the estimated rate carried by the incoming packet
to the output link's fair rate estimated by the node. In addition, as
noted in [Stoica98] with CSFQ it is possible to devise algorithms to
control these unfriendly flows without maintaining any per flow state.
Stoica et al Expires August, 1999 [Page 7]
INTERNET DRAFT PHBs Based on DPS February, 1999
3.2. Per Flow QoS Scheduling
In this section, we show that by using a PHB based on DPS, it is
possible to provide per flow guarantees. More precisely, our goal is
to replicate the functionality of a network with each node
implementing the Delay-Jitter-Controlled Virtual Clock (Jitter-VC)
algorithm [ZhaFer94]. In the remainder of this section, we will first
describe the implementation of Jitter-VC using per flow state, then
present our algorithm, called Core-Jitter-VC (CJVC), using PHB based
on DPS.
3.2.1. Jitter Virtual Clock (Jitter-VC)}
Jitter-VC is a non-work-conserving version of the well-known Virtual
Clock algorithm [Zhang90]. It uses a combination of a delay-jitter
rate-controller [Verma91, ZhaFer94] and a Virtual Clock scheduler.
The algorithm works as follows: each packet is assigned an
eligibility time and a deadline upon its arrival. The packet is held
in the rate-controller until its eligible time expires. The scheduler
orders the transmission of eligible packets according to their
deadlines. Jitter-VC guarantees that all packets will meet their
deadlines within the transmission time of a packet of the maximum
length.
For the k-th packet of a flow, its eligibility time e(j;k) and
deadline d(j;k) at the j-th node on its path are computed as follows
(for simplicity of notation we eliminate the flow's index):
e(j; 1) = a(j; 1),
e(j; k) = max(a(j; k) + g(j-1; k), d(j; k-1)), j >= 1; k > 1 (3)
d(j; k) = e(j; k) + l(k)/r, j, k >= 1
(4)
where l(k) is the length of the packet, r is the reserved rate for
the flow, a(j; k) is the packet's arrival time at the j-th node
traversed by the packet, and g(j; k), written into the packet header
by the previous node, is the amount of time the packet was ahead of
schedule at the previous node, i.e. the difference between the
packet's deadline and its actual departure time at the (j-1)-th node.
The purpose of having g(k; j-1) is to compensate at node j the
variation of delay due to load fluctuation at the previous node j-1.
It has been shown that if a flow's long term arrival rate is no
greater than its reserved rate, a network of Virtual Clock servers
can provide the same delay guarantee to the flow as a network of WFQ
servers [FigPas95, Goyal95, StiVer96]. In addition, it has been shown
Stoica et al Expires August, 1999 [Page 8]
INTERNET DRAFT PHBs Based on DPS February, 1999
that a network of Jitter-VC servers can provide the same delay
guarantees as a network of Virtual Clock servers [Georgiadis95].
Therefore, a network of Jitter-VC servers can provide the same
guaranteed service as a network of WFQ servers.
3.2.2 Core-Jitter-VC (CJVC)
In [StoZha99], a variant of Jitter-VC algorithm, called Core-Jitter-
VC (CJVC) is proposed. CJVC can be implemented with a PHB based on
DPS. In addition, it can be shown that a network of CJVC servers can
provide the same guaranteed service as a network of Jitter-VC
servers.
Figure 2 - CJVC algorithm at ingress and interior nodes
---------------------------------------------------------------------
Ingress node
---------------------------------------------------------------------
on packet p arrival:
i = get_flow(p);
if (first_packet_of_flow(i))
e[i] = crt_time;
delta[i] = 0;
else
delta[i] = l(p)/r[i] + delta[i] - l[i]/r[i] +
max(crt_time - d[i], 0) / (h - 1);
e[i] = max(crt_time, d[i]);
l[i] = length(p);
d[i] = e[i] + l[i]/r[i];
on packet p transmission:
label(p) <- (r[i], d[i] - crt_time, delta[i]);
---------------------------------------------------------------------
Egress node
---------------------------------------------------------------------
on packet p arrival:
(r, g, delta) <- state(p);
e = crt_time + g + delta; /* see Eq. (3) */
d = e + l(p)/r;
on packet p transmission:
label(p) <- (r, d - crt_time, delta);
---------------------------------------------------------------------
As suggested by Eq. (3) and (4), the Jitter-VC algorithm needs two
state variables for each flow: r, which is the reserved rate for the
flow and d(j; k), which is the deadline of the last flow's packet
Stoica et al Expires August, 1999 [Page 9]
INTERNET DRAFT PHBs Based on DPS February, 1999
that was served by node j. While it is straightforward to eliminate r
by putting it in the packet header, it is not trivial to eliminate
d(j; k). The difference between r and d(j; k) is that while all nodes
along the path keep the same r value for the flow, d(j; k) is a
dynamic value that is computed iteratively at each node. In fact, the
eligible time and the deadline of the k-th packet depend on the
deadline of the previous packet of the same flow, i.e., d(j; k-1).
A naive implementation based on DPS would be to pre-compute the
eligible times and the deadlines of the packet at all nodes along its
path and insert all of them in the header. This would eliminate the
need for the interior nodes to maintain d(j; k). The main
disadvantage of this approach is that the amount of information
carried by the packet increases with the number of hops along the
path. The challenge then is to design algorithms that compute d(j; k)
for all nodes while requiring a minimum amount of state in the packet
header.
Notice that in Eq. (3), the reason for node j to maintain d(j; k) is
that it will be used to compute the deadline and the eligible time of
the next packet. Since it is only used in a "max" operation, we can
eliminate the need for d(j; k) if we can ensure that the other term
in "max" is never less than d(j; k). The key idea is then to use a
slack variable associated with each packet, denoted delta(k), such
that for every interior node j along the path, the following holds
a(j; k) + g(j-1; k) + delta(k) >= d(j; k-1), j > 1 (5)
With Ineq. (5), the computation of eligibility time is reduced from
Eq. (3) to
e(j; k) = a(j; k) + g(j-1; k) + delta(k), (6)
Therefore, by using one additional variable, delta(k), we eliminate
the need for maintaining the deadline of the previous packet d(j; k)
for each flow at interior nodes. The computation of delta(k) is given
in [StoZha99]. Figure 2 shows the computation of the scheduling
parameters e(j; k) and d(j; k) by a CJVC server. The number of hops h
is computed when the admission control is performed.
3.3. Distributed Admission Control
The previous two examples focus on data path mechanisms and services.
In this section, we will show that PHBs based on DPS can also
implement control plane services such as distributed admission
control.
Stoica et al Expires August, 1999 [Page 10]
INTERNET DRAFT PHBs Based on DPS February, 1999
Admission control is a central component in providing quantitatively
defined QoS services. The main job of the admission control test is
to ensure that the network resources are not over-committed. In
particular it has to ensure that the sum of the reservation rates of
all flows that traverse any link in the network is no larger than the
link capacity C.
A new reservation request is granted if it passes the admission test
at each hop along its path. There are two main approaches to
implementing admission control. Traditional reservation-based
networks adopt a distributed architecture in which each node is
responsible for its local resources, and where nodes are assumed to
maintain per flow state. To support the dynamic creation and deletion
of fine grained flows, a large quantity of dynamic per flow state
needs to be maintained in a distributed fashion. Complex signaling
protocols (e.g., RSVP and ATM UNI) are used to maintain the
consistency of this per flow state.
A second approach is to use a centralized bandwidth broker that
maintains the topology as well as the state of all nodes in the
network. In this case, the admission control can be implemented by
the broker, eliminating the need for maintaining distributed state.
Such a centralized approach is more appropriate for an environment
where most flows are long lived, and set-up and tear-down events are
rare. However, to support fine grained and dynamic flows, there may
be a need for a distributed broker architecture, in which the broker
database is replicated or partitioned. Such an architecture
eliminates the need for a signaling protocol, but requires another
protocol to maintain the consistency of the different broker
databases. Unfortunately, since it is impossible to achieve perfect
consistency, this may create race conditions and/or resource
fragmentation.
A third approach is to use a simplified provisioning model that is
not aware of the details of the network topology, but instead admits
a new flow if there is sufficient bandwidth available for the flow's
packets to travel anywhere in the network with adequate QoS. This
simplified model may be based on static provisioning and service
level agreements, or on a simple dynamic bandwidth broker. In any
case, the tradeoff made in return for the simplicity is that the
admission control decision must be more conservative, and a much
smaller level of QoS-controlled service can be supported.
In the following, we show that by using a PHB based on DPS, it is
possible to implement distributed admission control for guaranteed
services in a DS domain. In our scheme, for each reservation, the
ingress node generates a request message that is forwarded along the
path. In turn, each interior node decides whether or not to accept
Stoica et al Expires August, 1999 [Page 11]
INTERNET DRAFT PHBs Based on DPS February, 1999
the request. When the message reaches the egress node it is returned
to the ingress, which makes the final decision. We do not make any
reliability assumptions about the request messages. In addition, the
algorithms does not require reservation termination messages. In the
following we describe the per-hop admission control. In [StoZha99] we
describe how this scheme can be integrated with RSVP to achieve end-
to-end delay.
3.3.1. Per-Hop Admission Control
The solution is based on two main ideas. First, we always maintain a
conservative upper bound of the aggregate reservation R, denoted
R_bound, which we use for making admission control decisions.
R_bound is updated with a simple rule:
R_bound = R_bound + r
whenever a request for a rate r is received and R_bound + r <= C. It
should be noted that in order to maintain the invariant that R_bound
is an upper bound of R, this algorithm does not need to detect
duplicate request messages, generated either due to retransmission in
case of packet loss or retry in case of partial reservation failures.
Of course, the obvious problem with the algorithm is that R_bound
will diverge from R. At the limit, when R_bound reaches the link
capacity C, no new requests can be accepted even though there might
be available capacity.
To address this problem, we introduce a separate algorithm, based on
DPS, that periodically estimates the aggregate reserved rate. Based
on this estimate we compute a second upper bound for R, and then use
it to re-calibrate the upper bound R_bound. An important aspect of
the estimation algorithm is that it can be actually shown that the
discrepancy between the upper bound and the actual reserved rate R is
bounded. Then the re-calibration process reduces to choosing the
minimum of the two upper bounds.
3.3.3. Estimation Algorithm for the Aggregate Reservation
To estimate the aggregate reservation, denoted R_est, we again use
DPS. In this case, the state of each packet consists of the amount of
bits the flow is entitled to send during the interval between the
time when the previous packet was transmitted up to the time when the
current packet is transmitted. Note that unlike the previous two
examples, in this case the state carried by the packet does not
affect the packet's processing by interior nodes. This state is
solely used to compute each node's aggregate reservation.
Stoica et al Expires August, 1999 [Page 12]
INTERNET DRAFT PHBs Based on DPS February, 1999
Furthermore, this state is not changed as the packet is forwarded.
The estimation performed by interior nodes is based on the following
simple observation: the sum of state values of packets of all flows
received during an interval (a, a + T_W] is a good approximation for
the total number of bits that all flows are entitled to send during
this interval. Dividing this sum by T_W, we obtain the aggregate
reservation rate. This is basically the rate estimation algorithm,
though we need to account for several estimation errors. In
particular, we need to account for the fact that not all reservations
continue for the entire duration of interval (a, a + T_W].
We divide time into intervals of length T_W. Let (u1, u2] be such an
interval, where u2 = u1 + T_W. Let T_I be the maximum inter-departure
time between two consecutive packets in the same flow and T_J be the
maximum jitter of a flow, both of which are much smaller than T_W.
Further, each interior node is associated a global variable Ra which
is initialized at the beginning of each interval (u1, u2] to zero,
and is updated to Ra + r every time a request for a reservation r is
received and the admission test is passed, i.e., R_bound + r <= C.
Let Ra(t) denote the value of this variable at time t. Since interior
nodes do not differentiate between the original and duplicate
requests, Ra(t) is an upper-bound of the sum of all reservations
accepted during the interval (u1, t]. (For simplicity, here we assume
that a flow which is granted a reservation during the interval (u1,
u2] becomes active no later than u2.) Then, it can be shown that the
aggregate reservation at time u2, R(u2), is bounded by
R(u2) < R_est(u2)/(1-f) + Ra(u2), (7)
where f = (T_I + T_J)/T_W. Finally, this bound is used to re-
calibrate the upper bound of the aggregate reservation R_bound(u1) as
follows
R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8)
Figure 3 shows the pseudocode of the admission decision and of the
aggregate reservation estimation algorithm at ingress and interior
nodes. We make several observations. First, the estimation algorithm
uses only the information in the current interval. This makes the
algorithm robust with respect to loss and duplication of signaling
packets since their effects are "forgotten" after one time interval.
As an example, if a node processes both the original and a duplicate
of the same reservation request during the interval (u1, u2], R_bound
will be updated twice for the same flow. However, this erroneous
update will not be reflected in the computation of R_est(u3), since
its computation is based only on the state values received during
(u2, u3]. As a consequence, it can be show that the admission control
Stoica et al Expires August, 1999 [Page 13]
INTERNET DRAFT PHBs Based on DPS February, 1999
algorithm can asymptotically reach a link utilization of C (1 - f)/(1
+ f) [StoZha99].
A possible optimization of the admission control algorithm is to add
reservation termination messages. This will reduce the discrepancy
between the upper bound R_bound and the aggregate reservation R.
However, in order to guarantee that R_bound remains an upper bound
for R, we need to ensure that a termination message is sent at most
once, i.e., there are no retransmissions if the message is lost. In
practice, this property can be enforced by edge nodes, which maintain
per flow state.
To ensure that the maximum inter-departure time is no larger than
T_I, the ingress node may need to send a dummy packet in the case
when no data packet arrives for a flow during an interval T_I. This
can be achieved by having the ingress node to maintain a timer with
each flow. An optimization would be to aggregate all "micro-flows"
between each pair of ingress and egress nodes into one flow, and
compute the state value based on the aggregated reservation rate, and
insert a dummy packet only if there is no data packet for the
aggregate flow during an interval.
Figure 3 - Admission control and rate estimation algorithm.
---------------------------------------------------------------------
Ingress node
---------------------------------------------------------------------
on packet p departure:
i = get_flow(p);
state(p) <- r[i] * (crt_time - prev_time[i]);
prev_time[i] = crt_time;
---------------------------------------------------------------------
Interior node
---------------------------------------------------------------------
Reservation Estimation | Admission Control
---------------------------------------------------------------------
on packet p arrival: | on reservation request r:
b <- state(p); | /* admission ctrl. test */
L = L + b; | if (R_bound + r <= C)
| Ra = Ra + r;
on time-out T_W: | R_bound = R_bound + r;
/* estimated reservation */ | accept(r);
R_est = L / T_W; | else
R_bound = min(R_bound, | deny(r);
R_est/(1 - f) + Ra);|
Ra = 0; |
---------------------------------------------------------------------
Stoica et al Expires August, 1999 [Page 14]
INTERNET DRAFT PHBs Based on DPS February, 1999
4. Carrying State in Packets
There are at least three ways to encode state in the packet
header: (1) introduce a new IP option and insert the option at the
ingress router, (2) introduce a new header between layer 2 and layer
3, and (3) use the existing IP header.
Inserting an IP option, while having the potential to provide a large
space for encoding state, will require all routers within a DS domain
to process IP options, thus complicating packet processing. In
addition, adding an IP option may cause packet fragmentation that
further reduces the performance.
Introducing a new header between layer 2 and layer 3 would require
solutions be devised for different layer 2 technologies. In the
context of MPLS [Rosen98, Rosen99] networks, the state can be encoded
in a special label. One way to do this is by using a particular
encoding of the experimental use field indicating a nested label on
the label stack that carried the PHB-specific state information rather
than an ordinary label. In this case, the label on the top of the
stack would indicate the label-switched path, and the inner label the
PHB-specific state. This would require a small addition to the MPLS
architecture to allow two labels to be pushed or popped in unison, and
treated as a single entity on the label stack.
4.1. Encoding State within an IP header
In this section, we discuss the third option: storing the additional
states in the IP header. The biggest problem with using the IP header
is to find enough space to insert the extra information. The main
challenge is to remain fully compatible with current standards and
protocols. In particular, we want the network domain to be transparent
to end-to-end protocols, i.e., the egress node should restore the
fields changed by ingress and interior nodes to their original values.
In this respect, we observe that there is an ip_off field in the IPv4
header to support packet fragmentation/reassembly which is rarely
used. For example, by analyzing the traces of over 1.7 million packets
on an OC-3 link [nlanr], we found that less than 0.22% of all packets
were fragments. In addition, ther are a relatively small number of
distinct fragment sizes. Therefore, it is possible to use a fraction
of ip_off field to encode the fragment sizes, and the remaining bits
to encode DPS information. The idea can be implemented as follows.
When a packet arrives at an ingress node, the node checks whether a
packet is a fragment or needs to be fragmented. If neither of these
are true, all 13 bits of the ip_off field in the packet header will be
used to encode DPS values. If the packet is a fragment, the fragment
size is recoded into a more efficient representation and the rest of
the bits is used to encode the DPS information. The fragment size field
Stoica et al Expires August, 1999 [Page 15]
INTERNET DRAFT PHBs Based on DPS February, 1999
will be restored at the egress node.
In the above, we make the implicit assumption that a packet can be
fragmented only by ingress nodes, and not by interior nodes. This is
consistent with the diffserv view that the forwarding behavior of a
network's components is engineered to be compatible throughout a
domain.
In summary, this gives us up to 13 bits in the current IPv4 header to
encode the PHB specific state.
4.2. Example of State Encoding
The state encoding is PHB dependent. In this section, we give
examples of encoding the state for the services described in
Section 3.
4.2.1. Encoding Flow's Rate
Recall that in CSFQ, the PHB state represents an estimate of the
current rate of the flow to which the packet belongs. One possible
way to represent the rate estimate is to restrict it to only a
small number of possible values. For example if we limit it to 128
values, only seven bits are needed to represent this rate. While this
can be a reasonable solution in practice, in the following we propose
a more sophisticated representation that allows us to express a larger
range of values.
Let r denote the rate estimate. To represent r we use an m bit
mantissa and an n bit exponent. Since r <= 0, it is possible to gain
an extra bit for mantissa. For this we consider two cases: (a) if r >=
2^m we represent r as the closest value of the form u 2^v, where 2^m
<= u <= 2^(m+1). Then, since the (m+1)-th most significant bit in the
u's representation is always 1, we can ignore it. As an example,
assume m = 3, n = 4, and r = 19 = 10011. Then 19 is represented as 18
= u*2^v, where u = 9 = 1001 and v = 1. By ignoring the first bit in
the representation of u the mantissa will store 001, while the
exponent will be 1. (b) On the other hand, if r < 2^m, the mantissa
will contain r, while the exponent will be 2^n - 1. For example, for m
= 3, n = 4, and r = 6 = 110, the mantissa is 110, while the exponent
is 1111. Converting from one format to another can be efficiently
implemented. Figure 4 shows the conversion code in C. For simplicity,
here we assume that integers are truncated rather than rounded when
represented in floating point.
Figure 4. The C code for converting between integer and floating
point formats. m represents the number of bits used by
the mantissa; n represents the number of bits in the
Stoica et al Expires August, 1999 [Page 16]
INTERNET DRAFT PHBs Based on DPS February, 1999
exponent.
---------------------------------------------------------------------
intToFP(int val, int *mantissa, int *exponent) {
int nbits = get_num_bits(val);
if (nbits <= m) {
*mantissa = val;
*exponent = (1 << n) - 1;
} else {
*exponent = nbits - m - 1;
*mantissa = (val >> *exponent) - (1 << m);
}
}
FPToInt(int mantissa, int exponent) {
int tmp;
if (exponent == ((1 << n) - 1))
return mantissa;
tmp = mantissa | (1 << m);
return (tmp << exponent)
}
---------------------------------------------------------------------
By using m bits for mantissa and n for exponent, we can represent
any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a
relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example, with 7
bits, by allocating 3 for mantissa and 4 for exponent, we can
represent any integer in the range [1..15*2^15] with a relative error
of (-6.25%, 6.25%). The worst relative error case occurs when the
mantissa is 8. For example the number r = 271 = 100001111 is encoded
as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 = -0.0554
= -5.54%. Similarly, r = 273 = 100010001 is encoded as u = 1001, v =
5, with a relative error of 5.55%.
4.2.2. Encoding Scheduling Parameters
As shown in Figure 2, implementing CJVC requires encoding three
parameters: (1) the rate r or equivalently l/r, (2) the slack delay
delta, and (3) g. Since the deadline determines the delay guarantees,
we use a representation that trades the eligible time accuracy for the
deadline accuracy.
To store the scheduling information we use three fields. The first
field encodes l/r + delta by using a floating point like
representation. Since g is never larger than l/r we encode it as a
fraction of l/r + delta. This allows us to save space without
Stoica et al Expires August, 1999 [Page 17]
INTERNET DRAFT PHBs Based on DPS February, 1999
compromising the accuracy. More precisely, in [StoZha99] we show that
by using three bits for mantissa and four bits for exponent to
represent l/r + delta, and only three bits to represent g, we can
compute the packet's deadline, i.e., d = crt_time + g + l/r + delta,
with a relative error which is bounded by (-13%, 13%). The third field
stores l/r also as a fraction of l/r + delta. Then, the eligible time
is computed as e = d - l/r. By allocating another three bits for this
field we can store all scheduling parameters in 13 bits. In [StoZha99]
we give a slightly different encoding that uses two extra bits in the
DS field to store the reservation state.
4.2.3. Encoding Reservation State
As shown in Figure 2, when estimating the aggregate reservation, the
PHB state represents the number of bits that a flow is entitled to
send during the interval between the time when the previous packet of
the flow has been transmitted until the current packet is
transmitted. This number can be simply encoded as an integer b. To
reduce the range, a possibility is to store b/l instead of b, where l
is the length of the packet.
4.3. Encoding Multiple Values
Since the space in the packet's header is a scarce resource, encoding
multiple values is particularly challenging. In this section we
discuss two general methods that helps to alleviate this difficulty.
In the first method, exemplified in Section 4.2.2. the idea is to
leverage additional knowledge about the state semantic to achieve
efficient encoding. In particular one value can be stored as a
function of other values. For example, if a value is known to be
always greater than the other values, the larger value can be
represented in floating point format, while the other values may be
represented as fractions of this value (see Section 4.2.2).
The idea of the second method is to have different packets within a
flow carry different state formats. This method is appropriate for
PHBs that do not require all packets of a flow to carry the same
state. For example, in estimating the aggregate reservation (see
Section 3.3.2) there is no need for every packet to carry the number
of bits the flow is entitled to send between the current time and the
time when the previous packet has been transmitted. The only
requirement is that the distance between any two consecutive packets
that carry such values to be no larger than T_I. Other packets in
between can carry different information. Similarly, if we encode the
IP fragment size in the packet's state, the packet has to carry this
value only if the IP fragment is not zero. When the IP fragment is
zero the packet can carry other state instead. On the other hand,
Stoica et al Expires August, 1999 [Page 18]
INTERNET DRAFT PHBs Based on DPS February, 1999
note that in CSFQ, it is mandatory that every node carry the rate
estimate as this is used to compute the dropping probability for that
packet.
5. Specification Issues
This section briefly describes some issues related to drafting
specifications for PHB's based on DPS.
5.1. Recommended Codepoints
At this time it is appropriate to use values drawn from the 16
codepoints [Nichols98] reserved for local and experimental use
(xxxx11) to encode PHBs based on DPS.
5.2. Interaction with other PHBs
The interaction of DPS PHB's with other PHB's obviously depends on the
PHB semantic. It should be noted that the presence of other PHB's in
a node may affect the computation and update of DPS state as well as
the actual forwarding behavior experienced by the packet.
5.3. Tunneling
When packets with PHBs based on DPS are tunneled, the end-nodes must
make sure that (1) the tunnel is marked with a PHB that does not
violate the original PHB semantic, and (2) the PHB specific state is
correctly updated at the end of the tunnel. This requirement might be
met by using a tunnel PHB that records and updates packet state, and
then copying the state from the encapsulating packet to the inner
packet at the tunnel endpoint. Alternatively, the behavior of the
tunnel might be measured or precomputed in a way that allows the
encapsulated packet's DPS state to be updated at the decapsulation
point without requiring the tunnel to support DPS behavior.
6. Security Considerations
The space allocated for the PHB state in the packet header must be
compatible with IPsec. In this context we note that using the fragment
offset to carry PHB state does not affect IPsec's end-to-end security,
since the fragment offset is not used for cryptographic calculations
[Kent98]. Thus, as it is the case with the DS filed [Nichols98], IPSec
does not provide any defense against malicious modifications of the
PHB state. This leaves the door open for denial of service
attacks. For example, in CSFQ, flow's rate overestimation may cause
disproportionate bandwidth being allocated to the flow inside the DS
domain. In the example in Section 3.3, the underestimation of the
aggregate reservation can lead to resource overprovision.
Stoica et al Expires August, 1999 [Page 19]
INTERNET DRAFT PHBs Based on DPS February, 1999
One way to expose denial of service attacks is by auditing. In this
context, we note that associating state with PHBs makes it easier to
perform efficient auditing at interior nodes. For example, in CSFQ, an
eventual attack can be detected by simply measuring a flow rate and
then compare it against the estimated rate carried by flow's packets.
7. Conclusions
In this document we propose an extension of the diffserv architecture
by defining a new set of PHBs that are based on Dynamic Packet State.
By using DPS to coordinate actions of edge and interior nodes along
the path traversed by a flow, distributed algorithms can be designed
to approximate the behavior of a broad class of "stateful" networks
within the diffserv architecture. Such an extension will significantly
increase the flexibility and capabilities of the services that can be
provided by diffserv.
8. References
[CSFQ] Core-Stateless Fair Queueing. URL:
http://www.cs.cmu.edu/~istoica/csfq.
[nlanr] NLANR: Network Traffic Packet Header Traces. URL:
http://moat.nlanr.net/Traces/.
[BenZha96] J.C.R. Bennett and H. Zhang. WF2Q: Worst-case fair
weighted fair queueing. In Proceedings of IEEE INFOCOM'96, pages
120-128, San Francisco, CA, March 1996.
[Bernet98] Y. Bernet, R. Yavatkar, P. Ford, F. Baker, L. Zhang,
K. Nichols and M. Speer. A Framework for Use of RSVP with Diff-
serv Networks, Internet Draft, draft-ietf-diffserv-rsvp-01.txt,
November, 1998.
[Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and
W. Weiss. An Architecture for Differentiated Services, Internet
Draft, draft-ietf-diffserv-arch-02.txt, October 1998.
[Demers89] A. Demers, S. Keshav, and S. Shenker. Analysis and
simulation of a fair queueing algorithm. In Journal of
Internetworking Research and Experience, pages 3-26, October
1990. Also in Proceedings of ACM SIGCOMM'89, pp 3-12.
[FigPas95] N. Figueira and J. Pasquale. An upper bound on delay
for the Virtual Clock service discipline. IEEE/ACM Transactions
on Networking, 3(4), April 1995.
[FloFal97] S. Floyd and K. Fall. Router mechanisms to support
Stoica et al Expires August, 1999 [Page 20]
INTERNET DRAFT PHBs Based on DPS February, 1999
end-to-end congestion control, LBL Technical Report, February
1997.
[Georgiadis95] L. Georgiadis, R. Guerin, and V. Peris. Efficient
network QoS provisioning based on per node traffic shaping. In
IEEE INFOCOM'96, San Francisco, CA, March 1996.
[Golestani94] S. Golestani. A self-clocked fair queueing scheme
for broadband applications. In Proceedings of IEEE INFOCOM'94,
pages 636-646, Toronto, CA, June 1994.
[Goyal95] P. Goyal, S. Lam, and H. Vin. Determining end-to-end
delay bounds in heterogeneous networks. In Proceedings of the
5th International Workshop on Network and Operating System
Support For Digital Audio and Video, pages 287-298, Durham, New
Hampshire, April 1995.
[Jacobson98] Van Jacobson, K. Poduri and K. Nichols. An
Expedited Forwarding PHB, November 1998. Internet Draft, draft-
ietf-diffserv-phb-ef-01.txt, November 1999.
[Heinanen99] Juha Heinanen, F. Baker, W. Weiss, and J.
Wroclawski. Assured Forwarding PHB Group, Internet Draft, draf-
ietf-diffserv-af-04.txt, January 1999.
[Kent98] S. Kent and R. Atkinson. IP Authentification Header,
Request for Comments: 2402, draft-ietf-ipsec-auth-header-07.txt,
November, 1998.
[Lin97] D. Lin and R. Morris. Dynamics of random early detection.
In Proceedings of ACM SIGCOMM'97, pages 127-137, Cannes, France,
October 1997.
[Nagle87] J Nagle. On packet switches with infinite storage. IEEE
Trans. on Communications, 35(4):435-438, April 1987.
[Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black.
Definition of the Differentiated Services Field (DS Field) in the
ipv4 and ipv6 Headers, Internet Draft, draft-ietf-diffserv-
header-04.txt, October 1998.
[ns2] Ucb/lbnl/vint network simulator - ns (version 2).
[Parekh92] A. Parekh and R. Gallager. A generalized processor
sharing approach to flow control - the single node case. In
Proceedings of the INFOCOM'92, 1992.
[Rosen98] E. C. Rosen, Y. Rekhter, D. Tappan, D. Farrinaci G.
Stoica et al Expires August, 1999 [Page 21]
INTERNET DRAFT PHBs Based on DPS February, 1999
Fedorkow, T. Li, and A. Conta. MPLS Label Stack Encoding,
Internet Draft, draft-ietf-mpls-label-encaps-03.txt, September
1998.
[Rosen99] E. C. Rosen, A. Viswanathan, and R. Callon.
Multiprotocol Label Switching Architecture, Internet Draft,
draft-ietf-mpls-arch-04.txt, February 1999.
[ShrVar95] M. Shreedhar and G. Varghese. Efficient fair queueing
using deficit round robin. In Proceedings of SIGCOMM'95, pages
231-243, Boston, MA, September 1995.
[StiVer96] D. Stilliadis and A. Verma. Latency-rate servers: A
general model for analysis of traffic scheduling algorithms. In
Proceedings of IEEE INFOCOM'96, pages 111-119, San Francisco, CA,
March 1996.
[Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless
Fair Queueing: Achieving Approximately Fair Bandwidth Allocations
in High Speed Networks. In Proceedings ACM SIGCOMM'98, pages
118-130, Vancouver, September 1998.
[StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services
Without Per Flow Management. URL:
http://www.cs.cmu.edu/~istoica/DPS/, February 1999.
[Verma91] D. Verma, H. Zhang, and D. Ferrari. Guaranteeing delay
jitter bounds in packet switching networks. In Proceedings of
Tricomm'91, pages 35-46, Chapel Hill, North Carolina, April 1991.
[ZhaFer94] H. Zhang and D. Ferrari. Rate-controlled service
disciplines. Journal of High Speed Networks, 3(4):389-412, 1994.
[Zhang90] L. Zhang. Virtual Clock: A new traffic control
algorithm for packet switching networks. In Proceedings of ACM
SIGCOMM'90, pages 19-29, Philadelphia, PA, September 1990.
9. Author's Addresses
Ion Stoica Hui Zhang
Carnegie Mellon University Carnegie Mellon University
5000 Forbes Ave 5000 Forbes Ave
Pittsburgh, PA 15213 Pittsburgh, PA 15213
Phone: (412) 268-2993 Phone: (412) 268-8945
Email: istoica@cs.cmu.edu Email: hzhang@cs.cmu.edu
Scott Shenker Raj Yavatkar
ICSI Intel Corporation, JF3-206
Stoica et al Expires August, 1999 [Page 22]
INTERNET DRAFT PHBs Based on DPS February, 1999
1947 Center Street, Suite 600 2111 NE 25th. Avenue,
Berkeley, CA 94704-1198 Hillsboro, OR 97124
Phone: (510) 642-4274 x300 Phone: (503) 264-9077
Email: shenker@icsi.berkeley.edu Email: raj.yavatkar@intel.com
Donpaul Stephens Andrew. G. Malis
Ascend Communications, Inc. Ascend Communications, Inc.
10250 Valley View Road, Suite 113 1 Robbins Road
Minneapolis, MN 55344 Westford, MA 01886
Phone: (612) 996-6840 Phone: (978) 952-7414
Email: donpaul@ascend.com Email: amalis@ascend.com
Yoram Bernett Zheng Wang
Microsoft Lucent Technologies
One Microsoft Way, 101 Crawfords Corner Road,
Redmond, WA 98052 Room 4C-516
Phone: (425) 936-9568 Holmdel, NJ 07733, USA
Email: yoramb@microsoft.com Phone: (732) 949-1042
Email: zhwang@dnrc.bell-labs.com
Fred Baker John Wroclawski
Cisco Systems MIT Lab for Computer Science
519 Lado Drive 545 Technology Sq.
Santa Barbara, CA 93111 Cambridge, MA 02139
Phone: (408) 526-4257 Phone: (617) 253-7885
E-mail: fred@cisco.com Email: jtw@lcs.mit.edu
Chuck Song Rick Wilder
MCI Internet Engineering MCI Internet Engineering
2100 Reston Parkway 2100 Reston Parkway
Phone: (703) 715-7082 Phone: (703) 715-7114
E-mail: csong@mci.net E-mail: wilder@mci.net
Stoica et al Expires August, 1999 [Page 23]