Next: 3 Emulating the routing Up: 2 Routing on a Previous: 2.1 Routing in the

## 2.2 Routing in the weak switch model

Routing is more complicated in the weak switch model. A problem arises when several packets arrive on the same input bundle at the same step. Since a switch can examine only one edge in the bundle, it can forward only one of them, while the others must wait. The routing algorithm must guarantee that the total delay experienced by each packet is at most bit steps.

Our routing algorithm is a variant of Ranade's algorithm for routing on the butterfly [18, 19, 23]. In Ranade's algorithm, each packet is initially assigned an -bit random rank, which it carries in front of its destination header. The most important invariant maintained by the algorithm is that the stream of packets leaving each switch is arranged in order of non-decreasing rank. In order to keep the stream sorted, a switch never sends a packet until it is sure that no other packets with smaller rank will arrive later.

The problem with implementing Ranade's algorithm in the bit model is that it takes too long to compare the ranks of two packets at the same switch. In our algorithm, each packet is also assigned a random rank initially, and that rank does not change throughout the course of the routing. However, a packet does not carry its rank along with it. Instead, we introduce a new type of packet called a token. A token carries no information other than its type, and as a consequence consists of only bits. The algorithm uses these tokens to represent the ranks of the other packets. It maintains the invariant that a packet of rank r passes through a bundle only after exactly r tokens have passed through the bundle. This invariant guarantees that the packets pass through the bundle in order sorted by rank. A token can be considered to have a rank as well; a token of rank r passes through a bundle only after exactly r tokens have passed through the bundle. Note that the tokens are not the same as the ghost packets used in Ranade's algorithm, which carry a rank. Our algorithm does not use ghost packets.

The mechanics of the algorithm are fairly simple. A switch maintains a pointer to an edge in each of its two input and two output bundles. We call these edges the current input and output edges, respectively. The behavior of the switch depends on whether the current input queues contain packets or tokens. There are four cases. If both of the current input edges have tokens in their queues, then the switch sends tokens on both of the current output edges and advances all four pointers. If one of the current input edges has a token in its queue and the other a packet, then the switch strips off the first bit of the packet's header, shunts the packet to the current edge of the output bundle specified by that bit, and advances the pointers for the input and output bundles used by the packet. If both of the current input edges have packets in their queues, then the switch forwards one of them, and advances the pointers for the bundles used by that one. If either of the current input edges has an empty queue, then the switch does nothing.

Initially, each packet p is assigned a random rank in the range , where , and will be specified later. The input switches on level 0 send out a stream of packets interspersed with tokens so that r tokens precede each packet with rank r. A total of w tokens are sent on each bundle. The tokens with rank w-1 follow all of the other packets and tokens in the network. A routing phase terminates when every output has received a rank w-1 token on both of its input bundles. Rank w-1 tokens can be specially marked so that the outputs do not have to count the number of tokens that they have received.

During the course of the algorithm, each edge is used by either one packet, one token, or not at all. Because w tokens pass through each bundle, we further dilate the butterfly by adding edges to each bundle. By using the same edges to send both packets and tokens, it is possible to remove these extra w edges. However, the algorithm and its analysis become slightly more complicated because several tokens may be sent across the same edge, and each one must wait to be sent until a space frees up in the queue at the end of the edge.

The proof that the heads of the packets quickly reach their destinations uses a delay sequence argument like that in [18, 19, 23]. A k-delay sequence consists of four components: a path backwards from an output to an input, a sequence of (not necessarily distinct) switches appearing on the path in the order , a sequence of distinct packets, and a sequence of non-increasing ranks. We say that a k-delay sequence has occurred if packet passes through for , and for .

The following lemma is the crux of the delay sequence argument. It shows that if some packet is delayed, then a delay sequence must have occurred. For simplicity we analyze the first routing phase only.

Proof: We will construct a delay sequence by spending lag points. The lag of a packet on level l at time is . If some packet arrives at its destination at step , then it has lag . We use these points to build the sequence. For each point spent, we will either insert a packet into the sequence, or decrease the rank of the next packet in the sequence. Since there can be at most w decreases in the rank, the sequence must contain at least packets.

The construction of the delay sequence begins with the last packet to arrive at its destination. Suppose a packet arrives at its destination at time . First, we insert , , and into the delay sequence. Then we follow back in time until it was last delayed. The lag remains at each step on the path, until the step at which failed to move forward, where the lag drops to .

If was delayed at switch because another packet was sent instead, then , and we insert , , and into the sequence, and follow back in time. In this case, we have inserted one packet into the sequence at a cost of one lag point. In the butterfly network, once the paths of two packets diverge, they do not meet again. Thus, if delays , then remains ahead of as long as their paths intersect. As a consequence, no packet is inserted into the sequence more than once.

If was delayed because a token was sent instead, then . In this case we do not insert anything into the sequence, but instead follow back in time until it was last delayed. In this case, we have lost one point in lag, but have also decreased the rank of the next packet in the sequence by at least one point.

The third possibility is that was delayed because the current edge on the other input bundle, c, had nothing to send. In this case we do not insert anything into the sequence, but go back through c to a switch s at the previous level at the previous time step. At the previous time step, switch s must not have sent anything through bundle c. This can happen for one of two reasons. Either s sent a packet on its other bundle, or one of the input bundles to s had nothing to send. In the former case, we insert , , and into the sequence, and follow back until it was last delayed. In the latter case, we go back through to a switch at the previous level at the previous time step and continue. Continuing in this fashion, we must eventually reach a switch that sent a packet.

Now that we have analyzed the ways in a which a packet can be delayed, we must consider the ways in which a token can be delayed. Suppose we are following a token back in time until it was last delayed at a switch s.

If was delayed because s sent a packet p instead, then we insert , , and into the sequence.

Otherwise, if was delayed because s sent another token instead, then , and we follow back until it was last delayed.

The last possibility is that was delayed because one of the input bundles c to s had nothing to send. In this case, we go back through c just as if a packet had been delayed for the same reason.

The preceding case analysis shows that for each point of lag spent, either a packet is inserted into the delay sequence, or the rank of the next packet in the sequence decreases. Thus, the sequence must contain at least packets.

to .667emto .667em

Proof: To bound the probability that some packet is delayed for steps, we now need only to bound the probability that some -delay sequence occurs. To do so, we enumerate all possible delay sequences, and sum the probabilities that each occurs. The argument is similar to the ones in [18, 19, 23]. There are ways to choose the path, since there are choices for the output at which it starts, choices for the input at which it ends, and only one path between them. Once the path is fixed, there are at most ways of choosing switches on the path. Let denote the level of switch . Then there are at most ways to choose . Finally, there are at most ways of choosing a sequence of non-increasing ranks in the range . A delay sequence occurs only if passes through and , for . For the delay sequence constructed thus far, the probabilities that these two events occur are , and , respectively. Thus, the probability that any delay sequence occurs is at most

Using the inequality for , it is straightforward to show that for any , there are constants and , such that for , and , , the product is at most .

to .667emto .667em

The analysis for the second phase of routing is similar. If we require the first phase to be completed before the second phase begins, then each takes bit steps, and the total time is bit steps. In fact, it is possible to pipeline the two phases so that some packets begin the second phase before the first is finished. The resulting algorithm is simpler and still requires only bit steps. The analysis is a little more complicated because a packet can delay another packet in the first phase, and can delay in the second.

Once the head of a packet reaches its true destination, the M bits of data following behind it trickle into the destination in bit steps. Since, with high probability, all of the heads reach their destinations in bit steps, the total time is also with high probability.

Next: 3 Emulating the routing Up: 2 Routing on a Previous: 2.1 Routing in the

Bruce Maggs
Mon Jul 22 17:37:10 EDT 1996