Gigabit Nectar Host Interfaces

The Gigabit Nectar network is built around a HIPPI interconnect. HIPPI networks are based on switches and have a link speed of 800 Mbit/second. Since this rate is close to the internal bus speed of many computer systems, inefficiencies in the data flow (e.g. unnecessary copying of the data) reduces the throughput observed by applications. The Nectar group develop a network interface architecture built around the "Communication Acceleration Block (CAB)". It is a single-copy architecture: the CAB uses DMA to transfer the data directly from appliation buffers to buffers on the host interface, calculating the IP checksum on the fly, thus minimizing the load on the system memory bus. The figure below compares the dataflow through a traditional interface (a) with a single-copy interface (b). Two network interfaces based on the CAB architecture were built by Network Systems Corporation: a Turbochannel interface for DEC workstations, and an interface for the iWarp distributed memory system.

Gigabit Nectar Workstation Interface

In the case of the workstation interface, the CAB transfers the data directly from the application's address space to the buffers on the network interface. This can be done both for applications that use the socket interface, which as copy semantics, and for applications that use more optimized interfaces based on shared buffers. The architecture of the workstation interface is shown below. The main components are outboard "network" memory, DMA engines, and support for IP checksumming.

In order for the CAB features to pay off, it is necessary to modify the protocol stack on the host: the copy operations and checksum calculation that are performed in a traditional protocol stack have to be eliminated since these functions are now performed by the CAB hardware. We modified protocol stack in DEC OSF/1 for the Alpha 3K/400 workstation to support single-copy communication. The main idea is that instead of passing the data through the stack in kernel buffers, we pass descriptors of the data through the stack; the network device driver then implements the single copy using the DMA engines on the stack. The following graphs show the impact of using a single-copy communication architecture.

The first graph (above) shows the application-level throughput as a function of the write size measured using ttcp using both the original (unmodified) protocol stack and the single-copy (modified) stack. We see that the modified stack can send data at 170 Mbit/second, which is the maximum the device can sustain, as is shown by the "raw HIPPI" throughput numbers. Throughput with the unmodified is limited to about 110 Mbit/second. The next graph (below) explains the performance difference. It shows what fraction of the CPU is used to support the communication throughput shown in the first graph. With the unmodified stack, almost the entire CPU is used for communication (specifically data copying and checksumming), so performance is limited by the host. With the single-copy stack, only a quarter of the CPU is used, even though a higher throughput is achieved, i.e. by eliminating the data copy and checksumming, we improved the communication efficiency significantly.

The final graph shows the efficiency of the communication for different write sizes. We define the efficiency as the ratio of the throughput over the utilization, i.e. it is the communication throughput the host can support given a sufficiently fast network and network interface. The results show that for large writes, the single-copy stack is about 5 times as efficient as the unmodified stack.

Gigabit Nectar iWarp Interface

Network I/O on a distributed-memory system such as iWarp is more complicated since the data is distributed across the private memories of the compute nodes. Moreover, protocol processing cannot be parallelized easily. We use an approach where the distributed-memory system creates large blocks of data that can be handled efficiently by the network interface. The network interface performs protocol processing, using the CAB to perform the most expensive in hardware (data movement and checksumming). The architecture of the network interface is shown below. The interface connects to the internal interconnect of iWarp through the links on the left, and the green blocks correspond to the CAB. We have observed rates of 450 Mbit/second from a simple iWarp application sending to a HIPPI framebuffer, and rates of 320 Mbit/second from iWarp to the Cray C90 in heterogeneous distributed computing applications (medical MRI image processing).

The host interface effort and networking results are described in more detail in papers.