"Bitslice" DES key search

The DESCHALL clients with version numbers ending in "dk001", "dk002", and "dk003" use a bit-parallel key-search method some refer to as "bitslicing." I'll try to explain the general technique briefly, and talk a bit about my particular implementation of it.

1. What's bitslicing?

The basic idea is that if you have a CPU that operates on N-bit integers, you can treat it like N processors, each operating on single bits. So for a 32-bit CPU, you get 32 "virtual processors", and for a 64-bit CPU, you get 64. The parallel machine you're simulating is of the "SIMD" (single instruction, multiple data) model; that is, all the "processors" execute the same instruction at each step, but they operate on different data.

Now, this certainly doesn't mean you can do everything 32 or 64 times as fast, because your virtual processors only work on a single bit at a time, so it takes more steps for them to accomplish most tasks. However, the calculations necessary to compute DES are well-suited to this kind of single-bit operation. In a bit-parallel (or "bitslice") DES keysearch approach, you test 32 or 64 different keys at once--one key per virtual processor. It turns out that you can check a block of 64 keys this way in substantially fewer steps than it would take to test the 64 keys one at a time, using a conventional approach.

The main reason that DES is faster to run using the bit-parallel technique is that computing the DES function requires doing lots of different bit-level operations, such as permuting the individual bits of a 64-bit chunk of data, or applying an arbitrary function to six bits of one word and inserting the four-bit result into another word. These operations are difficult to do efficiently on general-purpose CPUs, but when you're operating on one bit at a time, they become simpler.

Eli Biham was the first person (to my knowledge) to describe a bit-parallel DES implementation. You can read his paper, "A Fast New DES Implementation in Software", in the proceedings of Fast Software Encryption 4 (1997). It's also available in postscript from Biham's publications page.

2. My implementation

Here's a little information about the bitslice DESCHALL clients I've written.

2.1. Speed

My current bitslice DESCHALL clients get roughly the following keysearch rates:

CPU	word size	speed (Mkeys/sec)
DEC Alpha 21164/333MHz	64 bits	5.32
UltraSPARC I/167MHz	64 bits	2.43 (*)
MIPS R10000/194MHz	64 bits	2.11
DEC Alpha 21064A/233MHz	64 bits	1.81
PowerPC 601/80MHz	32 bits	0.26

For comparison, a 200MHz Pentium running Rocke's carefully tuned conventional client checks about 0.94 Mkeys/sec.

2.2 Keys to performance

Aside from the bitslice technique itself, several important factors contribute to this performance:

Rocke Verser's DES key search method

All DESCHALL clients use a keysearch method Rocke devised, which combines clever caching and short-circuiting to reduce the number of DES "rounds" the client must compute to eliminate a key from consideration. The full DES algorithm includes 16 rounds. For conventional clients, Rocke's method requires about 9.8 rounds per key, on average. For bitslicing with 64-bit words, it requires about 10.5 rounds on average, since you have to eliminate all 64 keys before you can quit and move to the next 64 keys.

Fitting heavily-used code in the on-chip instruction cache

The size (number of instructions) of the "inner loop" in a bitslice DES keysearch program is significantly larger than it would be using the conventional method. If this code doesn't fit in the on-chip instruction cache (usually 8-16 Kbytes), the performance will suffer significantly. Because instruction caches are often "direct-mapped," it's best if the inner loop code is in a contiguous 8-16K region of memory. The inner loop in my implementation fits in 7.5K of contiguous memory on an Alpha, and the size is similar for other RISC CPUs.

Fitting heavily-used data in the on-chip data cache

Just as important as fitting the code in the i-cache is fitting the data in the d-cache. Bitslice DES requires 32-64 times as much storage as conventional DES. Furthermore, the caching Rocke's key search method uses requires over ten times as much storage as a naive key search. This means it takes a bit of care to get the data to fit nicely in a typical 8K-16K d-cache. In my implementation, the heavily-used data occupies about 11K of continguous memory on 64-bit machines, and 5.5K on 32-bit machines. Fortunately, most 64-bit CPUs have at least 16K of on-chip data cache, and nearly all modern CPUs have at least 8K of on-chip data cache.

Optimized S-box circuits

The bulk of the work in bitslice DES code is in computing the eight "S-box" functions. These are functions that take six input bits and produce four output bits, and they have little regular structure. A conventional DES implementation typically computes the S-box functions by doing lookups in 64-entry tables. A bitslice implementation can't do this, because it has to compute the function 32 or 64 times in parallel, and the necessary lookup table would be enormous.

Instead, we design for each S-box a digital circuit that computes its output using simple gates like AND, OR, and XOR. The program will simulate the operation of this circuit, using the corresponding CPU instructions. The time it takes to simulate the circuit will be roughly proportional to the number of gates.

In his paper, Biham describes a general method for constructing these circuits, which yields an average of approximately 100 gates per S-box after standard compiler optimizations are applied. This is essentially what I used in the "dk001" DESCHALL clients.

Matthew Kwan has developed smaller S-box circuits--about 51 gates per S-box. He's made source code available; since he developed it in Australia, anyone can look at it. I intend to incorporate his improved S-box circuits when I release my code.

I've produced S-box circuits with an average of 66.1 gates per S-box; these are used in the "dk002" DESCHALL clients. The methods I used to construct them were somewhat interesting in their own right; I may write about that later.

Rocke has built even smaller S-box circuits. These are used in the "dk003" clients (UltraSPARC only, so far).

Using RISC CPUs

There are several features of RISC processors that make them more suitable than CISC processors for bitslice DES:

Enough registers: RISC processors typically have 16-32 or more registers. The circuit simulation is most efficient if you can keep 20 or so temporary values around and manipulate them quickly. An x86 has only seven general-purpose registers available, so many writes to memory are necessary.
Three-argument instructions: RISC processors typically provide instructions that can perform an operation on two registers and store the result in a third register. The circuit simulations often need to reuse the two source arguments later, so this is helpful. On x86 machines, the destination must be the same as one of the two source arguments.
A variety of logical instructions: RISC processors typically offer a good selection of logical instructions (e.g., AND, OR, ORNOT, ANDNOT, XOR, XNOR). This facilitates the circuit simulation. The x86 instruction set includes only AND, OR, XOR, and NOT.

Given these advantages, compilers for RISC machines can typically produce very good code from a bitslice program in C. With fewer registers available, or with conventional DES code, compilers have a much harder time competing with hand-generated assembly. I have tried building x86 bitslice code using a compiler, but it's at least 30% slower than Rocke's conventional client.

The MMX instruction set extensions offer some hope. They provide 64-bit operations, an extra ANDNOT instruction, and eight registers to work with, so an MMX bitslice client might work reasonably well. It will very likely require hand-tuned assembly, though, and we haven't tried it yet.

3. More information

To learn more about DES and cryptography in general, check out Bruce Schneier's Applied Cryptography.

Darrell Kindred

Last modified: Tue Aug 25 17:28:13 EDT 1998