15-212-ML Homework 5


Introduction

Handin

Send questions to:

Electronic Grading


The Magic of LZW

History Lesson

LZW is a simple adaptive dictionary based compression algorithm named after its inventors, Lempel, Ziv, and Welch. Today it is used in several common applications, including GIF images and the unix compress program.

Data Compression 15-999

The general concept behind most data compression programs is: LZW focuses on the first compression goal. It maintains a dictionary which maps from integers (called "codes") to the strings they represent; initially this dictionary has 256 entries (numbered 0..255), where the integer n maps to the one-charachter string n.

LZW's Adaptive Dictionary

However, instead of sending a dictionary ahead of time, as the zebra example does above, LZW infers the dictionary from the stream that it is compressing or decompressing.

How does it do this? The compressor translates a stream of characters to a stream of integers (codes), maintaining a dictionary D. We use D(a) to denote the numerical index for the string a in the dictionary and write ab for the concatenation of a and b (where each is either a character or a string) In addition to the dictionary, the compressor maintains a string called w and a character K. Initially, the first character of the data stream is placed into w, and the second character into K.

The LZW algorithm

The compression loop proceeds as follows:

For this assignment we will limit the size of the dictionary to 216 entries; if the dictionary fills up, your code should simply not make any more entries into the dictionary.

This process will yield as its output a sequence of codes, representing the compressed data stream.

Practice

Try these examples by hand. It is important that you completely master the process of LZW compression by hand before you write your code. All examples are over a 3-character alphabet, with "a"=0, "b"=1, and "c"=2 initially in the dictionary.

UncompressedCompressed
aabababaaa0,0,1,4,6,3
abcabbbbbcab0,1,2,3,1,7,4,3
cabcabcababcabc2,0,1,3,5,4,4,6,2

Decompression

Decompression is just a bit more difficult. At the beginning of the decompression, initialize the dictionary just as you did for compression.

If B is a valid index into the dictionary:

If the B is not a valid index into the dictionary ("KwKwK case"):

Note: the description of the algorithm is imperative in nature in order to make it easy to understand. However, the implementation of this section should not use any mutable data structures.

Variable-length encoding

The final step is to use variable-length encoding to output the stream of integers we have produced. At any point in the stream, both the compressor and decompressor both know the number of entries in the dictionary. The number of bits required to represent the largest index is then ceil(log2 maxindex). To conserve space, when we output an index, we only output that many bits. Note: it is important to realize that the number of bits output is determined by the largest index in the dictionary, not the index being ouput. This feature will be for extra credit.

Problem 0: Setup (0/85)

We provide a library for streams stream.sml, stream-based file I/O stream-io.sml, dictionaries dict.sml and bit vectors bit-vector.sml. You should include use statements for them at the beginning of your code:
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream.sml";
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream-io.sml";
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/bit-vector.sml"; (* for extra credit *)
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/dict.sml";
  

Problem 1: Stream Transducers (15/85)

Write a structure Transducer conforming to this signature (in file transducer.sml):
    signature TRANSDUCER =
    sig
	type byte = Word8.word
	type 'a stream                (* Use STREAM, not BASIC_STREAM *)
	exception OverFlow
	exception Error of string

	val byteStreamToCharStream    : byte stream -> char stream      

	val intStreamToByteStream     : int stream -> byte stream
	(* Raises exception OverFlow if an int exceeds 2^16 *)    
	(* Outputs each 16 bit int as two 8bit bytes, least *)
	(* significant byte first (network byte order)      *)

	val byteStreamToIntStream     : byte stream -> int stream
	(* reverse of intStreamToByteStream                 *)
	(* Raises exception Error with a description if     *)
	(* there are an odd number of bytes in the stream   *)

	val charStreamToByteStream    : char stream -> byte stream      
    end;
   
Implement the byteStreamToCharStream, byteStreamToIntStream, intStreamToByteStream, and charStreamToByteStream functions which convert the elements of their input streams into a different format, yielding an output stream. Style warning! There is an extremely concise way to write two of these functions, and we will deduct style points if you fail to recognize it. Think: what do these functions have in common?

Problem 2: De/Compression (60/85)

For this section you will be implementing an LZW compressor structure Compression that meets this signature (which can be found in file compress.sml):
    signature COMPRESSION =
    sig
	type 'a stream                (* Use STREAM, not BASIC_STREAM *)
	structure intDict    : DICT where type key = int
	structure stringDict : DICT where type key = string

        exception Error of string

	val compress                  : char stream -> int stream
	val compressAndShowDict       : char stream -> (int * int stringDict.dict) stream

        (* may raise Error on invalid input *)
	val decompress                : int stream -> char stream
	val decompressAndShowDict     : int stream -> (char * string intDict.dict) stream
    end;
   
Note that you do not have to do variable length code words.

Problem 2.1: Compression (30/85)

Implement Compression.compress and Compression.compressAndShowDict. The latter function should yield a stream of not just the compression codes but also the state of the dictionary after each new output code is written. This function is required for full credit and is essential in order for us to grant you partial credit on section 2.1; it allows us to watch you perform the compression step-by-step so we can determine what you did right and give you partial credit for it.

Problem 2.2: Decompression (30/85)

Implement Compression.decompress and Compression.decompressAndShowDict. The latter function should yield a stream of not just the compression codes but also the state of the dictionary after each new output character is written. This function is required for full credit and is essential in order for us to grant you partial credit on section 2.2; it allows us to watch you perform the decompression step-by-step so we can determine what you did right and give you partial credit for it.

Problem 3: De/Compressing Files (10/85)

Write a structure Lzw conforming to the following signature (which can be found in file lzw.sml):
     signature LZW = 
     sig
       exception Error of string
       (* takes a file and writes the compressed version to file.mlZ *)
       val compress : string -> unit

       (* decompresses file.mlZ into file and raises Error("not an mlZ file")
	  if parameter does not end in ".mlZ" *)
       val decompress : string -> unit

       (* reads the file in and runs it through both the compressor and
	  decompressor, then writes it back to file.muZ. Hopefully the
	  original and munged files will be identical (if done right) *)
       val munge : string -> unit
     end;
    

Problem 4: Variable length code words (+15 points ec)

One problem with the compression system described above is that 16 bits are used for each code word all the time, even if we don't 16 bits to distinguish that code for all other valid codes that might be seen at a particular point.

For extra credit write a structures VarCompress and VarLzw conforming to COMPRESS and LZW respectively, which implement compression and decompression using variable length code words. You should start with code words of length 9, and go up to a maximum of 16 bits. You may use the BitVector implementation provided.