Skip to main content

Homework 5: Priority Queue and Huffman

Due by 11:59PM ET on Monday, October 21, 2024

2 Required Problems: Priority Queue and Huffman (100 points total)

Setup and Logistics

We have put together several stub files –– click here to download them.

4 files to submit: Huffman.java, HuffmanTest.java, BinaryMinHeapImpl.java, BinaryMinHeapImplTest.java
Please do not submit any additional files.

Be sure to follow these guidelines:

Motivation – Compression Algorithms

The goal of a compression algorithm is to take a sequence of bytes and transform it into a different sequence of fewer bytes, such that the original can be recovered. Because compression algorithms reduce the size of a file, they allow quicker transmission of files over the network, benefiting everyone on that link. Regarding bandwidth, please see this xkcd link.

Compression algorithms can be lossless (like ZIP) or lossy (like JPEG). Lossy compression is useful for images, music, and video because multimedia is an emergent property of its file formats; in other words, we don’t care if there are slight imperfections in a JPEG image. On the other hand, text-based files must be compressed without any data loss; what good is a source code file if it cannot be compiled because a few of its bits switched from 1 to 0?

In this assignment, you will implement a lossless compression algorithm, namely Huffman Coding, that is used as part of larger compression schemes, such as ZIP. In order to implement it, you’ll also be implementing a binary min heap which will be used as a priority queue.

NOTE: The BinaryMinHeap interface cannot be modified in any way, nor can the method headers in BinaryMinHeapImpl or Huffman.java. Failure to abide by this will prevent your code from compiling and will lose you credit on the assignment.

Part 1: Binary Min Heap

Files to submit: BinaryMinHeapImpl.java, BinaryMinHeapImplTest.java

In order to implement Huffman’s algorithm, you’ll need to implement a min-heap. We have provided an interface that you must implement, and we have also provided you with the corresponding class stub file. Please make sure you do not modify the BinaryMinHeap.java interface at all! Also, do not modify any of the method headers in the BinaryMinHeapImpl.java file. As always, you may add package private helper methods in the Impl file. Note, you do not need to worry about your BinaryMinHeapImpl.java working with other implementations.

Please make sure you read the interface thoroughly, as it has details about the required runtimes for all the methods. Not all of the implementation requirements are in this writeup.

Some notes regarding implementation:

Part 2: Huffman coding

Files to submit: Huffman.java, HuffmanTest.java

The compression algorithm you’ll implement is known as Huffman coding. The idea behind Huffman coding, and an idea common to a lot of concepts and techniques in computer science, is that common activities should have a lower cost than rarer activities. For example, we encode ASCII characters with eight bits each, but we see the character e much more often than the character x, which we see much more often than the BEL character (ASCII code 7). Therefore, why not represent those more frequent characters with fewer bits and those rarer characters with more bits?

First, assume the existence of an alphabet, whose composing characters abide by some positive-valued probability mass function. In other words, each character in the alphabet has a probability of showing up, these probabilities are each greater than zero, and these probabilities sum to one. Your Huffman implementation will permit alphabet specification by either passing in a map from chars to ints or by providing a String from which the alphabet and probability mass function will be determined. Here, each int represents its corresponding char’s count, not its probability.

Why is this the case? We use ints instead of doubles or floats because of the inherent imprecision of the floating-point standard; it is impossible to accurately represent a number as simple as $\frac{1}{3}$. Instead, we determine the alphabet’s probability mass function by taking each Character’s corresponding count and dividing it by the sum of all Characters’ counts.

Here is a (simple) example alphabet, according to our specification:

  a     b     c  
1 2 2

Here, the character a shows up $\frac{1}{5}$ of the time, and the characters b and c each show up $\frac{2}{5}$ of the time.

Huffman coding uses a binary tree to encode character information, where every leaf represents a character in the alphabet and where each edge represents the state of a bit in the compression. Starting from the tree’s root, traveling to a node’s left child indicates appending a 0-valued bit to the compressed representation of a character, and traveling to a node’s right child indicates appending a 1-valued bit to the compressed representation of a character. Note that because of this, Huffman coding requires the alphabet to have no fewer than two characters.

The generation of the Huffman tree is the crux of the algorithm. It uses a priority queue, which is an abstract data type that keeps track of the smallest (i.e. highest-priority, hence the name) element in a collection of elements.

Using BinaryMinHeap as a priority queue:

Using your BinaryMinHeap as a priority queue with frequency as the key and a node as the value, one can implement Huffman’s algorithm as follows:

Create a lone leaf node for each character in the alphabet (i.e. you should have $n$ discrete leaf nodes). Add each node into a priority queue, and while the priority queue has more than one node in it, follow the following process:

  1. Remove the two nodes of lowest frequency from the priority queue.

  2. Create a new node. Assign its left child to be one of the two nodes removed in 1 and the right child to be the other (left vs right does not matter)

  3. Assign the new node’s frequency to be the sum of its left child’s frequency and its right child’s frequency. This new node is an internal node and no longer corresponds to a character in the alphabet.

  4. Add the new node to the priority queue.

When there is only one node left in the priority queue, that node is the root of our final Huffman tree.

Note: You may NOT use java.util.PriorityQueue as the priority queue for this part of the assignment. Instead, you should use your implementation of BinaryMinHeap.java

We have provided five method and constructor stubs for you. Do not modify the headers of these methods and constructors, only the bodies. Failure to abide by this will prevent your code from compiling, and you will lose credit on the assignment.

Of note regarding the implementation:

Visualizer

For this homework assignment, we are providing a visualizer to help you visualize Huffman trees and test the correctness of your Huffman implementation.

Eclipse Setup

First, you need to add the GraphJar.jar file to your build path. To do so, right click on your project in Eclipse → Build Path → Configure Build Path… → Click on Add External JARs... on the right pane → select GraphJar.jarOpenApply and Close.

You can then run the Huffman visualizer by right clicking on HuffmanVisualizer.java → Run as → Java Application

IntelliJ Setup

You should be able to run the visualizer by right-clicking on HuffmanVisualizer.java and running the main() method. If not, make sureGraphJar.jar is added as a library to your project (File → Project Structure → Libraries → Click on the plus sign in the right pane → Select the jar file).

In the topmost field, enter the alphabet seed, or the same string that the Huffman constructor takes in. You can then press Construct Tree to build the Huffman tree. Note: this operation uses the implementation of your Huffman constructor. You then should see your Huffman tree:

insert1

Note: For long strings, the resulting tree might have overlapping nodes. You can click and drag the nodes to move them around.

You can clear a tree and all of the variables by clicking the Clear Tree button.

In addition, the visualizer will display the Expected Encoding Length in the top right corner.

The visualizer supports compressing and decompressing strings. The visualizer may not act as intended with numerical characters, so please stick to non-numerical characters. You can do so by entering a string to either compress or decompress in the second row and clicking the compress or decompress buttons. The output will be displayed. In addition, the compression ratio will be displayed in the top right corner.

insert1

Style & Tests

The above parts together are worth a total of 100 points. Style and code coverage are graded on a subtractive basis with deductions as specified in the Setup and Logistics section of the writeup, and you will be graded according to the CIS 121 style guide. Gradescope will give you your grade on this section immediately upon submission to either autograder.

IMPORTANT: Please DO NOT use any external files (.txt, etc.) to write your JUnit tests for this assignment. Our code coverage tool will fail, and you will not be able to submit your code due to failing test cases.

You will need to write comprehensive unit test cases for each of the classes, inner classes, and methods you implement (including helper methods!) in order to ensure correctness. Make sure you consider edge cases and exceptional cases in addition to typical use cases. Use multiple methods instead of cramming a bunch of asserts into a single test method. Your test cases will be auto-graded for code coverage. Be sure to read the testing guide for some pointers on what level of testing we expect from you. Finally, due to the Code Coverage tools we use, please be sure to use JUnit 4 as it is the only version supported.

Note: You will not be able to write JUnit test cases for any private methods. Instead, make them package-private by leaving off any privacy modifier (i.e., do not write public or private).