Homework 5: Priority Queue and Huffman

Due by 11:59PM ET on Monday, October 21, 2024

2 Required Problems: Priority Queue and Huffman (100 points total)

Setup and Logistics

We have put together several stub files –– click here to download them.

4 files to submit: Huffman.java, HuffmanTest.java, BinaryMinHeapImpl.java, BinaryMinHeapImplTest.java
Please do not submit any additional files.

Be sure to follow these guidelines:

Ensure source files are not located in any package and that none of the files have a package declaration (e.g. package src;).
When submitting, submit each file individually at the root level (not contained in a folder or zip files).
Ensure that your own test files pass and do not run into infinite loops.
You have unlimited submissions until the deadline on the autograder marked [compilation], and two free submissions on the autograder marked [full].
For this assignment:
- Number of Free Submissions: 2
- Deduction Per Additional Submission: 10 points
- Max Code Coverage Deduction: up to 10 points
- Max Style Score Deduction: up to 5 points
- Manual TA Efficiency/Style Grading: yes
Do not wait until the last minute to try submitting. You are responsible for ensuring your submission goes through by the deadline.
Before asking questions on Piazza, be sure to check and keep updated with the pinned clarifications post.
Gradescope issues? Be sure to check out our Gradescope help page before posting to Piazza.

Motivation – Compression Algorithms

The goal of a compression algorithm is to take a sequence of bytes and transform it into a different sequence of fewer bytes, such that the original can be recovered. Because compression algorithms reduce the size of a file, they allow quicker transmission of files over the network, benefiting everyone on that link. Regarding bandwidth, please see this xkcd link.

Compression algorithms can be lossless (like ZIP) or lossy (like JPEG). Lossy compression is useful for images, music, and video because multimedia is an emergent property of its file formats; in other words, we don’t care if there are slight imperfections in a JPEG image. On the other hand, text-based files must be compressed without any data loss; what good is a source code file if it cannot be compiled because a few of its bits switched from 1 to 0?

In this assignment, you will implement a lossless compression algorithm, namely Huffman Coding, that is used as part of larger compression schemes, such as ZIP. In order to implement it, you’ll also be implementing a binary min heap which will be used as a priority queue.

NOTE: The BinaryMinHeap interface cannot be modified in any way, nor can the method headers in BinaryMinHeapImpl or Huffman.java. Failure to abide by this will prevent your code from compiling and will lose you credit on the assignment.

Part 1: Binary Min Heap

Files to submit: BinaryMinHeapImpl.java, BinaryMinHeapImplTest.java

In order to implement Huffman’s algorithm, you’ll need to implement a min-heap. We have provided an interface that you must implement, and we have also provided you with the corresponding class stub file. Please make sure you do not modify the BinaryMinHeap.java interface at all! Also, do not modify any of the method headers in the BinaryMinHeapImpl.java file. As always, you may add package private helper methods in the Impl file. Note, you do not need to worry about your BinaryMinHeapImpl.java working with other implementations.

Please make sure you read the interface thoroughly, as it has details about the required runtimes for all the methods. Not all of the implementation requirements are in this writeup.

Some notes regarding implementation:

An Entry binds Key and V together (e.g. in the Huffman algorithm, we want frequencies and ‘letters’ to be bound together). Key implements the Comparable interface, so that we can get a priority ordering on them. V can be any type.
We strongly recommend using ArrayLists rather than arrays to avoid resizing the array and dealing with casting issues. In either case, you should make an array/ArrayList of type Entry<Key, V> to store your elements.

You should implement min-heapify, but do not need to support an $O(n)$ build-heap operation. That is, all insertions into the heap will be done via the add method.
You must make sure your implementation has expected running time $O(\lg n)$ add, decreaseKey, and extractMin. Also, containsValue should have expected running time $O(1)$. In order to meet these runtime requirements, we recommend using a HashMap to map the values in the min heap to their index. Because of this map, the values in the heap must be unique. You can assume (will be proven in lecture) that containsKey/put/get/remove run in expected $O(1)$ time.
You should NOT implement your heap as a sorted array/ArrayList or linked list.

Part 2: Huffman coding

Files to submit: Huffman.java, HuffmanTest.java

The compression algorithm you’ll implement is known as Huffman coding. The idea behind Huffman coding, and an idea common to a lot of concepts and techniques in computer science, is that common activities should have a lower cost than rarer activities. For example, we encode ASCII characters with eight bits each, but we see the character e much more often than the character x, which we see much more often than the BEL character (ASCII code 7). Therefore, why not represent those more frequent characters with fewer bits and those rarer characters with more bits?

First, assume the existence of an alphabet, whose composing characters abide by some positive-valued probability mass function. In other words, each character in the alphabet has a probability of showing up, these probabilities are each greater than zero, and these probabilities sum to one. Your Huffman implementation will permit alphabet specification by either passing in a map from chars to ints or by providing a String from which the alphabet and probability mass function will be determined. Here, each int represents its corresponding char’s count, not its probability.

Why is this the case? We use ints instead of doubles or floats because of the inherent imprecision of the floating-point standard; it is impossible to accurately represent a number as simple as $\frac{1}{3}$. Instead, we determine the alphabet’s probability mass function by taking each Character’s corresponding count and dividing it by the sum of all Characters’ counts.

Here is a (simple) example alphabet, according to our specification:

a	b	c
1	2	2

Here, the character a shows up $\frac{1}{5}$ of the time, and the characters b and c each show up $\frac{2}{5}$ of the time.

Huffman coding uses a binary tree to encode character information, where every leaf represents a character in the alphabet and where each edge represents the state of a bit in the compression. Starting from the tree’s root, traveling to a node’s left child indicates appending a 0-valued bit to the compressed representation of a character, and traveling to a node’s right child indicates appending a 1-valued bit to the compressed representation of a character. Note that because of this, Huffman coding requires the alphabet to have no fewer than two characters.

The generation of the Huffman tree is the crux of the algorithm. It uses a priority queue, which is an abstract data type that keeps track of the smallest (i.e. highest-priority, hence the name) element in a collection of elements.

Using `BinaryMinHeap` as a priority queue:

Using your BinaryMinHeap as a priority queue with frequency as the key and a node as the value, one can implement Huffman’s algorithm as follows:

Create a lone leaf node for each character in the alphabet (i.e. you should have $n$ discrete leaf nodes). Add each node into a priority queue, and while the priority queue has more than one node in it, follow the following process:

Remove the two nodes of lowest frequency from the priority queue.
Create a new node. Assign its left child to be one of the two nodes removed in 1 and the right child to be the other (left vs right does not matter)
Assign the new node’s frequency to be the sum of its left child’s frequency and its right child’s frequency. This new node is an internal node and no longer corresponds to a character in the alphabet.
Add the new node to the priority queue.

When there is only one node left in the priority queue, that node is the root of our final Huffman tree.

Note: You may NOT use java.util.PriorityQueue as the priority queue for this part of the assignment. Instead, you should use your implementation of BinaryMinHeap.java

We have provided five method and constructor stubs for you. Do not modify the headers of these methods and constructors, only the bodies. Failure to abide by this will prevent your code from compiling, and you will lose credit on the assignment.

Of note regarding the implementation:

Make sure your add method is $O(\log n)$, so you can meet the tree generation runtime of $O(n\log n)$.
Decompression is straightforward; given a sequence of 0s and 1s, start from the root and go left or right as determined. When you get to a leaf, you’ve translated a character; record that, and go back to the root.
Compressing is less straightforward because you cannot use the 0s and 1s to find a character; you want to use the character to find the 0s and 1s. Any characters not in your alphabet are not compressible.
Both compress and decompress methods should return the empty string "" on an empty string input. That is, compressing an empty string is the same as not compressing anything!
Your compress and decompress methods should be as asymptotically efficient as possible. Part of this includes using the StringBuilder class to create your compressed and decompressed output instead of using String concatenation; the former can concatenate two Strings in $\Theta(1)$, whereas the latter can only concatenate two Strings in $\Theta(n)$. We will stress test your code against massive inputs to make sure you used StringBuilder.
The expectedEncodingLength method is nothing more than the expectation of a discrete random variable. We are using this method not to test your knowledge of basic probability, but instead to test the optimality of the encoding that your algorithm generates. You will need to compute the weighted sum of the lengths of the encodings of each character, where the weights are the probabilities of the characters.
You will need to use an inner class for the Huffman tree to represent nodes. You can make it package private so that you can test it. You are strongly encouraged to override the toString method to print a tree representation for ease of debugging.
We will compute a compression ratio that quantifies how well our compression performs. It should return the average compression ratio for all strings you have compressed over the lifetime of the object instance. We define compression ratio as the length of all outputs from compression over the length of all inputs to the compression. This length should be calculated in terms of “bits”. We recognize that our output, although a binary string, isn’t actually a sequence of bits. However, you can pretend that it is just some binary representation of the original input. You do not have to explicitly turn the input into binary, but treat the “length” as the number of bits. You can treat the input length as the number of characters times the size of a Java char, which is 16 bits. The ratio is maintained and modified as you continue compressing strings using the same Huffman instance, since it is an aggregate ratio. You should NOT be storing all the strings that were inputted to calculate the compression ratio, as this is highly inefficient.
Our implementation uses Strings of 0s and 1s to represent a stream of bits. This is convenient for a program written in Java, but it is not what would happen in the real world. If we were to implement this algorithm in production software, we would post-process each compressed answer by converting each character into its corresponding bit.

Visualizer

For this homework assignment, we are providing a visualizer to help you visualize Huffman trees and test the correctness of your Huffman implementation.

Eclipse Setup

First, you need to add the GraphJar.jar file to your build path. To do so, right click on your project in Eclipse → Build Path → Configure Build Path… → Click on Add External JARs... on the right pane → select GraphJar.jar → Open → Apply and Close.

You can then run the Huffman visualizer by right clicking on HuffmanVisualizer.java → Run as → Java Application

IntelliJ Setup

You should be able to run the visualizer by right-clicking on HuffmanVisualizer.java and running the main() method. If not, make sureGraphJar.jar is added as a library to your project (File → Project Structure → Libraries → Click on the plus sign in the right pane → Select the jar file).

In the topmost field, enter the alphabet seed, or the same string that the Huffman constructor takes in. You can then press Construct Tree to build the Huffman tree. Note: this operation uses the implementation of your Huffman constructor. You then should see your Huffman tree:

Note: For long strings, the resulting tree might have overlapping nodes. You can click and drag the nodes to move them around.

You can clear a tree and all of the variables by clicking the Clear Tree button.

In addition, the visualizer will display the Expected Encoding Length in the top right corner.

The visualizer supports compressing and decompressing strings. The visualizer may not act as intended with numerical characters, so please stick to non-numerical characters. You can do so by entering a string to either compress or decompress in the second row and clicking the compress or decompress buttons. The output will be displayed. In addition, the compression ratio will be displayed in the top right corner.

Style & Tests

The above parts together are worth a total of 100 points. Style and code coverage are graded on a subtractive basis with deductions as specified in the Setup and Logistics section of the writeup, and you will be graded according to the CIS 121 style guide. Gradescope will give you your grade on this section immediately upon submission to either autograder.

IMPORTANT: Please DO NOT use any external files (.txt, etc.) to write your JUnit tests for this assignment. Our code coverage tool will fail, and you will not be able to submit your code due to failing test cases.

You will need to write comprehensive unit test cases for each of the classes, inner classes, and methods you implement (including helper methods!) in order to ensure correctness. Make sure you consider edge cases and exceptional cases in addition to typical use cases. Use multiple methods instead of cramming a bunch of asserts into a single test method. Your test cases will be auto-graded for code coverage. Be sure to read the testing guide for some pointers on what level of testing we expect from you. Finally, due to the Code Coverage tools we use, please be sure to use JUnit 4 as it is the only version supported.

Note: You will not be able to write JUnit test cases for any private methods. Instead, make them package-private by leaving off any privacy modifier (i.e., do not write public or private).