ARTS 第 1 周 - Huffman 编码

Algorithm

因为很久没做算法题了,刚开始选择了 腾讯精选练习 50 题 这个列表中较为简单的一题:反转链表,一个简单的解法是:在遍历的过程中记录当前节点的前后节点,然后进行指针(引用)的变更即可,我的题解代码如下:

/**
* Definition for singly-linked list.
* public class ListNode {
* int val;
* ListNode next;
* ListNode(int x) { val = x; }
* }
*/
class Solution {
public ListNode reverseList(ListNode head) {
ListNode prev = null;
while (head != null) {
ListNode next = head.next;
head.next = prev;
prev = head;
head = next;
}
return prev;
}
}

Review

这周读了 Joel Spolsky(1965-)的一篇博客文章《The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets》。Joel 的个人博客叫做 Joel on Software,在世界范围内都有很高的知名度和影响力,他也是 Stack Overflow 网站的联合创始人,目前在自己创办的 Fog Creek 软件公司任 CEO,曾经出版过一本关于软件技术、人才、创业和企业管理的随想文集《More Joel on Software》,被阮一峰翻译成了中文,中文书名是《软件随想录》。

回到这篇博文,他在这篇文章中回溯了字符集的历史,从 ASCII、OEM 到 DBCS 由于没有统一的标准一度导致混乱,直到 Unicode 一统天下。但是 Unicode 只是一个字符集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储,又出现了多种不同的存储方式,导致 Unicode 在很长一段时间内无法推广,直到 UTF-8 出现,具体也可以看阮一峰的博客文章《字符编码笔记:ASCII、Unicode 和 UTF-8》。我通过 Joel 和阮一峰的文章了解了字符集和编码的历史,更重要得到一些启发:要想搞明白一个东西,可能需要了解它的历史,看它是基于什么样的历史背景出现的,为了解决什么问题 ?

值得一提的是 Java 中的 Char 类型只有两个字节,在内存中使用 UTF-16 编码,虽然 UTF-8 编码的「你」占三个字节,但是在 UTF-16 编码下用两个字节是可以装得下的:

String s = "你";
// 输出: 你
System.out.println(s.charAt(0));
// 输出: 1
System.out.println(s.toCharArray().length);
// 输出: 3
System.out.println(s.getBytes(StandardCharsets.UTF_8).length);

Tip

Maven 依赖冲突是工作中经常遇到的问题,我用过两种解决方案:一是通过 dependencyManagement 强制指定一个版本,另外是通过 Maven Shade Plugin 的 relocation 功能(注意使用这个插件后会产生多个 JAR 包,relocate 只会在被 shade 的包中生效)。

Share

最近工作中用到了一些压缩文件的知识,我大概查了下文件压缩的原理,看到很多压缩算法中都有 Huffman 编码的影子,于是想要自己实现一个可用的 Huffman 编解码程序。我参考的是《算法(第 4 版)》这本书中关于这个算法的讲解以及附带的 Java 代码,因为需要按位读写,还实现了 BitInputStream 和 BitOutputStream,调试了挺久才通过。然后我从网上下载了《The Great Gatsby》小说的英文文本文件用来测试,压缩前是 268644 字节,压缩后是 152374 字节,压缩比大概是 56.7%。

Huffman 算法是典型的贪心算法,其贪心选择的最优性可以通过数学归纳法证明,但是《算法(第 4 版)》中的证明省略了一些步骤,让人看起来不是很明白,建议结合《算法导论(原书第 3 版)》中的证明一起观看,后者在使用归纳法证明之前有两个引理的证明,有了这两个引理再看最后的证明就顺畅多了。

Huffman 编码是 David A. Huffman(1925-1999)在 MIT 读博士的时候提出的,1952 年他在论文《A Method for the Construction of Minimum-Redundancy Codes》中阐述了这个算法。这个算法的思想比较直观,证明也不复杂,实现起来也不难,总的来说十分优雅,我很喜欢。

// Huffman.java
public class Huffman {
private static final int ALPHABET_SIZE = 256;
private final BitInputStream in;
private final BitOutputStream out;

public Huffman(BitInputStream in, BitOutputStream out) {
this.in = in;
this.out = out;
}

private static class Node {
private final char ch;
private final int freq;
private final Node left, right;

public Node(char ch, int freq, Node left, Node right) {
this.ch = ch;
this.freq = freq;
this.left = left;
this.right = right;
}

private boolean isLeaf() {
return (left == null) && (right == null);
}
}

public void compress() throws IOException {
// read the input
String s = in.readString();
char[] input = s.toCharArray();

// tabulate frequency counts
int[] freq = new int[ALPHABET_SIZE];
for (char c : input) freq[c]++;

// build Huffman trie
Node root = buildTrie(freq);

// build code table
String[] table = new String[ALPHABET_SIZE];
buildCode(table, root, "");

// print trie for decoder
writeTrie(root);

// print number of bytes in original uncompressed message
out.writeInt(input.length);

// use Huffman code to encode input
for (char c : input) {
String code = table[c];
for (char cc : code.toCharArray()) {
out.writeBit(cc != '0');
}
}
}

private static Node buildTrie(int[] freq) {
PriorityQueue<Node> queue = new PriorityQueue<>(Comparator.comparingInt(n -> n.freq));
for (char c = 0; c < ALPHABET_SIZE; c++) {
if (freq[c] > 0) {
queue.offer(new Node(c, freq[c], null, null));
}
}
while (queue.size() > 1) {
Node left = queue.poll();
Node right = queue.poll();
queue.offer(new Node('\0', left.freq + right.freq, left, right));
}
return queue.poll();
}

private static void buildCode(String[] table, Node node, String code) {
if (!node.isLeaf()) {
buildCode(table, node.left, code + '0');
buildCode(table, node.right, code + '1');
} else {
table[node.ch] = code;
}
}

private void writeTrie(Node node) throws IOException {
if (node.isLeaf()) {
out.writeBit(true);
out.writeChar(node.ch);
return;
}
out.writeBit(false);
writeTrie(node.left);
writeTrie(node.right);
}

public void decompress() throws IOException {
// read in Huffman trie from input stream
Node root = readTrie();

// number of bytes to write
int length = in.readInt();

// decode using the Huffman trie
for (int i = 0; i < length; i++) {
Node node = root;
while (!node.isLeaf()) {
boolean bit = in.readBit();
if (bit) node = node.right;
else node = node.left;
}
out.writeChar(node.ch);
}
}

private Node readTrie() throws IOException {
boolean isLeaf = in.readBit();
if (isLeaf) {
return new Node(in.readChar(), -1, null, null);
} else {
return new Node('\0', -1, readTrie(), readTrie());
}
}

public static void main(String[] args) throws Exception {
File file = new File("/Users/wind/original.txt");
File compressed = new File("/Users/wind/compressed.bin");
File decompressed = new File("/Users/wind/decompressed.txt");

try (
BitInputStream in = new BitInputStream(new FileInputStream(file));
BitOutputStream out = new BitOutputStream(new FileOutputStream(compressed))
) {
Huffman huffman = new Huffman(in, out);
huffman.compress();
}

try (
BitInputStream in = new BitInputStream(new FileInputStream(compressed));
BitOutputStream out = new BitOutputStream(new FileOutputStream(decompressed))
) {
Huffman huffman = new Huffman(in, out);
huffman.decompress();
}
}
}
// BitInputStream.java
public class BitInputStream extends FilterInputStream {
private int buffer;
private int pos;

public BitInputStream(InputStream in) {
super(in);
}

public boolean readBit() throws IOException {
if (pos == 0) {
buffer = in.read();
pos = 8;
}
pos--;
return ((buffer >> pos) & 1) == 1;
}

public char readChar() throws IOException {
int c = 0;
for (int i = 0; i < 8; i++) {
boolean bit = readBit();
c <<= 1;
c |= (bit ? 1 : 0);
}
return (char) (c & 0xff);
}

public int readInt() throws IOException {
int x = 0;
for (int i = 0; i < 4; i++) {
x <<= 8;
x |= readChar();
}
return x;
}

public String readString() throws IOException {
StringBuilder sb = new StringBuilder();
char c = readChar();
while (buffer != -1) {
sb.append(c);
c = readChar();
}
return sb.toString();
}
}
// BitOutputStream.java
public class BitOutputStream extends FilterOutputStream {
private int buffer;
private int pos;

public BitOutputStream(OutputStream out) {
super(out);
}

@Override
public void flush() throws IOException {
clearBuffer();
super.flush();
}

public void writeBit(boolean bit) throws IOException {
buffer <<= 1;
buffer |= (bit ? 1 : 0);
pos++;
if (pos == 8) clearBuffer();
}

public void writeChar(int x) throws IOException {
for (int i = 0; i < 8; i++) {
boolean bit = ((x >>> (8 - i - 1)) & 1) == 1;
writeBit(bit);
}
}

public void writeInt(int x) throws IOException {
writeChar((x >>> 24) & 0xff);
writeChar((x >>> 16) & 0xff);
writeChar((x >>> 8) & 0xff);
writeChar(x & 0xff);
}

private void clearBuffer() throws IOException {
if (pos == 0) return;
if (pos > 0) buffer <<= (8 - pos);
out.write(buffer);
buffer = 0;
pos = 0;
}
}
0%