An Ode to bzip
Summary
A contrarian but well-argued case for why bzip is actually the best compression algorithm for code, despite conventional wisdom favoring zstd, xz, or brotli.
Compression Results (327KB Lua Source)
| Algorithm | Compressed Size |
|---|---|
| bzip3 | 61,067 bytes (best!) |
| bzip2 -9 | 63,727 bytes |
| lzip -9 | 67,651 bytes |
| brotli -Z | 67,859 bytes |
| xz -9 | 67,940 bytes |
| zstd -22 | 69,018 bytes |
| zopfli | 75,882 bytes |
Why bzip Wins for Code
BWT vs LZ77
Most compressors use LZ77 (finding repeated strings and referencing earlier occurrences). bzip uses BWT (Burrows-Wheeler Transform):
- Reorders characters to group them by context
- For code: similar patterns cluster together
- Simple run-length encoding then works well
- No need to store backreference offsets
Key Insight
Code has consistent patterns (function keywords, brackets, indentation). BWT exploits this better than LZ77's "recent history" approach.
Decoder Size Comparison
For self-extracting archives (like embedding in Lua):
| Algorithm | Decoder Size |
|---|---|
| Custom bzip-style | 1.5 KB (smallest!) |
| xz / lzip | ~1 KB |
| gzip | ~1.5 KB |
| brotli | ~2.2 KB |
| zstd | ~3 KB |
Debunking "bzip is slow"
- Compression: zopfli is actually slower AND produces worse results
- Decompression: bzip is slower than gzip, but comparable to zstd/brotli in high-level languages
- For embedded: "slow" is relative - all operations are slow in Lua anyway
Conclusion
bzip3 is recommended for best compression, but even bzip2 outperforms all LZ77-based algorithms for code. The key is BWT's ability to exploit the consistent structure of source code.