Brotli w/o a custom dictionary is a weird choice to begin with.
That said, I personally prefer zstd as well, it's been a great general use lib.
zstd is Pareto better than brotli - compresses better and faster
brotli 1.0.7 args: -q 11 -w 24
zstd v1.5.0 args: --ultra -22 --long=31
| Original | zstd | brotli
RandomBook.pdf | 15M | 4.6M | 4.5M
Invoice.pdf | 19.3K | 16.3K | 16.1K
I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.
I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044
Results by compression type across 55 PDFs:
+------+------+-----+------+--------+
| none | zstd | xz | gzip | brotli |
+------|------|-----|------|--------|
| 47M | 45M | 39M | 38M | 37M |
+------+------+-----+------+--------+Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
+---------+---------+--------+--------+--------+
| none | zstd | xz | gzip | brotli |
+---------|---------|--------|--------|--------|
| 47.81M | 37.92M | 37.96M | 38.80M | 37.06M |
+---------+---------+--------+--------+--------+
These numbers are much more impressive. Still, Brotli has a slight edge.Also, worth testing zopfli since it's decompression is gzip compatible.
file | raw | zstd (%) | brotli (%) |
gawk.pdf | 8.068.092 | 1.437.529 (17.8%) | 1.376.106 (17.1%) |
shannon.pdf | 335.009 | 68.739 (20.5%) | 65.978 (19.6%) |
attention.pdf | 24.742.418 | 367.367 (1.4%) | 362.578 (1.4%) |
learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%) | 35.223.532 (13.9%) |
For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5': zstd: 0.4532 +- 0.0216 seconds time elapsed ( +- 4.77% )
brotli: 0.7641 +- 0.0242 seconds time elapsed ( +- 3.17% )
The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed. pdftk in.pdf output out.pdf decompressthat data in pdf files are noisy and zstd should perform better on noisy files?
The pdfs you have are already compressed with deflate (zip).
If brotli has a different advantage on small source files, you have my curiosity.
If you're talking about max compression, zstd likely loses out there, the answer seems to vary based on the tests I look at, but it seems to be better across a very wide range.
I don’t think you’re using that correctly.
Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.
I have not tried using a dictionary for zstd.
Imagine a sales meeting where someone pitched that to you. They have to be joking, right?
I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.
You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.
[1]https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
brotli decompression is already plenty fast. For PDFs, zstd’s advantage in decompression speed is academic.
Here's discussion by brotli's and zstd's staff:
Something like this:
https://developer.chrome.com/blog/shared-dictionary-compress...
In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.
So it might land in the spec once it has proven if offers enough value
The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.
It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.
On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.
I too would strongly prefer that they use zstd.
The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.
Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.
Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.
>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."
The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.
If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.
- when jumping from page to page, you won’t have to decompress the entire file
Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.
> when jumping from page to page, you won’t have to decompress the entire file
This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.
Far from the same amount:
- existing tools that split PDFs into pages will remain working
- if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.
Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.
I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.
I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.
PDF is also a binary format.
> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.
Maybe read things a bit more carefully before going all out on the snide comments?
"Brotli is a compression algorithm developed by Google."
They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.
Sheer incompetence.
I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.
I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:
+------+------+-----+------+--------+
| none | zstd | xz | gzip | brotli |
+------|------|-----|------|--------|
| 47M | 45M | 39M | 38M | 37M |
+------+------+-----+------+--------+
Here's a table with all the files: +------+------+------+------+--------+
| raw | zstd | xz | gzip | brotli |
+------+------+------+------+--------+
| 12K | 12K | 12K | 12K | 12K |
| 20K | 20K | 20K | 20K | 20K | x5
| 24K | 20K | 20K | 20K | 20K | x5
| 28K | 24K | 24K | 24K | 24K |
| 28K | 24K | 24K | 24K | 24K |
| 32K | 20K | 20K | 20K | 20K | x3
| 32K | 24K | 24K | 24K | 24K |
| 40K | 32K | 32K | 32K | 32K |
| 44K | 40K | 40K | 40K | 40K |
| 44K | 40K | 40K | 40K | 40K |
| 48K | 36K | 36K | 36K | 36K |
| 48K | 48K | 48K | 48K | 48K |
| 76K | 128K | 72K | 72K | 72K |
| 84K | 140K | 84K | 80K | 80K | x7
| 88K | 136K | 76K | 76K | 76K |
| 124K | 152K | 88K | 92K | 92K |
| 124K | 152K | 92K | 96K | 92K |
| 140K | 160K | 100K | 100K | 100K |
| 152K | 188K | 128K | 128K | 132K |
| 188K | 192K | 184K | 184K | 184K |
| 264K | 256K | 240K | 244K | 240K |
| 320K | 256K | 228K | 232K | 228K |
| 440K | 448K | 408K | 408K | 408K |
| 448K | 448K | 432K | 432K | 432K |
| 516K | 384K | 376K | 384K | 376K |
| 992K | 320K | 260K | 296K | 280K |
| 1.0M | 2.0M | 1.0M | 1.0M | 1.0M |
| 1.1M | 192K | 192K | 228K | 200K |
| 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.2M | 1.1M | 1.0M | 1.0M | 1.0M |
| 1.3M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.7M | 2.0M | 1.7M | 1.7M | 1.7M |
| 1.9M | 960K | 896K | 952K | 916K |
| 2.9M | 2.0M | 1.3M | 1.4M | 1.4M |
| 3.2M | 4.0M | 3.1M | 3.1M | 3.0M |
| 3.7M | 4.0M | 3.5M | 3.5M | 3.5M |
| 6.4M | 4.0M | 4.1M | 3.7M | 3.5M |
| 6.4M | 6.0M | 6.1M | 5.8M | 5.7M |
| 9.7M | 10M | 10M | 9.5M | 9.4M |
+------+------+------+------+--------+
Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.
I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p
Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
+---------+---------+--------+--------+--------+
| none | zstd | xz | gzip | brotli |
+---------|---------|--------|--------|--------|
| 47.81M | 37.92M | 37.96M | 38.80M | 37.06M |
+---------+---------+--------+--------+--------+
These numbers are much more impressive. Still, Brotli has a slight edge.Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?
If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.
I can reproduce it just fine ... but only when compressing all PDFs simultaneously.
To utilize all cores, I ran:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait
(and similar for the other formats).I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done
That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?
So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.
I will try to reproduce, but I suspect that there is something unrelated to zstd going on.
What version of zstd are you using?
I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738
I can repro on my Mac with these steps with either `zstd` or `gzip`:
$ rm -f ksh.zst
$ zstd < /bin/ksh > ksh.zst
$ du -h ksh.zst
1.2M ksh.zst
$ wc -c ksh.zst
1240701 ksh.zst
$ zstd < /bin/ksh > ksh.zst
$ du -h ksh.zst
2.0M ksh.zst
$ wc -c ksh.zst
1240701 ksh.zst
$ rm -f ksh.gz
$ gzip < /bin/ksh > ksh.gz
$ du -h ksh.gz
1.2M ksh.gz
$ wc -c ksh.gz
1246815 ksh.gz
$ gzip < /bin/ksh > ksh.gz
$ du -h ksh.gz
2.1M ksh.gz
$ wc -c ksh.gz
1246815 ksh.gz
When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)
It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.
My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...
It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.
I'll chalk it up to "some APFS weirdness".
--ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.
Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495 qpdf --stream-data=uncompress in.pdf out.pdf
The resulting file should compress better with zstd.If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.
But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
I may dislike Google. But my support of JPEG XL and Zstd has nothing to do with competition tech being Google at all. I simply think JPEG XL and Zstd are better technology.
~/tmp/pdfbench $ hyperfine --warmup 2 \
'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \
'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \
'for x in xz/*; do xz -d >/dev/null <"$x"; done' \
'for x in br/*; do brotli -d >/dev/null <"$x"; done'
Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.6 ms ± 1.3 ms [User: 83.6 ms, System: 72.4 ms]
Range (min … max): 162.0 ms … 166.9 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 143.0 ms ± 1.0 ms [User: 87.6 ms, System: 43.6 ms]
Range (min … max): 141.4 ms … 145.6 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 981.7 ms ± 1.6 ms [User: 891.5 ms, System: 93.0 ms]
Range (min … max): 978.7 ms … 984.3 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 254.5 ms ± 2.5 ms [User: 172.9 ms, System: 67.4 ms]
Range (min … max): 252.3 ms … 260.5 ms 11 runs
Summary
for x in gz/*; do gzip -d >/dev/null <"$x"; done ran
1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done
1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done
6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done
As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz. +-------+-------+--------+-------+
| gzip | zstd | brotli | xz |
+-------+-------+--------+-------+
| 143ms | 165ms | 255ms | 982ms |
+-------+-------+--------+-------+
I honestly expected zstd to win here.https://news.ycombinator.com/item?id=46035817
I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?
Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:
+-------+-------+--------+-------+
| gzip | zstd | brotli | xz |
+-------+-------+--------+-------+
| 142ms | 165ms | 269ms | 994ms |
+-------+-------+--------+-------+
Raw hyperfine output: ~/tmp/pdfbench $ du -h zst2 gz xz br
37M zst2
38M gz
38M xz
37M br
~/tmp/pdfbench $ hyperfine ...
Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.5 ms ± 2.3 ms [User: 83.5 ms, System: 72.3 ms]
Range (min … max): 162.3 ms … 172.3 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 142.2 ms ± 0.9 ms [User: 87.4 ms, System: 43.1 ms]
Range (min … max): 140.8 ms … 143.9 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 993.9 ms ± 9.2 ms [User: 896.7 ms, System: 99.1 ms]
Range (min … max): 981.4 ms … 1007.2 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 269.1 ms ± 8.8 ms [User: 176.6 ms, System: 75.8 ms]
Range (min … max): 261.8 ms … 287.6 ms 10 runsOn the incompressible files, I'd expect decompression of any algorithm to approach the speed of `memcpy()`. And would generally expect zstd's decompression speed to be faster. For example, on a x86 core running at 2GHz, Zstd is decompressing a file at 660 MB/s, and on my M1 at 1276 MB/s.
You could measure locally either using a specialized tool like lzbench [0], or for zstd by just running `zstd -b22 --ultra /path/to/file`, which will print the compression ratio, compression speed, and decompression speed.
I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.
[1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...
But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.
The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.
If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.
The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.
ISO is pay to play so :shrug:
So your comment is a falsehood
https://pdfa.org/brotli-compression-coming-to-pdf/
> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.
> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.
MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.
It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).
“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.
All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.