Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI should fail if the dictionary file provided in -D does not contain a valid dictionary #2873

Closed
JustAnotherArchivist opened this issue Nov 23, 2021 · 1 comment
Assignees

Comments

@JustAnotherArchivist
Copy link

Describe the bug
When the zstd CLI is called with -D and that file does not contain a valid dictionary, this is either ignored or results in potentially confusing 'Dictionary mismatch' errors.

To Reproduce
Steps to reproduce the behavior:

  1. echo test | zstd -o test.zst
  2. echo notadict >falsedict
  3. zstdcat -D falsedict test.zst
  4. echo test | zstd -D falsedict -o test2.zst
  5. wget https://github.com/facebook/zstd/raw/c2c6a4ab40fcc327e79d5364f9c2ab1e41e6a7f8/tests/dict-files/zero-weight-dict (or any other valid non-empty dictionary)
  6. echo test | zstd -D zero-weight-dict -o test3.zst
  7. zstdcat -D falsedict test3.zst

Expected behavior
On steps 3, 4, and 7, I would expect an error that the contents of falsedict are not a valid zstd dictionary. Personally, I would prefer this to be a fatal error, but at least there should be a warning on stderr. Ideally, the error would include the file size.

Step 3 instead decompresses the file without any indication of an issue. Step 4 compresses the input without a dictionary. Step 7 produces an error 'Dictionary mismatch', which is probably the least bad case here but may lead the user to think that something's wrong with the compressed file or its metadata (e.g. when identifying the used dictionary out-of-band or when it is included in a skippable frame like #2349).

It gets even more confusing when increasing verbosity to level 4 or beyond: zstdcat -vvvv -D falsedict test3.zst then even prints a line Loading falsedict as dictionary but still doesn't indicate that the loading failed.

Desktop (please complete the following information):

  • OS: Debian oldstable and sid
  • Version: 1.3.8+dfsg-3, 1.4.8+dfsg-3 on amd64 (binary packages from Debian, not compiled from source by myself)

Additional context
I came across this issue as I had written a wrapper script to handle dictionaries in a skippable frame on .warc.zst files (cf. #2349). The script extracts the dictionary, puts it in a temporary file, then passes that to zstd for decompression. I was getting the 'Dictionary mismatch' error on some particular files and couldn't figure out for a long time why. As it turned out, the temporary file didn't get flushed to disk in some cases (namely when the dict was very small, so it was buffered somewhere without an explicit flush), leading zstd to see a 0-byte file and silently not load the dict. If zstd had told me about that load error, ideally including the file size it saw, this would've saved me a lot of debugging time.

@terrelln
Copy link
Contributor

I don't believe that this is a zstd bug, rather confusion about how zstd uses dictionaries.

Zstd accepts two types of dictionaries:

  1. Zstd dictionaries. These are required to follow a specific format, and start with the little-endian magic number 0xEC30A437. These dictionaries have a dictionary ID. If you compress data with this dictionary, tell the compressor to write the dictionary ID into the frame (which is the default behavior), and then try to decompress the data without a dictionary, or with a dictionary that doesn't match the dictionary ID, then decompression will fail with 'Dictionary Mismatch'.
  2. Raw content dictionaries. There are no requirements on these dictionaries, they are just bytes. No dictionary ID gets written into the frame. Zstd cannot ahead of time whether the correct dictionary is used, or if a dictionary is used at all. Zstd will fail if it tries to reference something in the dictionary, and the dictionary is too small. Or it will happily re-generate incorrect data. However, with checksumming enabled (the default for the CLI), zstd checksums the decompressed data, and will error if it detects that the data regenerated is incorrect.

By default, zstd will load dictionaries in "auto" mode. Which means we will load them as a zstd dictionary if they start with the magic number, and load them as raw content dictionaries otherwise. This is, however, configurable in the library.

And importantly, in the CLI zstd will checksum the decompressed data. Which means that unless you have an adversarial / very unlucky blob, zstd will error if the wrong dictionary is used for decompression, and it actually matters. In your example, the dictionary was empty. Which means that compression can't possibly reference any positions in the dictionary, so it doesn't matter if decompression has the dictionary or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants