CLI should fail if the dictionary file provided in `-D` does not contain a valid dictionary #2873

JustAnotherArchivist · 2021-11-23T03:11:18Z

Describe the bug
When the zstd CLI is called with -D and that file does not contain a valid dictionary, this is either ignored or results in potentially confusing 'Dictionary mismatch' errors.

To Reproduce
Steps to reproduce the behavior:

echo test | zstd -o test.zst
echo notadict >falsedict
zstdcat -D falsedict test.zst
echo test | zstd -D falsedict -o test2.zst
wget https://github.com/facebook/zstd/raw/c2c6a4ab40fcc327e79d5364f9c2ab1e41e6a7f8/tests/dict-files/zero-weight-dict (or any other valid non-empty dictionary)
echo test | zstd -D zero-weight-dict -o test3.zst
zstdcat -D falsedict test3.zst

Expected behavior
On steps 3, 4, and 7, I would expect an error that the contents of falsedict are not a valid zstd dictionary. Personally, I would prefer this to be a fatal error, but at least there should be a warning on stderr. Ideally, the error would include the file size.

Step 3 instead decompresses the file without any indication of an issue. Step 4 compresses the input without a dictionary. Step 7 produces an error 'Dictionary mismatch', which is probably the least bad case here but may lead the user to think that something's wrong with the compressed file or its metadata (e.g. when identifying the used dictionary out-of-band or when it is included in a skippable frame like #2349).

It gets even more confusing when increasing verbosity to level 4 or beyond: zstdcat -vvvv -D falsedict test3.zst then even prints a line Loading falsedict as dictionary but still doesn't indicate that the loading failed.

Desktop (please complete the following information):

OS: Debian oldstable and sid
Version: 1.3.8+dfsg-3, 1.4.8+dfsg-3 on amd64 (binary packages from Debian, not compiled from source by myself)

Additional context
I came across this issue as I had written a wrapper script to handle dictionaries in a skippable frame on .warc.zst files (cf. #2349). The script extracts the dictionary, puts it in a temporary file, then passes that to zstd for decompression. I was getting the 'Dictionary mismatch' error on some particular files and couldn't figure out for a long time why. As it turned out, the temporary file didn't get flushed to disk in some cases (namely when the dict was very small, so it was buffered somewhere without an explicit flush), leading zstd to see a 0-byte file and silently not load the dict. If zstd had told me about that load error, ideally including the file size it saw, this would've saved me a lot of debugging time.

The text was updated successfully, but these errors were encountered:

terrelln · 2021-11-23T19:47:55Z

I don't believe that this is a zstd bug, rather confusion about how zstd uses dictionaries.

Zstd accepts two types of dictionaries:

Zstd dictionaries. These are required to follow a specific format, and start with the little-endian magic number 0xEC30A437. These dictionaries have a dictionary ID. If you compress data with this dictionary, tell the compressor to write the dictionary ID into the frame (which is the default behavior), and then try to decompress the data without a dictionary, or with a dictionary that doesn't match the dictionary ID, then decompression will fail with 'Dictionary Mismatch'.
Raw content dictionaries. There are no requirements on these dictionaries, they are just bytes. No dictionary ID gets written into the frame. Zstd cannot ahead of time whether the correct dictionary is used, or if a dictionary is used at all. Zstd will fail if it tries to reference something in the dictionary, and the dictionary is too small. Or it will happily re-generate incorrect data. However, with checksumming enabled (the default for the CLI), zstd checksums the decompressed data, and will error if it detects that the data regenerated is incorrect.

By default, zstd will load dictionaries in "auto" mode. Which means we will load them as a zstd dictionary if they start with the magic number, and load them as raw content dictionaries otherwise. This is, however, configurable in the library.

And importantly, in the CLI zstd will checksum the decompressed data. Which means that unless you have an adversarial / very unlucky blob, zstd will error if the wrong dictionary is used for decompression, and it actually matters. In your example, the dictionary was empty. Which means that compression can't possibly reference any positions in the dictionary, so it doesn't matter if decompression has the dictionary or not.

terrelln added the feature request label Nov 23, 2021

15596858998 mentioned this issue Nov 26, 2021

CLI's -D fails when the argument is not a regular file #2874

Closed

felixhandte self-assigned this Mar 26, 2022

terrelln closed this as completed Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI should fail if the dictionary file provided in `-D` does not contain a valid dictionary #2873

CLI should fail if the dictionary file provided in `-D` does not contain a valid dictionary #2873

JustAnotherArchivist commented Nov 23, 2021

terrelln commented Nov 23, 2021

CLI should fail if the dictionary file provided in -D does not contain a valid dictionary #2873

CLI should fail if the dictionary file provided in -D does not contain a valid dictionary #2873

Comments

JustAnotherArchivist commented Nov 23, 2021

terrelln commented Nov 23, 2021

CLI should fail if the dictionary file provided in `-D` does not contain a valid dictionary #2873

CLI should fail if the dictionary file provided in `-D` does not contain a valid dictionary #2873