Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 error with testing set of torchtext.datasets.Multi30k(language_pair=("de", "en")). #2221

Open
raaaaaymond opened this issue Jan 14, 2024 · 5 comments

Comments

@raaaaaymond
Copy link

🐛 Bug

Describe the bug A clear and concise description of what the bug is.

Getting the following error while attempting to iterate through the testing set of (the 3rd thing returned by) torchtext.datasets.Multi30k(language_pair=("de", "en")):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte

To Reproduce Steps to reproduce the behavior:

  1. Install torch torchaudio torchdata torchtext torchvision.
  2. Run the following code:
training, validation, testing = torchtext.datasets.Multi30k(language_pair=("de", "en"))
for thing in testing:
    print(thing)
  1. Observe UnicodeDecodeError.

Expected behavior A clear and concise description of what you expected to happen.

There should be no error, but a print out of the data. For example, if instead of for thing in testing, we did for thing in validation or for thing in training, then everything works as expected.

Screenshots If applicable, add screenshots to help explain your problem.

Not necessary; just observe the error.

Environment

Please copy and paste the output from our
environment collection script (or
fill out the checklist below manually).

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro
GCC version: (MinGW-W64 x86_64-msvcrt-posix-seh, built by Brecht Sanders) 13.1.0
Clang version: Could not collect
CMake version: version 3.28.0-rc1
Libc version: N/A

Python version: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 12.2.128
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 546.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==1.6.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] torch==2.1.2+cu121
[pip3] torchaudio==2.1.2+cu121
[pip3] torchdata==0.7.1
[pip3] torchtext==0.16.2
[pip3] torchvision==0.16.2+cu121
[conda] Could not collect

Additional context Add any other context about the problem here.

@matekrk
Copy link

matekrk commented Jan 18, 2024

Hi, I encountered same issues (working on https://github.com/harvardnlp/annotated-transformer/ ) Any updates on that?

@zh-jp
Copy link

zh-jp commented Jan 28, 2024

Hi, I encountered same issues (working on https://github.com/harvardnlp/annotated-transformer/ ) Any updates on that?

I'm learning this repository too. And this issue happend when running:

train, val, test = datasets.Multi30k(language_pair=("de", "en"))

Inspired by processing datasets separately, this modified code can run successfully:

train = datasets.Multi30k(root='.data', split='train', language_pair=('de', 'en'))
val = datasets.Multi30k(root='.data', split='valid', language_pair=('de', 'en'))

But if you add test, the error will occur, for reasons I don't know.

The package version is following:

pytorch                   2.1.2           py3.11_cuda12.1_cudnn8_0    pytorch
pytorch-cuda              12.1                 hde6ce7c_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

torchaudio                2.1.2                    pypi_0    pypi
torchdata                 0.7.1                     py311    pytorch
torchtext                 0.16.2                    py311    pytorch
torchvision               0.16.2                   pypi_0    pypi

@foxy6624
Copy link

The error occurred because the original server went down, so the download link for the Multi30k dataset was temporarily modified. For example, the download link for the test set is: https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz, This compressed file contains .test.* and test.* files(* represents en, de, fr). However, the _filter_fn function filters files based on whether the filename contains: test.*, which also matches .test.* files. The .test.* files contain illegal utf-8 characters, causing an error when reading the file.

Possible solutions are:

  1. Change the download link for the Multi30k dataset back to the original, see: Multi30K dataset link is broken #1756 (comment).
  2. Modify _filter_fn function to match whether it contains /test.* (with an additional slash) instead of test.*, eg:return f"/{_PREFIX[split]}.{language_pair[i]}" in x[0].
  3. Manually download and unzip the test.* files to the torch cache directory: ~/.cache/torch/text/datasets/Multi30k/.

@OnlyBelter
Copy link

The test set is still not working!

I'm using the following version:

torch                     2.2.2                    pypi_0    pypi
torchdata                 0.7.1                    pypi_0    pypi
torchtext                 0.17.2                   pypi_0    pypi

It constantly throws an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte.

@donglinkang2021
Copy link

I met the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte when I run the script below:

from torchtext.datasets import Multi30k
train, val, test = Multi30k(language_pair=("de", "en"))

# for de, en in train:
#     print(de)
#     print(en)
#     break

# for de, en in val:
#     print(de)
#     print(en)
#     break

for de, en in test:
    print(de)
    print(en)
    break

And I found my root has download the test dataset mmt16_task1_test.tar.gz , which has not been unzipped correctly, so I just use the follow command, and the problem is solved:

tar -zxvf mmt16_task1_test.tar.gz # my solution to fix the error

My directory datasets/Multi30k :

# ls
mmt16_task1_test.tar.gz  test.de  test.en  test.fr  train.de  train.en  training.tar.gz  val.de  val.en  validation.tar.gz

Hope it helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants