-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 error with testing set of torchtext.datasets.Multi30k(language_pair=("de", "en"))
.
#2221
Comments
Hi, I encountered same issues (working on https://github.com/harvardnlp/annotated-transformer/ ) Any updates on that? |
I'm learning this repository too. And this issue happend when running: train, val, test = datasets.Multi30k(language_pair=("de", "en")) Inspired by processing datasets separately, this modified code can run successfully: train = datasets.Multi30k(root='.data', split='train', language_pair=('de', 'en'))
val = datasets.Multi30k(root='.data', split='valid', language_pair=('de', 'en')) But if you add The package version is following: pytorch 2.1.2 py3.11_cuda12.1_cudnn8_0 pytorch
pytorch-cuda 12.1 hde6ce7c_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 2.1.2 pypi_0 pypi
torchdata 0.7.1 py311 pytorch
torchtext 0.16.2 py311 pytorch
torchvision 0.16.2 pypi_0 pypi |
The error occurred because the original server went down, so the download link for the Multi30k dataset was temporarily modified. For example, the download link for the test set is: https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz, This compressed file contains Possible solutions are:
|
The test set is still not working! I'm using the following version:
It constantly throws an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte. |
I met the error from torchtext.datasets import Multi30k
train, val, test = Multi30k(language_pair=("de", "en"))
# for de, en in train:
# print(de)
# print(en)
# break
# for de, en in val:
# print(de)
# print(en)
# break
for de, en in test:
print(de)
print(en)
break And I found my root has download the test dataset tar -zxvf mmt16_task1_test.tar.gz # my solution to fix the error My directory # ls
mmt16_task1_test.tar.gz test.de test.en test.fr train.de train.en training.tar.gz val.de val.en validation.tar.gz Hope it helpful :) |
🐛 Bug
Describe the bug A clear and concise description of what the bug is.
Getting the following error while attempting to iterate through the testing set of (the 3rd thing returned by)
torchtext.datasets.Multi30k(language_pair=("de", "en"))
:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte
To Reproduce Steps to reproduce the behavior:
torch torchaudio torchdata torchtext torchvision
.UnicodeDecodeError
.Expected behavior A clear and concise description of what you expected to happen.
There should be no error, but a print out of the data. For example, if instead of
for thing in testing
, we didfor thing in validation
orfor thing in training
, then everything works as expected.Screenshots If applicable, add screenshots to help explain your problem.
Not necessary; just observe the error.
Environment
Please copy and paste the output from our
environment collection script (or
fill out the checklist below manually).
Additional context Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: