Can not load commonvoice dataset on windows #3781

jacobjennings · 2024-04-27T01:28:53Z

🐛 Describe the bug

When loading the common voice dataset on windows, the file train.tsv is loaded using cp1252 file encoding, leading to a failure.

training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[49], line 1
----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv)
     53 walker = csv.reader(tsv_, delimiter="\t")
     54 self._header = next(walker)
---> 55 self._walker = list(walker)

File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>

Versions

Python 3.11

The text was updated successfully, but these errors were encountered:

mogwai · 2024-05-03T12:22:03Z

You can try to download it from hugging face:

https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not load commonvoice dataset on windows #3781

Can not load commonvoice dataset on windows #3781

jacobjennings commented Apr 27, 2024

mogwai commented May 3, 2024 •

edited

Loading

Can not load commonvoice dataset on windows #3781

Can not load commonvoice dataset on windows #3781

Comments

jacobjennings commented Apr 27, 2024

🐛 Describe the bug

Versions

mogwai commented May 3, 2024 • edited Loading

mogwai commented May 3, 2024 •

edited

Loading