Quantity of datasets doesn’t seem like the right metric. The library just needs ...

albertzeyer · on Dec 21, 2022

I tried Librispeech, a very common dataset for speech recognition, in both HF and TFDS.

TFDS performed extremely bad.

First it failed because the official hosting server only allows 5 simultaneous connections, and TFDS totally ignored that and makes up to 50 simultaneous downloads and that breaks. I wonder if anyone actually tested this?

Then you need to have some computer with 30GB to do the preparation, which might fail on your computer. This is where I stopped. https://github.com/tensorflow/datasets/issues/3887. It might be fixed now but it took them 8 months to respond to my issue.

On HF, it just worked. There was a smaller issue in how the dataset was split up but that is fixed now, and their response was very fast and great.

jachian · on Dec 21, 2022

and as well as discoverability / searchability. how easily it is to find what you're looking for

dbish · on Dec 21, 2022

HF datasets have a much better UX for this

soraki_soladead · on Dec 21, 2022

UX preferences vary. Imo, hf is too verbose and their pages try to cram in too much information with poor information hierarchy. For example:

https://huggingface.co/datasets/glue

https://www.tensorflow.org/datasets/catalog/glue

jamesblonde · on Dec 21, 2022

In this case, I don't see UIs, i see HF have a 1GB+ dataset that is curated. TF, in contrast, has 10 glue datasets varying from KBs! to 100s of MBs in size.

For Glue, HF wins, hands-down, IMO.

soraki_soladead · on Dec 21, 2022

To each their own. I like that TF separates them since they are separate tasks and combining them is only one use case. At the end of the day we should just use what works best. The ML landscape is far from settled.