I wish they clarified whether this claim that humans have a 5.1% error rate is in "listen to this sentence once and transcribe it" or "study this recording however you like and transcribe it."

edit: They talk about this in the arxiv paper:

>The transcription protocol that was agreed upon was to have three independent transcribers provide transcripts which were quality checked by a fourth senior transcriber. All four transcribers are native US English speakers and were selected based on the quality of their work on past transcription projects.

>...The transcription time was estimated at 12-14 times realtime (xRT) for the first pass for Transcribers 1-3 and an additional 1.7-2xRT for the second quality checking pass (by Transcriber 4). Both passes involved listening to the audio multiple times: around 3-4 times for the first pass and 1-2 times for the second.

For anyone wondering what the recordings in the HUB5 2000 eval data (the test data) sound like: https://catalog.ldc.upenn.edu/desc/addenda/LDC2002S09.wav

God. Transcribing that would be mind-numbing. Glad computers are getting better at this.

