I actually prefer the conceptual model the author suggests: > Originally I wante...

blackkettle · on July 25, 2023

Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.

uoaei · on July 25, 2023

This seems like a good way to look at it. Another way to put it is, there is a certain "origin" or "default" confidence which is pinned to some fixed value pre-softmax, ie, all outputs are necessarily compared to that fixed value (pretending zero is another input to the softmax) rather than merely each other.

tylerneylon · on July 25, 2023

I like your description because it's relatively succinct and intuitively suggests why the modified softmax can help the model handle edge cases. It's nice to ask: How could the model realistically learn to correctly handle situation X?