Hacker News new | past | comments | ask | show | jobs | submit login

I actually prefer the conceptual model the author suggests:

> Originally I wanted to call this function ghostmax, as you can think of there being an extra zero-valued entry in x (as exp(0)=1), as well as a zero vector in the V matrix that attenuates the result.

Don't think of this as weighting the options so that some of the time none of them is chosen. ("Weights that add up to less than 1.") Instead, think of this as forcing the consideration of the option "do nothing" whenever any set of options is otherwise considered. It's the difference between "when all you have is a hammer, everything looks like a nail [and gets hammered]" and "when all you have is a hammer, nails get hammered and non-nails get ignored".

I like this framing because, as an example, it bothers me that our speech-to-text systems use this method:

1. A human predetermines what language the input will use.

2. Audio in that language is fed to transcribing software.

3. You get, with modern technology, a pretty decent transcription.

3(a). ...if the audio sample was really in the language chosen in step 1.

If you ignore the choice of language and feed French audio to an English transcriber, you get gibberish. This is wildly at odds with how humans do transcription, where absolutely the first thing that a system that only knows how to transcribe English will do, when given French audio, is object "hey, this is definitely not English".




Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.


This seems like a good way to look at it. Another way to put it is, there is a certain "origin" or "default" confidence which is pinned to some fixed value pre-softmax, ie, all outputs are necessarily compared to that fixed value (pretending zero is another input to the softmax) rather than merely each other.


I like your description because it's relatively succinct and intuitively suggests why the modified softmax can help the model handle edge cases. It's nice to ask: How could the model realistically learn to correctly handle situation X?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: