I don't know if I'm too concerned about that, to be honest. Yeah, there's a huge...

I don't know if I'm too concerned about that, to be honest. Yeah, there's a huge cost in terms of training the LLMs, and then there can be a cost for downstream inference, but I think it depends on the use case. In some cases, performance is the absolute top priority; in other cases, you might be willing to trade off some performance for better inference time, or model size, etc. If you need to put the model on cell phones or offline low-power devices, that's a key constraint that might make you reach for a different tool.

The nice thing about more "classical" approaches -- a simple BoW random forest or MLP, for example -- is that they're typically quick to train and experiment with, and they make for great baselines, if nothing else. So I doubt that we're in danger of people forgetting about them entirely. If people do, they're leaving quick, easy solutions on the table.

I do like your idea about triaging inference between smaller CPU vs. larger GPU models based on whatever signals. I haven't tried that before, but a project my colleagues worked on did some triaging between regex pattern-matching vs. model inference. Basically, the regex pulled some of the data out first if it matched very specific, known patterns, and then the rest was handled probabilistically. I guess the effectiveness of that sort of triaging approach depends on how strong and clear your signals are that let you choose one path over the other.