Hacker News new | past | comments | ask | show | jobs | submit login
Can generalist foundation models beat special-purpose tuning? (arxiv.org)
126 points by wslh 9 months ago | hide | past | favorite | 64 comments



Not strictly related to their study, but if we consider Solomonoff induction to be the most general model of all (albeit not computable), then I’d say yes to the question, if only because it will select the most specialized model(s) for the particular problem at hand automatically.

One could argue universal general intelligence is simply the ability to optimally specialize as necessary.

I think one aspect that is overlooked when people are involved is that we create specialized or fine-tuned models precisely for the reason that a general approach didn’t work well enough. If it had, we would have stopped there. So there’s a selection bias in that almost all fine-tuned models are initially better than general models, at least until the general models catch up.


Your argument has a very concrete example. We are a generalist intelligence. And when we encountered chess, we eventually developed specialist intelligence to be the best at chess.


General intelligence builds tools as required, maybe having a whole bunch of them built-in already. Maybe the ones that are most frequently useful. Maybe a more general intelligence can build more of these tools into itself as required.

Interesting philosophic topic how to ensure objectives aren't affected too much. You can make this mind experiment even as a human. There are parts of our goal system we might want to adjust, and parts we'd be horrified to, being very reluctant to accidentally risk affecting it.


This is another level though. GPT4 wont do that. But maybe whatever is coming next can.


There are lots of ways to define “good enough” as well. What are the costs of inferring several small experts contributing to a decomposed workflow versus using GPT4 for example. If you want to run multiple instances for different teams or departments, how do the costs escalate. Do you want to include patient data and have concerns about using a closed source model to do so. Etc.

There’s little doubt that GPT4 is going to be most capable at most tasks either OotB or with prompt engineering as here. But that doesn’t mean it’s the right approach to use now.


I assume the next step is to find the right model size for semantic specialism then create a model of model that's filtering down to specialized models.

same will happen for temporal information that stratifies and changes, ie, asking what a 1950s scientist understands about aquantim physics.


I don't understand the reasoning. if you merged two of the most specialized models together and simply doubled the size/parameter count, there is no reason the smaller one can generalize better than the larger one.

It would simply have two different submodels inside its ubermodel.


The answer to most of these kinds of questions of generalist versus specialist when it comes to AI, according to Richard Sutton's "The Bitter Lesson"[1], is that specialized AI will win for a while, but generalized will always win the long-term because it's better at turning compute into results.

Whether or not that will pan out with "fine tuned" versus "generalized" versions of the same data-eating algorithms remains to be seen, but I suspect the bitter lesson might still apply.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html


Well the counter argument is that with an equal amount of compute, specialized models will win over generalized ones. The reason is that generalized ones have to spread out their weights to be able to do a ton of irrelevant things to the niche subject that a finetuned model can accomplish with fewer and more focused weights.


That's probably true in terms of benchmark problems, but at the same time, losing knowledge of seemingly irrelevant things could make the specialised model less creative when it comes to looking for unexpected causes and links.

It would also be less robust in the face of exceptional situations that should make it doubt the reliability and relevance of its own training.

For instance, an autonomous driving system that doesn't know the first thing about the zombie apocalypse could put its passengers' lives at risk by refusing to run over "pedestrians".

A specialised diagnostic system might not notice signs of domestic violence that a GP would see. Being able to connect the dots beyond the confines of some specialised field is extremely useful in many situations.


> For instance, an autonomous driving system that doesn't know the first thing about the zombie apocalypse could put its passengers' lives at risk by refusing to run over "pedestrians".

I think that argument works better the other way. I'm outspoken in arguing many of the edge cases in driving are general intelligence problems we're not about to solve by optimising software over a few billion more miles in a simulator, but I don't want that intelligence so generalised that my car starts running down elderly pedestrians because there's been a lot of zombie literature on the Internet lately. I'm pretty confident there's more risk of people dying from that than a zombie apocalypse.


> For instance, an autonomous driving system that doesn't know the first thing about the zombie apocalypse could put its passengers' lives at risk by refusing to run over "pedestrians".

I'd prefer my self-driving car not to learn about the zombie apocalypse, lest it starts running over pedestrians.


I'd argue that yours is not a counter argument, but the same thing said by grandparent rephrased. It's obvious that with the same amount of computing power, specialized > generalized. But in 2-5 years, the future generalized models can beat the current specialized ones.


The future of this is the same as what happened with PCs. The specificity will initially be thrown away, added back later as an accelerator, and eventually, brought back into the fold as an automatically used accelerator. It all comes full circle as optimizations, hinting, tiering, and heuristics.

AI will be used to select the net to automatically load. Nets will be cached, branch predicted etc.

The future of AI software and hardware doesn't yet support the scale we need for this type of generalized AI processor (think CPU but call it an AIPU.)

And no, GPUs aren't an AIPU, we can't even fit whole some of the largest models on these things without running them in pieces. They don't have a higher level language yet, like C, which would compile down to more specific actions after optimizations are borne (not PTX/LLVM/Cuda/OpenCL.)


No, the GP is stating that the short term advantage will always exist, for fundamental reasons. It's just short term because it has to be reacquired often, not because there stops being an advantage.

In 5 years, the specialized model would still beat the generalized ones. They would just be different from the ones today.


Couldn't you apply the same prompt engineering tricks to the specialized model that their system beats?


Well, I'm not sure how valid is the entire thing. It's not clear how much cross-polination there is between those datasets and the ChatGPT training set (IMO, if I was training something like it, I'd take any specialized dataset I can find), or how useful answering those questions is to any real task.

Also, it's not only the prompt engineering. Training ChatCPT was a decades long process, with billions of dollars invested, and it takes a small cluster to run the thing. How would the other models compare if given similar resources?

Besides that, on the context of this thread, the bitter lesson has absolutely no relation to the specialized vs. general purpose model dichotomy. It's about the underlining algorithms, that for this article are all very similar. (And it's also not a known truth, and looking less and less as an absolute truth as deep learning advances, so take it with a huge grain of salt.)


Wouldn't it become more and more difficult to improve the specialist model? And wouldn't the ultimate generalized models (AGI) eventually be able to improve themselves automatically? Or am I talking about something different here?


This is an empirical observation regarding implementation details of machine learning models specifically, not math models in general. Clearly, in principle, specialized models can beat general models simply because at least some applications have exact models. No learned functional approximator will beat an ALU or even an abacus at simple arithmetic.


Well said, and fully agree. If you horse race two approaches, you can of course arbitrarily arrive at a winner based solely on the version of each approach you choose. You need a deeper look if you want to generalize.


I think it might be stretching the Bitter Lesson a bit to extend it to cover fine-tuning, since the original idea was addressing something very different.

Anyway, even if generalist models are truly better at everything, they’ll never be faster or cheaper than their specialist counterparts, and that matters.


In Sutton's essay he's talking about more general techniques vs ones that build in expert knowledge. In the case of this paper it's the same technique in both the fine-tuned and "general" cases. So Sutton's argument doesn't apply here. This is entirely about training data.

The big flaw in this paper, to me, is that no one knows what GPT-4 has been trained on. It could include all the same training sources as the specialized models for all we know. If that's the case, it's a bit curious that it requires prompting to get the accuracy, but again, we'll never know.

I'd go so far as to say this paper isn't even giving any useful information here, aside from the fact that GPT-4 can be made to work well in these specific domains with prompting.


I mean the whole point of the “bitter” part is that if you push most people that aren’t convinced of the data hypothesis, they will talk of a belief that there’s “something special” that happens to create intelligence that isn’t just raw number crunching (so to speak). It’s a human bias in how we view social hierarchy and expertise, and ignores the epistemic/biological perspective.

It’s a philosophical position as much as a technical one


no - because the range of responses in the generalized model will tend in some asymptotic way to "patterns the machinery recognizes" . Real knowledge is not like a machine process, in every instance. Any detective novel has a moment when a faint scent or a distant sighting connects very different inputs to aid in solving a mystery. Lastly, knowledge has resonance and harmonics; great Truths are paradox. Your bitter lesson is convenient at this time of evolution with computing, not The Answer.


Good read, thanks for sharing.


This was something we thought about a bit at Rad AI.

I think one of the problems this paper ignores isn't whether you can get a general purpose model to beat a special purpose one, but whether it's really worth it (outside of the academic sense).

Accuracy is obviously important, and anything that sacrifices accuracy in a medical areas is potentially dangerous. So everything I'm about to say assumes that accuracy remains the same or improves for a special purpose model (and in general that is the case- papers such as this talk about shrinking models as a way of improvement without sacrificing accuracy: https://arxiv.org/abs/1803.03635).

All things being equal for accuracy, the model that performs either the fastest or the cheapest is going to win. For both of these cases one of the easiest ways to accomplish the goal is to use a smaller model. Specialized models are just about always smaller. Lower latency on requests and less energy usage per query are both big wins that affect the economics of the system.

There are other benefits as well. It's much easier to experiment and compare specialized models, and there's less area for errors to leak in.

So even if it's possible to get a model like GPT-4 to work as well as a specialized model, if you're actually putting something into production and you have the data it almost always makes sense to consider a specialized model.


That's interesting, I'm currently working on an idea that assumes the opposite. Having built specialized models for years, the cost of having a data science team clean the data and build a model is pretty high, and it can take quite a while (especially if part of the project is setting up the data collection).

For prototyping and for smaller use cases, it makes a lot of sense to use a much more general model. Obviously this doesn't apply to things like medicine, etc. But for much more general things like: check if someone is on the train-tracks, or number of people currently queuing in a certain area, or if there's a fight in a stadium; I think multi-modal models are going to take over. Not because they're efficient, or particularly fast; but because it'll be quick to implement, test, and iterate on.

The cost of building a specialized model, and keeping it up to date, will far exceed the cost of an LVM in most niche use cases.


I think it depends on how much you expect your model to be used, and how quickly the model needs to react. The higher either of those becomes the more likely you'll want to specialize.

If you expect your model to be used a lot, and you don't have a way to distribute that pain (for instance, having a mobile app run the model locally on people's phones instead of remotely on your data centers) then it ends up being a cost balancing method. A single DGX machine with 8 GPUs is going to cost you about the same as a single engineer would. If cutting the model size down means you can reduce your number of machines that makes increasing headcount easier. The nice thing about data cleaning is that it's also an investment- you can keep using most data for a long time afterwords, and if you're smart then you're building automated techniques for cleaning that can be applied to new data coming in.


I'm curious to know what Rad AI ended up doing? IIRC the initial problem was how do you turn this set of radiology notes into some summary radiology notes, with a specific format. Is that right?

If you were approaching this problem anew today, you'd probably try with GPT-4 and Claude, and then see what you could achieve by finetuning GPT-3.5.

And, yes, for a given level of quality, the finetuned GPT-3.5 will likely be cheaper than the GPT-4 version. But for radiology notes, perhaps you'd be happy to pay 10x per even if it were to give only a tiny improvement?


I guess a question to ask is "What is GPT-4". Is it the algorithm, the weights, the data, or a combination of them all?

To put it another way, the researchers at Rad AI consumed every paper that was out there including very cutting edge stuff. This included reimplimenting GPT-2 in house, as well as many other systems. However, we didn't have the same data that was used by OpenAI. We also didn't have their hyperparameters (and since our data was different it's not a guarantee that those would have been the best ones anyways).

So with that in mind it's possible that Rad AI could today be using their own in house GPT-4, but specialized with their radiology data. In other words them using a specialized model, and them using GPT-4, wouldn't be contradictory.

I do want to toss out a disclaimer that I left there in 2021, so I have no insights into their current setup other than what's publicly released. However I have no reason to believe they aren't still doing cutting edge work and building out custom stuff taking advantage of the latest papers and techniques.


Why did "Case study in Medicine" get stripped from the submitted title? It's way more clickbaity now.


I think you answered your own question.


This is a very interesting study, but the generalization implied in the abstract -- that specialized models may not have advantages in specialized domains -- is an overreach for two reasons.

They introduce medprompt, which combines chain of thought reasoning, supervision, and retrieval augmented generation to get better results on novel questions. This is a cool way to leverage supervision capacity outside of training/fine tuning! But they compare this strategy, applied to GPT-4, with old prompting strategies applied to MedPalm. This is apples and oranges -- you could easily have taken the same supervision strategy (possibly adapted for smaller attention window etc) to get a closer comparison.

Second, MedPalm is a fine tuned 500b parameter model. GPT-4 is estimated to be 3-4x size. So the comparison confounds emerging capabilities at scale, in addition to prompting impact, with anything you can say generally about the value of model specialization.


Any sufficiently large library has a lot of specialty books in it.


you have got to be joking. This is not true in any real way.

source: six years in the book industry in California


What I want to know is how these systems compare to simple keyword search algorithms. E.g. presumably the person using these systems are medical professionals, typing in some symptoms perhaps gleaming for answers.

Theoretically, all these date could be cataloged in a way where you can type in symptoms x y z, demographic information a b c, medical history d e f, and get a confidence score for a potential diagnosis bases on matching these factors to results from past association studies. A system like that I imagine would be easier to feed in true information and importantly would also offer a confidence estimate of the result. It would also be a lot more computationally simpler to run I’d imagine given the compute required to train a machine learning model vs keyword search algorithms you can often run locally across massive datasets.


I’m looking to run experiments like this for specific domains and prompting strategies as a way to showcase the next version of https://evals.phasellm.com If the paper above resonates with you or if you have questions about promoting techniques, feel free to reply below or email me at hello #at# phaseai #dot# com

Happy to help!


Depends on your definition of winning - special-purpose tuning is vastly more cost effective since it allows you to train a smaller model that can perform specific tasks as good as a bigger one.

A good analogy is building a webapp - would you prefer to hire a developer with 30+ years experience in various CS domains as well as a PhD or a specialized webdev with 5 years experience at a tenth of the rate?


gpt-4 is not foundational, it's rlhf tuned. also I didn't find the actual gpt-4 model revision, so yeah. sloppy all around.


really ? "gpt4 beats our tiny model", that's your paper?


If I had to guess at what the next leap for AI would be, it looks something like a collection of small and large special-purpose models, with a model on top that excels at understanding which sub-models (if any) should be applied to the current problem space.


I would think you could apply an inference process like that to a smaller model and get results better than you get with zero-shot on a larger model... And be able to afford to run the smaller model as many times as it takes.


There is a reason we use forks and not sporks. Specialization can give you exactly what you want.


We found that GPT-4 performs at a very high level on dermatology specialist exam questions (which it would not have been trained on)

https://www.medrxiv.org/content/10.1101/2023.07.13.23292418v...


Extensive medical information is publicly available on websites like www.nhs.uk. So even if it wasn't literally trained on exam questions, it may have picked up substantially identical information elsewhere.

Which is still a significant feat. But I doubt it's doing anything more than the sort of semantic lookup that is already well-known to be within its capabilities.


Why wouldn’t it have been trained on such questions?


Not publicly available


Abstract says the sample questions were public available. The point they are making is that the model wasn't specifically trained for dermatology.


What makes you think that the entire body of private access journals and books hasn’t been used?


The questions were public available. The point is the model wasn't specifically trained for dermatology.


A real test would be fine tuning gpt4 and comparing that to gpt4 medprompt


I think the point of the study was precisely to see how well the original model without adaptations (my understanding is that they only use prompting, which does not affect weights) can perform. I think that question is arguably even more interesting than 'How well can we train a model to perform on med school questions?'. I'm not saying this is generalization, because the training set surely included a lot of medical literature, but if the base model without fine-tuning can perform well in one important domain, that's a very interesting data point (especially if we find the same to hold true for multiple domains).


Caveat this with the fact that they did rag.


The "RAG" part (over the training set) is by far the smallest contribution to the performance gains reported (see the ablation study in section 5.2). I don't think the model is actually learning in-context from the selected samples, but rather is continuing to be better conditioned to sample from the right part of the pre-training distribution here, which does a slightly better job when the samples are on topic (vs. random)


Yeah the Medprompt name is misleading


You can always fine tune a generalist model to get better results


Technical foul. Doing science based on closed-source model. 10 yard penalty for humanity.

GPT-4 almost disappeared last week. Aside from the obvious AI bus factor and "continuity of science" concerns, I find it absurd that so many papers and open source ML/LLM frameworks and libraries exist that simply would not work at all if not for GPT-4. Have we simply given up?

I thought this was hacker news.


I'm all in favour of doing reproducible work, especially against open-source ML (I'm the maintainer of a library for LLM inference), but what is the alternative here?

GPT-4 is still the flagship model "of humanity"; it is still the best model that is publicly available. The intent of the paper is to determine whether the best general model can compete with specialized models - how do you do that without using the best general model?


This is not answering that question. It is answering whether a specific large model with unknown training set can compete with specific small models on known smaller training sets.


Half of science comes down to poking at black boxes to see how they react. The universe is a closed-source model!


The other half is publishing results to fix that.


The black hole here is political though! The algorithm in question is here on earth.


Almost all modern science is gated by needing incredibly massive amounts of money.

'Closed source' science happens in every industry. While you may not think of what DOW is doing at science, it is and huge companies like that progress by their internal workings.

What you're stating is something else, that maybe we should fund sciences, such as AI more, but we really did that in the past and it had it's own fits and starts. Then transformers came out of Google and private industry has been leading the way in AI. If you want to keep up with cutting edge and the 10s of millions needed to train them, you'll need to use these private models.

Hackers hack stuff from private companies all the time. Hackers use proprietary software. There is no purity test we have to pass.


Llamafile is (rightfully) the top of HN right now.

I have high hopes that models will end up a bit like Linux servers, where everyone is building on open foundations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: