Falcon 40B LLM (which beats Llama) now Apache 2.0

mikeravkine · on May 31, 2023

4bit Quantized versions that run on an A100-40G or 2x3090/4090 24GB: https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

Inference is very slow right now but it works!

sireat · on June 1, 2023

How is it possible to run this model on 2x 4090s ?

I thought that 4090s were "nerfed" and nvlink support removed - https://www.windowscentral.com/hardware/computers-desktops/n...

Timshel · on June 1, 2023

Probably this : https://huggingface.co/docs/accelerate/usage_guides/big_mode...

MacsHeadroom · on June 2, 2023

Without NVLink different layers are loaded onto individual cards and only one card often has to wait for the other card, slowing down generation. It's still faster than CPU offloading. You can even mix and match GPUs like a 4090 + 3060.

bongobingo1 · on June 1, 2023

The hardware requirements on these models is basically at a fixed floor, and the democratisation will come from cheaper, possibly specialised, hardware, not reduced requirements, right?

Tepix · on June 2, 2023

I think there is still lots room for improvement to reduce hardware requirements, such as 3-bit quantization or pruning weights from sparse models.

But if you're willing to spend $1500 on two used RTX 3090, it's the sweet spot in terms of the ability to run large models right now. Everything beyond that is much more expensive.

airgapstopgap · on June 1, 2023

Multi-Query Attention, used here, should make 40B inference viable on systems where even 33B LLaMA with Multi-Head is basically unusable, so sometimes improvements still come from software optimization (it's no free lunch though).

wrs · on June 1, 2023

The minimum size of a model with equivalent performance is unknown (like so much else about LLMs), so requirements could be reduced.

BonoboIO · on June 1, 2023

Will be interesting to see if someone comes up with an ASIC or FPGA.

groby_b · on June 1, 2023

Why would you think there's an ASIC/FPGA design that significantly improves over GPUs specifically targeted at running large models already? Where's the win?

The fundamental limit for hardware acceleration are number of gates you can squeeze on a die, right now. (Or, alternatively. memory bandwidth)

code_witch_sam · on June 2, 2023

analog chips like what MythicAI is developing seem like the next obvious leap forward for deploying inferences broadly. ASIC/FPGA wouldn't be much different than a GPU. ASIC seems like a brittle solution

https://www.youtube.com/watch?v=GVsUOuSjvcg

takeiteasyy · on June 1, 2023

This will require a cpp port first to run on e.g. apple silicon?

wsgeorge · on June 1, 2023

No, but to benefit from the ggml ecosystem, yes. Someone's taking a stab at it: https://github.com/nikisalli/falcon.cpp

rovr138 · on June 1, 2023

This is using huggingface. It's python. There's no reason it wouldn't run on arm.

https://huggingface.co/tiiuae/falcon-40b

I'll also add that the fact something is in C++ doesn't mean it will run on arm or that it can be compiled in it.

brianjking · on June 1, 2023

How can it be deployed on a Huggingface Space or Colab notebook?

dzhiurgis · on June 1, 2023

is there a model version where it can be deployed using huggingface's tool?

yewenjie · on June 2, 2023

For people who directly want to check the benchmark - https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

kossTKR · on June 2, 2023

Very cool but how does this compare to GPT-4 (before it was nerfed)?

I feel like the best benchmark atm is the orig gpt-4 version.

redox99 · on June 2, 2023

None of these models are even close to GPT 3.5, let alone GPT4.

SparkyMcUnicorn · on June 3, 2023

Here's the discussion for that benchmark with some values.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

valine · on June 2, 2023

GPT-4 wasn’t nerfed. https://news.ycombinator.com/item?id=36155267

Taek · on June 2, 2023

*the api wasn't nerfed. ChatGPT4 was most certainly changed

fbrncci · on June 2, 2023

Very very hard to believe.

az226 · on June 4, 2023

It most definitely was. The chat output of GPT4 is much faster now and much worse quality. If you go in the playground and use the March 14 api (as opposed to the default GPT4 API) it is high quality and slow. Yes, there are two GPT4 API endpoints .

johnwheeler · on June 2, 2023

It most certainly was, despite what anyone from openai says. It’s not working the same.

kossTKR · on June 2, 2023

The comments seem to conclude that it was actually nerfed just not the API.

devmunchies · on June 2, 2023

nerfed how? I didn't know this.

low_tech_punk · on June 2, 2023

maybe the comment was about safety related training leading to performance loss. Sebastien used the "Unicorn benchmark" to visualize such nerfing. Watch his talk at timestamp 26:22. Ref: https://youtu.be/qbIk7-JPB2c?t=1582

Xenoamorphous · on June 2, 2023

Prob. referring to this https://news.ycombinator.com/item?id=36134249

senttoschool · on June 2, 2023

How does it compare to OpenAI's GPT4? It's not listed in the table.

version_five · on June 2, 2023

Am I the only one who finds this very sketchy? They had the whole license things, there's been some loud complaining by the HF CTO on social media that this model is not getting enough attention, and there are also press releases about how Falcon tops the "leaderboard":

https://www.morningstar.com/news/business-wire/2023052900504...

I've never seen this kind of "strategy" with an ML model before. Maybe I'm seeing something that isn't there...

It could be a question of not being used to see blatantly commercial advertising in places we're used to being about software. Feels like we're moving more towards the bro-ification of AI.

Havoc · on June 2, 2023

Not sure I care to be honest. With an open license the community can now take this and roll with it.

What PR noise they make is secondary to that in my mind.

ralfn · on June 2, 2023

Yeah. When companies do actually good things for PR/selfish reasons the society that rewards that with attention will thrive.

Cynics logic is always a race to the moral bottom

jerpint · on June 2, 2023

maybe attention IS all you need

lbotos · on June 2, 2023

For the segment of HN that isn't up to speed on ML this is a double entendre from a famous ML paper:

https://towardsdatascience.com/attention-is-all-you-need-dis...

version_five · on June 2, 2023

I wish I'd thought of that.

Seriously though, I think it's mostly the opposite. How much advertising did Georgi Gerganov do for ggml / llama.cpp and it's super popular. Maybe other people are just being more subtle, but I feel generally merit stands out on it's own and advertising is a poor substitute.

throwaway290 · on June 2, 2023

Probably PR to attract subsidies/investment. UAE might be throwing more money on tech/innovation branded projects than China.

version_five · on May 31, 2023

I saw this, I still am highly suspicious about the future of these models and later attempts at monetization. Did they really just drop their 10% royalty thing and decide they'll just open source all their models now?

Another comment mentions llama may get an open license, and there are other emerging alternatives. In six months there will be lots of options. I would not spend my time building anything around a model that started in such a sketchy way.

It would be interesting to hear about why they decided to change their license and what their plans are for the future.

I admit Falcon has really pissed my off because they pretended they had an open source model when they really released a sleazy freemium thing.

convexstrictly · on May 31, 2023

Do you consider the Unity model "sleazy"? But I agree that it was poor form to call it open-source at the time. Sadly, it seems to be standard practice in the LLM space to release models with all kinds of restrictions on use and call them "open source".

Well now Falcon is open source. Given they are giving away something that was very expensive to train, I am grateful.

version_five · on May 31, 2023

I don't know unity's model. I call Falcon sleazy because the royalty stuff (plus a bunch of absurd related terms) were buried in the license while the headline said open source. I'm not against commercial products, nor products with free and premium tiers. This otoh felt, like I say, just sleazy.

jrflowers · on May 31, 2023

Here is a link to an explanation of the Unity model.

https://gamedevbeginner.com/is-unity-free/

generalizations · on June 1, 2023

The explanation is less interesting than the fact they chose not to be upfront in their PR about the limitations of their original license. Personally, I don't care what terms they choose - but I do care if they misrepresent them. Honesty is important, and they weren't.

But all that's void now that they've gone Apache.

brucethemoose2 · on May 31, 2023

Well if its Apache 2.0 now, can't you clone it as-is and have it be open "forever?"

version_five · on May 31, 2023

That's part of why I'm interested in their motivation changing licenses. They obviously had a profit motive and a model that centered on collecting rents on the model and decided to back off from it.

Will they keep maintaining their code and improving the models and continue releasing everything under Apache 2.0?

I'm just saying I feel like we got a glimpse of what they're about and it wasn't pretty, so why build around them when there are lots of options.

brucethemoose2 · on May 31, 2023

They dont need to do squat.

Look at Stable Diffusion 1.5 and LLaMA: They are thriving, but the original implementations are ancient history, and Meta/StabilityAI/RunawayML have done precisely nothing. And to be blunt, their legality is very ugly, which already makes Falcon more attractive.

convexstrictly · on May 31, 2023

Agree. The Stable Diffusion Open RAIL M license: "You agree not to use the Model or Derivatives of the Model ... To defame, disparage or otherwise harass others"

Does "disparage" have a settled meaning in law? Is a parody disparaging?

https://github.com/CompVis/stable-diffusion/blob/main/LICENS...

dragonwriter · on June 1, 2023

> Look at Stable Diffusion 1.5 and LLaMA: They are thriving, but the original implementations are ancient history, and Meta/StabilityAI/RunawayML have done precisely nothing.

I mean, that’s true of SD 1.5 in the sense that what the original creators have done since is new versions (SD 2.0, 2.1, and currently SDXL, which is apparent another SD2-architecture model, and DeepFloyd.) 2.1 has also seen some community uptake, and XL likely will once it is released unless there’s something inhibiting that. DF seems to be slowed by different architecture and high resource cost, but I’ve seen posts about people integrating the DeepFloyd early stage models with other models from the SD ecosystem for the last stage upscaling and final rendering, so I wouldn’t be surprised to see it integrated in some of the community UIs as both an integrated workflow and with access to the individual models for mix-and-match workflows.

brucethemoose2 · on June 1, 2023

I dunno. Theres some experimentation with 2.1, but the consensus seems to be that it produces inferior output to 1.5 outside of some niches, and thats before taking the 768x768 1.5 finetunes into account.

Deepfloyd is niche.

SDXL is indeed interesting, especially if its happy with 4/8 bit quant... we will see about that.

Nevertheless StabilityAI seems kinda disconnected from all the innovations going on in the community compared to, say, huggingface.

janekm · on June 2, 2023

It's a faulty consensus that came from people comparing outputs from the base model with fine tunes of the SD1.5 model. SD2.1 actually is a far superior model once fine-tuned.

I agree that it's a shame that StabilityAI seem to struggle so much to actually leverage their community (ideally with much more open development)... One could say they're a little too "full of themselves" and think they know better than everyone else.

version_five · on May 31, 2023

Very good point, if a community builds up around it, that would make it much more attractive. So chicken / egg in a way. I can definitely see that happening.

brucethemoose2 · on May 31, 2023

Yeah. TBH the biggest hurdle is that Falcon is a different architecture, so the existing tooling won't "just work." A drop in LLaMA or RWKV replacement has a better chance of catching on, even if it leaves some architecture improvements on the table.

rvcdbn · on June 2, 2023

This is such a bizarre comment to make about something that is Apache licensed. Who cares what happened before now that it’s truly open?

LrnByTeach · on May 31, 2023

This Falcon-40B royalty free license may force Meta ... that LLama-7B/13B may soon be fully open sourced as Meta wants open source LLM advancements and contributions on its own LLM architecture.

convexstrictly · on May 31, 2023

Yann LeCun: "No. But it's not because we don't want to. It's because of complicated legal issues."

https://twitter.com/ylecun/status/1651782621540524032

jacooper · on June 1, 2023

Complicated legal issues of having the cake and eating it too.

smoldesu · on June 1, 2023

And also the potential, unprecedented legal issues that accompany releasing and defending a free/open model. The brownie points they'd receive aren't worth it, at least yet.

az226 · on June 4, 2023

GPT2 was released open. It’s been years. Microsoft’s CELA has vetted it. If they are thumbs up, I don’t see why others would be averse. Unless it’s not the reason. Methinks it isn’t.

ntonozzi · on May 31, 2023

Why do people think that Meta released their model in order to get open source coders to improve their models? They will get absolutely no competitive advantage from this. Every other team developing a closed source LLM can easily copy the innovations that open source coders have applied to Llama on their own, closed source models.

There's no advantage here. Meta just spent $10 million on releasing fun chaos into the world and increasing their recruiting power.

oceanplexian · on June 1, 2023

Meta's most valuable asset is their users, not their technology, so giving away technology is incidental to them. It's not the UI or superior features that makes Meta, Instagram, etc such powerful platforms, it's the network effect.

ChatGPT was the fastest growing app in history, leaders at Meta (The ones who do M&A, strategy, etc) probably raised an eyebrow. They don't really give a crap about some stupid talking chatbot, but OpenAI getting smart and building a Social Network around millions of brand new users could be an existential problem for them. When Lecun wanted to OSS it they were probably like, sure, we can kill a few birds with one stone. If LLMs are a commodity that stops OpenAI and Google before they even get off the ground.

mliker · on June 1, 2023

I like how you completely hallucinated a story about Meta, Lecun, and ChatGPT

sebzim4500 · on June 2, 2023

Yeah, but some of the innovations being made at OpenAI will be replicated by the open source community, and Meta can use those for free.

They aren't looking to create a technological edge themselves, they want to remove the edge that OpenAI has so that they can win using their user count/brand recognition/etc.

Zemtomo · on June 2, 2023

I'm also not following this

I believe it's more to make sure that others also continue to share their research.

Or also a general genuine good mindset of people involved in those groups.

convexstrictly · on May 31, 2023

Research in the LLM space is moving so fast, I doubt any architecture will "stick".

brucethemoose2 · on May 31, 2023

Meta's eventual, internal LLM architecture will be totally different than the open source LLMs, right? They don’t need to run on commodity hardware.

Maybe I am cynical, but I dont see the incentive for Meta to contribute an open model.

version_five · on June 1, 2023

To be fair, they contributed Pytorch which has defined a whole industry and is responsible for creating hundreds of billions of dollars in value or more. Contributing a set of model weights is an extremely minor thing in the shadow of that, so wouldn't exactly be uncharacteristic.

brucethemoose2 · on June 1, 2023

That is a fair point.

convexstrictly · on May 31, 2023

https://huggingface.co/tiiuae/falcon-40b

fnordpiglet · on June 1, 2023

Serious question - why doesn’t someone crowd source the funds to train a GPT scale model for open source? I assume it’s not just a matter of a ton of GPU instances?

djbusby · on June 1, 2023

Doesn't one need to have a bunch of "very good" data to train on? I'm under the impression that sourcing costs are large.

throwoutway · on June 1, 2023

Yes, and the labor of the Reinforcement Learning from Human Feedback (RLHF) .. is anyone crowdsourcing this?

jahewson · on June 1, 2023

Anthropic and Databricks have made datasets available. OpenAssistant is crowdsourcing one.

fnordpiglet · on June 1, 2023

These seem like things affordable with money and crowd sourcing of volunteer labor. In some ways it feels too valuable to leave in the hands of megacorps.

convexstrictly · on June 2, 2023

For fast inference, the HuggingFace cofounder, Thom Wolf recommends their text-generation-inference library https://github.com/huggingface/text-generation-inference

https://twitter.com/Thom_Wolf/status/1664613366751526914

fareesh · on June 1, 2023

In terms of building something that's usable (considering cost, speed, scale, etc) if comparing an OpenAI API call to these, it's difficult for me to see a current path where these have any viable application outside some niche scenario.

From what I understand, even to run locally you/your team needs to be able to afford a machine with a 4090. These are super expensive in some countries.

I played around with the smaller Llama/Alpaca models and it wasn't really viable to build anything with.

Not really seeing a use-case for fine-tuning either compared to just few-shot prompting.

Can someone fill me in on what I'm missing? It feels like I'm out of the loop

smoldesu · on June 1, 2023

I'm running Vicuna on a free 4core Oracle VPS, and it's perfectly usable for a Discord bot. Responses rarely take more than 15 seconds with <256 max token limit, and the responses are much more entertaining than GPT 3.5. I'm not using the streaming API my server software[0] offers, but if I did it would probably load somewhere between the speeds of GPT-3.5 and GPT-4. It's more or less the same time a human would take to compose the same message.

So... not exactly a serious use-case. But it's what I'm using, and now I'm saving 10s of dollars on inferencing costs per month!

[0] https://github.com/go-skynet/LocalAI

I'm also using this to improve acceleration - https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...

fareesh · on June 1, 2023

TIL Oracle has VPS offerings with a free tier. Are they any good? Is the free-tier time limited?

This use-case is alright for a toy I guess - which is the extent that I was originally expecting these things to be useful for.

smoldesu · on June 1, 2023

They're okay. This isn't the place for a full review of their offerings (especially considering everyone's mixed feelings on Oracle), but I'm confident that it's better than most 1core/$5 deals you'll find elsewhere.

> Are they any good?

Yep, free tier allows you to spec up to 24gb of RAM without paying, which is cool. The bottleneck is really the disk speed, but that's not an issue with mmaped models. There's enough cached memory that it loads instantly, so it's good-ish for this use case.

> Is the free-tier time limited?

No, but there are a lot of strings attached:

- The cores are vCPUs, not dedi (duh)

- You can't create new instances when demand is high (unless you add a credit card)

- Technically Oracle reserves the right to shut down the instance if demand gets really high (although I haven't heard any stories about this personally)

Proceed with caution. It's still a great place to start before you shell out $1/hr for dedi GPU rackspace.

tylergetsay · on June 2, 2023

They will shut down the VPS if there's no activity on it, not sure how they detect this though

ukuina · on June 3, 2023

CPU idle % over a rolling time window (say, 15 minutes).

paulgb · on June 1, 2023

Now I’m curious what your bot does!

smoldesu · on June 1, 2023

It's one of those "say a keyword with a question, get a response" type bots. I added in a few other "prompt sources" though, where it grabs the first part of an RSS entry or HN comment and tries to autocomplete the rest. Mostly just a boring testbed for me to play with models, for free, with friends.

anon373839 · on June 1, 2023

Fine-tuning is a much better proposition than you’re giving it credit for. Papers are coming out demonstrating that 7B parameter models can outperform GPT-4’s quality when trained on a limited set of tasks. Yet, a 7B model offers comparatively cheap and fast inference. Furthermore, for a lot of use cases, few-shot prompting is infeasible because you need to supply 2-3k tokens worth of few-shot examples with every prompt in order to fully specify the behavior you want. (As an example, think of long-form summarization where you want the summary to adhere to certain rules.)

sgu999 · on June 1, 2023

What kind of tasks? Could you give some links to the papers you're referring to?

anon373839 · on June 1, 2023

Sure, here are two:

1. Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

https://huggingface.co/papers/2305.14201

2. Gorilla: Large Language Model Connected with Massive APIs

https://arxiv.org/abs/2305.15334

Consider also these 2 papers supporting the feasibility of fine-tuning:

3. LIMA: Less Is More for Alignment [showing that a very small number of high quality examples is sufficient to align a base model]

https://arxiv.org/abs/2305.11206

4. QLoRA: Efficient Finetuning of Quantized LLMs [showing that LLMs can now be fine-tuned quickly on consumer-grade GPUs]

https://arxiv.org/abs/2305.14314

—-

Adding up these developments (all of which occurred during the span of one week), I don’t see how huge, slow, general-purpose models maintain their relevance in the long term, when a lean, domain-focused model is right there within reach of every application developer.

fareesh · on June 1, 2023

Sorry I should have been more specific - I was limiting my question to the bigger models. The smaller (~7B) models are feasible with these approaches.

MacsHeadroom · on June 2, 2023

One benefit of finetuning larger models, like 65B, is to free up limited context space vs few-shot prompting.

If you want a specific kind of interaction with the model then you could take up 1/3rd of the 2048 token context window with few-shot or you could simply finetune it with QLoRA for a few hours on a consumer GPU and then get to use the full 2048 context with the finetuned model.

anon373839 · on June 1, 2023

Ah. Perhaps the larger models will find use in in-house deployments where companies want their employees to have access to ChatGPT-like general purpose assistants, but want to prevent data from leaving their premises. LIMA shows the LLaMA 65B hitting a quality level somewhere between DaVinci-003 and GPT-4 with minimal alignment, so the base models are probably powerful enough already for this to work. Just speculating.

sgu999 · on June 1, 2023

Wow, thanks a lot! :)

senttoschool · on June 1, 2023

>From what I understand, even to run locally you/your team needs to be able to afford a machine with a 4090. These are super expensive in some countries.

That's because we're only half a year into LLMs becoming mainstream. Give it 3-4 years. The advancements in bringing down model size, optimizations, and newer GPUs, SoCs from Nvidia, AMD, Apple, Intel, Qualcomm, etc will make it so that top LLMs will run on a highend laptop/desktop.

Zemtomo · on June 2, 2023

This is bleeding edge stuff.

All advances in this direction do indicate that it will be easier and easier for more people to do things with it.

This doesn't need to work for everyone.

A 4090 costs today 2k, the 3090 with also 24gb costs today 1k and costed 2k.

caeril · on June 2, 2023

Any particular reason not to run this model on a single Jetson AGX Orin 64GB?

GP Core count is much lower than than the 4090 but it still does 275 int8 TOPS for only $2k

Zemtomo · on June 2, 2023

I'm the wrong person to ask but performance wise a 4090 has over a pflop and a Google search also showed a factor of 3-4 for a 3090.

cjbprime · on June 3, 2023

The relevant performance metric here is memory amount and bandwidth.

FeepingCreature · on June 2, 2023

Archive: https://web.archive.org/web/20230000000000/https://www.tii.a...

gjreda · on June 1, 2023

Not specific to this model, but beyond the large players (OpenAI, Cohere, etc) are there any free hosted versions of the open(ish) LLMs? Even the smaller 7B parameter ones? I'm prototyping out a project and using OpenAI for now, but it feels like there has to be a hosted alternative somewhere.

I spent some time today exploring HuggingFace's Inference API but if the model is sufficiently large (> 10gb), HF requires you to use their commercial offerings.

dzhiurgis · on June 1, 2023

> HF requires you to use their commercial offerings

Some of which are quite affordable ($80 per month). Larger ones can be like 2000 a month which is still ok to prototyping phase. You're basically paying for aws/gcp infrastructure.

I quite liked the UX of it, very intuitive. My trouble was finding a model that executes out-of-the-box tho. All of the GPT ones crash on startup.

YetAnotherNick · on June 2, 2023

https://chat.lmsys.org/

amstan · on June 2, 2023

40B is pretty large, right? I expect it would take 70GB or so of RAM to run it. That's some expensive hardware ($10,000 or more).

esquire_900 · on June 2, 2023

Some people use second hand P40 GPUs, which go for around 200-300$. Combine 3 of them with SLI and you've got 72GB of VRAM for less then $1000

gengolas · on June 2, 2023

I do use a P40 for my machine learning box, but I'm curious how you put three on the same system, given they need a CPU power plug and a pci-e port. Then, to cool them, you need to plug your own cooling system, requiring more specific power plugs to be available. What kind of chassis, motherboard, power unit you use to do that? It'll certainly will cost more than $1000 anyway, especially since you also need a decent amount of RAM to preload the models before you move them to the GPUs.

esquire_900 · on June 15, 2023

http://nonint.com/ has some interesting posts about how he build a custom server to house 8 GPU's (3090's in this case). You're right that that will set you back more than $1000, though I was only referring to the GPU's themselves.

amstan · on June 2, 2023

Woah, that's a cool direction. Thank you! I'll explore this.

washadjeffmad · on June 2, 2023

P40s are kind of a meme. Using ggmls has roughly the same performance at a fraction of the wattage on a dual-channel DDR5 system.

I still use GPTQ for 30B, but even CPU generates quickly enough at q5_1 on modern hardware.

Kelamir · on June 2, 2023

> https://github.com/ggerganov/llama.cpp/issues/1602#issuecomm...

Here somebody quantized it down to 29929.56MB .

esperent · on June 2, 2023

I briefly looked at prices a few days ago, I think you can rent an 80gb GPU for about $2.50 an hour. Then you just pay while you're using it.

Someone else might be better able to confirm the pricing, but in any case you don't need to purchase the hardware.

amstan · on June 2, 2023

I'm less interested in using someone else's computer (not as much, but similar to how i'm disinterested in an API from someone like OpenAI), would rather pay the upfront hardware cost than worry about how many tokens i generate (kind of hinders creativity and excitment about it).

dheera · on June 2, 2023

Stupid question but for feed-forward models why do we not yet have some kind of CPU RAM memory swap mechanism? Why is Pytorch still trying to load the whole damn model into GPU RAM at once and then complaining when it can't, instead of swapping portions of the model to CPU RAM, or hell, even SSD?

Sure, it might be a lot slower, but that's a lot better than "I give up, go buy $20K worth of hardware"

coolspot · on June 2, 2023

That’s what llama.cpp does, including offload to a disk. It allows you to run models as big as any combination of your VRAM, RAM or disk. But in the end, if it doesn’t fit into GPU VRAM, it will be slow.

For example Guanaco-33B generates ~10 token per second running fully from VRAM of my 3090, the ~1 token/second running from DDR4 RAM of my Ryzen. I would imagine it would do like a token per minute from NVM SSD.

anonymousDan · on June 2, 2023

Is it memory/disk bandwidth bound or latency bound?

mike_hearn · on June 2, 2023

Memory bandwidth bound.

sebzim4500 · on June 2, 2023

I haven't run the numbers, but I would expect that doing that would make it slower than just running on the CPU.

GaggiX · on June 2, 2023

With 4-bit quantization should take less than 40GB of VRAM/RAM.

valvar · on June 2, 2023

That much good RAM in itself isn't super expensive. So does the rest of the hardware have to be particularly powerful?

esperent · on June 2, 2023

It has to be GPU RAM from my understanding, unless you're happy to wait several minutes/hours for each response.

GaggiX · on June 2, 2023

From what I can find online LLAMA-65B 4-bit quantized can run 1 token/s on a Ryzen 7 3700X (using llama.cpp).

gymbeaux · on June 2, 2023

70GB of RAM would cost around $150 these days depending on how you get there. 64GB (2x32GB) of DDR4 is around $140 then another 8GB stick would be around $15.

Used DDR3 ECC would be roughly half that.

Semaphor · on June 2, 2023

I think they are talking about VRAM

drdaeman · on June 2, 2023

Anything that runs on GPU can be run on CPU, the only question is how much slower it's gonna be.

amstan · on June 2, 2023

I've been running vicunda-13B on a workstation that can be aquired from ebay for about $500. It's slow compared to online services, but probably slightly faster than text to speech would recite its output, so plenty.

Falcon 40B is probably too much for it, but apparently there's similar cheap hardware that could work.

Semaphor · on June 2, 2023

I know, I have a weak GPU but 64 GB RAM. Using those models works, but it’s more "ask a question, then do something else for a while, while your fans spin up" ;)

gymbeaux · on June 5, 2023

oop you're right... I know once the model is trained they run on the CPU/regular RAM

randomname9191 · on June 4, 2023

https://medium.com/towards-artificial-intelligence/falcon-40...

gandalfff · on June 2, 2023

Amy implementation for this akin to llama.cpp?

Kelamir · on June 2, 2023

Same question, so far I've found this thread https://github.com/ggerganov/llama.cpp/issues/1602 where people work on it.

low_tech_punk · on June 2, 2023

Hypothetically speaking, if Falcon 40B could out-perform GPT-3.5, would that force OpenAI to open-source GPT-3.5?

janekm · on June 2, 2023

Did Stable Diffusion "force" OpenAI or Google to open-source their imaging models? No. Don't see what the mechanism would be for that

wokwokwok · on June 2, 2023

No.

There no reason, particularly, to believe either that it does, or it would.

For openai to scramble and try to “catch up” with a competitor and make such a massive change in strategy would require someone to be offering an equivalent service (hosted inference) that was either orders of magnitude cheaper than their offering and just as good, or significantly better than it. Or legal compulsion.

This is none of those things. They won’t care.

nc · on June 2, 2023

It would as open source improvements would start to exceed performance of 3.5 for specific use-cases. At the very least they would have to make it fine-tunable.

sebzim4500 · on June 2, 2023

Sam Altman said recently that they are already working on making GPT-3.5/GPT-4 finetunable, they are just limited by the availability of compute (partially since none of their SFT infrastructure uses LoRa).

I had previously assumed it was safety concerns, since I don't see what stops someone from finetuning away all guard rails.

lostmsu · on June 6, 2023

I wish these releases had short but meaningful descriptions for their models.

E.g. 10 layers x 2048 tokens x 1024 embedding model using full attention.

sidcool · on June 2, 2023

I am an amateur when it comes to these models. What can I do with this model and how?

muttled · on June 3, 2023

At the most basic, you give it text and it can guess at what comes next. This means you could type "10 types of ferns" and it will build a list of 10 ferns. Or you could type it out how a transcript of a conversation would look like and it will basically fill in the other "side" of the conversation to make a chatbot (all the complicated chatbots are basically abstracting this). Think of it like a text box with a super-smart (arguably) person also having access where you can type one thing and then it'll type whatever it thinks would be the next thing to type.

lostmsu · on June 6, 2023

It's interesting, that intruction-tuned model has lower performance on ARC and HellaSwag.

Havoc · on June 2, 2023

Oh wow. That’s great news.

Falcon seemed good till I read the license fine print about pre approvals and what not. This seems to fix that

barefeg · on June 2, 2023

I’m curious to know the real reason for the change. Is it related to the data they used for training?

klysm · on June 1, 2023

If it turns out that there is a consistent architecture which works really well, how long before we see an ASIC?

fennecfoxy · on June 2, 2023

As with everything, follow the money, pretty much.

There will be an ASIC as soon as serious money is being made from LLMs, most use cases atm seem to be in prototype/toy stage, but I imagine we'll start seeing that change.

barrenko · on June 2, 2023

Is this multilingual?

stan_kirdey · on June 2, 2023

Chatting with Falcon 40B feels like chatting with GPT4, very capable model.

azinman2 · on June 2, 2023

Can you give examples?

low_tech_punk · on June 2, 2023

"As an AI model, I cannot give an example of me chatting with myself"