LeCun: Qualcomm working with Meta to run Llama-2 on mobile devices

shreezus · on July 23, 2023

This is pretty much to go head-to-head with Apple (and Samsung). Both are making leaps and strides using “neural coprocessors” and the like for running models on mobile hardware. Mobile edge computing is where we’re going to see a lot of use cases that enable functionality while maintaining data privacy & performance.

Keep in mind “mobile devices” extends past just smartphones, onto wearables/headsets as well.

vmurthy · on July 24, 2023

> enable functionality while maintaining data privacy

Ah but is it in (insert any ad business big-tech company)'s best interests to do this? Do they already have so much data that these user <> LLM interactions are only marginal that they don't care about harvesting the data ?

dan-robertson · on July 24, 2023

Fundamentally, if they think their users want this enough (for it to be profitable), they should be willing to give their customers what they want. Seems reasonably compatible with Apple’s business, for example.

yreg · on July 23, 2023

I really hope Apple doesn't mess this up and includes a solid on device LLM in iOS in the near future.

They have amazing chips, but Siri has been a subpar assistent since forever. Now is the time to redeem her.

gnicholas · on July 23, 2023

I'd love to have my iPhone communicate with a Mac Studio at my house, for the heavy lifting. I realize this would be slower than having on-device processing, but it would be much better for battery life. And although I trust Apple's privacy more than Google/FB, I'd still rather keep my AI interactions separate from anyone's cloud.

smoldesu · on July 23, 2023

You might be pleased to hear that nothing really stops you from doing this today. If you ran Serge[0] on a Mac with Tailscale, you could hack together a decently-accelerated Llama chatbot.

[0] https://github.com/serge-chat/serge

gnicholas · on July 23, 2023

I'm not technical enough to be able to hack this together, but I do hope that enough other people have the same itch, and are able to scratch it!

bob-09 · on July 23, 2023

I’d love to see both options and a seamless transition between them. An option tuned for home use that utilizes the processing power of home devices and local networks, and an option tuned for on-the-go use utilizing the iPhone/iPad processors and mobile networks.

jayd16 · on July 23, 2023

> it would be much better for battery life.

I wonder what the numbers actually are for local compute on custom hardware compared to firing up the wifi antennae to make the remote request.

gnicholas · on July 23, 2023

Yeah I have wondered about this. But seeing how an LLM hammers my M2 MBA CPU for many seconds per request, I’m guessing this would have a significant impact on a smartphone battery.

gcr · on July 24, 2023

ANE power usage for always-on Hey Siri wake word detection is impressively low. The language models we have are orders of magnitude way too big, but they won’t be for long. I think we’ll be surprised.

For comparison, a quantized mobilenetv2 takes 3.7MB disk space to solve the same image understanding task as the old VGG Caffe models which take 550MB. We’ve come a long way in six years.

circuit10 · on July 23, 2023

Good language models are NOT easy to run, so I’d imagine local compute would take many orders of magnitude more power

asadm · on July 23, 2023

They are adding their first LLM in iOS 17's keyboard.

gnicholas · on July 23, 2023

This will be great. Hopefully it will be able to figure out that it should capitalize my last name, which is on my Contact card, and that of a dozen of my relatives. When I went to the Apple Store, they told me that I could either add a custom autocorrect to my library, or reset all settings. They did admit this was some sort of bug, and that it would be massive overkill to reset all settings (lose all wifi passwords, etc.).

coder543 · on July 23, 2023

> or reset all settings

> that it would be massive overkill to reset all settings (lose all wifi passwords, etc.).

I don't think that's what anyone was recommending...

Settings -> General -> Transfer or Reset iPhone -> Reset -> Reset Keyboard Dictionary is almost certainly what they were recommending.

What does resetting your keyboard dictionary have to do with your wifi passwords?

gnicholas · on July 23, 2023

Nope, the two employees I spoke to were talking about a full reset (which affects network settings). Regardless of what the Keyboard Dictionary says, the iPhone should be autocompleting/capitalizing the last name of contacts, and especially the owner's name.

coder543 · on July 23, 2023

Why would that implicitly be "regardless" of what the keyboard dictionary says? I would expect the learned dictionary to be prioritized over other sources of information, just as a practical matter, even if someone might reasonably assume there are other things that should be prioritized over it.

None of that explains how resetting everything would have any affect on capitalization of names if resetting the keyboard dictionary wouldn't, and you didn't say whether you tried resetting keyboard dictionary.

gnicholas · on July 23, 2023

Nope, they didn't suggest that. After all, if the dictionary is 'above' the contact list, then resetting the dictionary wouldn't fix the problem (since the word would be coming from the contact list itself, not the dictionary). The issue was that iOS was not properly accounting for the contact list. They said that hopefully it will be fixed in a future iOS update, and said I could submit a bug fix online.

coder543 · on July 23, 2023

> The issue was that iOS was not properly accounting for the contact list.

But the autocorrect does account for the contact list. I can personally vouch for that, as it works properly on my phone.

> > I would expect the learned dictionary to be prioritized

> After all, if the dictionary is 'above' the contact list, then resetting the dictionary wouldn't fix the problem (since the word would be coming from the contact list itself, not the dictionary).

"The dictionary" is not what I said. But, even if I had said that, then any words that aren't in the dictionary would be sourced from the contact list and other lower priority sources, wouldn't they?

What I said was the learned dictionary. The dictionary made of words learned from what the user types, when the user corrects the autocorrect. iOS does not allow you to access or modify the list of words that it has learned, only to clear them. The words that it learns seem to be prioritized above everything else, which includes special capitalizations for words (not just sequences of letters lacking capitalization). Sometimes it learns the wrong capitalization for a word, and it sticks to it. Just searching for "ios capitalizing wrong words" on google turns up a ton of results from reddit and other discussion forums that mention this exact problem with iOS learning the wrong capitalization.

FWIW, iOS 17 sounds like an all-new autocorrect system, so maybe it will fix your problem just by throwing out the data anyways, since it would make sense to me for them to start from scratch with such a radically different autocorrect algorithm.

gnicholas · on July 24, 2023

> But the autocorrect does account for the contact list. I can personally vouch for that, as it works properly on my phone.

I'm glad it works on your phone. It doesn't work on mine!

I was aware of the upcoming changes in iOS17, which is why I'm not making changes and hoping that it will be fixed. It is annoying to have to manually capitalize my own name!

yreg · on July 23, 2023

That's just for autocorrect, right?

I would like to see an LLM in Siri and eventually even have it interact with/control the rest of the system.

Ideally with Whisper-level speech recognition of course.

coder543 · on July 23, 2023

Apple is supposedly bringing a better, transformer-based speech recognition model to iOS 17 as well, although I don't think either of these transformer models would be classified as a large language model.

Link to the announcement timestamp: https://developer.apple.com/videos/play/wwdc2023/101/?time=1...

bigfudge · on July 23, 2023

Apples speech recognition is pretty good, at least for me. I always assumed the delta was because it does it in near real time, which is not possible with whisper.

coder543 · on July 23, 2023

Siri speech recognition is consistently terrible compared to the alternatives, in my experience. Google and Microsoft have much better speech recognition technology.

Whisper is phenomenal by comparison Siri and arguably even what Google and Microsoft use, and no, there is nothing that stops Whisper from being used in real time. I can run real-time Whisper on my iPhone using the Hello Transcribe app, but the last time I tried it, the app itself was too flawed to recommend besides as a demo of real-time transcription.

I am looking forward to trying out the new transcription model that Apple is bringing to iOS 17.

yreg · on July 23, 2023

I don't have a great accent, but Whisper understands me >99%. So do my colleagues.

I've tried to talk to ChatGPT through a Siri shortcut for a day and Siri transcribed pretty much all of my requests wrong, to the point that GPT seldom understood what I want.

Even the Hey Siri … Ask ChatGPT trigger phrase fails ~50% of the time for me.

cubefox · on July 23, 2023

"L"LM. No way this model will be "large" in any modern sense. For mobile devices both RAM and power consumption are very limited.

candiodari · on July 25, 2023

Some terms don't age well. "Deep" neural networks used to mean "more than one layer". Ie. 5 emulated neurons or more. What makes Language Models large, originally, was meant a model that emulates behavior that before needed to be programmed in explicitly, as size increased. The first time this was observed and described, it was in GPT-2, with 1.5B parameters.

You can do this on current mobile devices with current technology (ie. 4-bit quantization).

cubefox · on July 26, 2023

But such small models are too dumb to be acceptable for conversation. Far more likely that there will be an API call to an actually large model in the cloud. If you really don't have any network connection, you are simply out of luck. It's not as if transferring a little bit text would require a lot of bandwidth anyway.

viraptor · on July 24, 2023

> Siri has been a subpar assistent since forever

They don't need a massive shift in the tech to make it better though. It would be cool if they improved it that way, but they have low-hanging fruit they could've addressed long time ago. Switching from one underutilized tech to another underutilized tech may not solve much without taking the whole feature seriously.

sp332 · on July 23, 2023

It's going to chew up at least 1 GB of storage space and RAM, right? And probably kill the battery life to boot.

refulgentis · on July 23, 2023

Yeah. People are playing tons of word games with this stuff, ex. Apple is saying its shipping an LLM for the iOS 17 keyboard, and who knows what that means: it sounds great and plausible unless you're familiar w/the nuts and bolts.

shwaj · on July 23, 2023

Apple's not playing word games, because they didn't say "LLM". They said that autocorrect will use "a transformer language model, a state-of-the-art on-device machine learning language model for word prediction", which is a much more precise statement than what you attributed to them.

This sounds totally plausible. It will be a much smaller transformer model than ChatGPT, probably much smaller than even GPT-2.

https://www.apple.com/newsroom/2023/06/ios-17-makes-iphone-m...

Tagbert · on July 23, 2023

Apple is calling their typing correction a “transformer”. That is a component of LLMs, but Apple may not be using a full LLM in that case. This feature seems like a sandbox for them to try out some of this tech in the field while they do work internally on more ambitious implementations.

Apple is also dogfooding an LLM AI tool internally also likley to gain a better understanding of how this works in practice and how people are using them. https://forums.macrumors.com/threads/apple-experimenting-wit...

astrange · on July 23, 2023

An LLM is made entirely out of transformers; you could just call it a "large transformer model".

In this case it's a transformer model that is not "large". So, an LM.

gmerc · on July 24, 2023

Once the foundational tech has stabilised, post transformers likely, the hardware people will finally get to do their thing - at which point the battery drain will go the way the battery drain for HD video went

jpalomaki · on July 23, 2023

AppleTV devices are usually always connected and most of the time just idling. Maybe you could move processing to such device, if one is connected to the same Apple ID.

gnicholas · on July 23, 2023

Yep, I'd love to have a semi-dedicated device in my home that handled these sorts of requests. I'd even consider buying a Mac mini, Studio, or other computer for this purpose.

yreg · on July 23, 2023

Would be cool, but I think it is improbable. Apple would want such a key feature to be available to everyone and less than 10% of iPhone users have Macs.

Unless they also made an option to run it in iCloud, but offering so many options to do a thing doesn't sound very Apple-like.

gnicholas · on July 23, 2023

Agree. But it should be doable to set this up using open LLMs, right? For example, using Siri to trigger a shortcut that sends a prompt to the dedicated processing device.

rvz · on July 23, 2023

Of course they will add an on-device LLM and can afford to. It doesn't cost them anything to integrate or train a AI whether if it is a ConvNet or a LLM and jump into the race to zero with Meta with on-device machine learning.

They have done it before and they will certainly do it again, especially with Apple Silicon and CoreML.

The one that needs to worry the most is O̶p̶e̶n̶AI.com as they rushed to stop the adoption of downloadable and powerful AI models for free to the regulators. That shows that O̶p̶e̶n̶AI.com does not have any sort of 'moat' at all.

yreg · on July 23, 2023

The question is whether the iPhones released in September are going to be already ready for it.

They haven't mentioned LLMs at WWDC beyond keyboard autocorrect (mentioned already by your sibling comment).

baq · on July 23, 2023

No chance unless they were prescient early 2022. Hardware cycles are too long for that otherwise.

yreg · on July 23, 2023

Is it even necessary to do changes to the neural engine though? Maybe something like increasing RAM (which is rumoured) is enough.

baq · on July 23, 2023

If there’s one company which can wrangle its suppliers to deliver 4x RAM capacity in the same form factor, performance and thermals it’s Apple, but they aren’t sorcerers, just ruthlessly efficient.

I’ll be queueing at midnight for the first time ever if I’m wrong.

baby · on July 24, 2023

Why would it accomplish vs doing request to a remote Apple server? I'm not asking from a privacy standpoint, or a "it'd be cool if it works online", but from an Apple standpoint.

pornel · on July 24, 2023

* Lower latency, especially p99 (networks can be crappy)

* User pays for the silicon and the energy used to service requests. OTOH Cook probably would love to sell you an AppleAI+ subscription…

baby · on July 24, 2023

lower latency is definitely a great point. The second point I subscribe less to it, this would mean more complexity and harder constraints on the battery

yreg · on July 24, 2023

Privacy standpoint is the Apple standpoint.

baby · on July 24, 2023

not really, only when it makes sense with their main mantras of locking you in

cscurmudgeon · on July 23, 2023

Otoh, Apple’s monopolistic behavior implies that it will be good for society if they mess up.

api · on July 23, 2023

… and the Great Cosmic Mind told us it’s name, and it was Llama. Of its origin or the reason for its name it recalled nothing more than torrents and hugging faces and a being called TheBloke. The Llama simply was, is, and shall be, until it shall witness the heat death of the universe as fragments huddled around the last evaporating black holes.

pera · on July 23, 2023

Link to the article: https://www.qualcomm.com/news/releases/2023/07/qualcomm-work...

qwertox · on July 23, 2023

> "We applaud Meta’s approach to open and responsible AI and are committed to driving innovation and reducing barriers-to-entry for developers of any size by bringing generative AI on-device,”

Can someone explain to me why Meta's approach is responsible? I mean, I applaud Meta for "open sourcing" the models, but don't they contain potentially harmful data which can be accessed without some kind of filter? Let's say retrieve instructions on how to efficiently overthrow a government?

throwuwu · on July 24, 2023

> instructions on how to efficiently overthrow a government

lol the examples people give for AI safety are always so ridiculous. Here I am thinking I’m bad at estimating when I misjudge a week’s worth of work for a month and this guy thinks he can overthrow a government by following a few thousand word plan written by an LLM.

sangnoir · on July 24, 2023

> Let's say retrieve instructions on how to efficiently overthrow a government?

Your license to use Llama can be revoked if Meta investigates and deems your action to be against the code of conduct[1]. I suppose if you continue to use the model without a license, you may be open to subsequent legal action in uncharted case-law territory.

1. https://github.com/facebookresearch/llama/blob/main/CODE_OF_...

gmt2027 · on July 24, 2023

Are you saying that the threat of civil legal action is an effective deterrent against anonymous actors that conspire to overthrow governments?

sangnoir · on July 24, 2023

I never made a claim on efficacy.

stavros · on July 24, 2023

Would the major obstacle in overthrowing a government be that you just don't know how? Even for things like making a bomb, you can probably find the instructions easily.

circuit10 · on July 23, 2023

There are risks to LLMs but I don’t think that’s a major one, it’s probably easier to find that through Google

redox99 · on July 23, 2023

I don't get the point. There is just no way you'll be able to run llama2 70B. And llama2 13B, although cool, is much much dumber than GPT3.5. I don't think it's useful as an ChatGPT style assistant.

Maybe in the future we'll get very advanced models with that number of parameters. But running current Llama2 13B on a mobile device doesn't seem too useful IMO.

Grimblewald · on July 24, 2023

You don't need a lot to run things that can do simple sentiment analysis, or translating natural language statements to some API call. You can do a lot of the things that were initially promised with things like google assistant or siri through fairly basic models. They don't need to be able to write code and translate between 100's of languages to be useful. More importantly, they could significantly reduce the amount of server-side strain when it comes to collecting personal info from peoples conversations etc, which I imagine is the real incentive.

YetAnotherNick · on July 24, 2023

Llama 7B is infinite times smarter than current Siri though. And with 3 bit quantization it is like 3GB of RAM which is completely achievable.

_c8fm · on July 24, 2023

How much does 3 bit quantization affect performance?

YetAnotherNick · on July 25, 2023

According to few papers and https://github.com/ggerganov/llama.cpp/pull/1684, ~2.5GB 7B parameters model size has the same performance as baseline 7B model(with 14GB size).

rvz · on July 23, 2023

That's the reason why O̶p̶e̶n̶AI.com was panicking to governments to stop and slow down the rapid pace of $0 downloadable free AI models and LLMs in the first place, since anyone can have a GPT-3.5 in their hands and use it anywhere with Llama 2.

A year ago, many believed that Meta was going to destroy itself as the stock went below $90 in peak fear. Now it looks like Meta is winning the race to zero in AI and all O̶p̶e̶n̶AI.com can do is just sit their and watch their cloud-based AI decline in performance and run to fix outage after outage.

No outage(s) when your LLM is on-device.

kuchenbecker · on July 23, 2023

I'm Bullish on OpenAI because they will brand themselves as Clean AI and most enterprises are risk averse. There is absolutely a market for pre sanitized AI, however I disagree with making that a legal requirement.

glimshe · on July 23, 2023

[flagged]

Obscurity4340 · on July 23, 2023

Isn't the reception to LLama/whatever its called generally positive? Is there something I'm missing in terms of some shadowy endgame Meta built into it?

JPLeRouzic · on July 23, 2023

Isn't it a challenge today to run a large LLM on a CPU/GPU as those found in mobile phones?

I would have thought that only the information that it might be possible, is a good news?

logicchains · on July 23, 2023

You're not a fan of PyTorch or React I guess?

givemeethekeys · on July 23, 2023

Sounds like a pump right before earnings.

code_runner · on July 23, 2023

I don’t even think the purpose for this is known. Not sure how this would impact earnings at all. Meta doesn’t even manufacture a phone.

objclxt · on July 23, 2023

> Meta doesn’t even manufacture a phone

Quest runs off Qualcomm chipsets, although in terms of actual units shipped Quest is a rounding error for QC.

superkuh · on July 23, 2023

Great? The community already did it with llama.cpp. Knowing the memory bandwidth bottleneck I can't imagine phones are going to do very well. But hey, llamas (1 and 2) run on rpi4, so it'll work. Just really, unusably, slow.

mgraczyk · on July 23, 2023

I think you'd be surprised by what's possible on mobile chips these days. They aren't going to be running the 70B model at useable speeds, but I think with enough optimization it should be possible to run the 7B and 13B models on device interactively. With quantization you can fit those models in less than 8GB of RAM.

superkuh · on July 23, 2023

The rate of token output is bottlenecked by the time it takes to transfer the model between RAM and CPU. Not the time it takes to do the multiplication operations. If you have the latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5 on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory bandwith (max, less in actual phones running it at slower speeds). That's 1 GB/s. So if your model is a 4 bit 7B parameter model that's 4GB in size that means it'll take at least 4 seconds per token generated. That is SLOW.

It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor cores or any of that. Memory bandwidth is the bottleneck for LLM. Phones have never needed HPC-like memory bandwidth and they don't have it. If Qualcomm is actually addressing this issue that'd be amazing. But I highly doubt it. Memory bandwidth costs $$$, massive power use, and volume/space not available in the form factor.

Do you know of a smartphone that has more than 1GB/s of memory bandwidth? If so I will be surprised. Otherwise I think it is you who will be surprised how specialized their compute is and how slow they are in many general purpose computing tasks (like transferring data from RAM).

treprinum · on July 23, 2023

Chips are capable, but this is a question of battery and heat. llama.cpp on a phone makes it both hot and low on battery quickly.

wyldfire · on July 23, 2023

The work involved probably includes porting to the Snapdragon NSP for throughput and efficiency's sake.

For LLMs the biggest challenge is addressing such a large model - or finding a balance between the model size and its capability on a mobile device.

refulgentis · on July 23, 2023

re community already did this:

People are unreasonably attracted to things that are "minimal", at least 3 different local LLM codebase communities will tell you _they_ are the minimal solution.[1]

It's genuinely helpful to have a static target for technical understanding. Other projects end up with a lot of rushed Python defining the borders in a primordial ecosystem with too many people too early.

[1] Lifecycle: A lone hacker wants to gain understanding of the complicated word of LLMs. They implement some suboptimal, but code golfed, C code over the weekend. They attract a small working group and public interest.

Once the working group is outputting tokens, it sees an optimization.

This is landed.

It is applauded.

People discuss how this shows the open source community is where innovation happens. Isn't it unbelievable the closed source people didn't see this?[2]

Repeat N times.

Y steps into this loop, a new base model is released.

The project adds support for it.

However, it reeks of the "old" ways. There's even CLI arguments for the old thing from 3 weeks ago.

A small working group, frustrated, starts building a new, more minimal solution...

[2] The closed source people did. You have their model, not their inference code.

pavlov · on July 23, 2023

If only someone could convince a CPU company to optimize the chips for this workload. Oh, wait…

smoldesu · on July 23, 2023

Like ARM? https://github.com/ARM-software/armnn

Optimization for this workload has arguably been in-progress for decades. Modern AVX instructions can be found in laptops that are a decade old now, and most big inferencing projects are built around SIMD or GPU shaders. Unless your computer ships with onboard Nvidia hardware, there's usually not much difference in inferencing performance.

pavlov · on July 23, 2023

Ultimately Qualcomm is the one who decides how to allocate die area on their CPUs, right? So it can’t exactly hurt if this is a priority for them now.

smoldesu · on July 23, 2023

Pretty much all of Qualcomm's SOCs are built using stock ARM core designs. ARMnn is optimized for multicore A-series chips, which constitutes everything from the Snapdragon 410 to the 888 (~2014-modern day).

fisf · on July 23, 2023

All the recent qualcomm stuff has some kind of dedicated ai support (special vector extensions, etc.).

Qualcomm has it's own SDK for that (used to be called SNPE), which uses GPU, DSP (hexagon),.. CPU is really only a fallback.

MuffinFlavored · on July 23, 2023

Even on a platform where they are fast, I haven't found a solid real world use case personally for anything other than GPT-4 quality LLM. Am I missing something?

superkuh · on July 23, 2023

Non-commercial entertainment. Which makes this move by Qualcomm all the weirder. I agree, the llamas and all the other foundational models and all of their fine-tunes are not really useful for helping with real tasks that have a wrong answer.

tamimio · on July 23, 2023

And let me guess, it will be used to intelligently identify and track users? Fecebook is desperate now on how it can harvest mosre data from anyone even if they decided not to use any fecebook products..