Hacker News new | past | comments | ask | show | jobs | submit login

If Apple would wake up to what's happening with llama.cpp etc then I don't see such a market in paying for remote access to big models via API, though it's currently the only game in town.

Currently a Macbook has a Neural Engine that is sitting idle 99% of the time and only suitable for running limited models (poorly documented, opaque rules about what ops can be accelerated, a black box compiler [1] and an apparent 3GB model size limit [2])

OTOH you can buy a Macbook with 64GB 'unified' memory and a Neural Engine today

If you squint a bit and look into the near future it's not so hard to imagine a future Mx chip with a more capable Neural Engine and yet more RAM, and able to run the largest GPT3 class models locally. (Ideally with better developer tools so other compilers can target the NE)

And then imagine it does that while leaving the CPU+GPU mostly free to run apps/games ... the whole experience of using a computer could change radically in that case.

I find it hard not to think this is coming within 5 years (although equally, I can imagine this is not on Apple's roadmap at all currently)

[1] https://github.com/hollance/neural-engine

[2] https://github.com/smpanaro/more-ane-transformers/blob/main/...




If I were Apple I'd be thinking about the following issues with that strategy:

1. That RAM isn't empty, it's being used by apps and the OS. Fill up 64GB of RAM with an LLM and there's nothing left for anything else.

2. 64GB probably isn't enough for competitive LLMs anyway.

3. Inferencing is extremely energy intensive, but the MacBook / Apple Silicon brand is partly about long battery life.

4. Weights are expensive to produce and valuable IP, but hard to protect on the client unless you do a lot of work with encrypted memory.

5. Even if a high end MacBook can do local inferencing, the iPhone won't and it's the iPhone that matters.

6. You might want to fine tune models based on your personal data and history, but training is different to inference and best done in the cloud overnight (probably?).

7. Apple already has all that stuff worked out for Siri, which is a cloud service, not a local service, even though it'd be easier to run locally than an LLM.

And lots more issues with doing it all locally, fun though that is to play with for developers.

I hope I'm wrong, it'd be cool to have LLMs be fully local, but it's hard to see situations where the local approach beats out the cloud approach. One possibility is simply cost: if your device does it, you pay for the hardware, if a cloud does it, you have to pay for that hardware again via subscription.


> but it's hard to see situations where the local approach beats out the cloud approach.

I think the most glaring situation where this is true is simply one of trust and privacy.

Cloud solutions involve trusting 3rd parties with data. Sometimes that fine, sometimes it's really not.

Personally - LLMs start to feel more like they're sitting in the confidant/peer space in many ways. I behave differently when I know I'm hitting a remote resource for LLMs in the same way that I behave differently when I know I'm on camera in person: Less genuinely.

And beyond merely trusting that a company won't abuse or leak my data, there are other trust issues as well. If I use an LLM as a digital assistant - I need to know that it's looking out for me (or at least acting neutrally) and not being influenced by a 3rd party to give me responses that are weighted to benefit that 3rd party.

I don't think it'll be too long before we see someone try to create an LLM that has advertising baked into it, and we have very little insight into how weights are generated and used. If I'm hitting a remote resource - the model I'm actually running can change out from underneath me at any time, jarring at best and utterly unacceptable at worst.

From my end - I'd rather pay and run it locally, even if it's slower or more expensive.


People have trusted search engines with their most intimate questions for nearly 30 years and there has been what ... one? ... leak of query data during this time, and that was from AOL back when people didn't realize that you could sometimes de-anonymize anonymized datasets. It hasn't happened since.

LLMs will require more than privacy to move locally. Latency, flexibility and cost seem more likely drivers.


You're still focused on trusting that my data is safe. And while I think that matters - I don't really think that's the trust I care most about.

I care more about the trust I have to place in the response from the model.

Hell - since you mentioned search... Just look at the backlash right now happening to google. They've sold out search (a while back, really) and people hate it. Ads used to be clearly delimited from search results, and the top results used to be organic instead of paid promos. At some point, that stopped being true.

At least with google search I could still tell that it was showing me ads. You won't have any fucking clue that OpenAI has entered into a partnering agreement with "company [whatever]" and has retrained the model that users on plans x/y/z interact with to make it more likely to push them towards their new partner [whatever]'s products when prompted with certain relevant contexts.


> Hell - since you mentioned search... Just look at the backlash right now happening to google. They've sold out search (a while back, really) and people hate it. Ads used to be clearly delimited from search results, and the top results used to be organic instead of paid promos. At some point, that stopped being true.

Only people in HN-like communities care about this stuff. Most people find the SEO spam in their results more annoying.

> At least with google search I could still tell that it was showing me ads. You won't have any fucking clue that OpenAI has entered into a partnering agreement with "company [whatever]" and has retrained the model that users on plans x/y/z interact with to make it more likely to push them towards their new partner [whatever]'s products when prompted with certain relevant contexts.

You won't know this for any local models either.


> You won't know this for any local models either.

But you will know the model hasn't changed, and you can always continue using the version you currently have.

> Most people find the SEO spam in their results more annoying.

This is the same problem. These models will degrade from research quality to mass market quality as there's incentive to change what results they surface. Whether that's intentional (paid ads) versus adversarial (SEO) doesn't matter all that much - In either case the goals will become commercial and profit motivated.

People really don't like "commercial and profit motivated" in the spaces that some of these LLMs stepping into. Just like you don't like SEO in your recipe results.


> But you will know the model hasn't changed, and you can always continue using the version you currently have.

Will you? What happens when an OS update silently changes the model? Again this is one of those things only HN-types really care/rant about. I've never met a non-technical person care about regular updates beyond being slow or breaking an existing workflow. Most technical folks I know don't care either.

> This is the same problem. These models will degrade from research quality to mass market quality as there's incentive to change what results they surface. Whether that's intentional (paid ads) versus adversarial (SEO) doesn't matter all that much - In either case the goals will become commercial and profit motivated.

Not at all. Search providers have an incentive to fight adversarial actors. They don't have any incentive to fight intentional collaboration.

> People really don't like "commercial and profit motivated" in the spaces that some of these LLMs stepping into. Just like you don't like SEO in your recipe results.

I disagree. When a new, local business pops up and pays for search ads, is this "commercial and profit motivated?" How about advertising a new community space opening? I work with a couple businesses like this (not for SEO, just because I like the space they're in and know the staff) and using ads for outreach is a pretty core part of their strategy. There's no neat and clean definition of "commercial and profit motivated" out there.


You wouldn't know that even if the model ran locally.


This happened with ChatGPT a few weeks ago.

https://news.ycombinator.com/item?id=35291112


Two issues though: leak of data from one party to another, and misuse of data by the party you gave it to. Most big companies don’t leak this type of data, but they sure as hell misuse it and have the fines to prove it.


Almost everyone is willing to trust 3rd parties with data, including enterprise and government customers. I find it hard to believe that there are enough people willing to pay a large premium to run these locally to make it worth the R&D cost.


Having done a lot of Bank/Gov related work... I can tell you this

> Almost everyone is willing to trust 3rd parties with data, including enterprise and government customers.

Is absolutely not true. In it's most basic sense - sure... some data is trusted to some 3rd parties. Usually it's not the data that would be most useful for these models to work with.

We're already getting tons of "don't put our code into chatGPT/Copilot" warnings across tech companies - I can't imagine not getting fired if I throw private financial docs for my company in there, or ask it for summaries of our high level product strategy documents.


Yes, just like you might get fired for transacting sensitive company business on a personal gmail account, even if that company uses enterprise gmail.

Saying that cloud models will win over local models is not the same as saying it will be a free-for-all where workers can just use whatever cloud offering they want. It will take time to enterprisify cloud LLM offerings to satisfy business/government data security needs, but I'm sure it will happen.


But right now what incentive have I to buy a new laptop? I got this 16GB M1 MBA two years ago and it's literally everything I need, always feels fast, silent etc

1. the idea would be that now there is a reason to buy loads more RAM, whereas currently the market for 64GB is pretty niche

2. 64GB is a big laptop today, in a few years time that will be small. And LLaMA 65B int4 quantized should fit comfortably

4. LLMs will be a commodity. There will be a free one

6. LLMs seem to avoid the need for finetuning by virtue of their size - what we see now with the largest models is you just do prompt engineering. Making use of personal data is a case of Langchain + vectorstores (or however the future of that approach pans out)


1. You're working backwards from a desire to buy more RAM to try and find uses for it. You don't actually need more RAM to use LLMs, ChatGPT requires no local memory, is instant and is available for free today.

2. Why would anybody be satisfied with a 64GB model when GPT-4 or 5 or 6 might even be using 1TB of RAM?

3. That may not be the case. With every day that passes, it becomes more and more clear that large LLMs are not that easy to build. Even Google has failed to make something competitive with OpenAI. It's possible that OpenAI is in fact the new Google, that they have been able to establish permanent competitive advantage, and there will no more be free commodity LLMs than there are free commodity search engines.

Don't get me wrong, I would love there to be high quality local LLMs. I have at least two use cases where you can't do them or not really well with the OpenAI API and being able to run LLama locally would fix that problem. But I just don't see that being a common case and at any rate I would need server hardware to do it properly, not Mac laptop.


1. You're working backwards from a desire to buy more RAM to try and find uses for it.

I'm really not

I had no desire at all until a couple of weeks ago. Even now not so much since it wouldn't be very useful to me

But the current LLM business model where there are a small number of API providers, and anything built using this new tech is forced into a subscription model... I don't see it sustainable, and I think the buzz around llama.cpp is a taste of that

I'm saying imagine a future where it is painless to run a ChatGPT-class LLM on your laptop (sounded crazy a year ago, to me now looks inevitable within few years), then have a look at the kind of things that can be done today with Langchain... then extrapolate


It sounds like we are in a similar position. I had no desire to get a 64gb laptop from apple until all the interesting things from running llama locally came out. I wasn't even aware of the specific benefit of that uniform memory model on the mac. Now I'm looking at do I want to do 64, 96 or 128gb. For an insane amount of money, 5k for that top end one.


The unified memory ought to be great for running LLaMA on the GPU on these Macbooks (since it can't run on the Neural Engine currently)

The point of llama.cpp is most people don't have a GPU with enough RAM, Apple unified memory ought to solve that

Some people have it working apparently:

https://github.com/remixer-dec/llama-mps


Thank you, that's exactly what I was looking for, specific info on perf.


I think the GPU performance for inference is probably limited currently by immaturity of PyTorch MPS (Metal) backend

before I found the repo above I had a naive attempt to get llama running with mps and it didn't "just work" - bunch of ops not supported etc


I think llama.cpp will die soon because the only models you can run with it are derivatives of a model that Facebook never intended to be publicly released, which means all serious usage of it is in a legal limbo at best and just illegal at worst. Even if you get a model that's clean and donated to the world, the quality is still not going to be competitive with the hosted models.

And yes I've played with it. It was/is exciting. I can see use cases for it. However none are achievable because the models are (a) not good enough and (b) too legally risky to use.


(A) is very use case depending. Even with some of the bad smaller models now, I can see devs making use of them to enhance their app (e.g. local search, summaries, sentiments, translations)

(B) llama.cpp supports gpt4all, which states that its working on fixing your concern. This is from their README:

Roadmap Short Term

- Train a GPT4All model based on GPTJ to alleviate llama distribution issues.


> is instant and is available for free today.

It's free for the user up to a point, but it costs OpenAI a lot of money.

Apple is a hardware vendor, so commoditization of the software while finding more market segments is definitely something that'd benefit them.

OTOH, if they let OpenAI become the unrivaled leader of AI that end up being the next Google, they end up losing on a topic they wanted to lead for long time (Apple has invested quite a lot in AI, and the existence of a Neural Engine in Apple CPUs isn't an accident)


"A lot of money" is a lot less money per user than to buy 64GB RAM to run an inferior model locally + energy and opportunity costs. The OpenAI APIs are super cheap for a single user needs. I expect them to be at least close to breaking even with their APIs pricing.


> "A lot of money" is a lot less money per user than to buy 64GB RAM

if OpenAI isn't able to get couple hundred bucks over the typical lifetime of a computer it means the added value they provide is very low (like several times less than Spotify or Netflix for instance), meaning they'll never be “the next Google”.

And if they are it means it make sense to buy it once instead of paying several times the price through subscription.

> The OpenAI APIs are super cheap for a single user needs. I expect them to be at least close to breaking even with their APIs pricing.

“Close to breaking even” means the price you pay is VC-subsidized, the expected gross margin for such kind of tech company is more than 50%. Expect to pay a lot more if/when the market is captive. And this will scale linearly with your use of the technology.

> energy and opportunity costs

What opportunity cost?


> Expect to pay a lot more if/when the market is captive.

Yes, this is a possibility but cloud computing became a commodity.

But I see why people would pay to have their own private and unfiltered models/embeddings.

> if OpenAI isn't able to get couple hundred bucks over the typical lifetime of a computer it means the added value they provide is very low (like several times less than Spotify or Netflix for instance), meaning they'll never be “the next Google”.

They don't have to worry about this today.

> What opportunity cost?

You could utilize the money and the time spent to do other things.


I think it’s quite likely that the RAM onboard these devices expands pretty massively, pretty quickly as a direct result of LLMs.

Google had already done some very convincing demos in the last few years well before ChatGPT and GPT-4 captured the popular imagination. Microsoft’s OpenAI deal I would assume will lead to a “Cortana 2.0” (obviously rebranded, probably “Bing for Windows”, “Windows Copilot” or something similar). Google Assistant has been far ahead of Siri for many years longer than that, and they have extensive experience with LLMs. Apple surely realises the position their platforms are in and the risk of being left behind.

I’m also not sure the barrier on iPhone is as great as you suggest - it’s obviously constrained in terms of what it can support now but if the RAM on the device doubles a few times over the next few years I can see this being less of an issue. Multiple models (like the Alpaca sets) could be used for devices with different RAM/performance profiles and this could be sold as another metric to upgrade (i.e. iPhone 16 runs Siri-2.0-7b while iPhone 17 runs Siri-2.0-30b - “More than 3x smarter than iPhone 16. The smartest iPhone we’ve ever made.” etc).


How much does 64GB of RAM cost, anyway? Retail it's like $200, and I'm sure it's cheaper in terms of Apple cost. Yet we treat it as an absurd luxury because Apple makes you buy the top-end 16" Macbook and pay an extra $800 beyond that. Maybe in the future they'll treat RAM as a requirement and not a luxury good.


and we know that more will be cheaper in future


With the integrated ram and cpu and gpu on apple silicon, however it's done it yields perf results. I do think that probably has higher cost than separately produced ram. And even separate from that, because they have that unified memory model unlike every other consumer device they can charge for it. So 64, 96 or 128 gb?


Its not done for perf results, Xbox doesnt have ram on package and somehow does 560 GB/s


The perf results I was referring to was the ability to run an llm locally (like llama.cpp) that uses a giant amount of ram in the gpu, like 40gig. Without this uniform memory model, you end up paging endlessly, so it's actually much faster for this application in this scenario. Unlike on a pc with a graphics card, you can use your entire ram for gpu. This isn't possible on the xbox because it doesn't have uniform memory as far as I know. So having incredible throughput still won't match not having to page.

Edit - I found an example from h.n. user anentropic, pointing at https://github.com/remixer-dec/llama-mps . "The goal of this fork is to use GPU acceleration on Apple M1/M2 devices.... After the model is loaded, inference for max_gen_len=20 takes about 3 seconds on a 24-core M1 Max vs 12+ minutes on a CPU (running on a single core). "


4. Weights are expensive to produce and valuable IP, but hard to protect on the client unless you do a lot of work with encrypted memory.

No, it'll be a commodity

Apple wouldn't care if the weights can be extracted if you have to have a Macbook to get the sweet, futuristic, LLM-enhanced OS experience


I've been looking into buy a mac for llm experimentation - 64, 96 or 128gb of ram? I'm trying to decide if 64gb is enough, or should I go to 96gb or even 128gb. But it's really expensive - even for an overpaid software engineer. Then there's the 1 or 2 tb storage question. Apple list price is another $400 for that second tb of storage.

For 64gb of ram, you can get an m2 pro, or get 96gb which requires the upgraded cpu on the pro. The studio does 64gb or 128gb. But the 128 requires you to spend 5k.

I can't decide between 64 or 96 on m2 pro, and 128 on the studio. Probably go for 96gb. Also what's the impact of the extra gpu cores on the various options? And there are still some "m1" 64gb pros & studios out there. What's the perf difference for m1 vs m2? This area needs serious perf benchmarking. If anyone wants to work with me, maybe I would try my hand. But I'm not spending 15k just to get 3 pieces of hardware.

List prices:

64gb/2tb m2 12cpu/30gpu 14" pro $3900

96gb/2tb m2 max 12/38 14" pro $4500

128gb/2tb m2 max 28/48 studio $5200


Check out the LLaMA memory requirements on Apple Silicon GPU here: https://github.com/remixer-dec/llama-mps


I’m pretty sure you can get a purpose-built pc tower in that range. Why would you favor a Mac over that? A lot of this stuff only has limited support for MacOS.


The unified GPU/CPU memory structure on ARM Macs is very, very helpful for running these LLMs locally.


Is there a big difference in principle between that and the "shared video memory" that has long existed on cheap x86 machines?

…or is it just that the latter had a way too weak iGPU and not enough RAM for AI purposes, whereas the bigger ARM MACs have more GPU power and enough RAM (more than most affordable discrete graphic cards) so that they are usable for some AI models?


You can't get that much VRAM on a PC for a comparible price.


Running models locally is the future for most inferencing cycles. There is a lot of accuracy that could be improved in your numbered list trying to dissuade people.

> 64GB probably isn't enough for competitive LLMs anyway

I am trying to charitable, but this is pretty not true. And the hedging in your statement only telegraphs your experience.


> Even if a high end MacBook can do local inferencing, the iPhone won't and it's the iPhone that matters

Doesn't the iPhone use the local processor for stuff like the automatic image segmentation they currently do? (Hold on any person in a recent photo you have take and iOS will segment it)


Yes but I'm not making a general argument about all AI, just LLMs. The L stands for Large after all. Smartphones are small.


>One possibility is simply cost: if your device does it, you pay for the hardware, if a cloud does it, you have to pay for that hardware again via subscription.

Yeah but in the cloud that cost is ammortized among everyone else using the service. If you as a consumer buy a gpu in order to run LLMs for personal use, then the vast majority of the time it will just be sitting there depreciating.


But then again, every apple silicon user has an unused neural engine sitting around in the SoC an taking a significant amount of die space, yet people don't seem to worry too much about its depreciation.


> 7. Apple already has all that stuff worked out for Siri, which is a cloud service, not a local service, even though it'd be easier to run locally than an LLM.

iOS actually does already have an offline speech-to-text api. Some part of Siri that translates the text into intents/actions is remote. Since iOS 15, Siri will also process a limited subset of commands while offline.


Chips have a 5-7 year lead time. Apple has been shipping neural chips for years while everyone else is still designing their v1.

Apple is ahead of the game for a change getting their chips in line as the software exits alpha and goes mainstream.


But they haven't exposed them to use. They are missing a tremendous opportunity. They have that unique unified memory model on the m1/m2 arms so they have something no other consumer devices have. If they exposed their neural chips they'd solidify their lead. They could sell a lot more hardware.


They are though. Apple released a library to use Apple Silicon for training via PyTorch recently, and has libraries to leverage the NE in CoreML.


> Apple Silicon for training via PyTorch recently

This is just allowing PyTorch to make use of the Apple GPU, assuming the models you want to train aren't written with hard-coded CUDA calls (I've seen many that are like that, since for a long time that was the only game in town)

PyTorch can't use the Neural Engine at all currently

AFAIK Neural Engine is only usable for inference, and only via CoreML (coremltools in Python)


Thank you! I wasn't aware of that. Let me research that. May 2022 announcement. Is this suitable for the the apps like llama.cpp since it's a Python library? It appears to be a library but they didn't document how to use the underlying hardware - but I welcome more info.


iPhones have similar Neural Engine capabilities, obviously far more limited but still quite powerful. You can run some pretty cool DNNs for image generation using e.g. Draw Things app quite quickly: https://apps.apple.com/us/app/draw-things-ai-generation/id64...


1. Quadruple it.

2. see above

Should be cheap, or why else are Samsung, Micron and Kioxia whining about losses?

Maybe go for something like Optane memory while doing so.


Optane is sadly no longer being manufactured.


I know. That's why I wrote something like ;-)


> "If you squint a bit and look into the near future it's not so hard to imagine a future Mx chip with a more capable Neural Engine and yet more RAM, and able to run the largest GPT3 class models locally. (Ideally with better developer tools so other compilers can target the NE)"

Very doubtful unless the user wants to carry around another kilogram worth of batteries to power it. The hefty processing required by these models doesn't come for free (energy wise) and Moore's Law is dead as a nail.


Most of the time I have my laptop plugged in and sit at a desk...

But anyway, there are two trends:

- processors do more with less power

- LLMs get larger, but also smaller and more efficient (via quantizing, pruning)

Once upon a time it was prohibitively expensive to decode compressed video on the fly, later CPUs (both Intel [1] and Apple [2]) added dedicated decoding hardware. Now watching hours of YouTube or Netflix are part of standard battery life benchmarks

[1] https://www.intel.com/content/www/us/en/developer/articles/t...

[2] https://www.servethehome.com/apple-ignites-the-industry-with...


My latest mac seems to have about a kilogram of extra battery already compared to the previous model.


Apple’s move to make stable diffusion run well on the iPhone makes me think they’re watching this space, just waiting for the right open model for them to commit to.


I wonder how good the neural engine with the unified memory is compared to say intel cpu with 32gb ram. Could anyone give some insight?


There seems to be a limit to the size of model you can load before CoreML decides it has to run on CPU instead (see the second link in my previous comment)

If it could use the full 'unified' memory that would be a big step towards getting these models running on it

I'm unsure how the performance compares to a beefy Intel CPU, but there's some numbers here [1] for running a variant of the small distilbert-base model on the Neural Engine... it's ~10x faster than running on the M1 CPU

[1] https://github.com/anentropic/experiments-coreml-ane-distilb...


Siri was launched with a server-based approach. It wouldn't be surprising if Apple's near-term LLM strategy would to put a small LLM on local chips/MacOS and a large model running in the cloud. The local model would only do basic fast operations while the cloud could provide the heavyweight intensive analysis/generation.


I can see how the apple silicon memory system can help with LLMs, but a couple points of reality check:

- such amounts of memory is locked behind very expensive sku which even most of mac userbase will not use ( <5% in the new purchases to be very conservative ). - not too long ago apple would restrict the amount of ram in their system for their own reasoning (source: https://9to5mac.com/2016/10/28/apple-macbook-pro-16gb-ram-li...) - just like mid 2010s gpus with 6-8 gb vram but with little to benefit from it, i don't see the ml accelerators/gpu in current models being capable enough to make the most of the memory available to it.


That's today... think of the future

My first computer had 512KB RAM and 20MB was an expensive hard drive.

64GB Macbooks are currently an expensive 'Pro' novelty, they will be the vanilla of tomorrow

> i don't see the ml accelerators/gpu in current models being capable enough to make the most of the memory available to it

that's exactly my point (and apparently today's Neural Engine can't even take advantage of all the unified memory available)

until LLaMA there was no reason to have more than this, they probably imagined it would just run a bit of face-detection and speech-to-text on the side

but if they got serious and beefed it up it could be the next wave of computing IMHO


Update: https://www.bjnortier.com/2023/04/13/Hello-Transcribe-v2.2.h...

The iPhone 14 runs Whisper model faster than an M1 Max, because it has a newer Neural Engine

I look forward to the M3 Macbook launch eagerly, while expecting mild disappointment


Can we re-invent SETI with such LLMs/new GPU folding/whatever hardware and re-pipe the seti data through a Big Ass Neural whatever you want to call it and see if we have any new datapoints to look into?

What about other older 'questions' we can point an AI lens at?



You need state of the art consumer tech to run a model comparable to GPT-3 locally at a glacial pace.

Or, you can use a superior GPT 3.5 for free.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: