More

frozenport · 2024-06-24T04:40:24

Struggling with the use case.

It seems like this is when you have the source or the libs but choose to mix x86 and arm?

It would seem if you have the source etc you should just bite the bullet and port everything.

adamjs · 2024-06-24T05:21:35

Two use-cases jump to mind:

* Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)

* Allows usage of third-party x64 DLLs in ARM apps without recompilation. (Source isn't always available or might be too much of a headache to port on your own.)

vsl · 2024-06-24T07:25:18

3. Improve x64 emulation performance for everybody. Windows 11 on ARM ships system DLLs compiled as Arm64EC - makes the x64 binaries run native ARM code at least within system libraries.

ack_complete · 2024-06-24T19:26:13

It's not worth using ARM64EC for just for incremental porting -- it's an unusual mode with even less build/project support than Windows ARM64 and there are EC-specific issues like missing x64 intrinsic emulations and slower indirect calls. I wouldn't recommend it except for the second case with external x64 DLLs.

callalex · 2024-06-24T06:01:49

At that point why trust the emulator over the port? Either you have sufficient tests for your workload or you don’t, anything else is voodoo/tarot/tea leaves/SWAG.

wtallis · 2024-06-24T09:05:45

"Why trust the emulator?" sounds a lot like asking "why trust the compiler?". It's going to be much more widely-used and broadly-tested than your own code, and probably more thoroughly optimized.

szundi · 2024-06-24T06:45:32

We might be lucky and the emulator guys might have enough testing

amelius · 2024-06-24T09:18:11

> Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)

Wouldn't it make more sense to have a translator that translates the assembly, instead of an emulator that runs the machine code?

frozenport · 2024-06-24T06:59:49

Yeah but you need to port the SIMD before shipping anyways?

So if you're doing incremental stuff might as well stub out the calls with "not implemented", and start filling them in.

creshal · 2024-06-24T07:14:41

The SIMD part will be emulated as normal, as far as I understand. So you can ship a first version with all-emulated code, and then incrementally port hotspots to native code, while letting the emulator handle the non-critical parts.

At least in theory, we'll see how it actually pans out in practice.

selimnairb · 2024-06-24T11:13:52

I feel like binary translation is a better approach. It’s a temporary workaround that allows users to use non-native programs while they are ported properly. ARM64EC seems like it will incentivize “eh that’s good enough” partial porting efforts that will never result in a full port, while making the whole system more complicated, with a larger attack surface (binary translation also makes the system more complicated, but it seems more isolated/less integrated with the rest of OS).

PaulHoule · 2024-06-24T12:35:52

My understanding is that ARM64EC only makes sense in terms of binary translation. That is, the x64 bits get translated and the ARM bits don’t.

anaisbetts · 2024-06-24T17:35:52

The use-case is huge apps that have a native plugin ecosystem, think Photoshop and friends. Regular apps will typically just compile separate x64 and ARM64 versions

doctorpangloss · 2024-06-24T16:40:03

Yes, bite the bullet and port. Of course it makes no sense.

These sorts of things are only conceived in conversations between two huge corporations.

Like Microsoft needs game developers to build for ARM. There’s no market there. So their “people” author GPT-like content at each other, with a ratio of like 10 middlemen hours per 1 engineer hour, to agree to something that narratively fulfills a desire to build games for ARM. I can speculate endlessly how a conversation between MS and EA led to this exact standard but it’s meaningless, I mean both MS and EA do a ton of things that make no sense, and I can’t come up with nonsense answers.

Anyway, so this thing gets published many, many months after it got on some MS PM’s boss’s partner’s radar. Like the fucking devices are out! It’s too late for any of this to matter.

You can’t play Overwatch on a Snapdragon whatever (https://www.pcgamer.com/hardware/gaming-laptops/emulation-pr... ) End of story. Who cares what the ABI details are.

Microsoft OWNS Blizzard and couldn’t figure this out. Whom is this for?

comex · 2024-06-24T18:59:27

> Anyway, so this thing gets published many, many months after it got on some MS PM’s boss’s partner’s radar.

Arm64EC is not new. It was released back in 2021.

frozenport · 2024-05-31T03:06:57

8 year old unicorn++ with a public demo sounds credible?

frozenport · 2024-05-31T01:14:37

Samba is on gen 4 silicon and still lagging, somebody over there is doing something wrong

snhbsqub · 2024-06-08T05:22:57

How are they lagging? They are running faster than anyone else at full precision and with many many fewer chips than Groq. Groq is not real.

frozenport · 2024-06-09T20:43:22

Well I've been using the groq public api, and its approx. the rates claimed.

Economics and costs are hard to predict. For example, Groq is not using HBM chips. So probably the cards are a lot easier to source.

Its not clear what the capacity of these systems are in terms of total users, or even tokens per second. Then you factor in cost. Then you realize all vendors will match a competitors pricing. Then you realize Groq doesn't sell chips.

¯\_(ツ)_/¯

The only thing you have is the public API to benchmark against: https://artificialanalysis.ai/

snhbsqub · 2024-06-12T15:35:28

- Groq has exactly 0 dollars in revenue - Groq requires 576 chips to run a single model - Groq can do low latency inference, but can't handle batches, and can't run a diversity of different models on each deployment - Groq quantizes the models, significantly affecting quality to get more speed (and don't communicate this to end users, which is very deceptive) - Groq can only run inference, cannot train on their systems

- SambaNova has real revenue from big customers - SambaNova can run any model on a single node at the speed Groq requires - SambaNova can do low latency inference just like Groq, but can also run large batches and host hundreds of models on a single deployment - SambaNova does not quantize models unless explicitly stated - SambaNova can run training at perf competitive with Nvidia, as well as fastest inference in the world at full precision

It really isn't a competition. Groq has done great as garnering hype in recent months, but it is a house of cards.

frozenport · 2024-06-14T02:28:52

I think semi analysis commented that they have pipelines instead of batches[1].

So every clock cycle you're doing useful work rather than loading up people into batches. And thats why the arch will probably win for inference, for training you're basically competing with software eco system and silicon density. AKA NVIDIA can give TSMC more money to get more ALUs on the die.

I think other places have attempted dataflow (FPGA etc) but they all basically had buffers (due to non-determinism in networks stack and even ram). SambaNova seems indistinguishable from an FPGA with a few clock cycles difference. I think they blew their shot with a Series D ($600 million???) where they made more of the same old. Maybe Intel will buy them to augment Altera? Looks like chasing parity with existing strategies.

I buy the Groq hype because its something different, certainly the public demo helped. HN is about the future.

[1] https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

frozenport · 2024-05-26T02:42:00

This misses the context.

A war between the two countries caused the death

Also the bug was known, and more of a limitation that required the system to be power cycled (and this unit wasn't).

frozenport · 2024-05-14T03:30:19

easier to color graph if you have like 4 registers :-)

frozenport · 2024-05-12T17:52:56

Appears it takes an LLM running on the fastest computers in the world to read lisp

frozenport · 2024-04-30T16:06:28

Transpiler is a kind of compiler

frozenport · 2024-04-30T08:47:02

Trademark issue, they are saying something is Nintendo, by way of their command name, when it isnt?

prmoustache · 2024-04-30T09:10:29

The emulator is not named nintendo, this is the common manpage for the gb, gba, nes and snes commands. Having said that, some if not all of them are also trademarks but since the 9front system is not sold, I am not sure trademark protection really apply. I believe the point of trademark is to prevent a product to have a same name as a competing one which is not really the case here.

glimshe · 2024-04-30T09:51:39

It does. Nintendo is able to make free products too, and trademarks are certainly about provenance. Nintendo could claim that the use of their trademark could confuse users on the origin of this free product.

frozenport · 2024-04-20T05:42:31

M2 macs can do it: https://twitter.com/junrushao/status/1681828325923389440

in practice 10 tokens per second is kinda annoyingly slow

most local people would opt for a smaller 7b model

zarzavat · 2024-04-20T08:12:13

Have been playing around with Llama3 7b today, it’s not very good. I’m sure that Facebook put everything they could into making it good, but 7B is apparently just not enough parameters.

d-z-m · 2024-04-20T16:14:54

not very good compared to what? Hard to reconcile your comment with its outsize performance on arena/glowing praise from others comparing it to much larger models.

zarzavat · 2024-04-21T10:10:20

Trying the 8B on translation gives some hilariously garbled results. It’s a small model so perhaps not unexpected but it’s definitely nowhere close to GPT-3.

I’d be cautious of synthetic benchmarks, you never know how much of the scores are due to contamination or survivorship bias.

anon373839 · 2024-04-21T10:22:55

Languages other than English are “out of scope” according to the model card, so I wouldn’t expect strong translation performance. In English, though, it’s incredibly capable for its size.

d-z-m · 2024-04-21T11:09:22

I didn't think arena was a synthetic benchmark?

pennomi · 2024-04-20T13:28:23

I assume you mean 8B? There is no Llama 3 7B.

sieszpak · 2024-04-20T14:53:38

Llama 3 8B seems sad to answer... this is the first model in a long time that has had trouble telling me how much is 3! - (factorial)

mistrial9 · 2024-04-20T15:59:36

llava-v1.5-7b-q4.llamafile yes agree that the impression is poor overall

frozenport · 2024-04-20T19:20:50

Yeah thats why OpenAI, etc don't run 7B

But for scrapping, and tool calling it represents a substantial gain.

frozenport · 2024-04-20T01:48:35

LLama3 looks particularly good at tool calling

Groq's low latency is particularly good for tool calling

Seems like two techs that will make coding obsolete :-)