More

variadix · 2024-07-30T21:50:23

From the creator of Forth https://youtu.be/0PclgBd6_Zs

144 small computers in a grid that can communicate with each other

variadix · 2024-07-28T19:16:24

It is my understanding that SMT should be beneficial regardless of core count, as SMT should enable two threads that can stall waiting for memory fetches to fully utilize a single ALU, i.e. SMT improves ALU utilization in memory bound applications with multiple threads by interleaving ALU usage when each thread is waiting on memory. Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.

adrian_b · 2024-07-29T00:16:13

In a CPU with many cores, when some cores stall by waiting for memory loads, other cores can proceed by using data from their caches and this is even more likely to happen than for the SMT threads that share the same cache memory.

When there are enough cores, they will keep the common memory interface busy all the time, so adding SMT is unlikely to increase the performance in a memory-throughput limited application when there already are enough cores.

Keeping busy all the ALUs in a compute-limited application can usually be done well enough by out-of-order execution, because the modern CPUs have very big execution windows from which to choose instructions to be executed.

So when there already are many cores, in many cases SMT may provide negligible advantages. On server computers there are much more opportunities for SMT to improve their efficiency, but on non-server computers I have encountered only one widespread application for which SMT is clearly beneficial, which is the compilation of big software projects (i.e. with thousands of source files).

The big cores of Intel are optimized for single-thread performance. This optimization criterion results in bad multi-threaded performance. The reason is that the MT performance is limited by the maximum permissible chip area and by the maximum possible power consumption. A big core has very poor performance per area and performance per power ratios.

Adding SMT to such a big core improves the multi-threaded performance, but it is not the best way of improving it, because in the same area and power consumption used by a big core one can implement 3 to 5 efficient cores, so such a replacement of a big core with multiple efficient cores will increase the multi-threaded performance much more than adding SMT. So unlike for the case of a CPU that uses only big cores, in hybrid CPUs, SMT does not make sense, because a better MT performance is obtained by keeping only a few big cores, to provide high single-thread performance, and by replacing the other big cores with smaller, more efficient cores.

t-3 · 2024-07-28T19:49:10

> Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.

I thought the reason SMT sometimes resulted in lower performance was that it halved the available cache per thread though - shouldn't larger caches make SMT more effective?

jmb99 · 2024-07-28T21:23:35

My understanding is that a larger cache can make SMT more effective, but like usual, only in certain cases.

Let’s imagine we have 8 cores with SMT, and we’re running a task that (in theory) scales roughly linearly up to 16 threads. If each thread’s working memory is around half as much as there is cache available to each thread, but each working set is only used briefly, then SMT is going to be hugely beneficial: while one hyperthread is committing and fetching memory, the other one’s cache is already full with a new working set and can begin computing. Increasing cache will increase the allowable working set size without causing cache contention between hyperthreads.

Alternatively, if the working set is sufficiently large per thread (probably >2/3 the amount of cache available), SMT becomes substantially less useful. When the first hyperthread finishes its work, the second hyperthread has to still wait for some (or all) of its working set to be fetched from main memory (or higher cache levels if lucky). This may take just as long as simply keeping hyperthread #1 fed with new working sets. Increasing cache in this scenario will increase SMT performance almost linearly, until each hyperthread’s working set can be prefetched into the lowest cache levels while the other hyperthread is busy working.

Also consider the situation where the working set is much, much smaller than the available cache, but lots of computing must be done to it. In this case, a single hyperthread can continually be fed with new data, since the old set can be purged to main memory and the next set can be loaded into cache long before the current set is processed. SMT provides no benefit here no matter how large you grow the cache (unless the tasks use wildly different components of the core and they can be run at instruction-level parallelism - but that’s tricky to get right and you may run into thermal or power throttling before you can actually get enough performance to make it worthwhile).

Of course the real world is way more complicated than that. Many tasks do not scale linearly with more threads. Sometimes running on 6 “real” cores vs 12 SMT threads can result in no performance gain, but running on 8 “real” cores is 1/3 faster. And sometimes SMT will give you a non-linear speedup but a few more (non-SMT) cores will give you a better (but still non-linear) speedup. So short answer: yes, sometimes more cache makes SMT more viable, if your tasks can be 2x parallelized, have working sets around the size of the cache, and work on the same set for a notable chunk of the time required to store the old set and fetch the next one.

And of course all of this requires the processor and/or compiler to be smart enough to ensure the cache is properly fed new data from main memory. This is frequently the case these days, but not always.

gpderetta · 2024-07-28T23:48:16

Let's say your workload consists solely in traversing a single linked list. This list fits perfectly in L1.

As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.

Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.

jmb99 · 2024-07-29T04:55:00

> so it is not just cache misses.

Agreed 100%. SMT is waaaay more complex than just cache. I was just trying to illustrate in simple scenarios where increasing cache would and would not be beneficial to SMT.

adastra22 · 2024-07-29T05:21:37

Depends greatly on the work load.

variadix · 2024-07-25T20:04:39

Depending on how SEO’d the thing you’re looking for is it can range from easy (looking up docs, specs, etc.) to impossible (product recommendations) to find quality information (without knowing what sources are reliable beforehand). I’m not sure that LLMs will fix the problem, seems like curation is the issue and none of the major players are interested in that.

variadix · 2024-07-21T09:17:27

More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.

hnthrowaway0328 · 2024-07-21T23:07:33

Is there any reading about this topic? By saying binary parsing I guess you meant code that parses say PNG or WAD files?

variadix · 2024-07-17T17:59:13

Yes, it’s why https://en.m.wikipedia.org/wiki/Intel_ADX exists

variadix · 2024-06-15T18:11:51

This just seems like a fundamental misunderstanding of what an LLM is, where people anthropomorphize it to be akin to an agent of whatever organization produced it. If Google provides search results with instructions for getting away with murder, building explosives, etc. it’s ridiculous to interpret that as Google itself supporting an individual’s goals/actions and not misuse of the tool by the user. Consequently banning Google search from the AppStore would be a ridiculous move in response. This may just be a result of LLMs being new for humanity, or maybe it’s because it feels like talking to an individual more so than a search engine, but it’s a flawed view of what an LLM is.

variadix · 2024-06-10T19:04:43

Being able to use macro expansion, stringification, concatenation, etc. in GCC inline asm is a major advantage over how, e.g. MSVC handled inline asm. Getting the constraints right is the hard part, the syntax isn’t much of an issue beyond the initial learning curve imo.

variadix · 2024-05-17T04:43:58

Yes it’s a bad thing, both scientifically and societally. For science, having taboo subjects limits our ability to understand the world. For society, it obfuscates what an effective solution to societal problems looks like.

variadix · 2024-05-17T02:05:56

Worth noting that the metadata structure can be done more efficiently with a flexible array member, but it requires alignment to alignof(max_align_t) to obey the ABI. The structure will consume the same amount of memory (16 bytes on x86-64) but you avoid a dereference and you have an extra pointer worth of space to use for whatever.

variadix · 2024-05-11T20:40:28

Having to do a load to get the metadata is probably why. Say you have a big array of strings and you care about the compactness of that array for cache reasons, so making the string structs small matters. You probably want to do some operation on all of them, e.g. accumulate the length of all of them so you can determine how much space is required to write them out. Now this either requires a byte scan for the null to determine length, or it requires a load from the heap (that’s probably cold) for the metadata, and a second load from the heap (that is also probably cold).

Generally when you do memory compactification of non-serialized structures it’s for cache related performance improvements, so unless the common case is overwhelmingly small strings there’s probably not a benefit. It’s worth testing to see where this is an improvement, but my guess is that it’s in extremely limited regimes, but maybe some bit manipulations instead of traditional byte scan could yield improvements, likewise it may be worth going to larger SSO sizes for SIMD and a larger scope of SSO cases.

If you’re interested in data structure optimizations like this there’s a cppcon talk about Facebook’s string implementation that has some clever tricks.