Hacker News new | past | comments | ask | show | jobs | submit login
Flow Computing aims to boost CPUs with ‘parallel processing units’ (ieee.org)
128 points by rbanffy 51 days ago | hide | past | favorite | 81 comments



Does anyone know what they mean by "wave synchronization"? That's supposedly their trick to prevent all those parallel CPUs from blocking waiting for data. Found a reference to something called that for transputers, from 1994.[1] May be something else.

Historically, this has been a dead end. Most problems are hard to cut up into pieces for such machines. But now that there's much interest in neural nets, there's more potential for highly parallel computers. Neural net operations are very regular. The inner loop for backpropagation is about a page of code. This is a niche, but it seems to be a trillion dollar niche.

Neural net operations are so regular they belong on purpose-built hardware. Something even more specialized than a GPU. We're starting to see "AI chips" in that space. It's not clear that something highly parallel and more general purpose than a GPU has a market niche. What problem is it good for?

[1] https://www.sciencedirect.com/science/article/abs/pii/014193...


GPUs have wavefronts so I assume it is similar? Here is a page that explains it:

https://gpuopen.com/learn/occupancy-explained/


Nope.

AMD's "wavefront" is an obfuscated word for what NVIDIA calls "warp".

NVIDIA's "warp" is an obfuscated word for what has been called for many decades in the computer literature as "thread". (NVIDIA's "thread" is an obfuscated word that means something else than what it means in the non-NVIDIA literature.)

NVIDIA has thought that it is a good idea to create their own terminology where many traditional terms have been renamed without any reason. AMD has thought that it is a good idea to take the entire NVIDIA terminology and replace again all terms with other words.


I'd assume that 'warp' is taken from textiles: https://en.m.wikipedia.org/wiki/Warp_and_weft

A warp is a thread, but a thread within a matrix of other threads.

I'm not into GPU programming, but doesn't nvidia have some notion of arranging threads in a matrix sort of like this?


Nope.

In NVIDIA parlance, a thread is the body of a "parallel for" structure, i.e. the sequence of operations that are executed for an array element, which are executed by one SIMD lane of a GPU.

A "warp" is a set of "threads", normally of 32 "threads" for the NVIDIA GPUs, the number of "threads" in a "warp" being the number of SIMD lanes of the execution units.

CUDA uses what Hoare (1978) has named "array of processes" and which in many programming languages is named "parallel for" or "parallel do".

This looks like a "for" loop, but its body is not executed sequentially in a loop, but the execution is performed concurrently for all elements of the array.

A modern CPU or GPU consists of many cores, each cores can execute multiple threads and each thread can execute SIMD instructions that perform an operation for multiple array elements, on distinct SIMD lanes.

When a parallel for is launched in execution, the array elements are distributed over all existing cores, threads and SIMD lanes. In the case of NVIDIA, the distribution is handled by the CUDA driver, so it is transparent for the programmer, who does not have to know the structure of the GPU.

The use by NVIDIA of the word "thread" would have corresponded with the reality if the GPU would not have used SIMD execution units. Real GPUs use SIMD instructions that process a number of array elements typically between 16 and 64. NVIDIA's "warp" is the real thread executed by the GPU, which processes multiple array elements, while NVIDIA's "thread" is what would have been executed by a thread of a fictitious GPU that does not use SIMD, so it would process only one array element per thread.


I dunno, it still sounds to me that nvidia is taking their (admittedly inaccurate) concept of a thread, putting a bunch in parallel, and calling that a warp to be cute.

I think the analogy still makes a kind of sense if you accept it at face value and not worry about the exact definitions. Which is really all it needs to do, IMO.

Again, I don't really know anything about GPUs, just speculating on the analogy.


Agreed that warp is a marketing term, but it is definitely not something that should be called "threads" except in the very loosest sense of the term.

A bunch of threads in parallel implies MIMD parallelism- multiple instructions multiple data.

A warp implies SIMD parallelism - single instruction multiple data (although technically SIMT, single instructions multiple threads https://en.wikipedia.org/wiki/Single_instruction,_multiple_t...).

From both a hardware and software perspective those are very different types of parallelism that Nvidia's architects and the architects of its predecessors at Sun/SGI/Cray/elsewhere were intimately familiar with. See: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy


We're starting to see "AI chips" in that space.

"Positronic" came to my mind.


My god the future sucks far more than we could have ever imagined. Imagine being sold a chatbot and being told it's an android!


FWIW, I already carry an android in my pocket.


The reason problems are hard to fit into most of what's tried is that everyone is trying to save precious silicon space and fit a specific problem, adding special purpose blocks, etc. It's my belief that this is an extremely premature optimization to make.

Why not break it apart into homogeneous bitwise operations? That way everything will always fit. It would also simplify compilation.


Seems like a nice idea — instead of the stark CPU/GPU divide we have today, this would fit somewhere in the middle.

Reminds me slightly of the Cell processor, with its dedicated SPUs for fast processing, orchestrated by a traditional CPU. But we all saw how successful that was (: And that had some pretty big backing.

Overcoming the inertia of the current computing hardware landscape is such a huge task. Maybe they can find some niche(s).


> Reminds me slightly of the Cell processor, with its dedicated SPUs for fast processing, orchestrated by a traditional CPU. But we all saw how successful that was (:

The success of the Cell is more subtle:

- (It seems many) game developers hated it because it is so different to program than other CPUs of other game consoles of its time (in particular the CPU of the Xbox 360). For game studios, time to market and portability of the game to other consoles is important.

- On the other hand, scientists who ported their high-performance numerical computations to the Cell seem to have loved it. Such software is often custom-built for the underlying hardware, and here cost (of the hardware) and possible speed are the measures on which to evaluate the hardware. Here the Cell processor of a PS3 cluster was much more competitive than other available solutions (GPGU did not really exist at this time).


Gabe Newell of Valve famously hated the Cell architecture, and I think that's very illustrative. He is of the generation of game devs that was very willing to try wild algorithms and hand-massage assembly and use all tricks to get 3D fast, so the PS3 should have been a perfect fit. But he did not like to have to start back at square one.


Isn't that because game developers use conditional statements more, and scientists typically have a flow-graph that describes a computation and this computation doesn't have conditional parts? So it is a more natural fit?


I don't know, in particular concerning the game developer perspective.

But from my observation, scientists who develop high-performance computing algorithms often think much deeper about the mathematical structure of their problems than game developers do. I thus have a feeling that what you describe as "flow-graph that describes a computation" is rather a result of this deep analysis.

I can easily imagine that this would partly also work for video games, but I would hypothesize either this is too much work that is not really rewarded in the game industry (the game industy is known ("crunch time") for having to churn out lots of new code fast), or if you are a lot into this kind of thinking, the game industry might not be the most rewarding place to work at.


The flow graph approach is also relevant if you want to keep track of rounding errors of your floating point operations. Every float is actually an interval (containing all possible real numbers in it) and you might want to keep track of the size of the intervals along with the numbers you get from the floating point units.


a game developer might have an algorithm that they have to implement.

a computational scientist might find an alternative algorithm that makes different still-acceptable tradeoffs and yet fit more naturally in a new arch.


On the other hand, the game dev might just make the graphics slightly less realistic, whereas the scientist has to simulate the true physical equations.


> whereas the scientist has to simulate the true physical equations.

Every model of nature is an approximation. In high-performance computing, it is often a serious tradeoff where to "cut corners" to make the computation even feasible (e.g.

- Which grid size or tesselation do we use?

- Do we use a more regular grid to make use of optimizations or do we use a more irregular grid/tesselation for more precision at the places where it matters?

- Do we use a simpler or more sophisticated model?

- For multi-scale problems: do we compute each scale individually and combine the results?

- etc.).

EDIT: In my opinion the central difference to game development is that in science each such possible optimization where one might cut corners has to be deeply analyzed. Simply using an easy, pragmatic shortcut that "looks sufficiently good" is rarely possible in science.


> Reminds me slightly of the Cell processor

I was thinking the same.

Also the Tilera CPU with many cores and mesh network (back in mid 2000's, eventually bought by Nvidia and used in something, don't remember).

Tangent:

Back in early 2000's I had a hobby project (ALife with ANN brain) and I was looking for more computation. Multiple CPU's was not ideal, GPU wasn't ideal because the read/write/computation model only matched 1/2 of my ANN's flow and was a mismatch for the other half.

I read about a new cpu and I ended up talking to one of the key guys from Tilera, I was pretty impressed they would take the time to talk to some random guy working on a hobby project.

I asked about the performance of individual computational units (assuming custom could beat the industry) and he surprised me when he responded "nobody is going to beat Intel at integer, you won't get an increase from that perspective"



I'd believe more in a heterogenous chip (e.g. MI300X, Apple M series, or even APUs) than in completely new chip tech.


It’s still two separate ISAs.


I have a hard time believing that dealing with a singular ISA for two different compute cases would be an "obviously better" solution. I don't doubt that it's plausible/possible, especially if RISC-V + extensions are still simple, but then again, that's far outside my wheelhouse.


Even with a GPU sharing memory with your CPU, calling GPU code is never as simple as doing a jump to that code. You need to set up the code the GPU will run and then let it loose on the data.


What they say is far too vague, so it is impossible to know whether they have any new and original idea.

It is well known that the CPU cores that are optimized for high single-threaded performance are bad for multithreaded tasks, because they have very poor performance per power and per area, so you cannot put many of them in a single package, because there are limits both for the die area and for the power dissipation.

There are 3 solutions for this problem, all of which are used in many currently existing computers.

1. A hybrid CPU can be used, which has a few cores optimized for single-threaded performance and many cores optimized for multithreaded performance, like the Intel E-cores or the AMD compact cores.

2. One can have one or more accelerators for array operations, which are shared by the CPU cores and whose instruction streams are extracted from the instruction streams of the CPU cores (like in the CPUs from many decades ago the floating-point instructions were extracted from the CPU intruction streams and they were executed by floating-point coprocessors).

The instructions executed by such accelerators must be defined in the ISA of the corresponding CPUs. Examples are the Arm SME/SME2 (Scalable Matrix Extension) and the Arm SSVE (Streaming Scalable Vector Extension) instruction sets. These ISA extensions are optional starting from Armv9.2-A or Armv8.7-A. AFAIK, for now only the recent Apple CPUs support them, but in the future the support for them might become widespread.

3. The last solution is to have an accelerator for array operations that has a mechanism independent from the CPU cores for fetching and decoding its own instruction stream. The CPU cores have to launch programs on such accelerators and get results when they are ready. Such completely independent accelerators are either parts of GPUs or they may be completely dedicated for computing tasks, when they no longer include the special-function graphics hardware.

Any up-to-date laptop CPU already includes inside its package at least 2, if not all 3 of these solutions, to provide a good multithreaded performance.

For servers, it is much less useful to have all these variants in a single package, because one can mix for instance one server with big cores with high single-threaded performance with many servers using much more compact cores per socket, for good multithreaded performance, and the servers can use multiple discrete GPUs per server.

It is not clear with whom this "Flow Computing" wants to compete. They certainly cannot make better compact cores than Intel, AMD or Arm. They cannot make something like a SME accelerator, because that must be tightly integrated with the cores for which it functions as a coprocessor.

So their "parallel processing units" can be only competitors for the existing GPUs or NPUs. Due to their origins in execution units for shader programs the current GPUs are not versatile enough. There still are programs that are easy to run on CPU cores but it is difficult to convert them to a form that can be executed by GPUs. So there would be a place for someone that could design an architecture more convenient than that of the current GPUs.

However there is no indication in that article that there exists any problem for which the "Flow Computing" PPUs are better than the current GPUs or NPUs. If the PPUs have some kind of dataflow structure, then their application domain would be even more restricted than for the current GPUs and NPUs.

EDIT:

Now I have read their whitepaper "Design goals, advantages and benefits of Flow Computing", from HotChips.

However, what that paper says about their patented architecture raises more questions than provides any answers.

Their description of the PPUs is very similar to the description of Denelcor HEP from 1979. HEP (Heterogeneous Element Processor) was an experimental computer designed by Denelcor, Inc., which was intended to be a competitor for the supercomputers like Cray-1 (1976).

While HEP was based on very good ideas, its practical implementation was very poor, using non-optimized and obsolete technology in comparison with Cray, so it has never demonstrated a competitive performance. The lead architect of HEP has later founded "Tera Computer Company", in 1987, which has designed computers based on the same ideas with HEP. Tera Computer had very modest results, but somehow it has succeeded in 2000 to buy the Cray Research division of Silicon Graphics, then it was renamed as Cray, Inc. (now a subsidiary of HPE).

While Cray-1 and its predecessors (TI ASC and CDC STAR) were based on exploiting the parallelism of hardware pipelines with array operations, which can provide independent operations on distinct array elements, which can be executed in parallel in different pipeline stages, HEP was based on exploiting the parallelism of hardware pipelines with fine-grained multithreading, where independent instructions from distinct threads can be executed in parallel in different pipeline stages.

HEP had multiple CPU cores ("core" was not a term used at that time). Each CPU core was a FGMT core, which could switch at each clock cycle between an extremely large number of threads. (FGMT is a term that has been introduced only much later, in 1996, with its abbreviation only in 1997; at the time of HEP, they used the term "fine-grained multiprogramming")

The very large number of threads executed by each FGMT core (e.g. hundreds) can hide the latencies of data availability.

The description of the "Flow Computing" PPUs is about the same as for HEP (1979), i.e. they appear to depend on FGMT with a very large number of threads (called "fibers" by Flow Computing) to hide the latencies. Unlike GPUs and NPUs, but like HEP, it seems the "Flow Computing" PPUs rely mainly on multithreading (a.k.a. TLP) to provide parallelism, and not on array operations (a.k.a. DLP).

The revival of this old idea could actually be good, but the whitepaper does not provide any detail that would indicate whether they have found a better way to implement this.


Isn't this where "NPUs" are going now?


SIMD units already fit somewhere in the middle.


Discussion (28 points, 3 months ago, 32 comments) https://news.ycombinator.com/item?id=40650662


This is based on legitimate (although second-tier) academic research that appears to combine aspects of GPU-style SIMD/SIMT with Tera-style massive multithreading. (main paper appears to be https://www.utupub.fi/bitstream/handle/10024/164790/MPP-TPA-... )

Historically, the chance of such research turning into a chip you can buy is zero.


Systolic processing, circa 1979. A concept that gets reinvented every decade:

https://en.wikipedia.org/wiki/Systolic_array


And it always gets prematurely optimized to fit a specific problem instead of being made general purpose compute engine.

FPGAs are essentially horrible systolic arrays. They're lumpy, and have weird routing hardware that isn't easy to abstract out. Those lead to multiple day compile times in some cases.

They don't pipeline things by default. The programming languages used are nowhere near a good fit to the hardware.

It's just a mess. It happens every time.


Systolic arrays are only used for fixed function computing within the context of individual instructions, e.g. a matrix multiplication or convolution unit. In practice however, most of these arrays are wave front processors, because programmable processing elements can take variable amounts of time per stage and therefore become asynchronous.

AMD's XDNA NPU is based around a wave front array of 32 compute tiles, each of which can either perform a 4x8 X 8x4 matrix multiplication in float16 or a 1024 bit vector operation per cycle.


It's funny how everyone in this thread is wrong on the high level concepts (blind leading the blind) but here you're wrong on the specifics too

> around a wave front array of 32 compute tiles, each of which can either perform a 4x8 X 8x4 matrix multiplication in float16 or a 1024 bit vector operation per cycle.

1. XDNA is exactly the opposite of a "wavefront" processor. Each compute core is single thread vector VLIW. So you have X number of independent fixed function operators.

2. There is no XDNA product with 32 cores and 4x8x4 matmul. Phoenix has 20 compute cores (16 usable) and performs 4x8x4. Strix has 32 and performs 16x32x8 matmul.


Does anyone have any knowledge/understanding on how this is (or isn't?) fundamentally different from Intel's Xeon Phi?

https://en.wikipedia.org/wiki/Xeon_Phi


How is it the same? The Xeon Phi was basically just smaller/weaker cores but more of them and with their SIMD units still there.


When will we get the “Mill” cpu?


I've been following that saga for a long time. Seems mostly like vapourware sadly.


Probably implementation hell. The big idea theoretically works, but when you get into the details the compromises for implementation steal away the gains.

I'm always reminded of the rotary internal combustion engine: in theory there's a whole suite of benefits, in practice they're "interesting" but not that great once practically built.


More time spent filing patents than implementation.


At this point, probably never it seems...


And considering their effort in patenting every idea related to it, we’ll only see it when the first implementer gets sued.


How is this different from an integrated gpu other than it presumably doesn't do graphics.


I'm probably missing something but why not use gpus for parallel processing?


GPUs work on massive amounts of data in parallel, but they execute basically the same operations every step, maybe skipping or slightly varying some steps depending on the data seen by a particular processing unit. But processing units cannot execute independent streams of instructions.

GPUs of course have several parts that can work in parallel, but they are few, and every part consists of large amounts that execute the same instruction stream simultaneously over a large chunk of data.


This is not true. Take the NVidia 4090. 128 SMs = 4x128=512 SMSPs. This is the number of warps which can execute independently of each other. In contrast, a warp is a 32-width vector, i.e. 32 "same operations", and up to 512 different batches in parallel. So, it's more like a 512-core 1024-bit vector processor.

That being said, I believe the typical number of warps to saturate an SM is normally around 6 rather than 4, so more like 768 concurrent 32-wide "different" operations to saturate compute. Of course, the issue with that is you get into overhead problems and memory bandwidth issues, both of which are highly difficult to navigate around -- the register file storing all the register of each process is extremely power-hungry (in fact, the most power-hungry part I believe), for example.

A PPU with less vector width (e.g. AVX512) would have proportionally more overhead (possibly more than linearly so in terms of the circuit design). This is without talking about how most programs depend on latency-optimized RAM (rather than bandwidth-optmized GDDR/HBM).


I'm happy to stand corrected; apparently my idea about GPUs turned obsolete by now.


The Nvidia 4090 indeed has 128 SMs, but the formula you provided (128 SMs = 4x128=512 SMSPs) isn't quite accurate. Each SM contains 64 CUDA cores (not SMSPs), and these are the units responsible for executing the instructions from different warps. The term "SMSP" isn't typically used to describe CUDA cores or warps in Nvidia’s architecture.


winwang’s comment is correct, yours is wrong.

“cuda core” refers to one lane within the SIMT/SIMD ALUs. These lanes within a SMSP don’t execute independently.

The term SMSP is definitely used for nvidia’s architecture :

https://docs.nvidia.com/nsight-compute/ProfilingGuide/index....

> smsp

> Each SM is partitioned into four processing blocks, called SM sub partitions. The SM sub partitions are the primary processing elements on the SM.

(Note that this kernel profiling guide doesn’t use the term “cuda cores” at all)

Also there are 128 “cuda cores” per SM in 4090, not 64 : https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...

> Each SM in AD10x GPUs contain 128 CUDA cores


> So, it's more like a 512-core 1024-bit vector processor.

I’m disappointed nobody tried to make a computer with only GPU-like cores.

That it’s possible we already know - the initial bring-up of a Raspberry Pi is done by the GPU, before the ARM cores are started, so, for that brief moment, there is a GPU doing some very CPU things.


I mean, you can definitely do CPU-like things on the GPU. Just... why would you make a bandwidth-optimized processor your primary processor, when most programs rely on the low-latency single-thread perf? In the case of parallelism, they are almost never vector-parallel (i.e. SIMD). Also, most modern processors have vector support, with AMD typically supporting AVX512 (already half the width of a GPU core), though I believe AVX is not quite as advanced(?) in terms of hardware intrinsics.

If you're interested in CPU-like GPU stuff, the term is GPGPU -- and I'm very interested in it myself.


I mean, you can definitely do CPU-like things on the GPU. Just... why would you make a bandwidth-optimized processor your primary processor, when most programs rely on the low-latency single-thread perf? In the case of parallelism, they are almost never vector-parallel (i.e. SIMD). Also, most modern processes have vector support, with AMD typically supporting AVX512 (already half the width of a GPU core), though I believe AVX is not quite as advanced(?) in terms of hardware intrinsics. Vector instructions are also not free, typically requiring the core to downclock due to power/heat.

And then there's all the supportive smartness of the CPU like relatively large L1/L2 cache per core, huge branch predictors, etc. What I'm trying to get at is that GPUs aren't just dumb parallel CPUs, but rather require significantly different divergence from typical CPU architecture.

But if you're interested in CPU-like GPU stuff, the term is GPGPU -- and I'm very interested in it myself.


GPUs have evolved to match their workloads, and, therefore, became large collections of wide SIMD devices tailored to run relatively simple computations repeatedly across long vectors.

What if we turned the idea upside down and allowed to be more complicated? Branches don’t work well with SIMD, but, when you run a SIMD processor over a vector of length one, you can start thinking of code with lots of branches, branch prediction, and of using the rest of the SIMD unit for speculative execution or superscalar operations. Now you have something that has characteristics of both, and a small piece of the GPU could be running scalar code efficiently. The memory system would need to be different, with perhaps lots of scratchpad memory on very wide buses with high throughput and high latency for the GPU-like workloads, and lower throughput/lower latency system for the more branchy workload. Explicit cache management seems to be a need here.

Not sure how it’d look like if someone implemented such a thing, but I’d love to see.


Because GPU are physically built to manage parallel task, but only a few kinds

They are very specialized

CPU are generics, they have lots of transistors to handle a lot of different instructions


I think this is a really unfortunate way to explain it. The issue is not that CPUs have a lot of different instructions - hardly anyone uses decimal math instructions, for instance, and no one cares about baroque complex addressing modes.

The difference is that GPU code is designed to tolerate latency by having lots of loop iterations treated as threads. A modern CPU tolerates latency by maintaining the readiness of hundreds of individual instructions ("in flight") - essentially focusing on minimizing the execution latency of each instruction. (which also explains how CPUs use caches and very high clocks, but wind up with somewhat fewer cores and threads.)

(note that I'm using cores and threads correctly here, not the nvidia way.)


Also moving data to and from the GPU takes MUCH more time than between CPU cores (though combined chips drastically lower this difference).


in the olden days of gp-cpu this was certainly true.

is it still true? do you just mean "latency overhead for setting up a single PCIe transaction is much larger than flinging a cache line across QPI/etc"?


I’m still waiting for a clockless core… some day


www.greenarraychips.com


If I understood correctly, the GreenArray chips don't have a clock but each individual core does via a local ring oscillator. So it might run at a different speed than its neighbors and the communication system handles that.

A more famous clockless processor was Amulet, an asynchronous implementation of the ARM. https://en.wikipedia.org/wiki/AMULET_(processor)


I can't prove you wrong, but the documentation for the F18A core seems to imply it's asynchronous:

> The F18A is built from asynchronous logic, therefore its instruction times are naturally approximate. The time required for all activity varies directly with temperature, inversely with power supply voltage (VDD), and randomly within a statistical distribution due to variations in the process of fabricating the chips themselves. Additional variations may occur with minor revisions in the version of F18 technology used by a particular chip. Applications should not be designed to depend upon the speed of activities within F18 computers for timing purposes without taking all of these variables into consideration.

https://www.greenarraychips.com/home/documents/greg/DB001-22...


I am more familiar with the F18A predecessors, such as the F21, than with this design. So it is quite possible that it is a proper asynchronous processor even if the older ones were not.

I am biased by some confusion I observed about the F21. There are two ways to design an ALU: an efficient solution is to combine the inputs and control signals using as little logic as possible to generate the results or to have a separate circuit for each operation and then have the control signals use a multiplexer to select the desired result.

A great example of the first style is the ALU in the ARM1:

http://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-f...

Many simple educational processors use the second style, as did the F21. In addition, the adder in the F21 was a simple ripple carry adder. In theory, you would have to use a clock slow enough that the + instruction would have enough time to generate the correct result even in the worst case. In the first ALU style that would certainly be true, but with how the F21 was actually designed just executing one or two NOP instructions before the + would work just as well. As soon as the values of the T and N registers were stable the adder would start to ripple and calculate no matter what instructions were being executed. Only when we get to the end of the + instruction does the result need to be ready.

This allowed the amazing demonstrations of some F21 chips, which used 800nm technology, running at 800MHz. This hand scheduling of code to allow frequencies beyond what some instructions could handled was described at the time by several people as being an asynchronous processor design, which is wasn't. That is why I am a bit wary when I see that term used for the descendants of the F21 even if it turns out to be correct.

Another interesting issue with running the F21 at 800MHz was that Chuck Moore's OKAD tools did not originally simulate heat transfer. As a result two transistors were a little smaller than they should have been and could burn out at such high frequencies. Inserting NOPs between instructions allowed these transistors to cool down enough not to be a problem. His next version of OKAD included a neat thermal model in order to avoid that mistake in future designs.


Thanks for the detailed information! The comment from the docs about the speed varying with temperature and VDD would certainly also be true of a CPU with a local ring oscillator, so if predecessors were also erroneously referred to as asynchronous, that does make me suspicious as well.


Considering place and routing of even synchronous digital logic is at the edge of what we can do computationally I really don't see this happening anytime soon.


It is far easier to join two asynchronous blocks into a larger system than two synchronous blocks. So place and route would be less critical.


Whenever I see the word "fintech" this is the article I am expecting, Instead I am disappointment with some drivel about banks.

I am not sure what is wrong with me, you would think my brain would have figured it out by now, but it always parses it wrong. perhaps if it were "finctech" that would help.


I’ve not yet had this problem but I surely will now! Thanks I guess.


The word "fintech" is disappointing, most because everyone calls their classical old school finance company a fintech that has a minor tech component. E.g. something like Solaris Bank is a fintech, but BaaS providers are a rarity and not the average fintech.


“Now, the team is working on a compiler for their PPU” good luck!


While the Itanium failed the Ageia PPU did succeed with its compiler. It was acquired by NVIDIA and became CUDA.

https://en.wikipedia.org/wiki/Ageia


It did indeed get merged into the CUDA group but I think the internal CUDA project predated it, or at least, several of the engineers working on it did


I always thought CUDA grew out of Ian Buck's PhD thesis under Pat Hanrahan; why do you credit Ageia?


That's not the same PPU, is it?


Is this like the Itanium architecture with its compiler challenges?


Indeed, a very smart compiler would be necessary, perhaps too much for the current compiler art, like the itaniun.

But...how about specializing to problems with inherent paralelism? LLMs maybe?


The Itanium C and FORTRAN compilers eventually became very, very good. By then, the hardware was falling behind. Intel couldn't justify putting it on their latest process node or giving it the IPC advancements that were developed for x86.

If you wanted to do something similar right now, it's possible to succeed. Your approach has to be very different. Get a lot of advancements into LLVM ahead of time, perhaps. Change the default ideas around teaching programming ("use structured concurrency except where it is a bad idea" vs "use traditional programming except where structured concurrency makes sense", etc)

But no, throwing a new hardware paradigm out into the world with nothing but a bunch of hype is not going to work. That could only work in the software world.


> Now, the team is working on a compiler for their PPU

I think a language is also required here. Extracting parallelism from C++ is non trivial.


Something similar to CUDA or OpenCl should do it, right?


Tell us something new, please.


This is a duplicate of https://news.ycombinator.com/item?id=40650662, and this article has nothing new in it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: