Not to draw attention to myself or anything, but if you're interested in learning to make cores for the Analogue Pocket or MiSTer (or similar) platforms, I highly recommend taking a look at the resources and wiki I'm slowly building - https://github.com/agg23/analogue-pocket-utils/
I started ~7 months ago with approximately no FPGA or hardware experience, have now ported ~6 cores from MiSTer to Pocket, and just released my first core of my own, the original Tamagotchi - https://github.com/agg23/fpga-tamagotchi/
If you want to join in, I and several other devs are very willing to help talk you through it. We primarily are on the FPGAming Discord server - https://discord.gg/Gmcmdhzs - which is probably the best place to get a hold of me as well.
This is awesome! Thank you for sharing and documenting this so ardently. I don't have a Pocket yet but look forward to trying this tomagotchi core when I do, and maybe learning more to become a contributor when I have some spare time.
I've always wanted to get more involved in emulation and this seems like a perfect scene to get involved through
I spent an hour looking through the GitHub repo for the Tamagotchi core and disassembler. That you accomplished all this in just seven months is very impressive, for someone like me just starting out learning Verilog on Xilinx FPGA in Vivado.
I’ve managed to make the 4 LEDs on my Digilent Arty board ping-pong back and forth like the power indicator on the Nintendo Switch Joy-Con controllers, debounce the buttons and switches and set up clock division to produce different frequencies from the hardware clock. That’s taken me a handful of days in my spare time.
The workflow is actually the exact same for the dev and retail Pockets. Both have an exposed JTAG port, but for the retail Pocket, you just need to remove the "hump" on the lower back. You want to make sure you get an official USB Blaster (https://www.mouser.com/ProductDetail/Terasic-Technologies/P0...), because they negotiate a lower voltage level with the Pocket (someone blew their JTAG port using a cheap one).
For a beginner, the Pocket is definitely easier. With MiSTer, you're building a bunch of other framework code that isn't your own, and that adds a lot of iteration time. I also think the Pocket APIs are more refined, but also more limited. I've already written some info about this - https://www.reddit.com/r/fpgagaming/comments/1318jsr/tamagot...
This is perfect, thanks. I guess i signed up for a dev kit at some point as I'm an FPGA engineer by trade and got sent a pocket dev kit for free. Was a bit of a bother trying to find a singular resource for this stuff.
Hey thanks for your Pocket cores, man. I really appreciate how fully featured the SNES core feels, even if it doesn't have save state support. You're doing the lord's work!
Thank you. I want to point out that the SNES core is not my own, I just did the port to Pocket. The real praise goes to srg320 (https://github.com/srg320) who has been in Ukraine all this time and is still pumping out updates.
My mind is blown but I'm also wondering if this isn't some kind of incredible over-engineering? Surely CPUs are fast enough to emulate these kind of devices in software. If they aren't, they must be an order of magnitude simpler in complexity.
I wouldn't ordinarily care about emulators, but actual hardware emulators is the craziest thing I've heard in a while. All that for a small handheld console?
The benefit of FPGAs is you can get nearly gate perfect emulation of an old games system. We've had emulators for years that get most things right but some games and minor things in old games require specific software patches to ensure the odd why they used the chips available produces the same output. There's a great old article from 2011 about the power required at the time to get a nearly perfect emulation of a NES. [0] The goal with the Pocket and all of Analogue's consoles isn't to be just another emulation machine but to run as close as possible to the original at a hardware level. That's their whole niche, hardware level 'emulation' of old consoles.
Software is orders of magnitude simpler in complexity, yes. The difference between a software emulator and a logic level emulator are immense.
But take the example of the difficulties with a software NES emulator:
In hardware, there is one clock that is fed into the 3 main disparate systems; the CPU, APU (audio), and PPU (picture). They all use different clock dividers, but they're still fed off of the same source clock. Each of these chips operate in parallel to produce the output expected, and there's some bidirectional communication going on there as well.
In a software emulator, the only parallel you get is on multiple cores, but you can approximate it with threading (i.e. preemption). For simplicity, you stick with a single thread. You run 3 steps of the PPU at once, then one step of the CPU and APU. You've basically just sped through the first two steps, because who will notice those two cycles? They took no "real" time, they were performed as fast as the software could perform them. Probably doesn't matter, as no one could tell that for 10ns, this happened.
You need to add input. You use USB. That has a minimum polling interval of 1000Hz, plus your emulator processing time (is it going to have to go in the "next frame" packet?), but controls on systems like the NES were practically instantly available the moment the CPU read.
Now you need to produce output. You want to hook up your video, but wait, you need to feed it into a framebuffer. That's at least one frame of latency unless you're able to precompute everything for the next frame. Your input is delayed a frame, because it has to be fed into the next batch, the previous batch (for this frame) is already done. You use the basis of 60fps (which is actually slightly wrong) to time your ticking of your emulator.
Now you need to hook up audio. Audio must go into a buffer or it will under/overflow. This adds latency, and you need to stay on top of how close you are to falling outside of your bounds. But you were using FPS for pacing, so now how to you reconcile that?
----
Cycle accurate and low latency software solutions are certainly not easy, and it's impossible for true low latency on actual OS running CPUs. Embedded-style systems with RTOSes might be able to get pretty close, but it's still not going to be the same as being able to guarantee the exact same (or as near as we can tell) timing for every cycle.
I want to be clear that none of these hardware implementations are actually that accurate, but they could be, and people are working hard to improve them constantly
Analogue could support a SNAC type adapter with the Dock (I highly doubt it, those USB ports are almost definitely wired directly to a real USB chip). Users could develop a cart slot adapter (someone has, he's just lazy and hasn't finished it), but this won't work with any OS features, just the cores themselves.
Sure it would probably be cheaper to chuck a cortex-A* or similar mid-range MCU in there. One advantage of FPGAs that it can achieve "perfect" emulation of a Z80 (or other) since it's running on the logic gate level. No software task latency, no extra sound buffering, etc. It can re-create the original clock-per-clock.
Emulating “accurately” is so difficult that not even Nintendo’s Game Boy emulator on the Switch does it properly. I’ve been replaying old games and comparing some questionable moments with my original Game Boy, and the timings are not quite right in some cases.
For example in Link’s Awakening, there’s a wiggle screen effect done by writing to OAM during HBlank. On the Switch it lags very differently than my GB (try it by getting into the bed where you find the ocarina). Or with Metroid 2, the sound when you kill an Omega Metroid is different too. It pitch shifts along with the “win” jingle.
These have almost zero impact on playability. But for purists and emudevs it’s a popular pursuit.
> Emulating “accurately” is so difficult that not even Nintendo’s Game Boy emulator on the Switch does it properly.
The NSO emulators on Switch are particularly terrible, worse than past iterations on Wii etc. at least for some systems, for no obvious reason other than Nintendo being lazy and/or not preserving past emulation work on other systems and not wanting to reuse open source solutions.
There's a few good comparisons on Youtube such as youtube.com/watch?v=ounQZv1MFNA that go a bit more into the development history
They have a huge impact if you want to play light gun games on CRT, where timing being out by a few micro seconds prevents your light gun from registering at all
You cannot make 80s lightguns work on an LCD. When you pulled the trigger, the system drew out a few white boxes where the targets are, and how long it took the gun to "see" white told you which box you were aiming at, because of the raster functionality of the electron gun. LCDs always output an entire screen at once, so that cannot work.
MiSTer makes me kind of sad, the DE10-nano board it's based on is 7 years old at this point, and the actual FPGA chip on the board is probably over twice as old as that. And this is still the peak of hobby FPGA chips. I wonder why Moore's Law is hitting the FPGA industry particularly hard all of a sudden.
There are better FPGA options, they're just more expensive. The DE-10 Nano was strategically chosen as "powerful enough to meet most wants while still being within a reasonable budget".
No one's going to plunk down $10k for a 19 EV Zynq UltraScale+ with 1.1M LEs, but they will spend $200 on a Cylcone V with 210k LEs.
There are mpsoc ultra scale+ chips that are under 100$ these days, so I think that's not true anymore. Even low end mpsoc ultrascale fabric is way better than that old altera stuff, with lut4 clbs, ultraram, and better dsp blocks. I've had the fabric run at 500 mhz in my designs.
What kills that board (and many like it) is the absolutely proprietary SYZYGY connector, which is surface-mount (so hobbyists won't develop a board ecosystem for it, and aliexpress circuit cloners won't do their thing unless there's a proven ecosystem), and probably $10 in high quantities (if it's even in stock). RIP.
I'm not even sure that board has enough overall I/O to connect a VGA output, controller inputs, and a SDRAM chip (all of which are essential to a viable MiSTer replacement - note that no one has figured out how to use DDR RAM for emulation yet, too much seek latency).
The issue with any other board is that the boards aren't going to have all the same ports and compatibility so you'd have to have potentially two largely separate forks of the cores and accessories which would be tough in a relatively small community like the Mister. For example your board doesn't have the large pin headers used by basically every Mister accessory so at a minimum you'd need an adaptor to go from some other port on your board to the pinouts that the mister ecosystem is built on.
That's honestly not true at all; it all just depends on your platform. On the Pocket, the FPGA _is_ the processor (there are actually two FPGAs, one for the actual emulation core, and one for scaling video, and there's technically a PIC microcontroller for uploading bitstreams and managing UI). The FPGAs are still not much power compared to the display itself. With the in-built current sensor on the dev kits, the highest we've measured drawn by the main FPGA is ~300mAh. Now this sensor isn't going to be the best measurement, but it's something to go off of.
Personally I think this is the biggest selling feature of FPGA based emulation.
The reality is both Software and FPGA emulation can be done very well and with very low latency, however to achieve this in software you generally require high end power hungry hardware.
A steam deck can run a highly accurate sega genesis emulator with read-ahead rollback, screen scaling, shaders and all the fixings no problem, but in theory the pocket can provide the exact same experience with an order of magnitude less power.
It's not quite apples to oranges of course, but the comfortable battery life does make the pocket much more practical.
When being nitpicky about latency is where FPGAs truly shine. You lose a good bit of it by connecting to HDMI (I think the Pocket docked is 1/4 a frame, and MiSTer has a similar mode) (EDIT: MiSTer can do 4 scanlines, but it's not compatible with some displays), but when we're talking about analog display methods or inputs, you can achieve accurate timings with much less effort than on a modern day computer.
For a full computer like the Steam Deck, you have to deal with preemption, display buffers, and more, which _will_ add latency. Now if you went bare metal, you could definitely drive a display with super low latency, hardware accurate emulation, but obviously that's not what most people are doing.
Gate for gate an FPGA consumes more power then a dedicated chip, but the power dissipation depends heavily on the programming. Careful programming can reduce power dissipation.
A potential advantage of an FPGA over a dedicated chip is that any unused functions can just be left out, saving power dissipation and logic resources. This is the (largely unrealised) promise of Reconfigurable Computing [1].
They are generally much more power hungry than the equivalent ASIC, built using the same process. However if your aim is to emulate a sufficiently old ASIC then you can probably beat it with a modern FPGA. And there are FPGAs aimed at low power consumption: they are just also slower.
You are entirely correct, but I would like to point out that there are Cyclone V cores running logic ~140MHz, not just RAM clocks, and the power consumption is nowhere near that.
Getting a large design that passes timing at that frequency with the Cyclone V fabric is unlikely, however.
----
The distinction here, being that more capable FPGAs can get up to the 600MHz+ range, and actually run a full design at that speed.
The MiSTer project[0] is a wonderful open-source introduction to a practical use case for FPGAs. It uses verilog to describe how the DE10-Nano chip should be set up to resemble various classic computers, arcade machines, and video game consoles. With a single device you can have an Apple II+, Super Street Fighter II Turbo, and a SNES. Currently it supports up to the PlayStation for console cores, which is probably the upper bounds for the DE10-Nano. If you want more info, My Life in Gaming has a great overview video[1] of the MiSTer that covers it in depth.
The entire project feels perfectly in line with hacker mentality and is exciting to watch grow. There's nothing like playing Super Metroid with an original SNES controller on a CRT at the end of the day.
Sadly the article doesn't go into details about how the programmable RAM is wired to the actual logic gates, which seems to me the most interesting and challenging part of designing an FPGA.
In my mediocre understanding of digital circuits, RAM is usually addressable, so it has to be wired in a more direct manner to enable such a design.
I posted this article because someone mentioned some Ryzen chip having an FPGA in another post, and I am now left wondering:
1. why don't we have more user-programmable FPGAs in our fancy desktop mainboards
2. is there a SoC board, ARM or RISC-V based, with an FPGA on board? The slower the CPU, the more useful an FPGA would be to accelerate compute tasks
Both Intel and Xilinx sell FPGAs with hard ARM cores inside so you can run real Linux while being able to interface with custom logic. Additionally, it's pretty common to create ARM, RISC-V, or PowerPC soft cores in the FPGA when there is no hard cores available. These mimic the real cores and will run software while allowing for things like custom instructions that can take advantage of the flexibility of FPGA fabric. The Xilinx Zynq and Intel Cyclone V have options for hard ARM cores. There are various designs of boards out there you can buy that implement Arduino or Raspberry Pi shield compatibility. The XUP PYNQ-Z2 supports both interfaces and runs a Zynq-7000 with a real ARM core.
You can do other things with soft cores that are not possible with an off the shelf CPU like triple mode redundancy. This is when you run a lot of the logic in triplicate and vote on the results to prevent a bit flip from messing up the software. This is common for space-based CPUs that are running on FPGAs. It's expensive to design a new chip in a very small run so it's much cheaper to just put the core on an off the shelf FPGA and use the rest of the FPGA fabric for custom logic functions.
I had a ZCU-104 on my desk for a year circa 2021. It took a while to wrap my head around given I have no background in embedded engineering, FPGA, or hardware eng. But that board was transformative to what I've seen over twenty years working around mobile, edge, and IoT products, particularly given the availability of the PYNQ stack and the fact that someone like me could take the same old Linux/GCC skills and port, run, and profile code there from x86 while getting wormhole effects from a few compiler switches that opened pathways to things none of us had ever seen before, at least not optimized the way they are there.
> 1. why don't we have more user-programmable FPGAs in our fancy desktop mainboards
It has been tried, but GPUs are so fast and efficient enough that it’s rarely worth it.
It’s very easy to attach an FPGA to the PCIe bus as an add-in card exactly like your GPU. In fact, many FPGA dev boards come in exactly this format. They’re available, they’re just not in demand.
> 2. is there a SoC board, ARM or RISC-V based, with an FPGA on board? The slower the CPU, the more useful an FPGA would be to accelerate compute tasks
Plenty of FPGA parts include ARM cores. It’s a fairly standard chip configuration.
You can also connect an FPGA and an SoC with PCIe or other interconnects. It’s really not an obstacle.
FPGAs just aren’t very efficient from a cost or dev time perspective for most applications. They’re indispensable when you need them, though.
> RAM is usually addressable, so it has to be wired in a more direct manner to enable such a design
DRAM is necessarily a grid.
SRAM, in e.g. the standard 6-transistor cell form, you can kind of dump individual bits anywhere you need one.
> why don't we have more user-programmable FPGAs in our fancy desktop mainboards
They tend to be horrifyingly expensive and there are few use cases you can't outperform with a GPU or even just vector instructions. Most of the interesting use cases for FPGAs are when you have direct access to the pins and can wire them up to high-speed signalling, which really isn't home user friendly.
Also all the tooling is proprietary.
> is there a SoC board, ARM or RISC-V based, with an FPGA on board
Buy a medium sized FPGA and download a CPU of your choice.
(I have a downloadable-CPU-sized FPGA board on my desk for testing not yet shipped ASIC designs. It costs about six thousand dollars and has a 48-week lead time on Farnell)
> Buy a medium sized FPGA and download a CPU of your choice.
Damn, of course one would be able to download a CPU and "emulate it" in hardware.
I never imagined that would be possible. Now I'm thinking that I had infinite free time, I would buy an FPGA and design a modern Lisp CPU. A RISC-V based design with native Lisp support. Who needs hardware when you can just emulate it in an FPGA.
>Who needs hardware when you can just emulate it in an FPGA.
I would go with something like "configure it into hardware" or "upload it into hardware".
It is important to be aware and spread awareness of FPGAs as configurable hardware. Once configured, it is the hardware you've configured into it, for all means and purposes. No emulation involved.
>Now I'm thinking that I had infinite free time, I would buy an FPGA and design a modern Lisp CPU. A RISC-V based design with native Lisp support.
There's even VRoom![0], a very high performance open microarchitecture effort. They were already at ~10.3 DMips/MHz a few weeks ago.
but it sort of seems like a waste of perfectly good luts if your objective isn't to develop a cpu but to use the cpu to script whatever heavy-duty data crunching the fpga is doing
> Sadly the article doesn't go into details about how the programmable RAM is wired to the actual logic gates, which seems to me the most interesting and challenging part of designing an FPGA.
It does, that's the part under the 'Look-Up Tables' section. The key is there aren't any actual logic gates just lots of little RAMs. You implement an arbitrary blob of logic by having the inputs form the address then the RAM gives the result of the logical function.
> You implement an arbitrary blob of logic by having the inputs form the address
> then the RAM gives the result of the logical function.
This is incorrect. Modern FPGAs are composed of small, configurable blocks which contain all sorts of logic. The idea is that the configurable blocks can be (internally) wired-up to implement your logic of choice. The wiring configuration is "loaded" at power-on and retained in memories within each, configurable block.
The LUTs are still the core of these general-purpose blocks, however (as opposed to the more fixed-function blocks like DSP blocks). There's only a little extra logic, mostly focused on muxing and flip-flops (I would argue the flip-flops are the main thing to realise is dedicated hardware). The only thing that's really a different kind of logic is dedicated carry-chain circuitry.
> I would argue the flip-flops are the main thing to realise is dedicated hardware
Fun fact: The Microchip [1] ProASIC3 logic area is mostly composed of core cells, each of which can be configured to be either one 3-input LUT or one D-flip-flop, so not even the flip-flops are dedicated hardware.
Well indeed modern FPGA fabric along with the various fixed function blocks can be very complex, but this is a beginners 'How Does an FPGA Work?' for which a bunch of LUTs connected by programmable interconnect is a useful approximation.
You may be right; I don't know. It may be the case that a modern LUT is implemented as a bunch of logic plus a switch matrix. If so, this is for real estate, power, or speed reasons but not for functional reasons.
The fact remains that you can implement any logic function with static RAM LUTs, and the early Xilinx chips (like from 30 years ago) were implemented with static RAM.
Although flip flops can be created from gates (and they are usually taught this way), in silicon they are usually implemented from raw transistors in feedback configurations, because implementing a flip flop from gates requires more transistors than it does when you bypass the gate-level abstraction.
> Sadly the article doesn't go into details about how the programmable RAM is wired to the actual logic gates
Not sure what you mean by that. Do you mean how a RAM is used as a lookup table to implement logic gates, how routing works, or how block RAM is integrated into the FPGA fabric?
> is there a SoC board, ARM or RISC-V based, with an FPGA on board?
Better yet, there are a number of FPGAs available with an ARM SoC on board. Xilinx Zynq, Intel Cyclone V SoC, various others.
People have mentioned that the RAM is the logic (or at least most of it), but you are correct that it's only half the challenge: the routing of signals between logic blocks is also a huge part making an FPGA: there's a bunch of painful tradeoffs between flexibility, cost, and performance (and predictability of signal propagation). It's probably the area with the most secrecy from the vendors, as well. But broadly speaking there's usually a hierarchy of potential connections, from many local connections which can connect adjacent logic blocks together to a smaller number of connections which can jump longer distances (though you can always push a signal by hopping between logic blocks, this takes more resources and is slower), and a few connections which cover the whole of the chip, usually used for clock signals (and there's some extra stuff to care about here, since you usually want the clock signal to arrive to every flip-flop at the same time, you need to design the shape of these connections carefully)
As for question 1, they're far more common in server grade stuff where typically they are baked in. Consumer stuff just doesn't need/use as much IO throughput and muxing that the FPGA provides on say, a large networking switch.
There are PCIe compatible FPGAs that you can plug into your desktop like a graphics card to accelerate certain tasks. In general though, our workstation hardware just isn't specialized enough to require them, but can be extended to do so. If something is a large enough business model, they'll just make an ASIC.
FPGA doesn't have the instruction pipeline as the command is encoded in the gates themselves. It means that on runtime the FPGA is not turing complete[0] as opposed to the CPU[1].
There is a phrase "data is code and code is data" in security context.
The new saying if FPGA would ever replace cpus' as the main computation hardware(as you don't need turing complete when you keep using the same apps[microservices])
is something like "code is execution and execution is code" as you imprint the code in the gates. It would get rid of a whole class/subclass of memory safety vulernabilitie.
This paradigm change is like what webassembly did to the web.
The slogan should be "make the bitstream go mainstream"
Some made a demo running wasm on fpga[1], not sure if using a cpu or directly
of course you move complexity to compiling, and increase loading speed, all for order of magnitude faster execution
Companies devloped high level synthesis compilers but it's diffcult and challenging as you need to synchronize parallel excution piplines which you don't have to in cpu since it has steady clock rate for each step in the pipeline
A copmany named legup computing(acquired by microchip) compiled memcached/redis applications to fpga and improved perfromance & power efficency by an order of magnitude(10x)
There are a lot of intellectual properties in hardware design as opposed to software so tools and knowledge is scarce.
If anyone works / want to work on this problem hit me up in the comments
[0] Unless you implement a cpu on top of the fpga :)
[1] Assuming infinte memory, which is false, but good enough
I've had this idea for a while: make an FPGA capable of executing WASM bytecode, then offloading WASM execution to the FPGA. Sounds like a fun project to learn FPGA and how to make a CPU.
Sorta off topic, but I wonder if a CPU with WASM bytecode as its native instruction set could be more performant / power-efficient than JIT-ing WASM code to ARM/x86 assembly. My understanding is that modern processors comes with a wide range of optimization tricks like register renaming, out-of-order, superscalar, ... such that it's probably just easier to JIT WASM bytecode to the native instruction set, so we'd get those optimizations for free, as opposed to design your own WASM CPU with those same optimizations.
> FPGA doesn't have the instruction pipeline as the command is encoded in the gates themselves. It means that on runtime the FPGA is not turing complete[0] as opposed to the CPU[1].
That obviously depends entirely on the circuit, many sufficiently advanced circuits probably end up being accidentally turing complete.
there's a hint that you are misunderstanding how things work a bit here. By the time you get to the footnote: "Unless you implement a CPU on top of the FPGA :)" you realize that of course FPGAs can be turing complete, but it's not really correct to say "on top of", more correct with be to say "implement a CPU with the FPGA. Secondly, you never say a CPU isn't really turing complete (footnote: unless you use JMP instructions). We would typically class machines class using the maximum possible because it's usually trivial, but uninteresting, to, for example, program or limit a Turing machine to not be Turing-complete.
Your other footnote is spot-on. I refuse to consider CPUs to be Turing complete (out of pedanticism) because given physical memory constraints. Realizable physical computers are, like all digital logic circuits) actually Finite State Machines with just really, really big state space (like 2^(~10^15) states). If you put every single flip-flop and DRAM bit back the way it was and execute the next clock, it will deterministically follow the same pattern. They aren't even PDAs (Push-Down-Automatons). EVERYTHING is an FSM (pedantically). Turing machines and PDAs are theoretically useful models for theoretical computer science and math, however. But not technically accurate.
FPGA's aren't CPUs that lack instruction pipelines. They are universal ASICs and/or digital circuits that are re-programmable. ANYTHING that can be accomplished with synchronous (clocked) digital logic circuit can be implemented with an FPGA, including CPUs, but also potentially many, many, other things.
> [0] Unless you implement a cpu on top of the fpga :)
You should revisit what "Turing complete" means. The whole idea is based on one architecture being able to perform the computations an other architecture can by implementing a sort of emulator of the other architecture.
So the fact that the FPGA can implement a CPU shows that the FPGA is Turing complete, there is no "unless" about it.
I've read that AMD's 7040-series mobile CPUs will have an "FPGA-based AI engine developed by Xilinx" [1] - I'm wondering how programmable that will be.
I know there's been some performance difficulties emulating the PlayStation 3's various floating point modes. It's the kid of thing that I think an on-chip FPGA could theoretically help with, although I don't know if it'd be worth the trouble in this specific case. (Or if AMD's implementation will be flexible enough to help.)
Neat. If the author is around, might I suggest pushing some of the 'why use an FPGA' to the front? I think it would benefit from a more concrete example motivating the use of an FPGA - like a picture of some simple circuit using a seven segment display on a broad board next to a picture of an FPGA implementing the same circuit in order to make it more clear that it is a substitute for putting experiments together by hand. I think it will help newcomers better contextualize what is happening and why.
I think in the same vein your wrap up of why you might want to do something in hardware vs software is great and well placed.
Hmmm, I guess now is as good a time as any to bumblefuck around with small electronics projects for fun. Thanks for the reminder!
This is meant to be an introduction though, right? You can simply write “some people do X, and others claim Y is better” then move on.
I read several paragraphs of the article and I still don’t know why you’d use one, despite taking computer architecture and analog electronics courses in undergrad.
I don’t want to read about logic gates again and I don’t want to read about the nuances before I broadly understand what the point is.
For anyone else still wondering, here’s Wikipedia:
> FPGAs have a remarkable role in embedded system development due to their capability to start system software development simultaneously with hardware, enable system performance simulations at a very early phase of the development, and allow various system trials and design iterations before finalizing the system architecture.
Basically, rapid prototyping I guess. That makes sense.
If that was an ask for a specific example, one of the most common uses for FPGAs is DSPs. Say you have a simple FIR filter of, say, 63 taps. To do this in a CPU requires you to load two values and do a multiply/accumulate for each tap in sequence. Very (!!) optimistically, that’s about 192 instructions. With an FPGA, you can do all the multiplications in parallel and then just sum the outputs - probably done in 2 cycles and with pipelining your throughput could be a sample every clock.
If the FPGA is too slow, too power inefficient etc you can (if you have the money!) take the same core design and put it in an ASIC. The FPGA provides an excellent prototyping environment; in this example you can tune the filter parameters before committing to a full ASIC.
> multiply/accumulate for each tap in sequence. Very (!!) optimistically, that’s about 128 instructions
This is what all those vector instructions are for.
FPGA is kind of invaluable if you have lots of streams coming in at high megabit rates, though, and need to preprocess down to a rate the CPU and memory bus can handle.
Yes, indeed :) Didn’t want to muddy the waters with vector instructions, and it’s fair to say that the dedicated DSP chip market has been squeezed by FPGAs on one side and vectorised (even lightly, like the Cortex-M4/M7 DSP extension) CPUs on the other.
You can do the multiplications in parallel but summing 63 values in one clock is not going to work that well. You would almost certainly want more pipelining, though with an FIR you can do this without increasing latency.
What do you mean by "you". Maybe "you" as in a general consumer don't need an FPGA, but I guess one could argue a general consumer doesn't need a general purpose computer either.
There are certainly many use cases where you absolutely do need an FPGA, i.e. anything were you need to process large amount of IO in realtime. For example the guys from simulavr (talk about how they use an FPGA for display correction) here: https://simulavr.com/blog/testing-ar-mode-image-processing/
Many modern devices would not function without FPGAs
> anything were you need to process large amount of IO in realtime.
I'm working on a FPGA-based system right now. We're using an FPGA precisely because this is what we're doing -- about a hundred I/O ports that have to be processed with as little latency as possible.
(SimulaVR dev) It's not wrong to say that in most cases, tasks are better solved without an FPGA.
But when you need one you need one (or an ASIC if you have the volume and don't need reconfigurability)
I suggest it purely for educational purposes. The first struggle isn't identifying the best use case - its understanding wtf is going on. Putting it in terms of something more familiar is helpful for that.
Your thing would make for a wonderful followup topic though.
Here's a nice series that picks up where this one leaves off (shows how flip-flop/LUT units are organized into cells inside a PLB, programmable logic block). It also is the first step in a tutorial on using Verilog, building a hardware finite state machine, and eventually a RISC-V processor on a FPGA:
If anyone's interested in getting into FPGA stuff, the iCE40UP5K (the 5K indicates the number of LUTs; there are other sizes available) has an open-source toolchain that works very well. There are pretty cheap development boards using this FPGA, I use the UPduino, which I bought in 2019 or so. (Recently the same team released a new UPduino dev board that also includes an RP2040, which is probably pretty useful, available at https://lectronz.com/products/pico-ice-rp2040-plus-lattice-i...)
What can you do with an FPGA of this size? Not too much, but it's about enough to, for example, fit an implementation of a YM2151 FM audio chip (audio chip from the 80s that was used in synthesizers and arcade machines). I think it would probably also be big enough for old CPUs such as the Z80 or 6502. Unfortunately it's not 5V-tolerant so you can't wire it directly to parts of that era.
I second the upduino, it's a great board. Recently i have been using the iCESugar-pro which is also supported by the open source toolchain, and comes in a handy SODIMM form factor that makes it easy to incorporate into real projects. It uses the lattice LFE5U-25F-6BG256C with 24k LUTs and 32 MB SDRAM, 106 usable i/o connections, about $60.
I saw various comments about how FPGAs are not ready for consumer hardware, apple is using them in the airpod max already (probably for filtering audio)
They really excel for high throughput & low latency - which noise canceling sounds like a good example of! In addition to this, they are already being used in communication systems & data centers to speed up latency sensitive computations. Edge AI seems like a big market that they will be used for soon, probably more likely b/c they can be flashed unlike ASICs and new NN architectures drop every couple of years.
There’ll also extremely common in most hardware used for any realtime media processing. With them being mutable it can be updated to support new compression methods, or switched in field to drop a device between acting as an encoder or decoder.
So can a large FPGA be somehow used to brute force encryption?
I don't really understand electronics to see if a GPU could be faster than a FPGA, but my guess is yes?
It seems that anything that can be programmed is inherently slower than a FPGA equivalent doing the same task.
Does larger enough key size always defeat a FPGA?
I would guess that it becomes power and cost prohibitive for a private company to deliver such possibility, but of course, a large government entity like the NSA might have enough resource to pay for enough FPGA to decrypt most things.
> So can a large FPGA be somehow used to brute force encryption?
Yes and no. It is usually possible to use an FPGA to build a specific logic circuit that will run some particular algorithm faster than the algorithm would run on a general-purpose CPU. (By a factor of maybe 2x to 100x. It varies a lot depending on the algorithm.)
It is also the case that if you implement that algorithm on a dedicated custom chip it can always be designed to run faster than the equivalent FPGA.
So if you really want maximum speed, spend the megabucks for dedicated silicon.
That said, modern GPUs and DSP chips might be able to run some algorithms almost as fast as dedicated silicon and faster than FPGAs. So maybe you don't need megabucks. Again, it depends on the algorithm.
In any case you'd need many orders of magnitude of speedup to usefully brute-force a modern crypto algorithm, and the most you'd get even from dedicated silicon is about three. Useful speedup requires quantum computers -- which don't yet exist -- and even quantum computers will only help break asymmetric crypto, not symmetric crypto.
Even though the FPGA fabric might encode the solution more effectively, there are other important differentiators: clock speed and memory bandwidth. GPUs have higher clock speeds and typically better memory bandwidth (related of course).
With the higher clock speed, GPUs can well outperform FPGAs for many problems.
Although it was a little bit of a long way to get to a point, I did somewhat appreciate the article's going into the idea of the circuit structure and look up tables.
It made me think (and I don't know if this is an apt analogy), it's like if you have 2 versions of a machine responsible for sorting different size balls in a factory.
One of them (#1) is built with pipes that exactly encode (in hardware, like a grid of holes) which balls are allowed to pass which size buckets on a production line. The other (#2) uses a robot that looks at each ball passing by, does some measurements, and decides whether it's the right size for a certain bucket.
(#1) is not very flexible, but is really great at the job and can do it just by shaking the tray and works for thousands of balls at once. (#2) needs to take time to study each one passing by. But you can reprogram it more easily.
Each has its purpose. And the FPGA tries to give you the advantages of #1 with a bit more flexibility to change the size of the holes, etc.
Then, replace balls with input video files that need an efficient decoder in hardware, etc. and I see the link.
Is there a good "starter" FPGA kit which has an open-source toolchain, and the ability to build/deploy everything from the command line?
I'm going through links in the comments here now in search of this too.
It would be cool to hook up an FPGA to an image sensor and do my own processing from the raw sensor data. Based on that description, I'm not sure which dev board or starter kit would be best to be looking at.
Here's one more thing I don't know - how do you evaluate whether or not an FPGA is sufficient for what you want to do? For example, let's say I want to talk to an image sensor which has a 3088x2076 resolution, and either capture still images or maybe video at some frame rate. Which specs do I need to hone in on and evaluate to determine which FPGAs would work, and which won't?
It seems that operations on FPGAs can run much more efficiently than their cpu equivalent. For an 'AND' operation, a cpu needs to load code and data from a memory into registers, run the logic and write the result register back to some memory. This while filling up the pipeline for subsequent operations.
The FPGA on the other hand has the output ready one clock cycle after the inputs stream in, and can have many such operations in parallel. One might ask, why are cpus not being replaced by FPGAs?
Another interesting question, can software (recipes for cpus) be transpiled to be efficiently run on FPGAs?
I could ask GPT those questions, but the HN community will provide more insight I guess.
They both use the same kind of components; the FPGA does not have a speed advantage, you are simply comparing the speed of a very simple circuit element to the speed of a very complicated pipeline.
You would use an FPGA to simulate a special purpose circuit, which would be faster than a CPU for its specific purpose. We have CPUs because having a general purpose processing chip is incredibly handy when you want to be able to do more than one thing.
EDIT: I forgot to mention that the device outputs in one clock cycle by definition: if your clock is too fast then your components output signals dont have time to stabilize and you will get read errors, so you ensure your clock is slow enough for everything to stabilize.
> The FPGA on the other hand has the output ready one clock cycle after the inputs stream in, and can have many such operations in parallel. One might ask, why are cpus not being replaced by FPGAs?
FPGAs are more or less a flexible replacement for an application specific (logic level) integrated circuit. A CPU can do a wide variety of tasks, with a small penalty for switching tasks. An ASIC can do one thing and that's it, a FPGA can do many things, but with a large penalty for task switching. (you can have a CPU as an ASIC or an FPGA, but...). ASICs require a lot of upfront design work and costs, so you can't use them for everything. ASICs and especially CPUs tend to be able to achieve a higher clock speed that FPGAs, but it kind of depends.
> Another interesting question, can software (recipes for cpus) be transpiled to be efficiently run on FPGAs?
Not really; the way problems are solved is drastically different, and I'd expect most things would need to be reconceptualized to fit. And a lot of software isn't really suited to living as a logic circuit. Exceptions would be encoding, compression, encryption, the inverses of all of those, signal processing, etc. Things where you have a data pipeline and 'the same thing' happens to all the data.
The very lowest-level operations on FPGAs can run more efficiently than emulating them on an CPU. The higher level operations, which is usually what we want, are slower, as a 'gate' in an FPGA is approximately an order of magnitude slower, more power hungry, and expensive than a gate in a CPU (or GPU). modern CPUs excel at executing data-dependent operations of a wide variety while exploiting as much parallelism as available, while GPUs excel at churning through a huge amount of memory doing parallel arithmetic operations: for both of theses tasks an FPGA is going to be worse. For an FPGA to be the better option you need a task which is specialised enough that the speedup from designing a specialised circuit to do it overcomes the intrinsic slowdown from it being implemented in an FPGA as opposed to raw silicon, and these are not your typical software tasks, usually much more hardware-type tasks.
The two areas where FPGAs excel are having extremely low and predictable latency, and having lots of extremely flexible high-bandwidth IO (still less bandwidth than the memory or PCI-E bandwidth of a GPU, but it's close and it's much more flexible). For computation throughput, however, there's only a few niche applications, though one important one is simulating the logic which is going to be implemented in a CPU, GPU, or ASIC to validate it before the expensive process of actually etching some silicon.
> The FPGA on the other hand has the output ready one clock cycle after the inputs stream in, and can have many such operations in parallel. One might ask, why are cpus not being replaced by FPGAs?
You need to be careful with interpreting cycle counts as execution time. A naive operation on an FPGA may take a single cycle because the author has asked for that, rather than constructing and using a pipelined multi-cycle ALU [1]. Therefore the design tool will synthesize a layout that performs the operation in a single cycle, but the cycle time (and thus clock frequency for that portion of the layout) will be limited by however long that operation needs to complete, and will be much longer than the cycle time of the CPU you are comparing against. Note that a CPU almost certainly has "better" ALUs than those you'd get on an FPGA - optimized fixed-function circuits can always be faster than their generic programmable equivalents - the advantage of the FPGA is that you can have only (and more of) the operations you need, running in parallel as many times as the device will support.
You could make a CPU by which every instruction (even very complicated vector instructions like AVX or SVE) was completed entirely in a single cycle, but that cycle itself would be very long, and all of the incredible throughput benefits of pipelined execution would be lost.
[1] Pretty much all of the vendor design tools can instantiate an ALU for your device, given parameters like # of bits, operation types, # of cycles, etc.
FPGAs are the next big frontier for software development, and have been since the '90s, they just need the programming model worked out. This is the traditional story told about FPGAs, but GPGPU programming suddenly overtaking FPGA development about 2010 despite their awkward programming models makes that story rather suspect. The thing is, a lot of the benefits of FPGAs are really best-case scenarios, and when you move to more typical scenarios, their competitiveness as an architecture dwindles dramatically.
Pipelining on an FPGA requires being able to find, and fill, spatial duplication of the operations being done. If you've got conditional operations in a pipeline, now your pipeline isn't so full anymore, and this hurts performance on an FPGA far more than on a CPU (which spends a lot of power trying to keep its pipelines full). But needing to keep the pipelines spatially connected also means you have to be able to find a physical connection between the two stages of a pipeline, and the physical length of that connection also imposes limitations on the frequency you can run the FPGA at.
If you care about FLOPS (or throughput in general), the problem with FPGAs is that they are running at a clock speed about a tenth of a CPU. This requires a 10x improvement in performance just to stand still; given that software development for FPGAs requires essentially a completely different mindset than for CPUs or even GPUs, it's not common to have use cases that work well on FPGAs.
(I should say that a lot of my information about programming FPGAs comes from ex-FPGA developers, and the "ex-" part will certainly have its own form of bias in these opinions).
Yeah I don't really see FPGAs ever making their way down to consumers the way GPUs and CPUs have (end users actually programming them).
For (semi) fixed pipeline operations FPGAs will basically always be worse than some slightly more specialized ASIC like a GPU/AI engine.
One area FPGAs can be exceptionally good at is real-time operations. You have much better control over timing in the general on FPGAs vs MCU/CPUs, but I don't think that's inherent (you could probably alter the mcu architecture a bit and close the gap).
I could be wrong but I also think you get better power draw for things like mid to low volume glue chips in embedded systems because you're not powering big SRAM banks and DMAs just to pipe data between a couple hardware interfaces. This is only because of market forces though obviously, because if mid to low volume ASICs become viable in terms of dev time they'll be much better.
one big problem is memory. basic cpus have alot of facilities for high-speed synchronous interface with DRAM, and truly vast amount of resource for cache.
partially as a result, a good model for compiling code to fpgas uses a dataflow paradigm, since we don't need to serialize all operations through a memory fetch, cache, or even register file.
if we hadn't decided to move all our computing to the cloud, I suspect fpga accelerator boards for applications which map well to that model would have some traction in specialized areas. signal processing is definitely one such.
> One might ask, why are cpus not being replaced by FPGAs?
Most of the time you want data-dependent execution. FPGA systems excel at "fixed pipeline" systems, where you have e.g. an audio filter chain .. but even that is usually done in efficient DSP CPUs.
> Another interesting question, can software (recipes for cpus) be transpiled to be efficiently run on FPGAs?
A subset can. Things like recursion are right out. Various companies have tools to do this, but you usually end up having to rework either the source you're feeding them, or the HDL output.
>One might ask, why are cpus not being replaced by FPGAs?
they do sometimes !, for very specific applications. The problem is that an FPGA is programmed for one specific task and would have to be taken offline and reprogrammed if you wanted to do something else with it. Its not general purpose like a CPU where you can load up any program and have it run.
Programming an FPGA is also comparatively much harder to reason about than a CPU because of the parallelism and timing you described.
Some of the more modern Xilinx stuff has features where you don't need to take down the whole FPGA to reload a bitstream onto part of the chip. Its really neat, you can do live reprogramming of one component and leave the others alone or have an A/B setup where one updates while the other is unchanged.
Yes, I'm working on a Xilinx ARM processor with an FPGA. The FPGA and the CPU are independent units in the chip that can each operate with or without the other. We can indeed reprogram the FPGA without taking the system down.
These are really good questions to be asking, and to help with that let's consider 3 attributes of compute complexity: time, space, and memory
The traditional way of computing on a CPU is in essence a list of instructions to be computed. These instructions all go to the same place (the CPU core) to be computed. Since the space is constant, the instructions are computed sequentially in time. Most programmers aren't concerned with redesigning a CPU, so we typically only think about computing in time (and memory of course)
On an FPGA (and custom silicon) the speedup comes from being able to compute in both time and space. Instead of your instructions existing in memory, and computed in time, they can be represented in separate logic elements (in space) and they can each do separate things in time. So in a way, you're trading space for time. This is how the speed gains are achieved.
Where this all breaks down is the optimization and scheduling. A sequential task is relatively easy to optimize since you're optimizing in time (and memory to an extent.) Scheduling is easy too, since, they can be prioritized and queued up. However, when you're computing in space, you have to optimize in 2 spatial dimensions and in time. When you have multiple tasks that that need to be completed, you then need to place them together and not have them overlap.
Think trying to fit a ton of different shaped tiles on a table, where you need to be constantly adding and removing tiles in a way that doesn't disrupt the placement of other tiles (at least not too often.) It's kind of a pain, but for some more constrained problem sets, it might make sense.
These aren't impossible problems, and for some tasks, the time or power usage savings is worth the additional complexity. But sequential optimization is way easier, and good enough for most tasks. However, if our desire for faster computing outpaces our ability to make faster CPUs, you may see more FPGAs doing this sort of thing. We already have FPGAs that are capable of partial reconfiguration, and some pretty good software tools to go along with it.
People should make their software faster before they turn to FPGAs. Unless you already did that! But FPGA is a lot of fun to learn and play with even if you don't have a ready application.
On the other hand, there are fpga boards priced comparably to microcontrollers now. It can be very refreshing and useful to escape the "programming" paradigm.
Hobbyists are already contributing the value of their time which is generally much larger than the cost difference between device types, so in a sense the fpga barrier to entry may actually be higher for companies than individuals at the moment.
I don't know if I'm online on this one but the only thing that have stoped me from finally learning FPGA it's a practical use for me or my Profesional life. I have not yet encounter a situation besides the bucket list item of making a chip, which btw I don't even need to know FPGA at this point to complete this, that has made me seriously sit and Lear FPGA
It seems FPGAs are universal Turing machines (in contrast to ASICs, which are not generally universal), which would make them similar to CPUs. But CPUs are clearly "more universal" in some sense. Perhaps because CPUs can change their programs at run time, which doesn't seem possible for FPGAs. Unfortunately the article doesn't compare FPGAs to CPUs.
these are much cheaper to produce and run, not to mention more flexible than GPU’s for LLM training. as an added benefit you can write (if you were so inclined) hardware-level logic to whatever workload you want. reprogrammable as well.
-background : 4-5 years ago worked on embedded systems tech that used these for fintech. they seemed like the future back then and LLM workloads were barely a thing. Glad more people are learning about them.
The current list of what it can do with FPGA is listed here - https://openfpga-cores-inventory.github.io/analogue-pocket/ and the inevitable sub-reddit is a good resource. https://old.reddit.com/r/AnaloguePocket/