Awesome! A huge amount of work must have gone into this.
I did some playing around with VGA graphics from the Pico when it first came out (wrote a simple library to produce SNES like graphics, wrote it all up on my blog https://gregchadwick.co.uk/blog/playing-with-the-pico-pt6/). It felt like Doom should be doable but I figured you'd need an off chip RAM expansion interfaced via the PIO. Clearly not.
The Pico really is a very fun board to play around with. Could be a great target for a retro style mini console thing.
(Note that it's a dev board; if you just want to play AAA games, not the thing to buy. If you want to program a game and show it off, it's what you want.)
I built a retro style game console for myself and now working on building games on it. Never thought doom is possible without lot of external hardware for RAM and storage
Fun VGA experiment -- thanks for writing it up. I've done VGA with FPGAs but I like how Pico is way cheaper with an open tool chain and great accessories.
The article talks about this and allows you access Flash as if it was very slow RAM (fronted with a 16K cache of actual RAM). This allows the author to do many things like directly access the levels and textures without loading them into RAM, And in fact storing them compressed and using the second core to uncompress them on the fly
Not the same but not totally different. MCU abstraction is simple and more like vintage stuff. So it would closer to an 80s system that executed many routines from memory mapped ROMs -- in addition to system RAM -- with an instruction cache on CPU.
MCUs can have on chip RAM/ROM and off chip [quad] serial RAM/ROM, and even parallel access RAM/ROM like FRAM. Several ways to skin the cat. Or cut the pie, as it were.
Much of that is for art assets. Do it with fewer or lower-resolution textures and sprites, and you could get away with quite a bit less. The executable code can fit in well under one megabyte. You could even procedurally-generate the art, if you've got way more CPU core available than storage.
The SNES ran Doom with two 64k RAM banks (albeit with textures and data such as level geometry running directly from ROM.)
SNES Doom - 128KB RAM, I think a 2MByte ROM, CPU is 65816 at 3Mhz + SuperFX RISC CPU at 21Mhz, which also had its own 64KB of RAM.
PSX Doom - 2MB RAM + 1KB fast scratchpad, able to load from a standard 650MB CD, CPU is a MIPS R3051 at 33Mhz + the PSX accelerated graphics, not used except to draw strips
So doing this in a device that has not too much more RAM than the SNES and also has to livestream the VGA signal Atari 2600 style is exceedingly impressive. It's a dual CPU unit but basically having to spend a core manually bit banging the VGA signal like that is what fascinated me the most.
I haven’t checked but don’t think that the second core was bitbanging the VGA signal. The RP2040 has PIO (programmable I/O) mini cores that can read directly from RAM (DMA) and address the GPIO pins directly. They most likely used that to their advantage.
That's cool, it's Doom in a completely self-contained cartridge size. Would it be possible to just hook up a cartridge like this to a monitor (through e.g. usb-c) directly?
Also if the RP2040 is just $1, does this mean we should be able to get e.g. Doom on cheap handheld single-game devices like the old Game & Watch and similar machines? I remember spending hours on these "racing games" or 12-in-1 Tetris LCD machines from the toy shop. How much does a small (2-4") color OLED or backlit LCD cost these days? Actually, what is the cheap handheld market looking like these days? I had a boggle at the local toy shop's website, VTech is still going for it but mainly with baby toys it seems, and those Tetris handhelds are still the same from 20-30 years ago, they cost just €3,99 these days. I'm also seeing some products from a company called Wonky Toys, and miniature Atari arcade cabinets.
What a coincidence, I was working on porting doom to the e-ink badger2040 last week. Getting doom to fit into memory was fairly straightforward, but they did a better job than me. I'm very impressed they got the original WADs and networking going as well. Great work!
I got things drawing with low-complexity WADs, but had issues/graphical snow after a few seconds in and needed more polish to fit the original WAD in memory. I figured the video would be best if it opened with the hangar level, so I haven't made a one yet. Might be worth rebasing off this effort instead.
- "I decided to leave the XIP cache to do its thing, and select a few small areas of hot code or data to promote to RAM manually"[1]: I understood this as you leaving the XIP cache activated. But this seems at odds with "16K of flash XIP cache, that we’ve talked about, but decided not to use."[also 1], which I'm interpreting as "decided not to make use of the XIP cache (i.e. turn it off)" (maybe I'm misreading).
- I thought ARM32 has 12(-14) usable registers (compared to 14-15 in x86-64), so why these mentions of "scarce Cortex-M0+ registers"? (Does FIQ mode reduce the number of usable registers?)
- "not good on a Cortex M0+ where the overhead of a function call is generally 30-40 cycles, with the corresponding loss of most of your precious “in-register” state": are function calls disproportionally slower on Cortex M0+? (Certainly 30-40 cycles seems high.) Why is that? (Registers r4-r11 are callee-saved[2], thus not lost; mutable data might have to be re-read from memory, though--just like on other architectures, but maybe CPU caches are faster on those.)
- "These OR values can be stored in a lookup table indexed by higher bits in the sample position, and thus the 8x space savings can be realized without needing any branches in the code!"[3]: Cortex-M0+ has a 2-stage pipeline[4], I'd hence expect the cost of a jump to be just 1 additional cycle, for the re-processing of the 1st stage for the next instruction (maybe I'm wrong), which would be the same as a memory access. (Maybe multiple jumps can be saved this way, though.) Did measurements show the lookup table to be faster?
> - I thought ARM32 has 12(-14) usable registers (compared to 14-15 in x86-64), so why these mentions of "scarce Cortex-M0+ registers"?
Cortex-M0+ is thumb-only (compressed 16bit instruction encoding).
"In Thumb state, the high registers, r8-r15, are not part of the standard register set. The assembly language programmer has limited access to them, but can use them for fast temporary storage."
> (Does FIQ mode reduce the number of usable registers?)
There is no FIQ mode in Cortex-M. Instead you usually have the nifty Nested Vectored Interrupt Controller (NVIC) and it is designed so your interrupt handlers can be regular C functions, with no special handling needed (no special interrupt return instruction) needed.
In thumb2 virtually all arm32 Instructions are available, some as 32bit encodings. But even the 16bit encodings include some instructions that work on hi registers.
> - "I decided to leave the XIP cache to do its thing, and select a few small areas of hot code or data to promote to RAM manually"[1]: I understood this as you leaving the XIP cache activated. But this seems at odds with "16K of flash XIP cache, that we’ve talked about, but decided not to use."[also 1], which I'm interpreting as "decided not to make use of the XIP cache (i.e. turn it off)" (maybe I'm misreading).
The way I'm reading it, you can disable the XIP cache and use that 16KB of RAM for anything else you want. But the author "decided not to use" it for something else, that is, the 16KB are still being used as cache.
The Pi Pico reinvigorated my love of tinkering with electronics. I can hack my way through C on an Arduino (and would probably still use it for any serious deployment that I didn't expect to turn into a big community effort) but for standing up quick proof of concepts, embedded python is outstanding. Incredible for $4.
These newer compatible boards being released are awesome.
Fun stuff, it always amazes me that people are surprised. Not having lived through it is a part of that I'm sure.
The RP2040 is more powerful than an 80286. The PC/AT which was hugely more powerful than the original IBM PC (on which DOOM also ran). Put a keyboard, mouse, and an frame buffer on an STM32F4 or F7 and you've got the computational and capability equivalent of the PC's that powered the world in 1985. People did accounting, CAD, spreadsheets, email, all sorts of things on them. Amazing I know, but here we are.
IIRC, Doom was very playable, but not exactly smooth on my cheap 386SX which was... 20mhz? But it ran like butter on the 66MHZ 486's in the school's computer lab.
In terms of instructions per clock and I/O bandwidth it compares more favorably to 16 bit architectures than 32 bit ones even though the Cortex-M family is nominally 32 bits.
RP2040 is designed to provide full bandwidth to both cores at once without bandwidth contention. Combined with the PIO's you can do some really impressive bitbanging.
And require 4MB ram... since i have machine with 2MB i wasn't able to enjoy it.
But i found DOS4GW command line option to "emulate" ram with swap file on DOS. I make virtual memory like 4MB and run game. It took 15 minutes to start game and show main menu, another 5 mins to navigate on menu and 15 to run game.
Frame rate was some like frame PER minute.
I thought about building toy pc with rp2040 but I wasn’t able to solve gpu problem. Driving display seems very hard task without some dedicated hardware. And using serial output is not fun.
So, FWIW, I've been playing around with this. I've got an FPGA board that has an HDMI output[1]. I have a simple 1280 x 720 frame buffer running on it (read the DRAM, display it on the monitor. I'm building a carrier board to connect it to an STM32F429 Nucleo-144 board using the ST Micros flexible memory controller (FMC) peripheral. This will present the frame buffer contents to the STM32 as memory.
Additionally, some "control registers" are being implemented in the FPGA that can do certain actions. At a minimum they are "clear to one color", "copy region", "scroll region", and "copy glyph". The STM32 has the DMA2 peripheral that does a lot of cool bitblt type functions but these can be nominally slowed down by not synchronizing with the FPGA's schedule for displaying things.
The STM32 is running micropython. The "plan", such as it is, is to let the REPL run using the display as its terminal, and a "graphics mode" to reserve parts of the screen for graphics. The small goal is to re-create sort of the VIC-20/C64/ZX Spectrum kind of "vibe" (interpreted language easy access to the graphics) and then build from there. Clearly the basic frame buffer is like 20% of the FPGA so there is lots of room to do other stuff in there.
There are a lot of options on this front. As far as I can tell, most displays for embedded platforms are sold as modules that you interface with via a serial or parallel bus. There are libraries out there to handle the grunt work, if you don't want to dig through data sheets yourself.
If you want something that doesn't use any dedicated hardware, interfacing with analog displays (e.g. NTSC/PAL/VGA) can be done with a handful of resistors on GPIO pins. Conceptually, it is easier but actually dealing with timing is a pain. Again, libraries that deal with the grunt work are available.
Yes, they are solving the storage and RAM challenges partially by throwing CPU at it: not using native pointers, switching between multiple struct sizes where the original had one, compressed integer values, etc., also they restructured drawing to happen in slices as the beam travels, which must have its costs. Also I wonder whether original Doom relied on some GPU hardware, whereas here everything happens in software. RP2040 doom also has to emulate the sound hardware, and handle lots of interrupts to initialize the DMA for each individual video scanline.
OTOH they are actually overclocking the RP2040 at 270Mhz.
Sure, but it's not about the speed of the hardware but what was done to port it to the hardware. RAM and ROM would be larger then. It has 256KB, I remember 286's having well over 1MB, 386's even more!
> For RP2040 Doom, whilst I thought I might need to build my own single pin PIO networking with some sort of token passing, it turned out I had 2 GPIO pins free that could be configured for I2C, so I decided to just use that instead.
This is really nice. Trying to port Doom to RP2040 was on my on my todo list, but all along I feared that the SRAM would simply not be sufficient for a port with authentic feel and original assets. I'm glad to be proven wrong. I wonder if DEH support is out of question.
I can't wait to see what kind of a chip they make after the RP2040.
So I just got my hands on a couple of Picos. I was so excited to find a RPi board in stock, that I forgot to check if it has WiFi or not. I would like to run some form of OpenSprinkler on that (even if not OpenSprinkler, I can use cron jobs to control the sprinkler relays by hand).
I don't think you understood what you were buying. The Pico has no relation to the rest if the RPi product line.
It is basically an Arduino on steroids. It cannot run cron jobs. It has no network stack, kernel or operating system, and cannot run any software not specifically written for it.
I guess the RPi Zero W would suit your needs a lot better, and it seems to be in stock in some shops. The Pico doesn't run Linux so cron jobs aren't possible.
I did some playing around with VGA graphics from the Pico when it first came out (wrote a simple library to produce SNES like graphics, wrote it all up on my blog https://gregchadwick.co.uk/blog/playing-with-the-pico-pt6/). It felt like Doom should be doable but I figured you'd need an off chip RAM expansion interfaced via the PIO. Clearly not.
The Pico really is a very fun board to play around with. Could be a great target for a retro style mini console thing.