There was no framebuffer in those consoles [1]. So you pretty much only have to store game state and some auxiliary data in those 128 bytes, which starts sounding a lot easier.
Modern games now have programmers deal with drawing a frame a pixel at a time when writing shaders. The GPUs themselves render a tile at a time and not the whole buffer.
[1] https://en.wikipedia.org/wiki/Television_Interface_Adaptor