Improvements to the Xerox Alto Mandelbrot drop runtime from 1 hour to 9 minutes

gumby · on June 26, 2017

Shutting off redisplay was not an uncommon technique on the Alto (the first time I saw it I was somewhat disconcerted, and though the machine had crashed as that wasn't all that uncommon in those days).

The key, if you read Thacker's original design paper, is that screen refresh rate was 2/3 of the bus bandwidth. This was considered absurd: why so much resource devoted to I/O (many machines in those days still had channel controllers, a design which has implicitly returned).

These old tidbits are fun because they make you think about different design tradeoffs. I love the line in K&R (IIRC -- is it still in there?) where they explain that the standard library is required (and say, including the quotes "you mean I have to call a function to do I/O?"). And the implicit defense in the 801 paper (the original RISC paper) that yes, the compiler really could do a good job.

wooby · on June 26, 2017

The "801 paper" (I think): http://www.ece.ucdavis.edu/~vojin/CLASSES/EEC272/S2005/Paper...

pmontra · on June 26, 2017

> One way to speed up the Alto is to shut off the display. I tried this and improved the time from 24 minutes to 9 minutes, a remarkable improvement [...]: to display pixels on the screen, the CPU must move the pixels for each scan line from RAM to the display board, 30 times a second.

Ah, same as the Sinclair ZX81 fast mode for the same reasons. The ZX80 was always running in fast mode and displayed the screen only if no program was running. Old times.

adrianmonk · on June 26, 2017

Yes, I had the TS1000, a slightly modified version for the US market (with NTSC video output). I remember its built-in BASIC actually had "FAST" and "SLOW" commands. I love that they felt "SLOW" was an appropriate name for a command.

I guess I had always assumed this was about memory bandwidth, but this Alto article raises the possibility that the CPU may have been doing double duty (probably for cost reasons and not flexibility).

And this documentation says it was "four times as fast", which is probably too much of a difference for just memory bandwidth: http://www.worldofspectrum.org/ZX81BasicProgramming/chap13.h...

I wonder if the implementation is as simple as just swapping the priorities of user code and display code.

rwmj · on June 26, 2017

The way this actually worked is a giant hack, explained here: http://zx81.us/zx81vid.txt But basically yes it really was about 4 times faster in FAST mode.

vidarh · on June 27, 2017

The C64 would also speed up for similar reasons (the VIC (graphics) chip could control the bus, so number of cycles "stolen" depended on your graphics - e.g. every scanline with sprites present it would steal more cycles).

PhasmaFelis · on June 27, 2017

C64, too. It was common to blank the screen during intensive processing--many games with audio samples would blank the screen while playing their tiny collection of dodgy voice lines, though a few didn't.

chromaton · on June 26, 2017

An even larger speedup could be realized by performing "solid guessing" like Fractint: Render at 1/2 resolution (that's 1/4 the number of pixels), then take a second pass only in areas where there's a color boundary. This cuts render times down to less than half.

Jtsummers · on June 26, 2017

When I was in grad school I (briefly) worked on a rendering engine based on this concept. It was trying to do ray tracing for an entire scene in real-time (original work would've been circa 2003, mine around 2006). With dynamic content.

The algorithm was, roughly, to render 16x16 pixels at random in the scene. If the variance in colors was low, you're done. If it's high, you cut it in half (kd-tree). Pick a region at random and a random 16x16 pixels and render them. If it's low variance you're good, otherwise split. Eventually things like a wall would have maybe 4 regions describing it, though taking up a large portion of the final image. You repeat this random selection and rendering continuously. There was also a way to merge regions should the scene change and they have a similar color though I've forgotten the details.

It was quite fast on commodity hardware from 2004/2005.

specialist · on June 26, 2017

Coolness. My buddy Roy created AccuRender, a plugin for AutoCAD and others. He did something similiar, back in the day; I'd have to ask him about the details. Strategy went from a scanline to a progressive rendering. While slower overall, the quick results enabled quick sanity checks, a huge UI productivity boon. Now I have to ask him, relive the glory years. :)

duskwuff · on June 26, 2017

IIRC, there's an even more powerful trick available: render the edges of a rectangle. If there are no color boundaries on the edges, you can usually assume it'll be solid-colored and skip rendering any of the contents. Some exceptions apply.

hhmc · on June 26, 2017

An enhancement to this method which I've seen is to check the edges of the rectangle for colour boundaries, and also check single random point inside. Presumably you can extend to n x n rectangles and k points inside as a function of n.

tetraodonpuffer · on June 26, 2017

what are the exceptions? I thought that Mandelbrot was always guaranteed to be ok for this optimization (unlike, say, Julia)

duskwuff · on June 27, 2017

The exception I'm aware of is when the rectangle is so large that it surrounds the entire set.

hhmc · on June 27, 2017

If you are doing colouring based on escape iterations this technique can elide detail.

kens · on June 26, 2017

Well, I already ported Fractint to X Windows so I'm not looking to port it to the Alto too :-)

skykooler · on June 27, 2017

Could this be improved even further by starting with 1/512 resolution (or whatever divides well) and making more passes only where there's a boundary - basically as a quadtree?

azinman2 · on June 26, 2017

A long time ago someone from Google Maps told me they did something like this with their tiles when everything was a pre-rendered image.

zeckalpha · on June 27, 2017

Raster pyramids!

ChuckMcM · on June 26, 2017

Very nice. I think the whole art of dealing with limited resources are lost on new programmers who have gigabytes of rams and gigacycles of compute. I love the old 1K demos for that reason, it really puts the puzzle back into programming.

In my experience, if you are familiar with RTL level synthesis of hardware then microprogramming isn't a huge leap (to me they feel very similar).

AceJohnny2 · on June 26, 2017

> I think the whole art of dealing with limited resources are lost on new programmers who have gigabytes of rams and gigacycles of compute.

Chuck, people have been saying that since punch-card days.

And here I am, cutting out strings from my program to make it fit under the 1MB barrier for my bootloader-to-SRAM to be happy :)

ChuckMcM · on June 26, 2017

No doubt true. I still think it makes programming more interesting if it is constrained.

jacobolus · on June 27, 2017

You just need to solve more computationally intensive problems. There are plenty of simulations and mathematical graphics you can make which take hours of rendering time on a fast modern machine, where optimizing the inner loops can have a big impact.

specialist · on June 26, 2017

Constraints make a lot of creative endeavors interesting. I have no idea why Lego modeling is so compelling when you can just print miniatures, but for me looking at people's work is like crack.

Retra · on June 27, 2017

You're just solving the wrong problems then. Try rendering the whole Earth at sub-cm resolution or something. There are so many things we can't do right now because we're too constrained...

digi_owl · on June 27, 2017

I think he is more railing against text editors made using JS and a chrome-less web browser...

pishpash · on June 26, 2017

Programmers should have machines worse than the users. Flame away.

ChuckMcM · on June 26, 2017

Always a controversial position :-) Back in the day hosting the full development environment would often level out the differences in performance, not as much today. With mobile development you don't have much of a choice, your stuff has to run on the phone.

pjmlp · on June 27, 2017

It still holds though, when developers get to only test on Google Pixie and Samsung S8, while customers are on ZTE Blade, the stuff might even install, but run it won't.

bluedino · on June 27, 2017

At least for testing. My favorite example was the Microsoft Office for Mac team testing every build on a 6100/60

krylon · on June 27, 2017

I like to have a decent machine for development, especially when running Visual Studio (which I do not use often, but Visual Studio is not exactly lightweight).

But for testing I absolutely prefer to use a low-end machine. I used to have a SparcStation 20 at home (it died a few years ago, though), and I loved that machine for that precise reason. (Also: Big-endian vs. Little-endian issues) Damn, that beast was SLOW. Not just slow: It only had 64 MB of RAM, and I had removed the hard drive because it was way too loud, so the entire thing ran over the network. Swapping over a half-duplex 10MBit line makes you really think about memory usage. ;-)

dingo_bat · on June 27, 2017

For testing/debugging, yes. For compiling, no.

tetha · on June 26, 2017

I've made java services respond with 500us on average for in-memory queries. Yeah, average, including GC stops. This is where things grow really fun, because every lock, every array access, every method call becomes a problem.

It's an interesting world. Closed and not entirely relevant to most developers.

Bjartr · on June 26, 2017

What are good, up to date resources for performance tuning to that degree in Java? I've looked into this a few times and keep hitting walls where the resources I have found say "let the JVM do it's thing".

tetha · on June 27, 2017

Mechanical sympathy is a good source, probably the source for java specific topics. In general, you'll want to really, really learn the internals of the JIT so you can structure your application to cooperate with the JIT and the GC.

Beyond that, I've found the biggest throughput gains and/or latency decreases by implementing the right threading or lack of threading, locking or lock-free implementations, and then tuning, profiling and optimizing things just right for the target hardware.

That's just a little sentence, but it's an amazingly hard topic. There are not many resources about this out there beyond just looking at high performance systems, regardless of language. The innodb storage engine is a fun project with some interesting locking, and I like reading about the things the kernel does.

wpietri · on June 27, 2017

This mailing list is amazing:

https://groups.google.com/forum/#!forum/mechanical-sympathy

I understand maybe a third of it, but what I do get is high quality.

pjmlp · on June 27, 2017

Besides what the others pointed out, check profilers like VisualVM, ControlCenter, Netbeans integrated one.

There is also the option to get the generated machine code, here is a recent example with the Oracle JDK about SIMD programming in Java.

http://prestodb.rocks/code/simd/

Finally, note that just like many programming languages, Java enjoys multiple implementations, each with its own JIT.

Someone · on June 26, 2017

"However, Alto microcode is pretty crazy, so I'm not going to try a microcode Mandelbrot."

I don't understand that logic, coming from somebody who has a restored Xerox Alto and wrote a Mandelbrot generator in BCPL for it "to learn how to use the Alto's bitmapped display, not make the fastest Mandelbrot set".

I also expect this guy will get source code for the 'simple' things such as an OR instruction or a fixed-width multiplication instruction within a few weeks.

Edit: I’ve been browsing http://bitsavers.trailing-edge.com/pdf/xerox/alto/Alto_Hardw....

There apparently are only 16 basic micro-instructions (”The ALU function field controls the SN74181 ALU. This device can do a total of 48 arithmetic and logical operations, most of which are relatively useless. The 4-bit field is mapped by a PROM into the 16 most useful functions”), and OR is one of them:

       ALUF  FIELD FUNCTION  
         0   BUS            (A)
         1   T              (B)
         2   BUS OR T*      (A+B)
         3   BUS AND T      (AB)
         4   BUS XOR T      (A XOR B)
         5   BUS + I*       (A PLUS I)
         6   BUS - I*       (A MINUS I)
         7   BUS + T        (A PLUS B)
        1OB  BUS - T        (A MINUS B)
        11B  BUS - T - 1    (A MINUS B MINUS I)
        12B  BUS + T + 1*   (A PLUS B PLUS I) 
        13B  BUS + SKIP*    (A PLUS I) 
        14B  BUS.T* (AND)   (AB) 
        15B  BUS AND NOT T  (A & NOT B)
    16B-17B  UNDEFINED

So, following the best source I have, (that PDF) which states:

”For the most part, since the Alto is such a simple machine, writing Alto microcode is a straightforward exercise in rule-following”

that shouldn’t be too hard, provided that there is a free spot for writing the microcode instruction, or that you can find an instruction to give up in exchange for a simple OR.

Edit 2: that PDF also describes the way you do an OR, so that apparently was the way to go:

”To "or" together the contents of ACO and ACI; results ACO:

    COM 1.1
    AND 1,0
    ADC 1,0

”

kens · on June 26, 2017

The microcode isn't quite as simple as you make it sound. Trust me, I've needed to understand a lot of the Alto's microcode to debug disk and other issues, and it's brain-twisting.

There are the 16 basic ALU micro-instructions, but these are combined with 8 different bus sources and two sets of "special functions" to make fairly complex micro-instructions. The special functions are a bit crazy and also totally change meaning depending on which hardware task is running.

Control flow in microcode is pretty wacky. Each micro-instruction contains the address of the next micro-instruction, so it's like a goto in every instruction with instructions randomly scattered through memory. Additionally, branches are done by the hardware setting bits in the address based on conditions, so you end up with basically a computed goto. Finally, branches happen the instruction after the condition. So even just following the control flow of microcode is hard.

Another fun thing with microcode is that to read memory you write the address to the memory address register. Then you need to wait two microinstructions before you can read or write the memory value.

If you want to take a look at the Alto's microcode: http://bitsavers.trailing-edge.com/pdf/xerox/alto/microcode/...

Someone · on June 26, 2017

Yes, I saw a bit of that, and did realize what you meant by pretty crazy. Rereading my last edit, I also notice that the smiley I had there got lost in editing. Apologies for that.

If you do have the microcode for AND, and a RAM board to store custom microcode, creating an OR instruction shouldn't be insanely hard, though, even without an assembler (finding a spot to put it without losing another instruction may be impossible, though, and that could mean a fairly large rewrite of your OS and applications)

Also, branch delay slots (https://en.wikipedia.org/wiki/Delay_slot#Branch_delay_slots) apparently predate RISC.

pm215 · on June 27, 2017

Well, the microcode for AND is simple; it's this line

G17: L_ ACDEST AND T, TASK, :SHIFT;

...the problem is that that's in the middle of the decode (there's a computed jump into this lot from the line labelled DIS1:). So you could change the AND insn to do OR instead, trivially (use "OR" here where the line says "AND"), but you probably wanted to keep AND. So you'd need to at least add enough of your own decode to be able to put the insn somewhere else. And as you say that looks like the real problem: http://users.rcn.com/crfriend/museum/doco/DG/Nova/base-instr... suggests that given 16 bit instructions and the format they've used there isn't much space left. The best you could do would be to steal some of the 'io instruction' space and have it operate on fixed accumulator registers the way MUL and DIV do.

(The decode is remarkably confusing even knowing what it's supposed to be doing; there seems to be a lot of implicitly jumping to entries in decode tables based on bits of the instruction...)

kens · on June 27, 2017

Looks like you've figured a lot of it out, but if you want more details... The way instruction decoding works is the hardware slams some of the instruction bits onto the microaddress bus, causing a 16-way dispatch based on the instruction, selecting which instruction to execute. In detail, looking at the microcode:

  !17,20,GETAD,G1,G2,G3,G4,G5,G6,G7,G10,G11,G12,G13,G14,G15,G16,G17;
  DIS0:	L←T←IR←MD;
  DIS1:	T←ACSOURCE, :GETAD;
  GETAD: T←0, :DOINS;			PAGE 0
  G1:	T←PC -1, :DOINS;		RELATIVE
  G2:	T←AC2, :DOINS;			AC2 RELATIVE
  ...
  G17:	L←ACDEST AND T, TASK, :SHIFT;

"IR" is the special instruction register. Loading it from memory (MD=Memory Data) causes a dispatch after the next microinstruction. ":GETAD" is a microcode branch, so the next microinstruction after DIS1 would normally be at label GETAD. However, IR causes bits 0,5,6,7 of the instruction to get OR'd into the microaddress, resulting in the dispatch to G1, G2 etc. But how does the microassembler know that G1, G2, etc need to be aligned properly for this to work? That's what the !17,20 directive does: 17 and 20 are the necessary alignment in octal, followed by the list of labels that need to be aligned.

Further instruction decoding is done with the IDISP and ACSOURCE special microcode functions, which do more complex decoding (based on multiple bit combinations) resulting in similar dispatches. So the branch to :DOINS or :SHIFT will do a second dispatch, based on the earlier ACSOURCE.(See page 30 of the hardware manual for details.)

jecel · on June 27, 2017

Actually, the PARC guys stole all of the io instruction space for other things on the Alto. Only the math and load/store instructions are compatible with the Nova.

kens · on June 26, 2017

Even if I were crazy enough to write microcode, there's a problem that we don't have the microcode assembler MU.RUN. Maybe it will turn up on one of the PARC disks we're archiving...

edmccard · on June 26, 2017

Theres a mu.run on (at least) one of the disks archived at the Computer History Musuem, listed at

http://xeroxalto.computerhistory.org/_cd8_/alto/.index.html

http://xeroxalto.computerhistory.org/_cd8_/alto/.mu.run!2.ht...

but I don't know if there's any way to transfer it from a disk image to an actual Alto disk?

kens · on June 26, 2017

Thanks - I searched that archive for MU.RUN but somehow missed it. We got a gateway from the Living Computer Museum that lets us transfer files to the real Alto. So I guess I need another excuse for why I'm not writing microcode :-)

simcop2387 · on June 26, 2017

Interesting, I hope that the archives end up available to the public. It'd be really neat to build some kind of functional emulator like the visual 6502 to play around in. Do it on a block level for the CPU so you can examine the state as things happen.

kens · on June 26, 2017

Funny you should mention that, because the Living Computer Museum wrote a high-quality emulator (ContrAlto) for the Alto. It's super-accurate, even emulating the microcode and hardware, and runs at the speed of the live machine. It includes a debugging mode that lets you examine machine state and step through machine instructions and microcode. Source: https://github.com/livingcomputermuseum/ContrAlto

ContrAlto was very helpful while getting the real Alto running, and I also use it for developing software such as the Mandelbrot. I wrote a guide to getting started with ContrAlto: http://www.righto.com/2016/10/simulating-xerox-alto-with-con...

simcop2387 · on June 26, 2017

Awesome I'll check that out. I think I remember looking back when this started and found SALTO but it appears to be incomplete and buggy. ContrAlto looks interesting.

qubex · on June 26, 2017

Pardon my ignorance, but does the Alto multi-task between user programs? If the answer to that question is affirmative, how does the machine cope with simultaneously running two programs that load different custom microcode? Does it load and unload and reload at every context switch? Or does it simply not multi-task?

kens · on June 26, 2017

The Alto is a "personal computer" and only runs one program at a time. The operating system provides primitive task management (sort of like threads) so the operating system can for instance process network packets while a program is running, or a program can contain multiple tasks. Thus, a program can load the microcode it needs when it starts up and doesn't need to worry about conflicts.

This task management is entirely orthogonal to the 16 tasks that are running at the microcode level to refresh the display and handle I/O devices.

e12e · on June 26, 2017

Fascinating article. One question, you write that: "Using what I learned with the Mandelbrot, I wrote a program to display images; an example is below." - I couldn't find source code for that on github - is it published anywhere?

It would be interesting to see some code that exercise more parts of the system, beyond the Mandelbrot generator.

About coding style - I guess this is tabs-vs-spaces territory - but is there a particular reason why you only indent blocks in the loops, not in the procedure/functions (like under main)?

I also wonder a bit about some of the magic constants - like 30705 - is there an overhead to use variables in BCPL (eg: indirect de-reference, no automatic in-lining by the compiler)?

Finally, how about procedure call overhead? Granted, the Mandelbrot generator is rather simple (shifts and other tricks notwithstanding) - so I can see why it makes sense to keep it all in a single procedure - but what does a call/ret look like on the Alto? (eg: in simplified assembler, in terms of stack-push, registers etc)?

kens · on June 26, 2017

I've put the image code on github: https://github.com/shirriff/alto-display-image

Coding style: I tried to match the style of the existing Alto files, which starts functions at the left margin and then indents by 3 (!) spaces from there. e.g. http://xeroxalto.computerhistory.org/Pixel/IFS/Sources/.IfsM...

Magic constants: I'm lazy so it's easier to leave them inline when I'm hacking on code. I don't know if there's runtime overhead.

BCPL procedure call overhead: kind of nasty. There's no stack support in the instruction set, just jump-and-link. So a called procedure first calls a subroutine "getframe" that sets up the stack frame (very similar to a C stack frame). Then a second subroutine "moveargs" moves the call arguments into the stack frame. At the end of the procedure call, a third subroutine call cleans up and does the return.

e12e · on June 26, 2017

> I've put the image code on github (...)

Thank you! It's interesting to see how similar this code is to assembly code calling OS procedures, after declaring them external - I've recently been playing (again) with assembly for the win64 arch, and one sample program (a mish-mash of example code available for nasm, fasm and "go" assembler - is rather similar IMNHO - note this just displays a window, no loading of (image) files):

  ; Assemble and link with:
  ;   nasm -f win64 .\hello-nasm.asm
  ;   golink .\hello-nasm.obj \
  ;     Kernel32.dll User32.dll /entry:WinMain

  global WinMain
  extern ExitProcess
  extern MessageBoxA

  section '.text'

  WinMain:
    sub rsp,8*5 ; reserve stack for API use and make
                ; stack dqword aligned

    xor rcx, rcx         ; uType = MB_OK
    lea rdx, [szCaption] ; LPCSTR lpCaption
    lea r8, [szTitle]    ; LPCSTR lpText
    xor r9d, r9d         ; hWnd = HWND_DESKTOP
    call MessageBoxA

    mov ecx,eax
    call ExitProcess

  section '.rdata'
  szTitle:   db 'Hello, Title!', 0
  szCaption: db 'Hello, World!', 0

The more things change...

nasm: http://www.nasm.us/ go linker (and assembler): http://godevtool.com/ (not to be confused with Google's "golang" Go programming language and tools.

tyingq · on June 26, 2017

A pretty in-depth hardware manual exists if your interest was piqued by this article.

http://bitsavers.trailing-edge.com/pdf/xerox/alto/Alto_Hardw...

RmDen · on June 26, 2017

Just like FAST on a commodore 128, it would run at 2 MHz instead of 1, and the screen went black. To go back to 1MHz, you would use the SLOW command IIRC

vidarh · on June 27, 2017

That was a bit different in that the CPU clock frequency was actually changed. The screen blanking in that instance was because the VIC-II could only run at 1MHz, so the screen would get messed up in 40 column mode if the CPU (and by extension the bus speed) was switched to 2MHz.

You do get a speedup by just disabling the screen too on both the C64 and C128, though, as with the display on the VIC hogs a lot of memory bus cycles. That's a bit more similar to the situation on the Alto, except instead of different "hardware threads" they are separate chips and the VIC just has the ability to take control of the bus whenever it needs memory access.

roywiggins · on June 26, 2017

My Computer Org&Architecture course in college had us programming on a PDP-8, which has an even more ridiculously restricted instruction set. I always intended to build a Mandelbrot renderer for it, but never got around to it.

The actual physical PDP-8 we had never had a printer hooked up, so actually drawing the thing would have been impossible, other than using the blinkenlights to render a line at a time.

stuaxo · on June 26, 2017

Great!

It would be good if it was updated every so often, it probably wouldn't kill the speed too much.

philovivero · on June 26, 2017

It is time to coin a new phrase: the Hacker News inverse comment-interest ratio. The fewer comments on a story, the more likely it is to be extremely interesting and relevant to a hardcore geek.

Anytime I see a slightly-curious headline with only 0-3 comments, I know it's going to be good.

maxerickson · on June 26, 2017

That's probably the wrong explanation.

There's some kind of velocity factor in the position ranking, so a story that quickly gets a bunch of votes will have a high rank and few comments. But the comments are likely to come as the link sits there for a while.

vxNsr · on June 26, 2017

I think this is fairly accurate. It's really easy to upvote something you find interesting, but takes much more time to come up with good comments. Usually these types of articles will generate quite a few comments even though they start out on the top with basically none.

SAI_Peregrinus · on June 26, 2017

Simple time to read the article also applies. I've been following this series, so I upvoted it, then read it, then came back to read the comments.