Hacker News new | past | comments | ask | show | jobs | submit login

One simple hardware mental model is running two processors in parallel with XORs on all pins with wired-OR outputs hooked up to the error-shutdown line. In practice that doesn't really work because irrelevant differences in PCB trace length and clock distribution mean both processors run just a little bit outta sync jitter so the XOR will be firing fairly constantly for a fraction of each clock cycle. You could play games with gating and sampling only when well settled....

A somewhat more realistic way to build it outta real hardware involves each processor runs every 1 outta X clock cycles. Reading between the lines of the story WRT "down clocked processors" I suspect this is what its doing. This also detects weird power spike problems where lightning miles away means every processor running is going to run 0xFFFF whatever opcode that happens to be, but a couple ns later the spike is gone and the other processors are back to normal. Or RFI/EMC, etc. This is all very nice other than a significant hit to performance if you run a large number of processors. Then again if your primary figure of merit for your system is extreme reliability and not raw MIPS...

It would be interesting to hear if they run dual port RAM to get around the jitter XOR problem and have some alternative way to detect synchronous behavior. Dual port ram is weird but COTS. Triple port ram is not as COTS.




Comparing bus states of two CPU's running of same clock is perfectly feasible and there are no problems with glitches as long as the whole mechanism is synchronous (and the CPU is completely deterministic, which it should be, but then there are things like RNG-based cache eviction policies). In fact both original Pentium and some m68k processors have all the required hardware for this included and building error checking pair of them involves literally connecting all bus pins of two CPUs in parallel.


I read architectural details of NonStop and Stratus. Both were designed to compare I/O outputs instead of just catching everything at CPU level. The CPU-level redundancy is more for catching transient faults with fail-fast logic. Idea is problem shows up in obvious way by time it hits I/O buses.

Past that I dont recall much.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: