Stratus: Servers that won’t quit – The 24 year running computer

JoachimSchipper · on Jan 29, 2017

Does someone know why these systems - Stratus, Tandem/HP NonStop, etc. - appear to be relics of the past?

I can see how Google is better off spending money on software engineers than on hardware (thus, fault-tolerant distributed systems); but there's lots of systems that don't need Google-scale computer power, nor Google-style ultra-low-cost-per-operation hardware, but that really need to stay up - utilities, banks, every system at the center of a big enterprise.

Is the issue that modern software just isn't reliable enough to make hardware failures an important part of the downtime? Or is it just that systems like https://kb.vmware.com/selfservice/microsites/search.do?langu... are cheaper today? (I guess those actually work, right? The level of complexity involved is frightening...)

VLM · on Jan 29, 2017

In the old days 100% of computers were doing civilization critical work, Air traffic control, Social Security checks, nuclear reactor monitoring. Computers are expensive and automation provides the most gain in those applications. There just aren't that many civilization critical systems.

In the modern era we continue to have the same small number of civilization critical systems, we still have exactly one air traffic control system. However we have billions of, well, filler. Ringtones and social media and spam. Therefore rounded down 0% of modern computers are doing civilization critical work and safe secure systems superficially appear to be a thing of the past even if the number of systems and employees remain constant.

In 1960 the average ALU was calculating your paycheck and it was rather important to get that done on time and correctly. In 2010 the average ALU is a countdown timer in your microwave oven, its your digital thermostat, its my cars transmission processor, maybe its video game system or other human interface device (desktop, laptop, phone) but probably numerically unlikely. Things are sloppier now because they're sloppy-tolerant applications.

MichaelRenor · on Jan 29, 2017

Great explanation. In other words the median number of computers performing some kind of critical task has probably gone from 90% somewhere closer to 0.

zhte415 · on Jan 30, 2017

In absolute numbers they're still there. Air traffic control and baggage sorting applications still run APL, and your bank still runs COBOL. It has not gone away.

koja86 · on Jan 30, 2017

While hearing about COBOL in banks quite often all my friends actually working there say Java and Oracle and all related job adverts I have seen so far were Java or C++. Could someone please provide some relevant facts?

zhte415 · on Jan 30, 2017

OK, so I'm probably revealing something about myself... appreciate HN's value for what you say rather than who you are. But as you asked...

Flexcube now owned by Oracle (core banking, basically a general ledger if used to that extent) and VisionPlus (credit cards) are done in multiple languages, but the core is COBOL. If you're in banking you probably have a guess who I worked for.

These systems go deep into an organisation's nervous system, and in the case of Flexcube, for example, come from a prominent bank's nervous system.

Most in-hourse mainframe programming today however is on the middle layer providing APIs with JSON feeds, heavy lifting left to 3rd parties largely with workforces in India and sometimes China.

This is a world away from trading or investment management. While for a layman 'finance' is often a catch-all, banking IT and trading IT are different worlds. Java dominates investment/trading, with C# doing a bit but not much. C++ is mainly plugins for traders written for something not self-taught VBA that need to do things fast, but Java has largely filled that void supplemented by C#.

Edit: Around 5-10 years ago, a lot of IT work in banking was about unifying each country's code-base around a common core, the larger the bank the more tedious the task. Now it is all about AML and KYC, if you're interested in banking non front-office space. ML is a joke, quants do what quants do and win 50% of the time.

YeGoblynQueenne · on Jan 30, 2017

It must be quite common to be in a big financial corp, work in IT and not see a green screen (mainframe terminal) in your entire working life there. I did a stint in one of the mainframe teams in my old workplace (Big Card Network™) but the vast majority of the people in my age bracket who were in IT worked with java and javascript, on the web applications that make up the corp's front-end.

As zhte415 lets on, most of the people working on the mainframes were third-party contractors. I don't think the company even advertises COBOL and mainframe roles. When I was hired (as part of a graduate programme) the job description didn't say a word about mainframes. I had to raise a little hell to get into one of the mainframe teams, and it wasn't even particularly easy.

I mean, you'd think with all the stuff people say about how the old guard is retiring and those ancient systems will need people who know how to maintain them, they'd have been waiting for me with open arms. Not quite.

My hunch is that there's a supply/demand thing going on. It's not that all those big banks etc. don't need modern-educated software engineers that can tend to the ancient tech. They do. Except, those new generation soft. eng's don't really care about the ancient tech, and the corps can buy cheap foreign labour to tend to the mainframes. There's no supply and all demand is covered. So they advertise for the roles that they do need filled, which is to say, everything on mobile platforms, the web etc.

zhte415 · on Feb 1, 2017

Going into mainframe but also keeping your eyes open for multiple techs after college is a great way to get into architecture and international career at a comparably very young ago, if you're into that.

mcherm · on Jan 30, 2017

So, I work for one of the larger US banks, so I'm qualified to answer this question!

I'll narrow this down to one particular section of our consumer bank, because that serves as a fairly "clean" example, not much affected by outsourcing other complications.

We have a "core banking system" -- the computer that keeps track of the account balances for customers and moves around the bits that represent dollars. This is implemented on 1960s technology: ours doesn't happen to be in Cobol, but it is written in a different language of similar vintage. The system is pretty rock-solid: as an example, it has only a couple hundred milliseconds at most for all of the processing needed to approve or decline a credit-card transaction in real-time, and it has no difficulty checking balances, restrictions, and business rules to enforce that -- that isn't so impressive until you consider that even the slowest transactions need to meet that limit and we still have to allow for network latencies. We struggle to find qualified programmers for this system: the few who are actually skilled in it can command pretty decent salaries and benefits, and we'll search around the world to find them. The system is not abandonware: we'd love if it were written in something more modern, but a complete rewrite would be an incredible undertaking (this system developed over several decades, and that is difficult to recreate). On the other hand, we ARE looking at questions like how to get it to run in the cloud.

So that's all true, but if you look through our technical job listings, you'll mostly find us looking for Java developers, PostgreSQL DBAs, Angular web-developers, and other such positions. That is because the core banking system is a tiny portion of what we do. In my specific area, we have roughly 20 development sprint teams (and a few other support folks not on those teams). Of those, ONE team has core banking system developers on it (along with some developers with other skill sets). By that estimate, it makes up 2% to 5% of the work we do.

The fact is, keeping track of the balance in your account is only ONE TINY PART of what your bank does. We have to keep track of your personal information (email address, mailing address, login id, etc). We have to serve you up a website. We have to process scanned checks, send out marketing emails, analyze traffic to detect fraud, and hundreds of other things. The core banking system (and other similar systems in similar companies) may be written in 1960s languages because the existing systems are robust and well-tested, but for that exact reason, they don't require a huge amount of development work.

MichaelRenor · on Feb 2, 2017

> the few who are actually skilled in it can command pretty decent salaries and benefits, and we'll search around the world to find them.

Just out of curiosity, what kind of numbers are you talking about?

> On the other hand, we ARE looking at questions like how to get it to run in the cloud.

Call me a masochist but that sounds really fun.

mcherm · on Feb 2, 2017

> Just out of curiosity, what kind of numbers are you talking about?

Unfortunately, that's exactly the kind of thing my employer does not want me to talk about.

> Call me a masochist but that sounds really fun.

Oh, it is. It's actually a really interesting project, and from what I've seen so far I think it may well be fully successful and go live with our actual customer records by mid-year 2017.

innocenat · on Jan 30, 2017

The Core banking system (the one that is actually processing transactions) may still be in COBOL. The other add-on services are probably in other languages. I know a few banks here (Thailand) that has successfully transition its core banking to Java, so I would guess many large banks have done the transition too. However, I doubt smaller banks would have done the transition.

breakingcups · on Jan 30, 2017

Intuitively I'd feel like the larger the bank, the more work such a transition would be and thus the smaller the likelihood of it having happened.

zhte415 · on Feb 1, 2017

You're correct. It is a massive transition, middle-layer upon middle layer.

Oracle market a completely different code-base (which is fully Java), even single-currency code-bases, from those for larger banks, small third parties 'partners' providing implementation.

VLM · on Jan 30, 2017

The larger the bank (or any company, really) the more labor they've been devoting to merger cleanup.

StreamBright · on Jan 29, 2017

And nuclear power plants still managed by systems from the 70s-80s, simply because they still operate and still do what they have to.

PeCaN · on Jan 30, 2017

Nobody wants to be that guy who has to rewrite the nuclear power plant software.

I do wonder what they use. I assume off-site computers aren't really an option for networking reasons, so Stratus, NonStop, z mainframes, etc probably. I wonder if they have a backup mainframe, or if the redundancy in one of those things is enough.

intrasight · on Jan 30, 2017

>Nobody wants to be that guy who has to rewrite the nuclear power plant software.

Nobody is allowed to rewrite the nuclear power plan software. My first job was at Westinghouse Nuclear Division. Those Fortran libraries were off-limits in terms of changes.

StreamBright · on Jan 30, 2017

No surprise, same story in energy balance circle management to calculate the optimal schedule.

dfc · on Jan 30, 2017

> energy balance circle management

WTF? What is energy balance circle management?

StreamBright · on Jan 30, 2017

When you are creating a schedule for all the power plants taking into consideration predictions for home and industrial users and trading as well. There can be domestic and foreign trading involved as well. Not sure how it is called in English.

erelde · on Jan 30, 2017

Is there a good move here? Update or not?

RugnirViking · on Jan 30, 2017

I used to work at a computer museum and we had an old computer from the 60s that was used in a plant and they came back and took it because they were upgrading their main computer and they wanted to keep redundancy I.E a backup for their backup.

In short, yes. They always have two machines running ready to swap in. I have heard some also run multiple live mainframes to check that the results agree, and discard and restart calculations if they don't.

pjmlp · on Jan 30, 2017

For example systems like RC 4000.

http://brinch-hansen.net/papers/1967b.pdf

TeMPOraL · on Jan 30, 2017

Also I heard part of the reason is that those systems are trivially maintainable on-site; if something breaks, you take out a soldering iron and a box of spare capacitors, and go fix it.

edblarney · on Jan 30, 2017

Agreed - but also consider 'multiple points of redundancy'.

If you have your service spread across 10K servers, well, it doesn't matter if some go down.

Instead of making them all 100% super fault-tolerant for a crazy expensive unit price, you can make them relatively cheap and replace them when they fail.

brianwawok · on Jan 29, 2017

Worked at a big co switching off tandem. Went from giant tandem machine that cost 2 million dollars per year to rent to about 6 off the shelf Dell servers for 6k each one time fee. Performance at the end was about 3x the old system. Was a glorified billing system. If one node went down, who cares, the other 5 did the job.

Most people don't value 99.999 vs 99.9999 reliability as worth millions of dollars per system per year. Space shuttles may disagree, but not billing platforms.

trothamel · on Jan 29, 2017

Note that the space shuttle didn't use processors running in lockstep, like the Tandem machines do (from what I understand). It used 5 single-core computers. Four of them would run the primary software, with a single control channel controlled by each of the four.

For something like the elevons, control was accomplished by connecting 3 actuators to 3 of the channels, with voting being accomplished by physical force - if the 3 computers disagreed, the 2 would overpower the one. (Things like thrusters used electronic voting, close to the thruster itself.)

This seems closer to modern architectures with multiple computers, than the mainframe idea of redundant hardware.

Trombone12 · on Jan 29, 2017

Note that all those computers shared the same bus, and so that the complete system wasn't as redundant as hoped as this story[1] retells:

"At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic fuel for mission STS-124 when a 3-1 disagreement occurred among its General Purpose Computers (GPC 4 disagreed with the other GPCs). Three seconds later, the split became 2-1-1 (GPC 2 now disagreed with GPC 4 and the other two GPCs). This required that the launch countdown be stopped.

During the subsequent troubleshooting, the remaining two GPCs disagreed (1-1-1-1 split). This was a complete system disagreement. However, none of the GPCs were faulty. The fault was in the FA 2 Multiplexer Demultiplexer. This fault was a crack in a diode. This crack was perpendicular to the normal current flow and completely through the current path. As a crack opened up, it changed the diode into another type of component ... a capacitor.

Because some of the bits in the signal are smaller than they should have been, some of the GPC receivers could not see these bits. The ability to see these bits depends on the sensitivity of the receiver, which is a function of manufacturing variances, temperature, and its power supply voltage.

From the symptoms, it is apparent that the receiver in GPC 4 was the least sensitive and saw the errors before the other three GPC. This caused GPC 4 to disagree with the other three. Then, as the crack in the diode widened, the bits became shorter to the point where GPC 2 could no longer see these bits; which caused it to disagree with the other GPC. At this point, the set of messages that was received correctly by GPC 4 was different from the set of messages that was correctly received by GPC 2 which was different again from the set of messages that was correctly received by GPC 1 and GPC 3. This process continued until GPC 1 and GPC 3 also disagreed with all the other GPC."

[1] Adapted from https://c3.nasa.gov/dashlink/projects/79/wiki/test_stories_s...

hashhar · on Jan 30, 2017

Is there a website which collects/links to such anecdotes and surfaces interesting ones? I would to read more such stories.

cjp222 · on Jan 30, 2017

Tandem computers did not run software in lockstep to achieve fault tolerance. They were shared-nothing parallel clusters before shared-nothing parallel clusters were cool, with message-based communication between processes independent of the node each was running on. This gave near-linear scalability as the number of nodes grew. I worked on parallel sorting and parallel data loading back in the 1980s, scaling up to 256 nodes, and Tandem's NonStop SQL developed and commercialized similar parallel database query evaluation in the late 1980s, ahead of its time.

ams6110 · on Jan 30, 2017

Also I think the processors ran different, independently developed software, the idea being that it would be unlikely for independent codebases to have the same bugs.

sitharus · on Jan 30, 2017

The four main CPUs ran the same Primary Flight Software. There was a separate backup computer running the Backup Flight Software, which was developed by an independant team, to take over if the PFS failed.

There's a bit of debate as to what would have happened if it ever took over.

jackvalentine · on Jan 30, 2017

I expect they did know, but it'd be an interesting exercise to not tell either development team if they were writing the "real" or the "backup" software, and just swap them around to keep people on their toes.

newman314 · on Jan 29, 2017

So....

Years ago, I had a customer that had a Tandem which was quite exciting to me as I had yet to encounter one up to that point of my career.

Imagine my letdown when I eventually discovered that the Tandem was used for FTP. ¯\_(ツ)_/¯

z3t4 · on Jan 29, 2017

Five way redundancy look good on paper, put if you use the same vendor and hardware purchased at the same time, you have a one point of failure, aka when one server fails there's a high probability that the others will fail simultaneously.

sllabres · on Jan 29, 2017

It depends on what you are trying to protect and what options you have if one component fails (replacement).

For highly critical systems that is an issue but it is being adressed. Take as example the space shuttle or the Airbus (3/2) primary/secondary flight computer [1] where the backup is being build by different companies with different processors ...

[1] https://ifs.host.cs.st-andrews.ac.uk/Resources/CaseStudies/A...

eeZah7Ux · on Jan 29, 2017

By running statistics on a large enough infrastructure you can see how frequently 2 hard drives in the same raid fail within days, or even hours!

Same brand, same batch, operating time, running temperature, even same vibration does trick. :(

chiph · on Jan 30, 2017

We had 3 hard drives (the legendary IBM DeathStars) from the same production batch fail within hours of each other. Part of me was like "Yeah! Awesome quality control!" and a larger part was like "Oh shit." We ended up losing the array and had to restore from backup.

davnn · on Jan 29, 2017

smallnamespace · on Jan 29, 2017

If the hardware failure is due to a manufacturing problem, then it affects every machine. If it's problem with the production batch, then again, unless you bought from different batches, you haven't reduced risk much.

And if it's a software problem, then again the redundancy isn't helping you.

Basically you haven't succeeded in getting 5 independent tries in the problem -- the failures are highly correlated.

It's the same reasoning that led to the financial crisis -- you have a CDO or an MBS that has a bunch of different obligations so you think you've diversified away risk, but in actuality there is one big factor that causes everything to fail at once.

throwawayish · on Jan 30, 2017

Also - same shipping way. If I buy four disks from the same shop, then those four disks came in the same delivery to the shop (so they were all dropped at the same time), are handled by the same picker, thrown into the same parcel and dropped a couple times the same way.

yellowapple · on Jan 29, 2017

Given the existence of things like Erlang/OTP, I doubt software is the big issue here.

Rather, the issue is that individual machines are cheap enough that you can have machine-level redundancy at a much lower cost than trying to engineer a single machine with redundant components.

The issue with machine-level redundancy is that they tend to run their own isolated operating systems (which is great from a fault-tolerance perspective, but is rather wasteful from a computational perspective). Operating systems like Plan 9 were billed to help bridge this gap and make clusters of machines communicating over 9P feel like a single unified whole, but they never seemed to catch on (besides maybe the concept of a Beowulf cluster).

ChuckMcM · on Jan 29, 2017

They aren't, as VLM points out they just aren't something you see a lot of quantity of. For some applications you can achieve 100% uptime with networks and clusters, for others you still use doubly or triply redundant processor networks and various manufacturers have specialty chips for those markets (generally Health, Life, Safety (HLS) type systems).

Sometimes people developed redundant but not non-stop systems. I talked with an VP of Citibank when working at NetApp and they had a number of systems which ran on schedules of alternate days, so one would process transaction records for a while, then another would take over and repeat. They had three identical systems where one was eseentially a hot standby for the other two, and new versions of code would be deployed on one which would run the same transaction records and they would check for the same output, so they could do a 'walking' upgrade of software. Back when Tandem's and big Sun iron ruled the roost those machines were too expensive to have an extra one which was essentially a spare. These days however its much more economical to do that.

suprjami · on Jan 29, 2017

Modern software hasn't had the capability until recently. These Stratus VOS systems get updates by direct assembly patching over the running kernel. Linux ksplice and kpatch are relatively new, and AFAIK only Oracle support their use in production over their own UEK Linux Kernel.

Source: I work with several ex-Stratus employees who have told me some of their war stories.

sllabres · on Jan 30, 2017

Redhat too, since 7.2 on x64 (with some support requirements). No experiences though.

technion · on Jan 30, 2017

For my part (sample size of one means nothing of course)...

The stratus server I worked with in around 2002 was, I believe, one of the first to move towards standard x86 configurations and Windows OS.

They did so because, no matter how reliable your server, executives were jumping on Windows as an application server bandwagon and developers were targeting it.

There were several driver and firmware bugs that would BSOD the whole server, and bring the entire fault tolerant OS down.

Moreover, storage was a hack job. They couldn't make hardware RAID cards work in that configuration, and Windows software RAID was a joke, so you were stuck with Veritas storage software. Twice we applied Windows Service Packs only to find it would BSOD due to a Veritas software incompatibility. If it was a more average server I'd apply said update to a test server first, but no one could afford a test Stratus so you were always testing in production, which doesn't help reliability.

Here's a picture of it. It became my homeserver for a few years but frankly it was less reliable than a desktop.

http://imgur.com/a/7010z

rwallace · on Jan 30, 2017

One concern is that there is a limit to how reliable a single box can really be, no matter what sort of engineering you throw at it. Consider http://thedailywtf.com/articles/Designed-For-Reliability - because the box was designed to never ever go down, when it did go down, not only was there no failover, but it took twenty-four hours to reboot.

tyingq · on Jan 30, 2017

I would think it's largely because the cost difference between a normal server and something like a Stratus grew significantly larger over time.

You could more easily justify the cost when the upcharge wasn't so huge.

And, of course, we all got better at reliable distributed systems with commodity servers. So aside from the cost gap, the uptime gap was closing as well.

ethbro · on Jan 30, 2017

I'd tune that a bit and guess that the price of non-Stratus hardware fell through the floor.

In the mid-90s, data centers full of commodity hardware would have been cost prohibitive.

Elv13 · on Jan 29, 2017

This issue has been fixed for good. If you need uptime, then virtualization is good enough. You can migrate VMs to other hosts on moment notice. As long as the failure isn't sudden, then you probably have enough time to migrate the VM elsewhere. Good servers, workstations and UPS have enough monitoring to warn you.

youdontknowtho · on Jan 30, 2017

"every system at the center of a big enterprise"...that's the thing. A lot of enterprises actually don't need this kind of uptime or can get by with other types of redundancy.

Your comment about software, though, is a good point. The last relatively common OS that has that kind of uptime is OpenVMS.

jlarocco · on Jan 30, 2017

They're not cost effective.

It's possible to get close enough with off the shelf hardware and good system design, so nobody wants to pay 5-10x more for special hardware.

youdontknowtho · on Jan 30, 2017

Oh yeah, Mainframes are still built this way.

zzzcpan · on Jan 29, 2017

> Is the issue that modern software just isn't reliable enough to make hardware failures an important part of the downtime?

I think the issue is more about networks never being reliable enough for such reliable computers to ever make much sense. A data center simply cannot give you five nines availability over the internet, so it doesn't matter how reliable everything inside is, you would still need geo redundancy and all that fault-tolerance in a distributed system.

cyberferret · on Jan 29, 2017

Reminds me of a day decades ago when I walked into a client site to do a server upgrade, and I noticed on their Novell Server, the uptime counter said it was close to 3 years of continuous uptime. I was flabbergasted. It was just an ordinary little small business with about 10 employees, and it was an old IBM server on a $100 UPS.

I almost couldn't bring myself to shut it down to do the upgrade. I mean, I saw servers with 100 to 300 days uptime routinely, but not one that had never been interrupted for over 1000 days....

Animats · on Jan 30, 2017

Why not? I just looked at a dedicated server I have at a Codero colo site, and it says "System status: Database up for 516.23 days."

cyberferret · on Jan 30, 2017

I guess some context might be in order - this server was back in the early 90's, in an outback town that is known for massive thunderstorm and power cuts that usually play havoc with computer systems. Being a machine literally in a cupboard in the back of an office, and not in a controlled data center, it was an achievement to have run for 3 years uninterrupted.

jcoffland · on Jan 29, 2017

I'd be more interested in the computer system with the longest uptime. I've had Debian systems that have achieved over 1 year uptimes. Could have gone longer but kernel upgrades took precedence over uptime.

nickpsecurity · on Jan 29, 2017

OpenVMS or AS/400. Very common for admins to report systems running for 5+ years without crashes while doing real work. People occasionally forget where they're at and have to go looking for them. My company has an AS/400 employees never saw go down or get maintenance over 10 years. It's in use 24/7. They might have some fail-over setup or upgrade when nobody is using it due to traffic patterns. We've just never seen it fail. Their other benefit is they're very hands off with it nearly self-administering.

protomyth · on Jan 29, 2017

Or that AS/400 calls the tech to fix it and s/he just does it while no one is around. Had a call once from IBM asking for directions since the AS/400 RAID cache battery needed replacing. No danger of losing data, the AS/400 was just running slower because it was bypassing the cache.

rashkov · on Jan 29, 2017

So... a tech would just show up at your office, ready to fix the machine? On the one hand, that is some fantastic service when you're running mission critical software. On the other hand, that must cost a whole lot of money and is quite devious of IBM.

seanp2k2 · on Jan 29, 2017

Yeah, that type of service and phone-home is also common with $$$ enterprise storage (SANs). Funny story: in one instance I know of first-hand, the people running a NetApp filer setup with a redundant pair of systems had chosen to disable AutoSupport (NetApp's version of the phone-home support system where a tech would just show up with replacement disks when some failed) and take on support themselves. They did not test the alerting well enough after some network changes, which left the poor SAN unable to get distress signals through.

One day, the shared mount on a ton of systems around the company just stopped working. The master node (these have internal redundancy at the network, controller, shelf, and disk level) had failed over to the redundant node, which after time had so many disks fail without being replace that it went down hard. They had to work with NetApp support (at great expense) to get it back up, which took a few days. The system had been attempting to warn the admins as best it could of the impending doom for the better part of a year, and no one noticed.

The moral of this story is that you can have big-dollar redundant systems, but if you don't also have some dollars going to the care and feeding of them, it's wasted. The other lesson here is that such systems don't exist in isolation and are only as good as their weakest link, which in this case was a now-unreachable SMTP server. Lastly, although the LCD displays were trying their best to also signal this fault condition, people who would walk by these physical machines were used to this, so they didn't bother to actually read the errors. Do that at the peril of anything dependent on that system.

syntheticnature · on Jan 30, 2017

"The Fail-Safe Theorem: When a Fail-Safe system fails, it fails by failing to fail safe."

--https://en.wikipedia.org/wiki/Systemantics#System_failure

bcbrown · on Jan 30, 2017

I love that book. Have you read https://www.amazon.com/Introduction-General-Systems-Thinking...? It's similar, but a bit more rigorous.

throwawayish · on Jan 30, 2017

> The system had been attempting to warn the admins as best it could of the impending doom for the better part of a year, and no one noticed.

This is the area were inverted Unix philosophy applies: Give me good news as well. A weekly mail, that backups (or the filer or RAID or whatever) works A-OK. If the mail stops coming, something broke.

hackuser · on Jan 29, 2017

> The moral of this story is that you can have big-dollar redundant systems, but if you don't also have some dollars going to the care and feeding of them, it's wasted.

Agreed. Another moral: Don't believe anything works until you see it with your own eyes.

protomyth · on Jan 30, 2017

Had a friend unplug the UPS to test it while I was still configuring the new machine. "Can't really be sure unless you test it"

throwawayish · on Jan 30, 2017

Especially true for backups. Make sure those aren't write-only!

shubb · on Jan 29, 2017

Note to self, if need to socially engineer my way into the equipment room at a large company, pretend to be an IBM tech answering a phone home...

protomyth · on Jan 30, 2017

Good luck with that, as we do check these things before we allow folks to get to the servers. IBM has been doing this for a fair bit and we aren't exactly blind idiots.

It would be interesting where I am since you skip local crime and go straight to federal offense. Kept me away from booze in high school since its not minor in possessions, its federal bootlegging. I can only imagine what trying to enter a server room nets a person.

teh_klev · on Jan 29, 2017

We just decommissioned a pair of EMC CX's. These things had phone home to EMC (and then eventually a third party maintainer) if something was about to go bad or had failed. We'd usually have an engineer onsite within four hours at the DC with all the right parts to carry out a fix.

By the time it was being looked after by the third party maintainer they'd just leave us a few disks and a couple of fibre channel controllers onsite and we'd get the DC remote hands to swap them out. The faulty parts were then boxed up and couriered back to the maintenance company all on their dime.

There was not an awful lot of involvement on our end except for authorising access to the DC when an engineer needed to attend.

None of this was massively expensive (and this was 24x7x365 cover) in the great scheme of things given their mission critical status.

jonknee · on Jan 29, 2017

> that must cost a whole lot of money and is quite devious of IBM.

You have essentially described IBM's business model. You pay more, but if something goes wrong someone shows up.

protomyth · on Jan 29, 2017

It actually isn't that bad on the budget. We don't have the same day service (I think its 24 - 48 hour response time). It sure beats the heck out of waiting for the disaster as the thing is pretty proactive. Much, much cheaper than service and support from Oracle, and Oracle doesn't visit or monitor the machine.

mavrc · on Jan 30, 2017

> So... a tech would just show up at your office, ready to fix the machine?

Yes. They are expensive and deviously good at it. Or, at least they used to be many years ago.

Many years back I worked in a tiny town (1000 people?) four hours from the closest depot and we even got same day service. This cost a king's ransom. But the AS/400 ran most of the business...

nickpsecurity · on Jan 29, 2017

It's possible. I do want to try to get the uptime info from the people that run it if I ever see them. We regularly get techs come in to improve or fix IT systems. Just not that one.

protomyth · on Jan 29, 2017

I have had multiple years of uptime until I needed to do an OS upgrade for some new part of the financial system.

nickpsecurity · on Jan 30, 2017

That's what vast majority of my feedback has said. Stays up until OS upgrade. Whereas, my research on VMS side indicated they do clusters that support rolling upgrades of the individual machines to not affect application. Might be around that remaining limitation of AS/400 uptime if they have software for that. Do they? Or are customers forced to upgrade to mainframe with Sysplex?

protomyth · on Jan 30, 2017

OpenVMS has an amazing cluster system, and yes they did upgrade one machine at a time in the cluster and lose no overall uptime.

I think the bigger IBM z machines do allow zero-time upgrades, but the iSeries (new name for AS/400) is basically a reboot the machine affair. I've never heard of Sysplex in regards to the iSeries.

They sent me two DVDs with 12 pages of instructions. The unnumbered first page is release notes and a notice that physical media deliveries would be an average of 5 to 9 business days on Jan 31, 2007 and beyond. (odd notice for a 2016 document)

Page 1 has my customer information with all sorts of tracking numbers. It also include the number of selected files (1,360), Kilo-bytes (sp) of data (3,963,529), fixes (3,394 fixes with 815 Kilo-bytes of data were superceded), etc.

Pre-install instructions finally show up on page 3 and the big warning about f'n it up shows on page 4. Page 6 brings us to the actual install which is written to be typed exactly with exact detailed results of each step.

to give an example

  2. Load the image catalog into the virtual optical 
     device using the following command:

  LODIMGCLG IMGCLG(ptfcatalog) DEV(OPTVRTxx) OPTION(*LOAD)

  3. Type GO PTF and press the Enter key

  4. Take menu option 8 and press Enter key

Took about an two hours counting the the 45 minute reboot. We have a very old and slow machine (need to buy a new one in the next two years, we are at a decade of use now) so reboots are a bit slow.

skissane · on Jan 29, 2017

Do AS/400 and OpenVMS systems support live kernel patching? Can you patch the entire OS (even the lowest layers) without any reboot? If they don't, then you can't have zero downtime for years unless patches aren't being applied, which is rather bad for security.

gaius · on Jan 30, 2017

With VMS you can hot add a machine to the cluster, migrate all running processes to it, then shut down the first node without any interruption. People kept systems running while moving buildings like this. And this was 20+ years ago.

youdontknowtho · on Jan 30, 2017

Also supports cluster nodes with different CPU architectures. It was/is really slick.

nickpsecurity · on Jan 30, 2017

That's something people keep forgetting that should be in any "Required Features for Good, Clustering Software" document. That they could move the software from VAX to Alpha to Itanium within same cluster was pretty amazing. Also a requirement for high-availability if a processor goes EOL. The other strategy is to bet on the market leader. Something they missed. ;)

youdontknowtho · on Jan 30, 2017

Yeah, they (the new owner) is finally porting OpenVMS to x64. Their roadmap shows that this is going to be a long road, though. I can't remember their name...the company that bought VMS from HP. HP is licensing it from them for their own products.

nickpsecurity · on Jan 30, 2017

It's this company:

https://www.vmssoftware.com/about_faq.html#faq1

They didn't buy it. HP is too greedy about the lucrative revenue & maybe patents they get off VMS that made me wonder why they killed it in the first place. Instead, it appears (not certain) that VMS Software Inc gets some OEM-like license that puts them in control of it while HP still owns it, can license it themselves (probably existing customers), and likely gets a portion of VMS Software Inc's licensing revenue. So, they've pushed off almost all responsibility for it onto another company without actually selling it or losing all the revenue.

Still a good development for VMS customers. Both legacy and people who will suffer through archaic stuff to reap benefits of bulletproof clusters. I'd be happy to. :) Only problem is it's basically got no security attention over the years with plenty of zero days or configuration issues lurking in there. If I deployed in a company, I plan to put network guards in front of it to absorb attacks while converting the traffic into something easy to process. Ideally, a PCI-based guard that forces it to only talk to the application's memory instead of rest of system. Plus safe-coded apps (Rust, Ada) and API wrappers to try to spot BS. That VMS supports cross-language development will probably help.

EDIT: The weird thing is they finished a port to Alpha ISA before they got to x86 ISA for migrations. I thought Alpha's already migrated to Itanium. Didn't see that coming lol.

nickpsecurity · on Jan 30, 2017

Here's a guide on the main techniques it used for high-availability:

http://h41379.www4.hpe.com/openvms/whitepapers/high_avail.ht...

krylon · on Jan 30, 2017

There remains the subtle distinction between downtime and unplanned downtime...

bluedino · on Jan 29, 2017

Most people recommend an IPL of an iSeries (AS/400) at least every 1-3 months. Did they just do that after hours?

srssays · on Jan 29, 2017

https://www.reddit.com/r/uptimeporn/top/?sort=top&t=all

donatj · on Jan 29, 2017

I had a Win2K AS file server that would go years at a time after MS stopped pushing security updates. I switched to a Linux micro server a while ago and now currently use a Mac mini. The Mac Mini honestly has had the worst uptimes of the three.

aidos · on Jan 29, 2017

I was asked to look at something on a Debian box the other day and logged in to discover over 2 years have passed since a reboot. A bit dangerous, but also wonderful to think about all that data flowing through without issue.

MichaelRenor · on Jan 29, 2017

I love the idea of that, but it also means no security patches for that same period of time. I'd be weary of placing that host in any network with publicly-routable hosts.

delinka · on Jan 29, 2017

No security patches to the kernel. You can certainly upgrade user-space items (daemons and the like...)

syrrim · on Jan 30, 2017

I heard you could hot patch the kernel as well somehow.

PeCaN · on Jan 30, 2017

You can use stuff like https://github.com/dynup/kpatch or http://www.ksplice.com/ to apply security patches to a running Linux kernel, but that's still pretty new stuff.

Stratus has been doing that since the 90s, though :-)

praseodym · on Jan 29, 2017

I've installed a dedicated webserver back in 2006 when VM's and VPS'es were less common. In 2007 it was upgraded from Debian Woody to Debian Etch and it kept doing its job until it was decommissioned with nearly 8 years of uptime. I find it quite remarkable that there we no datacenter power outages or any hardware problems (machine had a single HDD) during that period. And apparently the software (a Drupal website) didn't have any obvious vulnerabilities either.

discordianfish · on Jan 29, 2017

At my first job, we had a sun box running that was booted once and never rebooted until being decomissioned around 5 years later.

nickjj · on Jan 29, 2017

I had 275 days of up time on a Windows 7 box when I had updates disabled. Definitely not breaking any records, but it was a stable box.

It was very common to have many months of up time, despite a lot going on (virtual machines, games, etc.).

googamooga · on Jan 30, 2017

First stock exchange in Russia called RTS was built originally on Stratus computers. First launch of the trades was done on Intel 960 based Stratus XA/R in 1995, then in late 1996 we migrated to two PA-RISC based Stratus Continuums, if memory servers me well. Both models provided quadruple redudancy and for the whole trades history we had no interruption related to hardware problems. Only bugs in our software. ))

Continuums were working till 2003 when they were replaced with x86-based Stratus servers, which were double or triple redundant, depending on the model. They are still running at the RTS Exhange which is now part of Moscow Exchange (formely MICEX).

The only problem with x86 models, at least the models produced back in mid-00 was that the backplane with synchroniztion clock was not redundant and if the clock failed the whole server went down.

lallysingh · on Jan 29, 2017

Note: clocked-down (from 33Mhz) Intel RISC CPUs. It's reliable, sure, but at what cost / MIPS? Power, maintenance, finding spares, opportunity cost from adding new functionality, etc.

Also, how does this compare to the oldest running satellites?

ftio · on Jan 29, 2017

Sounds like a modern day Ship of Theseus (https://en.wikipedia.org/wiki/Ship_of_Theseus). I wonder how many parts have been untouched since they were installed on Day 1.

akeck · on Jan 29, 2017

My favourite modern day example: http://www.smithsonianmag.com/smart-news/this-japanese-shrin...

__s · on Jan 29, 2017

Key differences being that this is more if the Ship of Theseus was going through the replacemnets, all the while remaining out at sea

gaius · on Jan 30, 2017

Trigger's Broom was in active use.

eps · on Jan 29, 2017

80% of it is original

[1] http://www.computerworld.com/article/3162416/data-center/boo...

nxc18 · on Jan 29, 2017

I interned at a company that had given continuous service with a tandem/NonStop system since the 70s (80s latest) iirc.

Its very cool to see such reliable systems in practice.

makethetick · on Jan 29, 2017

Stratus are still going (I'm assuming it's the same company, same product, same name) and making fault tolerant hardware like this.

They also offer a lower end product based on two standard servers where all the redundancy is software based with virtual machines.

It's clever kit and looks after a lot of critical infrastructure where downtime isn't an option.

ghaff · on Jan 29, 2017

It is although I believe all the FT is now done in software in some fashion. i.e. the CPU/Memory/Chipset/IO, etc. are standard components with the lockstep done through a software layer. The same is true of HP Integrity Nonstop AFAIK. I used to follow these when I was industry analyst but I've been away from this space for a few years.

makethetick · on Jan 29, 2017

They still do both, here's their hardware offering which is hardware and software based: http://www.stratus.com/solutions/platforms/ftserver/

And this is there software based solution which runs on commodity hardware: http://www.stratus.com/solutions/software/everrun-enterprise...

I only work with the latter but have heard the hardware offering is bulletproof.

nickpsecurity · on Jan 29, 2017

"The use of the cloud for server farms made of hundreds, thousands, and often more computers that are transparent to the user has achieved much the same goal, providing one’s connection to the cloud is also redundant. "

This isn't true at all. The cloud offerings are more like traditional, high availability such as fail-over clusters. They might compete with VMS clusters by now with the right components. Maybe even exceed them except for longevity. Systems like NonStop and Stratus are fault-tolerant systems supposed to get five 9's availability with [hopefully] imperceptible moments of failure that [hopefully] recover immediately. I wouldn't trust a cloud platform for that. The latency alone would probably prevent the solution from matching a local, NonStop cluster.

Zenst · on Jan 30, 2017

I saw a demo of these systems back in the mid90's when I was working in the reinsurance industry. Was Impressive,fault tolerant stories of customers first knowing something needed to be replaced was when a spare part turned up, which the customer would fit. All apart from the hard drives which needed an engineer due to health and safety aspect due to weight of them. Was full diagnostics with any issue modem dialed up to stratus to report the fault and get replacement part shipped out. Was as if they thought of everything until I asked the question about how they handled heat if the aircon failed. Turned out back then the early systems did not have any temperature monitoring, so somewhat stymied the flow of good stories.

But impressive kit, not cheap but at the time was one of the best choices outside the AS/400/IBM route and indeed Lloyds did purchase a few of these systems.

More so given they lasted this long in use and even today I've had best in breed systems have faults that were not iditeified as soon as possible, recall some fancy IBMkit having a dieing PSU and could smell the electrical burn faintly without any diagnosstics flagging up any problem for that server to fail in a few days time.

But today we tend not to have one large all singing and dancing stable system and often cheaper to have redundancy thru load balancing and with that able to pull a server out of a work load cluster. This along with Assured messaging systems (MQ etc) you negate so many hardware faults that can and did in the past put a halt to work.

dsfyu404ed · on Jan 30, 2017

This reminds me of a switch I heard about at UMaine. It was a 48-port Cisco something or other installed and configured to basically be a dumb switch in a telecom closet in some building around 2004. That building happened to have some things in it that needed to be on generator backup so it never lost power. 11yr later my friend went to go upgrade it.

eps · on Jan 29, 2017

mods, may want to merge this with [1] from yesterday.

[1] https://news.ycombinator.com/item?id=13507972

__s · on Jan 29, 2017

Curious they have 2 pairs rather than a triplet

VLM · on Jan 29, 2017

One simple hardware mental model is running two processors in parallel with XORs on all pins with wired-OR outputs hooked up to the error-shutdown line. In practice that doesn't really work because irrelevant differences in PCB trace length and clock distribution mean both processors run just a little bit outta sync jitter so the XOR will be firing fairly constantly for a fraction of each clock cycle. You could play games with gating and sampling only when well settled....

A somewhat more realistic way to build it outta real hardware involves each processor runs every 1 outta X clock cycles. Reading between the lines of the story WRT "down clocked processors" I suspect this is what its doing. This also detects weird power spike problems where lightning miles away means every processor running is going to run 0xFFFF whatever opcode that happens to be, but a couple ns later the spike is gone and the other processors are back to normal. Or RFI/EMC, etc. This is all very nice other than a significant hit to performance if you run a large number of processors. Then again if your primary figure of merit for your system is extreme reliability and not raw MIPS...

It would be interesting to hear if they run dual port RAM to get around the jitter XOR problem and have some alternative way to detect synchronous behavior. Dual port ram is weird but COTS. Triple port ram is not as COTS.

dfox · on Jan 29, 2017

Comparing bus states of two CPU's running of same clock is perfectly feasible and there are no problems with glitches as long as the whole mechanism is synchronous (and the CPU is completely deterministic, which it should be, but then there are things like RNG-based cache eviction policies). In fact both original Pentium and some m68k processors have all the required hardware for this included and building error checking pair of them involves literally connecting all bus pins of two CPUs in parallel.

nickpsecurity · on Jan 29, 2017

I read architectural details of NonStop and Stratus. Both were designed to compare I/O outputs instead of just catching everything at CPU level. The CPU-level redundancy is more for catching transient faults with fail-fast logic. Idea is problem shows up in obvious way by time it hits I/O buses.

Past that I dont recall much.

Sanddancer · on Jan 30, 2017

Triples can have their own fun failure modes. Suppose a surge or the like fries two chips at once. It's rather rare, but it happens from time to time. The instructions come out and now you have three chips with different results. Which ones are bad and which ones are the good? More importantly, how do you recover without taking the entire system down. Redundancy in this way means you have the best chance of catching problems, and also recovering from those problems when things do go cactus.

abraae · on Jan 30, 2017

As the old saying goes,"when you go to sea, take one clock or three".

myrandomcomment · on Jan 29, 2017

Quantum Link the online service for the Commodore 64 ran on this system. The first GUI online multiplayer world was on Q-link. Q-link became AOL.

https://en.m.wikipedia.org/wiki/Quantum_Link

HillaryBriss · on Jan 30, 2017

> CPU check logic checks the results from each, and if there is a discrepancy, if one CPU comes up with a different result than the other, the system immediately disables that pair and uses the remaining pair.

how does this work? how does the system know which pair is the faulty pair?

empath75 · on Jan 30, 2017

More info than you probably wanted to know: https://www.google.co.in/patents/US5263034

general_ai · on Jan 30, 2017

I wonder how it compares to a recent ARM-based Arduino in terms of computational power.