Hacker News new | past | comments | ask | show | jobs | submit login
Computer Crash Wipes Out Years of Air Force Investigation Records (govexec.com)
169 points by joering2 on June 14, 2016 | hide | past | favorite | 93 comments



There has been too much recently of government agencies losing data due to "crashes" or "inadvertent" deletions. For example the IRS, state department, and now the Air Force.

A really easy way to fix this is to make a law so that if government data such as this gets lost, then the cabinet level person responsible for overseeing the agency immediately loses their cabinet level position and is barred from further work with the federal government.

This would do wonders to align incentives all through the federal agencies when it comes to data safeguards.


This idea is too simplistic. Although I used to support it. I've changed my position from yours to 'retrained and demotion' position. Since people make mistakes, I think being 'barred from further work' is counter-productive and the government would lose out on many able and competent workers. Heck, this idea could also be exploited by lower-level workers due to not liking their higher-up bosses; i.e. a lower-level worker intentionally screwing up and costing the job of the higher-up boss because of the 'new law' stating the higher-up is now 'barred' from further govt work. Easy way to take out the competition going up a ladder.

P.S. I think legislating accountability is an issue the public needs to debate. In all countries. The US has seen numerous gross negligence/errors (heck one is too many) in various levels of govt (fed, state, local). I hope we can come up with a better solution than 'fire them and bar them' and even mine of 'demote/retrain'. There has to be a better way to legislate govt accountability.


Indeed, such ideas would create an atmosphere of fear and increase cover-ups, rather than solve problems. There needs to be a solution that acknowledges the inevitably of mistakes and still deals with the issue.


Exactly this. How many of us would be up in arms if developers were fired for making mistakes? You just need to look at all the stories around rm -rf...


Maybe thats exactly what should happen? I mean, whole "security" antics where companies loose tens of millions of logins/credit cards numbers and it all ends with "southpark BP apology" People fuck up epically and get a slap on a wrist with a promotion, I mean, come on


so what if tomorrow your bank says it has lost all records due to programming/tech error? who is responsible there? why is such sensitive data (air force etc) should be treated differently than cash/finance data?


If you have backup, a crash won't lose months of data. I'm ok to lose a so called "competent worker" if he/she hasn't been doing something as simple as off-site backup.

It's local data corruption here. At worst you loose the data since the last backup.

Now, in the army, they have redundancy procedures for everything, and you want to make us believe the one server used to keep them in check, not only had crash beyong repair, but has no backup ?


You'd be surprised how tenuous backup situations are at most companies and organizations. Even if your company is doing full disaster recovery checks 24/7, one after the other, a "well-placed" failure/mistakes or small series of failures/mistakes can lead to data loss.

Many of the things that cause data loss are just simple mistakes caused by a failure to review changes carefully, even when you have multiple levels of review- I would dare to say especially when you have high confidence that someone else is reviewing your changes.

"Manual" data changes, e.g. executing SQL statements or scripts that execute SQL that aren't a part of your application, in my experience are the most common cause of data loss.

After manual data changes the second most common in my experience is not understanding what you are doing. For example, you might take a chance on an upgrade that fails because you have to meet a deadline.

Changes to application code are next. Typically when making changes to an application, a little more thought may be put into it than a one-off data migration or change, but if you are under time pressure, don't know what you are doing, or are assuming someone else will catch your mistakes, you could easily screw everything.

Following this- mistakes that cause hardware or software failure. I worked at one large organization where storage arrays with various power backups were just "turned off" by a contractor that didn't understand the impact of what he or she was doing.

Finally, you might have configuration issues or the hardware might just fail.

Really, there is no substitute for having your data backed up frequently, and in a way you know how to easily restore and have tested, by building another machine from the ground up to replace it and documenting and practicing that well. Very few do this frequently. And even if you do- what if all of your hardware were destroyed? Can you easily go out and buy something off the shelf with instructions you have in your head or stored safely around the world and rebuild everything?


Please note the severity of this. It's not "We had a crash and lost a weeks worth of data" - yeah, that happens. And it's totally acceptable if you're unable to restore it, for any of the reasons you mention.

This is "We had a crash and lost over a decade's worth of data" - yeah, that shouldn't happen. Ever. If you don't have a working backup from sometime in the past 13 years, -you are doing something wrong-. That's not a small series of failures or mistakes. That's either staggering levels of incompetence, or maliciousness.


This is exactly right. A misconfigured backup script, or a pessimal chain of crashes, or a lot of other things could lose recent data. I wouldn't be shocked by data loss.

This is decade of data. It should have been on in cold storage, on tested media, in multiple locations. No single error, or small chain of errors, should have enabled this to happen.


We are not talking about "most companies". We are talking of the air force, an organisation that is built to be resilient and handle crisis. Backups, fallbacks, redundancy, checks, reviews, access management and risk evaluation are BAKED IN their culture and mission.

And we are not talking about "any data". This was cleary very sensitive data they new they needed to protect.

Either the Air Force is failling at being the very thing it's been created to be (which I doubt) or something is fishy (ocaml razor).


Yeah, forgive me if I'm too cynical, but Hanlon's razor fails when this sort of thing involves criminal investigations of government employees.

Edit: it also inspired a variation on an old joke about how the same word means different things to different people. For example, suppose the order goes out to "secure the data center."

Marines report back, "We have destroyed the data center."

Army reports, "We have killed everyone in the data center and are holding the position."

Navy: "We locked the doors when we left for the day."

Air Force: "We signed a three-year contract with an outsourcing company, with an option to extend at the same price for ten more years."

Which of course means that just because I don't think Hanlon's razor applies doesn't mean I think they're competent.


> Yeah, forgive me if I'm too cynical, but Hanlon's razor fails when this sort of thing involves criminal investigations of government employees.

I've had this thought before. Can you think of a general, defensible clause to add to Hanlon's razor to account for this?


"There is no point in trying to distinguish incompetence from malice when it directly benefits the party committing the error."

You need to treat it all as malice.


This seems like a very reasonable standard when the power of the state is being wielded.


It applies at all times. Given that it's very easy to disguise your behavior as incompetence, it doesn't make sense to give the benefit of the doubt to people who do something that gives them a huge payout. Trying to forgive the incompetent people while punishing the malicious ones just brings you a huge epidemic of mysteriously incompetent people.


But what if it's not about punishment, but prevention?


The usual formulation I hear is:

"Never attribute to malice that which can be adequately explained by stupidity. But don't rule out malice."


"Never attribute to malice that which is adequately explained by carelessness."

Unless that malice is aimed at keeping an individual out of trouble rather than harming others.


It only applies to physical persons, not to groups.


Imagine what would happen in the industry if every time someone screwed up like that they got fired. Just like that, without any option given to explain what happened, how they tried everything they could from stopping it and so on.

Ah, I know it's the way it is in some organisations, sure. But is it a good way to handle people screwing up? People always screw up. If you start firing them for screwing up you end up with a pool of inexperienced workers who keep screwing up in the same way. If you keep them around, they learn and never screw up in the same way again.


there is screwing up and screwing up. one - you do all you could, act proactively, educate yourself, but forces beyond your control prevent you from reaching optimal state and massive clusterfk happens. I would keep this one.

then you have ignorants, that in 2016 will not even think that mission-critical system should have SOME backup, off-site. people that will put amazing amount of energy into little political games, backstabbing, or just plain old incompetent lazy ones. those should become exemplary cases.

investigation should sort out which is which, but most will fall on either side of spectrum. usually it's really not that hard.


"Neither the Air Force nor Lockheed Martin, the defense firm that runs the database, could say why it became corrupted or whether they’ll be able to recover the information."

Incumbent trusts local IT manager and local IT manager has contract with external company.

The question becomes: "Did the contract with Lockheed Martin cover any form of regular backup and additionally some form of long term open format archiving?"

Perhaps some checklists or standards that all long term record keeping systems need to comply with would be good.


It's super convient considering congress is investigating their discipline processes after some incidents they've had.


It's always an attractive idea to eliminate incompetence through threats. It's been tried lots of times throughout history, and has never succeeded (in fact, it usually makes things worse).


I agree. These days I don't believe it when I hear something so opportune... oh, surprise surprise, we lost all our data on criminal investigations. How convenient for some people!


> A really easy way to fix this is to make a law so that if government data such as this gets lost, then the cabinet level person responsible for overseeing the agency immediately loses their cabinet level position and is barred from further work with the federal government.

I see a fantastic way to completely abuse this. Need to get someone who disagrees with you kicked out? Hire some corporate espionage people to take care of that work for you!


Losing their position is meaningless - they'll just go private and make even more money while being recklessly incompetent. They'd just hire a clean agent as their intermediary for government contracts.

No, I'd slap prison sentences on everyone involved, not just loss of government employment.


Making mistakes is an important part of learning and becoming experienced. I'm sure we've all done a bad "rm -f" from time and time, and each mistakes makes you more cautious in the future.


All that would lead to is make any process change in the government impossible, domination of defensive document workflow a.k.a. CYA, and thorough lack of leadership.


This seems like a great idea for incentivizing success. I'd support such legislation 100%.


Can not agree more. Losing data must be considered a same level of crime as losing money or life.


Sigh. When I was in university, I worked at a computer store one summer that provided IT services to local offices. One of the clients was the Mexican consulate. We were hired to do various things, including installing a tape backup system.

I was installing some service packs at about 5pm. Windows NT4. 5pm, everyone's gone home, safe, right? Server asks if I want to restart, I say yes. Then a guy walks into the server room, asks if I had restarted the server. I say yes. Whoops, OK, I guess someone was working.

Next day, the database was dead. Seemingly because I had restarted the server while someone was using it. It was a custom database job, and the contractor who had made it was on vacation in some other country. Here's the kicker. Usually tape backups have rolling tapes, right? But we hadn't started rolling tapes yet. So the good backup from 2 days prior was overwritten the night before with the crashed database. Ugh. And I was due to go back to school in 2 weeks in another city. Ouch. I still feel bad about that to this day. No idea how my boss ended up.

Never ever let someone who doesn't know what they're doing mess around with your mission critical systems where the consequences are not recoverable. Always guide them. Although in my case, it seems I may have known more than my boss at the time, so maybe wasn't possible for me....


The records were not wiped out by a computer crash. They were wiped out by gross incompetence.


The joys of IT. Back in the day, to wipe years of records you had to burn or shred tons of paper in multiple locations. Now you just crash one little system... Despite 20 years of Microsoft-induced crashes, clear best practices and yadda yadda, most organizations will still have a single backup system for any given data repository -- if they have one at all.

And the answer when shit happens is "outsource! Go to the cloud!" -- so that local managers won't be responsible when the snafu happens and data is lost (if it happens to Google, it can and will happen to anyone).

Progress, eh.


I would consider it progress.

The only data loss to Google that I was able to find was [0], where a single datacenter was hit by lightning 4 times in short succession, and 0.000001% of disk space in said datacenter was lost. And only because the lost data consisted of recent writes that had yet to be backed up. And only because the lightning affected the AUX power systems in a way that they have since committed to fixing (or perhaps have fixed).

IMO, cloud storage is the most reliable way to store data, and Google the most reliable. I never know if I'm going to drop my laptop and lose all work. Or if the harddrive in my server is going to give up and burst into flames. But the great thing about cloud storage is that it abstracts away hardware components and failures so it's somebody else's problem. And they worry a lot more about data loss than I do. Cloud data storage is designed with the assumption that hardware is faulty. Spread redundant copies of data to geographically isolated datacenters, and you're much safer than some admin of averagejoe.com keeping his data on a single harddrive in his closet.

[0] - https://status.cloud.google.com/incident/compute/15056#57195...


The problem is not just DR, but anything that can result in data loss. There was the one where they'd delete stuff you didn't mean to [1], the one where they got stuff back by the scruff of the neck [2]... i can't be arsed to look further back, but I'm pretty sure shit happened earlier as well (and I'm just looking at GMail). And of course there are tons of horror stories about accounts disabled or deleted for commercial or other unspecified reasons, because once you put your trust in another entity, that's it, they can do as they please. Their track record might be as good as anyone's, but it still has a few holes and people should prepare for it rather than considering "the cloud" as a magic bucket.

(disclaimer: my life would probably be ruined by losing the gmail account I've held since private-beta days, because life is short and I'm shit at sysadmin; but there's a big difference between a lone old geek and large organizations...)

[1] http://www.dailymail.co.uk/sciencetech/article-2548010/Has-G...

[2] http://www.datacenterknowledge.com/archives/2011/03/01/googl...


I don't get the problem with [2]? The tape worked exactly as intended.


The only thing between your data and /dev/null in a cloud data center is a vacation and expired credit card.


Thank goodness storage is cheap. I've had a revolving door of credit/debit card numbers over the last few years thanks to identity theft, and the SaaS providers I use for personal stuff are really good about keeping my data around even after invalid credit-card auto-pay cycles fail.

For more serious data, I have to imagine that SLAs generally define data retention periods even after exiting the agreement, right? (I legitimately don't know)


This is also a bit "above my pay grade", but i've heard bits and pieces about how our company has some kind of agreement setup that if there is a billing dispute that there is a minimum amount of time that must pass before they can shut us down.

Obviously it doesn't help if the company goes under, but it's not as simple as the parent commenters statement.


It depends on the vendor. I negotiated a pretty big agreement a few years ago and were able to get the vendor to agree to hold data for a year after termination. We also backup to another provider.

That said, over the years I've witnessed all manner of billing and payment fuckups. I'm not trying to say that cloud is bad -- but that billing snafu is a risk factor that you need to understand and account for.


IMO, cloud storage is the most reliable way to store data, and Google the most reliable

Does Google publish any durability numbers? Amazon claims 99.999999999% (11 9's) for S3 and Glacier (and considerably worse for EBS volumes). How does Google compare?


Don't have numbers here, but it's pretty good, and we regularly move the data around and check it. (Ie it's not just rotting on a hard disk somewhere.)


one point anecdote: Google's Keep lost my data after it decided to upgrade itself (gone from phone and cloud). Just some notes but still.


Check if they are in your archive. I thought it deleted mine too however I just found them in there.


> The joys of IT. Back in the day, to wipe years of records you had to burn or shred tons of paper in multiple locations. Now you just crash one little system. [...] most organizations will still have a single backup system for any given data repository -- if they have one at all.

No, back in the day, you could still wipe years of records because most people only used a single filing cabinet with no duplication or off-site copies. "Destroyed in a fire" was the canonical way to destroy all records, easily. This story has nothing to do with the introduction of IT. Incompetence is still incompetence.


16-18 million records destroyed in a fire in 1973:

http://www.archives.gov/st-louis/military-personnel/fire-197...


Fire? Why be so bombastic. All it takes is one mis-applied label to permanently destroy records in a large enough filing system.


  Bernard: Shall I file it?
  Hacker:  Shall you file it? Shred it!
  Bernard: Shred it?
  Hacker:  No one must ever be able to find it again!
  Bernard: In that case, Minister, I think it's best I file it.
Yes, Minister - The Death List


Back in the day, in 2300 BC, records caught in fires were preserved forever, while records not so lucky eventually returned to the earth. (Barring the odd formal peace treaty inscribed in metal.)

We know a lot more about conquered Mesopotamian kingdoms than we do about successful ones, because step one after taking over a city was apparently burning it down, or at least burning down the palace.


It never ceases to amaze me how many teams have the idea that DR is simply buy some new hardware. sadly its not just a government failing.


> He also criticized the Air Force for notifying Congress on Friday afternoon, five days after senior service leaders was told about the problem.

Prime time for burying bad news.


I recall an episode of the X-Files where they talked about a 'basement fire', this reminds me of that. When you hear about a high-level government department "losing" files, it's usually because they've had an office party around the paper shredder.

Or something along those lines.


>office party around the paper shredder

This phrase made me laugh, but I don't really understand what it means, can you explain please?


"frantically destroying documents" I suppose would be another way to frame it... But I seem to remember that was the wording on the show, or pretty close to it.

It made it sound like the staff were running around and being very energetic in their operation of the paper shredder.


It's the 'big government agency' version of that movie scene where the hacker tosses all their drives in the microwave as the feds kick down the door.


Really? Like there are no backups? Is this amateur hour?


This is a Lockheed snafu, so yes, it is amateur hour.

Much of the data is backed up, and more of it will be retrievable from original sources. Maybe not all of it; I'll be very interested to see what gaps remain when they declare the issue resolved.


The cynic in me says the gaps will be for whoever was being investigated and triggered this corruption.


I would withhold judgement on Lockheed. It wouldn't surprise me if someone there raised the lack of a backup strategy as an problem with the govt. ... only to be ignored or told there wasn't enough budget for it.


Lockheed is one of those companies that fires senior people to save money. Imagine how this plays out with turbine blades.


Why do companies do this, instead of "pay cut, take it or leave it?"


It's never phrased that boldly. People aren't "fired", their contracts are not renewed, or they are declared redundant because of the new kid with an advanced degree... usually just after they've finished training the new kid.

The most common move is to cut the highest paid section of the workforce after they acquire the service contract from a competitor. All major defense firms do this, which leads to problems maintaining institutional continuity.


From the article:

>> It’s possible that some data is backed up at local bases where investigations originated.


Edit: Agreed.


That's a pretty stupid reason not to have backups. If anything, a lack of backups makes me feel like any safeguards for protecting PII would be sub par.


This is a very sound leap of logic you're making.


Ah, the great flood of '76. Thousands of records were lost.

Very convenient, that flood.


Genuine question - can you post some more information?


It was a quote from the TV show 'Yes, Minister':

James Hacker: How am I going to explain the missing documents to "The Mail"?

Sir Humphrey Appleby: Well, this is what we normally do in circumstances like these.

James Hacker: [reads memo] This file contains the complete set of papers, except for a number of secret documents, a few others which are part of still active files, some correspondence lost in the floods of 1967...

James Hacker: Was 1967 a particularly bad winter?

Sir Humphrey Appleby: No, a marvellous winter. We lost no end of embarrassing files.

James Hacker: [reads] Some records which went astray in the move to London and others when the War Office was incorporated in the Ministry of Defence, and the normal withdrawal of papers whose publication could give grounds for an action for libel or breach of confidence or cause embarrassment to friendly governments.

James Hacker: That's pretty comprehensive. How many does that normally leave for them to look at?

James Hacker: How many does it actually leave? About a hundred?... Fifty?... Ten?... Five?... Four?... Three?... Two?... One?... Zero?

Sir Humphrey Appleby: Yes, Minister.

(Source: http://www.imdb.com/character/ch0030014/quotes)

Edit: Although apparently a similar real-life incident occurred when Hurricane Sandy wiped out a significant portion of an FBI record archive: https://nsarchive.wordpress.com/2014/09/16/archival-neglect-...


Reminds me of this: The Front Fell Off - http://m.youtube.com/watch?v=3m5qxZm_JqM


The is Australian, and if you like it, look for _The Games_ on YouTube, a mock drama series about the 1996 Sydney Olympic Games organizing committee


maybe the other poster meant this one? Oct 1976 Frederick, MD: http://www.fredericknewspost.com/archive/flood-flashback/art...


If you're going to say something about paper, see also the 1973 Personnel Records Archive Fire: https://en.wikipedia.org/wiki/National_Personnel_Records_Cen...


Dear 18F [1]: Help, please.

[1] https://cloud.gov/


18F is severely outgunned. They need your help. I can't say that enough. The entire Western US is on fire, there's a drought, they have one fire hose and their next boss may turn it off.


Nice to see an encouraging response for 18F here. They are fighting the good fight, and most people don't even know what they are up against.


>Data about current investigations has also been lost, which is delaying them.

[...]

>"We've opened an investigation to try to find out what’s going on, but right now, we just don’t know," Stefanek said.

I wonder where they're putting the files for that investigation.

I wonder if their backups were corrupted too, or if they just overwrite the old ones with newer ones as a cost-saving measure. Or whether no one ever tried restoring from them and for whatever reason they never could have been restored from.

Since people rarely prioritize it until it's too late, maybe 'restore from backup' drills should be a common thing.


May the additional security requirement deter sensible backup procedure? Like you can't have copies of backup somewhere? Or backups have to be encrypted and the keys are in a USB drive somewhere, which had gone missing/bad and the encrypted data can't be recovered?


Good Grief. For a lot less than an F-35 program, they could install a TSM rig, with off sight backup.

Ours (at a state university) deals with petabytes of data. We've had to escalate to IBM's Tier 3 support a couple of times over the past decade, but we have never lost a file.


Any information on the contract with Lockheed or the consequences of catastrophic failure?


Do please keep in mind that on September 10th, 2001, the secretary of defense reported to Congress that the department could not account for 2.3 trillion dollars of spending.


It's becoming clear that all government data should be part of blockchain(s).


Just ask the Russians for a backup. And if that fails, the next-higher authority: the NSA.


Always hard to tell the difference between gross incompetence and malice.


Oh, come on. Nobody ever made any backups?


Any sufficiently advanced incompetence is indistinguishable from a malice.


[flagged]


Please stop posting unsubstantive comments to Hacker News.

We detached this subthread from https://news.ycombinator.com/item?id=11906732 and marked it off-topic.


Can you elaborate?


M$ LOL


Yeah this is believable. Where I worked, in the secured network there were maybe 15 workstations with a maximum of maybe 5 people working at any one time, unless there was a "TOUR" going on. We had 2 contractor and 1 government IT person supporting this network.

I asked the one of the contractor ITs when was the last backup. Response: "Do you want me to do a backup of your workstation tonight?"

WTF!!! I am no IT slob but the minimum is Mon-Thur backup of new and changed files that day, and Friday night was full network backup.

Why wasn't this done?!!!! The tape backup makes too much noise. I guess the bitch couldn't read her endless Romance Novels because of the noise.

I find an intern, not a full employee, working on a design file in the account of another engineer. WTF!!! We believe that since the intern was leaving soon, it would appear that the design was the work of the engineer.

Security violation!!!!! The contractor IT people swore they did not know this was going on...... BS!!! With so few on the network at any one time, and in the mornings only the intern was on the network, these contractor IT people were unaware!!!!

The government IT person knew, but she did not want to report it because she was well aware of the contractor's history of retaliation.

BTW..... during the investigation, the government engineer asked where was it written that they could not share passwords!!!!! Can you hear Snowden laughing at this.

All swept under the lumpy rug....... the violations would make the organization look bad.

As for me..... I got the IR treatment.... that's Isolation and Retaliation... then after 2 and a half days, supervisor sent me an email as to my where abouts.... my reply: Retired as of 4:00 PM PST, 3 days previous.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: