Data Loss at GitLab

aamederen · on Feb 1, 2017

With this incident, they once again showed that they are dedicated to transparency, even in the worst days. This increased their popularity on me and I believe among other developers. However, this may not be the case with the business people. I hope they can survive that and also publish a guide for getting better at the "ops" side of the things.

praneshp · on Feb 2, 2017

HN can be funny sometimes. GitHub got a lot of hate about a year ago just for not releasing new features. GitLab cost everyone a day because their backup/ops practices were silly, and everyone loves them more.

I've screwed up before, and I sympathize/empathize with their ops folks, but this should make us think about plan B in case something like this happens again.

deckar01 · on Feb 2, 2017

> GitLab cost everyone a day

GitLab isn't popular because of the stability of its cloud platform. It's popular because you can install your own instance for free practically anywhere with minimal effort.

I run GitLab CE on a box in my server closet for projects that involve livelihoods.

praneshp · on Feb 2, 2017

I know (because I use it and also host it at home, and like it). The point of my comment was HN's reaction.

Spare_account · on Feb 2, 2017

Remember that HN is not a single-minded mob, during each incident a particular group was vocal on a given subject but that group may not have been the same individuals who were vocal each time.

praneshp · on Feb 2, 2017

Fair point, thank you.

ergo14 · on Feb 2, 2017

You can also look at RhodeCode CE - supports git/mercurial/svn and has a really nice installer you can use to keep things up to date.

thewhitetulip · on Feb 2, 2017

>It's popular because you can install your own instance for free practically anywhere with minimal effort.

If that is the reason why you use Gitlab, then why not try gitea or gogs? gogs is written in Go and provides a docker image or a drop in binary

pritambarhate · on Feb 2, 2017

I have been running Gitlab CE for around three years now and I am very happy with it. The interface it good. The access management works very good. And the best part? Keeping it updated is very easy on a ubuntu box. Just a couple of very easy commands.

I am so happy with Gitlab that I didn't even install Gogs even once to give it a try, though I know about it since it's initial days. For me Gitlab CE just works. Unless Gitea shows a 10x feature I am very unlikely to shift away from Gitlab CE.

problems · on Feb 2, 2017

GitLab is a lot more feature packed than Gitea/Gogs. Gogs is lightweight, good for personal projects, but if you're looking for something to deploy company wide with integrated everything, Gitlab is the way to go.

sdesol · on Feb 2, 2017

I think this is changing at a pretty fast pace ... well for Gitea anyways. It also looks like Gitea is lighting a fire under Gogs, as they appear to be iterating at a faster pace as well. Here's a very quick breakdown of what's going on.

Activity for the last 160 days. There were 175 commits to gogs and 720 commits to gitea.

https://gitsense.com/gogs-gitea/commits-160days.png

Activity for the last 60 days. There were 109 commits to gogs and 262 to gitea.

https://gitsense.com/gogs-gitea/commits-60days.png

https://gitsense.com/gogs-gitea/changes-60days.png

https://gitsense.com/gogs-gitea/changes-files-60days.png

The options and vendors directory are unique to Gitea and they account for a lot of the changes within the last 60 days. I was told the vendors directory is used to store dependencies but I don't know what the options directory is used for. And as the following shows, they account for a lot of the files touched, in the last 60 days.

https://gitsense.com/gogs-gitea/changes-options-vendor-60day...

Based on what I've read on Hacker News, the developer behind Gogs, tends to merge in changes in spurts, so it's hard to tell if this recent flurry of activity is a spurt or not. In this 365 days of activity, you can see the 3 spurts for Gogs so far.

https://gitsense.com/gogs-gitea/commits-365days.png

Regardless of whether or not Gogs will continue to develop at an increased rated, it looks like Gitea will.

E6300 · on Feb 2, 2017

Gogs is pretty sweet. After hearing about it here I installed it on my old home server with constrained memory because I couldn't meet Gitlab's (IMO) obscene memory requirements. It's perfect for my needs as a hobbyist and I even use it as a secondary Git server from work. It's also way, way faster than a self-hosted Gitlab or any public Git server, so even though my current server has more than enough memory to run Gitlab, I still use Gogs.

The only (minor) issue I've had with it was when I tried to push the OpenCV repository to it, just for the hell of it, on a heavily constrained VM (Debian with 256 MiB of RAM). The poor thing just couldn't handle it without crashing until I upped the memory to a couple gigabytes.

tscs37 · on Feb 2, 2017

It might be related to git and not Gogs/Gitea itself.

I've found Gitea to handle larger repos fine after running the following:

  git config --global core.packedGitWindowSize 16m
  git config --global core.packedGitLimit 64m
  git config --global pack.windowMemory 64m
  git config --global pack.packSizeLimit 64m
  git config --global pack.thread 1
  git config --global pack.deltaCacheSize 1m

This reduces the memory used by git during certain operations.

juskrey · on Feb 2, 2017

The one who erred (and recovered) once is more valuable than the one who never made mistakes.

bbcbasic · on Feb 2, 2017

There is no basis for that assertion.

juskrey · on Feb 2, 2017

This is basic risk management. What matters is error survival, not absence of errors. Moreover, total absence of errors means accumulating dangerous risk. See NN Taleb, for example.

bbcbasic · on Feb 2, 2017

Sounds defeatist to me. Especially when we are not talking about literal moonshots, but backup and restore procedures. The IT equivalent of seatbelts.

throwaway2048 · on Feb 2, 2017

I think at this point, recognising that software and software deployments invariably have bugs/issues is just realistic. The only setups that are free of them, are setups where the problems havent been discovered yet.

shakna · on Feb 2, 2017

A well known library like curl, has had CVEs.

A not known library that also addresses the same circumstances as curl might have none listed.

Which would you prefer to use?

bbcbasic · on Feb 2, 2017

That's a cross purpose argument

mondoshawan · on Feb 2, 2017

The one with the CVEs because it has had eyes on the source enough to generate them. The devil you know and all that.

shakna · on Feb 2, 2017

Exactly. They've made mistakes, and can recover from them.

The other hasn't had that chance.

Thus, try the first, as it is better tested in the world.

ysavir · on Feb 2, 2017

That's very different, though. I trust the company/organization that had to fire an employee for a horrible mistake. They are going to be very careful with who they hire or let near the code in the future, and had much more to lose than a single person.

But I don't trust the employee that was shown forgiveness for a horrible mistake. Some might learn to not repeat what they had done, while others learn that they can get away with things through the magical power of phrasing the situation in a positive light. And some might be mistake-makes-for-life. Not all people come out the other side stronger.

astrostl · on Feb 2, 2017

I see this play out with regularity in the board game hobby.

Manufacturer A: perfect product. Manufacturer B: omits some pieces, expedites replacements.

People will love and extol the amazing service virtues of B, not A.

vollmond · on Feb 2, 2017

There's another reasonable takeaway: GitLab will probably now prepare for this sort of thing better than the average competitor who hasn't been bitten by this.

ryanplant-au · on Feb 2, 2017

Borges: "In Alexandria, it has been said that the only persons incapable of a sin are those who have already committed it and repented."

sytse · on Feb 2, 2017

We'll certainly learn from this, post-mortem write-up issue is in https://gitlab.com/gitlab-com/www-gitlab-com/issues/1108

antocv · on Feb 2, 2017

Every corporation ever has screwed up like GitLab if not more.

They are just better at hiding it, smoothing it over and lying.

marklgr · on Feb 2, 2017

That's Internet culture for you: being lovable trumps pretty much anything else.

jopsen · on Feb 2, 2017

Nah, no need to think about backup now.

Stuff like this only happens once :) haha,

To be fair, I suspect gitlab will a adopt a 'never-again' policy towards making sure their backups work.

yAnonymous · on Feb 2, 2017

>GitLab cost everyone a day

That's just nonsense. In most cases, the local repo should be more than enough to continue work.

praneshp · on Feb 2, 2017

Only for development right? If you were someone that handled issues for the team, or depended on webhooks triggering builds, you were still blocked.

I'm not trying to skewer Gitlab, I host an instance at home. I've also nuked 30% of the ports on our openstack cluster and screwed up everyone's day. I admire the transparency. I just wanted to call out HN's reaction, vs way smaller (technical) issues involving GitHub. But someone did point out downthread that HN's not a hivemind, so there is that.

mattcoles · on Feb 2, 2017

The actual problem was with their DB, not the repo storage so I'm not sure what the actual fallout was, but people lost things like issues/merge requests, so I can see how workflow would be interrupted.

seanp2k2 · on Feb 2, 2017

Yeah, kind of defeats the whole D part of DVCS when you centrally-host everything. GitHub is great for creating an open-source watering hole and becoming a de-facto standard for open source project hosting, but bad for mission-critical stuff if you're relying on something like that to be 100% available, especially when using a technology which is designed to be distributed.

Floegipoky · on Feb 2, 2017

It would have been really cool if GitHub had fully bought into the distributed model and built the wiki/issue tracker/etc such that they were included in the repository, so that they were also decentralized.

fictioncircle · on Feb 2, 2017

The "Plan B" is to have a local Gitlab instance and use the mirror feature to mirror it to Gitlab.com and/or vice versa.

You should always have 2+ production nodes in case one goes down.

rdslw · on Feb 2, 2017

Just to stress one thing, while mentioned in their report, most people don't comprehend:

Gitlab was _one_ (last!) final step from complete data loss of everything. One. At that night, there was quite a long moment there had only one copy (and 6 hours old). Every other backup was missing/notworking/deleted.

This is scary.

crististm · on Feb 2, 2017

This is the fact that deserves more attention. The operator did a manual backup before work which was the only available one.

I find this to be an eye opener. The real problem was dodged by doing the right thing in the proverbial last minute (6 hours).

s_kilk · on Feb 2, 2017

In other words, there is an alternate universe in which Gitlab is now categorically dead, all because that employee didn't make that single backup.

garaetjjte · on Feb 5, 2017

Didn't they had two copies? (staging server and LVM snapshot)

mercwear · on Feb 2, 2017

I am sick of the "They are transparent so it's ok" argument. The fact is this type of thing should not happen anymore.. I get it, stuff fails but it's not acceptable to lose production data at this scale and save face by being transparent. That was ok back in 2003.. Not anymore.

seanp2k2 · on Feb 2, 2017

Compare with Github and their 98.935% uptime over the past month: https://status.github.com/graphs/past_month

Neither should be exclusively relied upon for business-critical services. Ask Github about their production backups and DR plans some time.

mst · on Feb 2, 2017

I still remember the outage where github ran their tests against production - which started by TRUNACTEing everything before loading fixtures.

I think "everybody learns the hard way" is just one of those things with operations.

anonymous_green · on Feb 2, 2017

Probably they cut cost by hiring a group of really incompetent set of developers because they are a "startup" and "transparent" and "its ok". Nothing could make state of the company more "transparent" than that. xD

djsumdog · on Feb 1, 2017

I don't think this affects to many of the larger shops, since they tend to host their own GitLab instances (typically for security, compliance, legal, etc.) My current shop runs GitLab community edition on an internal cluster.

anonymous_green · on Feb 2, 2017

Transparency? Gitlab is the company that interviews people and rejects them on basis of salary, they've been doing this for awhile, every few months they call a group of people, waste their time and then deny based on the salary. In my case, I've confronted them that USD 100-120k is the market average but they had this stupid startup argument that doesn't make any sense to me.

hobarrera · on Feb 2, 2017

> USD 100-120k is the market average

In the US maybe, but given that they're a remote company with people all over the world, they don't necessarily have to live by US standard (especially when it comes to spending!).

40k USD is a VERY good developer salary in Argentina, and I'd bet that in other countries a lower figure might make the cut too. I can definitely understand why they'd hesitate to pay thrice that amount.

I believe it's over 5 times the average salary.

anonymous_green · on Feb 2, 2017

Yes, but if you look at the average here its not really outrageous or "5 times the average salary" as you said, here: https://www.indeed.com/salaries/Gitlab-Salaries

And its not just me, they've done this in the past as well: https://news.ycombinator.com/item?id=10924957

Also when you work for a remote company, they are cutting cost in terms of office and all which should reflect back in the salaries, the whole idea of working remotely was to mutually benefit both the employeer and employee and not just gitlab using their whole startup argument to cop out when they want on that "truly remote" so-called transparent company.

hobarrera · on Feb 2, 2017

Like I said, they're a remote company, and therefore can hire people in cheaper regions, and don't have to adapt to more expensive markets (like your own).

I fail to see how this disproves it, it merely proves that you're not a good choice for them (because you live in a, relatively, expensive region).

bartvk · on Feb 3, 2017

> because you live in a, relatively, expensive region

If what OP is saying is true, then the company might as well be upfront about it.

anonymous_green · on Feb 5, 2017

> If what OP is saying is true, then the company might as well be upfront about it.

Regardless of that, the average salaries are posted online and they seem to suggest quite the opposite to what they offer people outside of US. It's just an indication of how much of a bully culture they have in negotiations or discriminatory, they might as well just outsource the site and not have a team of their own at all.

anonymous_green · on Feb 2, 2017

Most of their staff is from US/UK, but they just do not want to pay fair wages, which is why I posted the other thread where it was highlighted more.

silexia · on Feb 2, 2017

Is it more fair to pay you $120k or more fair to hire three people in a lower cost area?

This argument sounds more selfish than fair to me.

anonymous_green · on Feb 5, 2017

It's unfair to pay low to some and high to the other (as I pointed out earlier most of their employees are in US who are getting paid the average US salary).

A company should follow some rules to neutralise the discrimination between employees, it shouldn't be all pick and choose and take advantage where they can, that's not really the reason one should have a remote distributed team to get cheap labor.

dpc_pw · on Feb 2, 2017

What's wrong with that? Isn't it obvious that salary is an important factor in hiring people?

manigandham · on Feb 2, 2017

Transparency is not a substitute for competency. Both are important, but I'll take the latter every time.

empath75 · on Feb 1, 2017

Yeah it made me not consider using them for anything other than toy projects.

balls187 · on Feb 1, 2017

Why?

Everyone has their own copy of the gitlab remote.

BHSPitMonkey · on Feb 1, 2017

Not everyone has their own copy of the project's issues/pull requests/settings/webhooks/CI config/etc., which the incident affected (and 6 hours worth were lost). Git repositories themselves do have the happy side effect that at least someone on your team has a full latest clone on their machine, but that's only because the source code is what you're there to use every day. Even if GitLab had a great "checkout your issues/etc. as a git repo" story, not many users would have an up-to-date copy backed up 100% of the time.

balls187 · on Feb 2, 2017

Right, there was data loss, and it was not trivial.

Does that mean you should only use gitlab for toy projects? I don't think so.

I think they'll quickly learn from this.

taneq · on Feb 2, 2017

Maybe a dumb question but couldn't they be included in a hidden folder in the root of your repo or something? That way they'd get replicated automatically every time someone does a pull.

Ajedi32 · on Feb 2, 2017

It'd probably have to be on a separate branch, but yeah they technically _could_ do that. Maybe something like this? https://github.com/duplys/git-issues

beefsack · on Feb 2, 2017

We use gitlab.com for code reviews and CI, and we basically lost a day of integration productivity because of it.

Yes, our process is too tightly tied to a single service, but it happens because we don't have the resources to self-host our own solution. We love GitLab, but this has absolutely got us looking elsewhere.

jonknee · on Feb 2, 2017

> We love GitLab, but this has absolutely got us looking elsewhere.

If you love GitLab and don't want to self host you can pay them for GitLab Hosted (paid customers had no troubles today).

balls187 · on Feb 2, 2017

How many days of productivity are you going to lose migrating off GitLab ?

Gracana · on Feb 2, 2017

And how do you ensure you're not going to have the same problem at the next place.

diegoperini · on Feb 2, 2017

And what will you do if the next place people doesn't tell you the root cause of a probable similar event?

mdekkers · on Feb 2, 2017

The gitlab service is obviously important to your organisation, but not important enough to pay for their service, and incidentally, support a more robust environment?

pricechild · on Feb 2, 2017

Do you self-host any services currently & have they incurred any similar unexpected downtime in the past?

If you self-hosted code reviews/CI, would you expect having a similar downtime causing problem in the future?

I expect the answer to be "Yes" for most companies.

Kiro · on Feb 2, 2017

> I believe among other developers

I don't know. I'm very hesitant trying out GitLab now whereas I was interested before.

synicalx · on Feb 2, 2017

Getting better at ops? Checking backups regularly to make sure they work is not a new thing.

jchook · on Feb 2, 2017

All press is good press.

spinlock · on Feb 3, 2017

For me it underscores one simple rule: you backups are useless if you don't regularly practice restoring from them.

xenithorb · on Feb 3, 2017

Or I don't know, at least checking they exist

dentemple · on Feb 1, 2017

Any publicity is good publicity, right? :D

testUser69 · on Feb 1, 2017

I think the "business people" are people who hide their technological inadequacy behind supposed business expertise. A businesses success is not determined by whether or not they use tools that adhere to business micro-cultural "values".

If "too much transparency" is a turn off for you, you're probably just an authoritarian trying to scheme and scam your way into profit, and you probably lack the confidence required to put whatever skill you think you have on display.

leesalminen · on Feb 2, 2017

At my company we always heir on the side of transparency and liberally post to our status. subdomain whenever an issue is identified.

Recently, our Redis cluster failed and both master & slave host machines rebooted.

This caused a latency spike in our app, from roughly a 70ms to 400ms response time for less than 10 minutes.

We posted to status within 60 seconds and posted 3 updates within those 10 minutes.

The next day, a new customer (who hadn't gone live with the app yet) cancelled their subscription because the app was "not reliable".

I guess my point is that there is a balance to strike. Our customers are not tech-savvy in any way and treat any small issue as the end of the world. Maybe there's no need to freak people out for a minutes-long latency spike.

jasonm23 · on Feb 2, 2017

FYI

https://en.oxforddictionaries.com/definition/err_on_the_side...

Just in case you ever use that phrase in critical correspondence. Better to err on the side of caution.

leesalminen · on Feb 2, 2017

Ah, thanks! It didn't feel right but was too lazy to check. Will definitely remember this for next time.

tyre · on Feb 1, 2017

This demonstrates a remarkable lack of empathy. Try understanding why people on the business side do what they do—it's almost never out of malice or desired authoritarianism.

Working with people is hard, specifically because you have to get outside of your head and think about how they see the world.

GavinMcG · on Feb 1, 2017

I suspect the comment you're replying to meant that the business people might feel negatively about trusting the company that lost data, regardless of transparency.

debaserab2 · on Feb 2, 2017

It really wouldn't have been that hard to write your opinion in a way that doesn't disparage the character of the person (that you don't really know) that you're responding to.

There's no reason to hide behind a new account other than to be an asshole.

synicalx · on Feb 2, 2017

Honestly, I'm completely flabbergasted by this. Five backups, and NONE worked properly? Who made this? The S3 bucket was EMPTY? Has no one ever tested any of these backups?

It's not just the impact, which is fairly sizeable in it's own right, but it's the HUGE oversight on their part and the fact they tried to pin part of this on PostreSQL?

Credit where it's due; their report/transparency were good if a little unprofessional, and something I'd like to see more of from other companies.

Putting on my BOFH hat, this is what happens when you let Devs do operational stuff.

developer2 · on Feb 2, 2017

I agree. Their catastrophe is not the kind of thing that represents mild hiccups in operations that will be quickly resolved by the same employees who allowed this scenario to occur the first time around. This isn't the result of a single oversight. It screams of a systemic problem with the way the business operates - period. Maybe every project is rushed out with the deadline being the only metric that counts, quality be damned. Or maybe they're missing talent in areas that matter - like experienced postgres DBAs managing the database rather than having the developers and/or someone who has only ever used MySQL trying to wrangle it.

What they went through is what you'd expect from a first-time pet project, not a professional business out to make money. I suspect they prioritized trying to scale with enough features to catch up with the competition above all else, to the point where sustainability is an afterthought. When desperately trying to gain market share is more important than the quality of the product, this is what you can expect.

sytse · on Feb 2, 2017

I totally agree with needing an experienced DBA, we've had a vacancy open for this for a while https://about.gitlab.com/jobs/specialist/database/

developer2 · on Feb 4, 2017

To be a bit more specific, it is extremely unlikely (nearly impossible) that you could ever find a single individual who is both a "postgres database specialist" and a "ruby developer". Do you even know what the word "specialist" means? It means specializing in one area - like postgres - without trying to be a Jack of all trades, ie: with ruby.

"DBA" is a full-time position, not an addon to a developer's duties. They are separate skillsets; you don't get a "2-for-1" special by hiring an expert developer + DBA in one person for one lowly salary.

Do you realize this event would have never happened if you had hired a pure DBA? Or do you really believe you can pin the blame on your ruby developers for not being able to wrangle a production postgres database?

The oblivious or intentionally cheap "the DBA must be an amazing ruby developer" expectation shows your hiring staff - or the management guiding them - has absolutely no clue what they are doing. I can just imagine the internal discussion right now; pointing the finger at the developers with no postgres experience, or downplaying the significance of this event and pretending like it was simply bad luck, and lying to yourselves about how "it will never happen again".

This job posting is completely outside the realm of reason. If that job posting has been up for months or years, I can see its description being exactly why you didn't have the right talent on board to avoid this incident.

developer2 · on Feb 3, 2017

>> As a database specialist not only will you work on the database, but also the application's usage of the database. As a result a significant amount of experience with both Ruby and PostgreSQL is a strict requirement.

This is a mistake. The number of DBAs who are good with postgres is very small compared to something like mysql. The truly talented pool for such a position is too small to expect them to also be a developer. The very mention of terms like "ruby" and "programming" should be removed from that job post. It's not a realistic expectation.

dijit · on Feb 2, 2017

They want a developer who can do database stuff, not a DBA (based on that job posting.)

corobo · on Feb 2, 2017

They are who you're replying to

> I'm the CEO of GitLab https://about.gitlab.com/ More information about me is on http://sytse.com

dijit · on Feb 3, 2017

Fair enough, but I don't see him denying it.

briffle · on Feb 2, 2017

Just as an aside, you can hire a remote DBA to help out, until you get a full time person. We have a contracted 3rd party that does secondary monitoring around the clock, and a very, very good pro that we work with on projects. In our case, we have a 'retainer' for 15 hours a month. (and a fixed cost per hour after)

haylem · on Feb 3, 2017

The entire GitLab team is remote, as far as I understand their company model.

https://about.gitlab.com/2015/04/08/the-remote-manifesto/

But maybe you meant part-time?

developer2 · on Feb 4, 2017

Wow, that explains quite literally everything. Never, ever, trust a company whose cost-cutting measures involves opting to not pay for office space to run a legitimate business. The attempt to tout the lack of a full office as being "modern, hip, and cool" is a blatant lie from management - it's the same concept as open-office floor plans, reimagined and magnified by a thousand. They get to save hundreds of thousands (possibly even a couple of million) dollars a year; meanwhile, you get the exactly the kind of product you'd expect from a bunch of people pretending to be productive from home and working in such a disjointed fashion.

It is always a HUGE red flag when a company opts to have all or a majority of their workforce working remotely. It's a cost-cutting measure, nothing more. Cutting costs equates to cutting corners, and the business - and its customers - suffer the deserved consequences.

This really explains the flippant "it's 11pm and I want to go to bed" reaction in their report. The guy doesn't have an office to go to when shit hits the fan. He's sitting at home, with a bloody ssh terminal open, trying to remotely debug critical engineering problems over a slow vpn connection. Alarmingly huge red warning flags.

anonymous_green · on Feb 5, 2017

Cost-cutting measure don't just end at the office space level, they hire cheap developers too. There's quite a big divide between salaries they pay to their team in US and remote team and they get to do it just because they are "remote".

yuhong · on Feb 9, 2017

I think remote is better than open office floor plans in many cases. It just have different problems.

BrentOzar · on Feb 2, 2017

The word "backup" doesn't appear anywhere in the job description.

Might wanna revise that today.

justinclift · on Feb 2, 2017

They'd probably be better off hiring a "backup engineer" to go through all of their systems (end to end) ensuring full backups exist and work for everything.

Instead of backing up every system in isolation, have well though out backup/restore processes for all parts of their operation.

Saying that as someone who's done exactly this before (professionally, for mission critical places). ;)

eg:

· Inventory the systems (boxes, services, etc)

· Determine what each needs (package dependencies, etc)

· Create scripting (etc) for consistent backups

· Work out the restore processes

· Make it work (can take several test/dev iterations)

· Document it

And also (importantly):

· Have the ops staff perform the documented processes, to reveal holes in the docs, and show up parts which need simplifying

JohannesH · on Feb 2, 2017

Untested backup === No backup

People like to say they have backups or a "backup procedure", but in my experience almost none of them ever tested the backup... Not even once. 95% of the time "having a backup procedure" just means "we have a replica of some data sitting somewhere with no idea how/if we can restore it, or how long it takes".

baq · on Feb 2, 2017

it's worse. you might be trying to restore from an untested backup and fail, wasting precious time.

five times.

_gqhf · on Feb 2, 2017

I remember a local company a decade ago where they were obligated to delete part of their data on fridays just to test all the backups and procedures. At the time I thought it was completely nuts. I probably still think it's at least a little bit nuts but I can see the point.

mdekkers · on Feb 2, 2017

Putting on my BOFH hat, this is what happens when you let Devs do operational stuff.

:thumbsup:

empath75 · on Feb 2, 2017

I got thrown on to the team managing netbackup at a massive company a few years ago, and it's actually fairly common that people don't test backups. We sent out notifications constantly warning people that if they aren't testing their backups, they aren't really backing up, but at least once every other month, someone would come back to us wondering why their backups weren't there, when they were trying to recover something.

activatedgeek · on Feb 1, 2017

While this is disastrous, I still think Gitlab is the best thing happened to OSS. This could be taken as a rhetoric, but on a more actionable side, we must all learn from Gitlab's experience. Almost everybody experiences this issue, but very few come out clean.

nicpottier · on Feb 1, 2017

I think you are confused.

The best thing to happen to OSS is GitHub, not GitLab. GitLab is just a fast follower and likely wouldn't even exist without the former.

I for one am happy to throw money GitHub's way for their role in so dramatically changing how we code.

iatanasov · on Feb 2, 2017

I think you might be confused. One of the nice things to happen to OSS was the availability of kernel ( Linux kernel that is ) and git ( ... ) and both set very open models which allowed the OSS to trive. Github allowed for using git without knowing it .

JohannesH · on Feb 2, 2017

I think you might be confused. One of the nice things to happen to OSS was the availability of transistors and electricity. Both made it possible to make millions of logic calculations per second without knowing it. :)

mightbeconfused · on Feb 2, 2017

I think you might be confused. One of the nice things to happen to OSS was the availability of human beings to actually develop the concept of OSS and any underlying technology.

imiric · on Feb 2, 2017

Please leave these threads on Reddit.

dkural · on Feb 2, 2017

Github itself is not open source. The irony.

grzm · on Feb 2, 2017

There are plenty of useful closed source services and tools that can be used to support open source projects, or just software development in general. Indeed, they facilitate open source projects with an easy to use service. Is there anything they're doing that prevents or blocks open source development? They do have open source projects that they've made available.

dkural · on Feb 2, 2017

My pizza joint does not prevent or block open source development either. Apple has open source projects that they've made available. Many people use Visual Studio and Azure for open source projects, or software development in general. I find it very easy to use. I hope Github makes a lot of money from enabling open source. Who knows, maybe Microsoft will buy it.

More seriously: Github is a business focused on turning open source community ethos into hard $$$. There is nothing wrong with that, and their model depends on people buying into the carefully constructed marketing of Github -hearts- open source. It's inconvenient for Github for people to point out that it is a closed source for-profit business with shareholders to please.

JonathonW · on Feb 2, 2017

Meanwhile, GitLab is also a business focused on turning open source community ethos into hard $$$-- part of their core product may be open source, but they're not a non-profit.

(One might even say that Gitlab is even more focused on turning open-source community ethos into hard $$$, because GitLab takes open-sourced community contributions to GitLab CE and rolls them directly into their paid, closed-source GitLab EE. Not that I'm suggesting that's necessarily a bad thing; I just want to make it clear that this isn't a Mozilla-style "take contributions from the community for the community's sake"-type thing.)

yorwba · on Feb 2, 2017

> GitLab takes open-sourced community contributions to GitLab CE and rolls them directly into their paid, closed-source GitLab EE

Just FYI, the GitLab EE source code is available at https://gitlab.com/gitlab-org/gitlab-ee/. You're right about the "paid" part, though.

grzm · on Feb 2, 2017

Many companies take advantage of open source software. Some support open source through support contracts. Some which develop software release some of it, and employ devs who also contribute to projects. I'm sure there are many other ways companies use and support OSS directly and indirectly; some, don't support OSS at all.

SaaS companies are in a position where it can make it difficult for them to release their core software because it severely undercuts their viability if it's easy for people to set up their own installations.

What follows is naïve speculation as to counter arguments:

One could argue that they could differentiate themselves on better quality of service and support. It's possible, of course, but there may be enough people willing to put up with lesser quality that it may not be feasible for them to survive.

One could counter that taking advantage of community software contributions can lower their own internal development costs. Maybe, but there are very real costs when managing an open source project, including not being able to move as fast as they may want to. For example, if GitHub wanted to make a change for expediency' sake that wouldn't be appropriate for the community code, they're saddled with maintaining their own fork, which adds considerable overhead.

Of course GitHub loves open source. Their business is to provide a service that benefits open source (and closed source, for that matter) projects by providing infrastructure to manage their code. Some of their current and former employees are well-known in the open source community. I refuse to see anything nefarious or untoward (or even the slightest bit bad) about this until I see them doing something that impedes or actively works against open source. I see nothing "inconvenient" for GitHub about this. Some people are far too quick to go out of their way to look for hypocrisy.

(And as for "turning open source community ethos into hard $$$": my goodness. Yes, they make money. I hope plenty to continue to support the business, improve the service, and pay their employees well. How many community projects pay nothing for hosting on GitHub? How many devs pay to host their own projects, or fork others? The only reason I've considered paying was to host private repos: not open source. It'd be interesting to see their revenue numbers, but I suspect most of their revenue is for enterprise installs for software that isn't open source. Any corrections as to this most welcome.)

And there are alternatives for those who want open source. GitLab appears to be a great one.

romanovcode · on Feb 2, 2017

Yeah, even their enterprise edition code is obfuscated like crazy.

For a company that praises open-source so much they really do try to be as closed as possible themselves.

dsacco · on Feb 2, 2017

Have you tried to reverse engineer GitHub EE?

drvdevd · on Feb 1, 2017

I pay for GitHub and can't deny their positive role in OSS, but I also use GitLab and will happily continue (and love that they have an OSS edition). I don't think it's a binary choice. Personally I will often adopt multiple pieces of software with redundant features because nuance often dictates the right tool for the job, and I enjoy being prepared for that.

nicpottier · on Feb 2, 2017

Absolutely, but the parent was saying GitLab changed OSS, and though that may be true to some extent, it is a pittance compared to the imapct GitHub has made.

manquer · on Feb 2, 2017

git itself is the real force behind the OSS impact. Github model while successful was merely replacing sourceforge and google code. People always had ways to host and share open source projects.

Github enables centralization of code which is not really inline with philosophy of git being fully distributed . Gitlab with its OSS model and the ability to host your own git service makes is bit closer to that vision than Github is.

Thaxll · on Feb 1, 2017

You code differently using Git vs SVN? That's an interesting concept. It doesn't change the way you code it just changes the way we share code.

jasonwatkinspdx · on Feb 2, 2017

Git makes multi-tasking on multiple branches far easier than SVN. In the SVN days I always had multiple checkouts and a pile of scripts for changing/updating to branches for handling a few regular tasks in that context. Once git won I was able to delete them and have a far simpler workflow.

gowld · on Feb 1, 2017

It doesn't change how I code. It changes how we code.

nstj · on Feb 2, 2017

Git absolutely changes the way I code. Cheap feature branching and easy rapid collaboration are huge in how they influence my coding practice and output. I wouldn't enjoy coding anywhere near as much if it weren't for Git.

VintageCool · on Feb 2, 2017

A few months ago I was developing for a breaking change and realized that I wasn't committing anything. That's very much an SVN mindset where merging is so painful that you culturally just avoid it altogether. Git's emphasis on being distributed and branching and merging constantly meant that, once I pulled my head out of my ass and started committing locally, I was able to enjoy the benefits of source control while developing my breaking changes without affecting anyone. And later, when I merged my branch into the master, it turned out to be completely painless.

flukus · on Feb 1, 2017

Github was just a follower with a new SCM. The original would be source forge.

nicpottier · on Feb 2, 2017

SourceForge pretty much failed where GitHub strived though. For all practical purposes they were just a hosted CVS/SVN and nothing more. GitHub really nailed the idea of community and that made all the difference. Git is incidental, though lightweight branching was probably an important contributor to their approach.

stevekemp · on Feb 2, 2017

While I can't disagree that sourceforge failed, it was definitely more than a mere code-host.

Sourceforge had integrated mailing-list support, bug trackers, page-hosting, and more.

There was a community there, it was just that each community was based around a particular project. There was little chance of a user of project A from interesting with project B. But I guess the same could be said of github.

Sourceforge failed in part because of feature-creep, and availability issues. But I think it would be unfair to say that it wasn't "social".

flukus · on Feb 2, 2017

I know the execution was a lot better with github, but what did it really add that source forge, google code, etc didn't do? They all had bug trackers, mailing lists, etc.

will_hughes · on Feb 2, 2017

The killer feature is stupid easy forking and sending pull requests with the one interface.

Found some tiny project that does almost everything you want, but has a tiny feature you can implement easy? No worries, fork, add it.

Even if it's never pushed upstream, it's easy for someone to find and perhaps use themselves.

I do this with things like zabbix templates/scripts and docker containers that are 99% what I want.

Previously there was no easy way to do this - I'd have to set up an entirely new sourceforge/google code/etc account, which is a lot more effort.

flukus · on Feb 2, 2017

> The killer feature is stupid easy forking and sending pull requests with the one interface.

That's the part that wasn't really possible before git though. But you could always do an anonymous checkout and make your changes there,you could even do a pull request (via a patch). There was no need to signup for that, even though most of us had a source forge account (I still do) anyway.

kakwa_ · on Feb 2, 2017

Well, even if their transparency in failure is admirable, I'm not completely surprised.

Just running an apt-get install gitlab gives me around ~350 dependencies. In those dependencies, I see python, ruby, nodejs, redis, postgres.

With a little Java, and a little Go, plus some admin scripts written in Perl, the picture would be mostly complete...

I may be a little harsh, but when I see a piece of software with so much complexity in it that it requires 3 stacks, I'm not completely convinced it's a well conceived piece of software.

Having 5 backup systems, with none of them working properly kind of falls in the same category.

Operyl · on Feb 2, 2017

Err, isn't Git the reasoning for that perl dependency? Iirc, Git depends on Perl for --interactive..

rimantas · on Feb 2, 2017

There not many project these days that would not require similar stacks. At least one core language: be it Ruby, or Java or whatever. Then you will have a DB. Then you will have some fancy frontend which will sure depend on node js. Then you might have some KV store/caching so you get Redis.

tigershark · on Feb 2, 2017

I completely disagree. Normally you choose ONE core language and you use only that in a monolith application. I can admit different languages for the unit tests, like groovy on java side or F# on .net. Or using two languages on the same platform, like Scala and Java on the JVM and C# + F# on .net. Or one language server side with the web side written in javascript or in something else. If you are not breaking your application into micro services using several languages is just a very very bad smell. And in this case, apparently, they are using python, ruby, javascript, java, go and perl. I would say that from my point of view this is unacceptable and it is just a symptom of scarce planning, as we have seen in this incident where apparently nobody thought to plan for a backup test.

vorg · on Feb 2, 2017

Even using 4 languages is too many, e.g. Java and Scala for system building, Apache Groovy for build scripts and unit testing, and Javascript for web side as in your example. Go with something like Kotlin, which compiles to JS, Android, and JVM bytecode, can be used for Gradle build scripts, and has builder-style syntax useful for test scripts. A single language is better than 3 or 4.

dman · on Feb 2, 2017

I would be surprised if most startups arent doing something similar behind the scenes.

dsacco · on Feb 2, 2017

In my experience of working closely with about 100 startups, most do not have that complex a tech stack. What I see nowadays is basically:

[Python|Ruby|JavaScript] backend + [JavaScript] frontend + [Postgres|Mongo] database + redis + [AWS|GCP] hosting.

Rarely, I see Java or Go instead for the backend. I cannot recall the last time I saw more than three languages in production for anything nontrivial. I've seen companies significantly larger subsisting on one language for backend (even in microservice architecture!) and one language for frontend. That's not to say there is no sophistication, just that the number of actual technologies in play is slimmer.

This isn't a comment on GitLab's utility or stability, of course. I haven't worked with them in this context, and I'm not a user. I'm just pointing out that, assuming those dependencies are all for GitLab and not e.g. git itself, that is quite a stack to maintain. I don't know if we can extrapolate that to a systemic issue with GitLab that caused a data loss incident, though. That seems uncharitable.

baq · on Feb 2, 2017

to be fair, nodejs is useful for npm even if you're only doing client-side js.

freehunter · on Feb 2, 2017

That's what I thought when I read the report earlier today. A system with that much complexity is doomed to failure eventually.

I can understand "right tool for the job" but at some point it should all come together. An MVP can be hacked together from bits and bobs, but when it becomes a business it should be refactored to reduce complexity where possible.

Trying to maintain five seperate backup solutions, let alone trying to restore from all five of them, sounds like my worst nightmare. Trying to restore from one backup is often hard enough by itself.

iurisilvio · on Feb 1, 2017

Exactly. today we're able to see a bad disaster recovery happening, but it happens all the time. I revised my backups and added some redundancy to it.

Everyone knows how much backups are important and talk about it all the time, but I bet a lot of companies don't do it right. It is expensive, don't add real value (except when it does), etc.

Today I added automated replication of backups from AWS to another cloud provider. Just in case...

I did this becauase recently, a local brazilian cloud provider (ServerLoft) didn't paid his server (Equinix) and went offline forever. 16k companies went offline without time to recover anything there.

Klathmon · on Feb 1, 2017

As terrible as it is for GitLab, it's doing some good here as well.

I'm using this as a great example of a reason to bump up some of the fixes to some of our backups and replication issues up the priority list. And it's much easier to sell to some of the "higher ups" when you can point at a concrete example of how badly a misstep here can hurt.

I'm floored with their honesty and openness, I can honestly say I wouldn't be able to put this out there like they have... But i'm really glad they are doing it, and I'm really happy at the outpouring of support they are getting for it from people like 2ndQuadrant.

provost · on Feb 1, 2017

As someone who hasn't had the opportunity to use Gitlab, could you expand on why you "think Gitlab is the best thing happened to OSS"?

mi100hael · on Feb 1, 2017

It provides a lot of the same features as GitHub, only you can self-host for free, contribute to development, and all the other niceties of a free project. A lot of people (myself included, and OP I suspect) believe it's also a particularly good project because GitHub has too much market share and influence in the free software ecosystem for a for-profit company.

Vinnl · on Feb 1, 2017

Note that GitLab is a for-profit company as well. It's just that it aligns better with the open source values.

djsumdog · on Feb 1, 2017

..and they have a commercial offering too. Unlike some other things I've seen, at least their commercial offering isn't crippled and mostly has things that specifically cater to enterprise customers (like Active Directory integration. The community edition still has basic LDAP).

mschuster91 · on Feb 1, 2017

> like Active Directory integration. The community edition still has basic LDAP

I manage our company's GitLab instance. It's connected to our massive AD and it works just fine - the only thing I really miss is LDAP group creation and assignment. GitLab uses only the name and email attributes from LDAP/AD either way, and I think if I have some spare time I'll just write a hourly cronjob that manages groups and assignments using the GitLab API.

Girlang · on Feb 2, 2017

Plus github has an agenda that has nothing to do with code, and excludes large segments of the public.

i336_ · on Feb 4, 2017

Your comment was killed (I just vouched for it) because of your unsubstantiated claims. Can you give some concrete examples of why you think this way?

tigershark · on Feb 2, 2017

seriously? almost everybody has FIVE different backups system and not one of them works? I don't know where you work, but in my job monitoring everything is essential. And a couple of times per year we do a disaster recovery test. Something like this is completely and totally unacceptable. It means that they never tested any backup.

FlorianRappl · on Feb 1, 2017

Couldn't agree more. Also transparency is key here. Everyone makes mistakes, but being open about it does not only let customers (and employees!) know what's going on, but also gives other professionals the ability to learn from it.

neurotech1 · on Feb 2, 2017

There are two types of web application providers. Those who have already had database mishaps causing data loss, and those that WILL. No matter how good, or our how large and 'professional' the company is, mistakes can happen.

Being open about it inspires confidence that they will improve in the future, instead of being quiet and having repeat incidents.

neals · on Feb 1, 2017

I don't get to work on databases this size and today has been an incredible lesson and a journey. I've been reading all the comments and blogs, watching the stream and Googling what I didn't know or understand.

I feel like the next step for me is scaling my business so that we have an actual usage for my newly found interests :)

sulam · on Feb 1, 2017

Then you end up having deep expertise in a topic that's only important for larger companies. Sometimes this works out well, sometimes it leaves you a little stuck. :)

stusmall · on Feb 1, 2017

Meh. Just learn, learn, learn. If it isn't completely applicable to your life that's okay. Sometimes there are pearls of wisdom in best practices in field completely unrelated to your own. Learning something new is always a good thing.

haggy · on Feb 2, 2017

THIS ^^^ So much this. I can't tell you how many times I picked up a book or paper thinking "There's no way i'm going to get anything new out of this". As I start reading, I start to find little tidbits of information that make me think in ways I didn't before.

skywhopper · on Feb 1, 2017

Oh, there's plenty of demand for skilled DBAs with deep expertise with handling large volumes of data, even at small-to-medium-sized companies. GitLab being one of those companies! In fact, demand for this sort of thing is likely to increase. If it interests you, dig in. You'll never stop learning.

hyperpape · on Feb 1, 2017

I'm not sure if it's really company size, though there's probably a correlation. My company is only marginally bigger than GitLab in terms of employees (just recently 200+), and we have several dozen databases bigger than this.

wkd · on Feb 1, 2017

This, it depends more on the product and the requirements. I've worked in companies that are just a few employees and are running databases larger than this. I've also worked for a company with ~100 employees whose only 'database' is an in memory cache and a few kilobytes big textfile

jjirsa · on Feb 2, 2017

Running databases at scale is great - it's what I do for a living, and I wouldn't want to do anything else

But please realize that this is a horrible example. Almost everything done was wrong, technical choices, processes, everything - please don't use it as a positive example.

anw · on Feb 1, 2017

For those wondering about the live Gitlab stream, you can watch them work here:

https://www.youtube.com/watch?v=nc0hPGerSd4

cs02rm0 · on Feb 2, 2017

That is remarkably transparent. Commendable.

throwaway7767 · on Feb 2, 2017

It sounds like a terrible working environment for the sysadmins. When shit is broke, you focus on fixing it. Being on camera is distracting, and setting up the livestream probably takes a bit of time that could be spent on actually fixing the problem.

jobvandervoort · on Feb 2, 2017

We had the call live with all engineers and asked if everyone was okay with streaming this.

Someone not involved with helping fixing the problem set up the stream from their home, while we continued work as normal.

I think the overall spirit was that it was comfortable to do it like this. Note that no one was required to work like this and we'd happily stopped streaming if anyone would have any problems with it.

YorickPeterse · on Feb 2, 2017

It's both fun and incredibly frustrating at times.

intsunny · on Feb 1, 2017

The gitlab situation and Uber's article speak to the level of immaturity of PGSQL's native replication feature, and more importantly: how not widely google-able nor documented/adopted the replication strategies are.

djsumdog · on Feb 1, 2017

I didn't think replication was one of the issues Uber noted, but then I looked back at the blog posts:

https://eng.uber.com/mysql-migration/

..and yep, it's listed.

In my own experience, I'd still take PostgreSQL over MySQL. MySQL doesn't allow for DDL modifications within a transaction, which makes database migration with tools like Flyway a little less resilient. On the other hand, you can use one connection for multiple databases with MySQL, MSSQL and others, which you can't with Postgres.

I mean really, they all have trade-offs. It really just depends on your specific use case.

tshannon · on Feb 1, 2017

I believe gitlab used slony, not the native replication. I'm not well versed in postgres, but that's what I gleaned from reading their event log.

YorickPeterse · on Feb 2, 2017

As mentioned below we only used Slony to upgrade from 9.2.something to 9.6.1. For regular replication we use PostgreSQL's streaming replication.

detaro · on Feb 1, 2017

The google doc says at the bottom that slony only was used for a migration once, otherwise they use the replication features built into postgres.

jessaustin · on Feb 1, 2017

The decision not to use native replication might be attributed to its purported immaturity?

benmmurphy · on Feb 2, 2017

i chose postgres over mysql at my current job because there was an easy to use backup scripts (wal-e) available at the time and i didn't see anything comparable for mysql. so we have a hot slave and also streaming backups constantly to s3 and a full base backup every week and this with very minimal work. [https://github.com/wal-e/wal-e]

i actually regret not using mysql because mysql at the time supported out of the box logical replication which would make database upgrades easier. we haven't upgraded our postgres DB boxes because it would involve a lot of pain. whereas if we were using mysql we probably would just upgrade the slaves and wait for a failure.

wheelerwj · on Feb 1, 2017

i wish everything was discussed/handled as publicly and transparently as this whole scenario.

i really hope this becomes a thing.

cornedor · on Feb 1, 2017

I totally agree, the live stream [1] is amazing, discussing steps in the open like that and if there is some time left answer questions from the YouTube chat.

[1] https://www.youtube.com/c/Gitlab/live

jobvandervoort · on Feb 1, 2017

Glad to hear this was appreciated. It was an experiment -one we only hope to repeat in other scenarios.

throwaway7767 · on Feb 2, 2017

As a sysadmin, I'd find it incredibly distracting to be on a livestream while trying to fix a critical issue. For your employees sake I hope you don't do this again.

Have a single point of contact that provides information about the recovery process. Being transparent and providing technical info is good, but that task should not be handled directly by the admins at the same time they are focusing on the drop-everything-shit-is-broke emergency.

mandlar · on Feb 1, 2017

Is there an archive of the stream?

jobvandervoort · on Feb 1, 2017

Part of the stream is here: https://www.youtube.com/watch?v=nc0hPGerSd4

Not sure if we'll be able to get the full 8+ hours up.

Vinnl · on Feb 1, 2017

...and keep it up (sorry, couldn't resist :P)

marricks · on Feb 1, 2017

From their blog post,

> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a 6 hours old backup.

That must be _terrifying_ to realize. I mean, thank goodness they had a 6 hour old back up or they'd be in such an awful spot.

cptskippy · on Feb 1, 2017

I would counter that they're still in an awful spot because this announcement reeks of incompetence and isn't something you want to hear from the guys you're entrusting with keeping your code safe.

It would be like Boeing or Airbus announcing all the safety features on their airliners were non functioning.

Vinnl · on Feb 1, 2017

> the guys you're entrusting with keeping your code safe.

To add to the other replies: I'm not trusting them with keeping my code safe (everybody has copies on their own computers). I'm trusting them with facilitating my workflow, helping me collaborate, and also to keep my issues etc. safe.

So yes, this can have a relatively large impact in terms of not being able to work as efficiently for a day and potentially losing some issues, but nowhere near as disastrous as losing all my code.

djsumdog · on Feb 1, 2017

Exactly, that's the whole point of git. Your entire code base is distributed (or at least everything you push out to a remote). Unlike CVS/SVN, you have that whole copy and collective history of everything that was pushed in that repo.

If you want to move it, in most cases you can simply create a new bare repo on another remote and push yours to it. It's probably the easiest system out there to simply pick up your data and go, unlike the walled gardens of social networking, video (YouTube isn't video, it's video + annotations + another ecosystem of tools that's not easy to export), and other services.

Even blog software isn't as resilient. You still have to export your Wordpress or Ghost blog when you want to move it. With git, when you work on it, you already have a full copy (with a few exceptions of course, like remote branches people prune without merging or local branches people never push).

holman · on Feb 1, 2017

> I would counter that they're still in an awful spot because this announcement reeks of incompetence and isn't something you want to hear from the guys you're entrusting with keeping your code safe.

This might be a good time to note that GitHub dropped their database awhile back (twice!). Bad Things always happen; you recover, learn, and hopefully fewer Bad Things happen in the future. Nature of the beast, unfortunately.

mynameisvlad · on Feb 1, 2017

I don't think those two scenarios are comparable. The safety features are in use on an airplane constantly. Backups are only needed when a disaster happens. It'd be more akin to B/A announcing that some emergency system like the air masks isn't functioning. Still incredibly troubling, especially if there was a cabin depressurization scenario, but not to the level of all safety features being broken.

andyana · on Feb 1, 2017

Not all the safety features are in use all the time. Like, you don't use the inflatable slide unless your evacuating.

mynameisvlad · on Feb 1, 2017

Sure, and I never said all of them are used all the time, but parent comment specifically did say "all the safety features". I wanted to make the distinction clear that safety features in planes can and are used actively during flights, whereas backups are only used when there's a catastrophe.

gukov · on Feb 1, 2017

They are in an awful spot.

It can be argued that once a company messes up this bad it will make sure nothing similar happens ever again. However, it can also be argued that if a tech company has all five of its back up procedures fail it's borderline criminally negligent.

dredmorbius · on Feb 2, 2017

BOAC Comet.

testUser69 · on Feb 1, 2017

The fact that they were upfront and honest about it and that they even live streamed themselves fixing the problems makes me want to use gitlab even more. If anything I have even more confidence in them. You didn't hear a peep from Microsoft when the forced windows 10 upgrade bricked thousands of laptops. Perhaps that's why such a huge portion of developers prefer OSX/Linux to Windows? I've run six businesses over the past decade and when windows 10 started rolling out I was in a Houston office that lost nearly a hundred terabytes of client data. We did everything by the books, paid for the business and enterprise editions of Windows, used their servers, used their proprietary software stack, used their support service, and they still fucked us and didn't really seem to care or think they did anything wrong.

I have another company that runs on a completely open stack, where pretty much nothing is integrated by a specific vendor. We have hiccups, but we've never had the OS get hijacked and upgraded.

I've noticed most start-ups run by devs run on a more open stack and hack their way through problems on the cheap, and the ones run by corporate executives try to keep things as closed as possible, but end up spending millions to solve problems that they could have had some people solve for fun on the internet.

I prefer to use the right tool for the right job, but I wish companies like Microsoft would be more open when they cause huge issues that end up causing monetary loss. I make sure all my critical infrastructure is open source these days.

FireBeyond · on Feb 2, 2017

"You didn't hear a peep from Microsoft when the forced windows 10 upgrade bricked thousands of laptops. Perhaps that's why such a huge portion of developers prefer OSX/Linux to Windows? "

Weird that you group OSX in there. The only company in mainstream tech more secretive than Microsoft when it comes to problems is... Apple.

owenmarshall · on Feb 2, 2017

Being able to own up to a problem doesn't imply an ability to fix the problem going forward.

"Test-recover backups" is ops 101. "Monitor your backup process to be sure your backup store isn't empty" is ops 101. "Script your rollouts so you don't have an ops person doing SSH on boxes" is... ok, that one might be ops 102.

This points to a company with almost no understanding of how to operationalize software. There are certain to be far more landmines, possibly even bigger ones. Hiring an ops person to fix these problems is definitely possible - and I sincerely wish GitLab luck getting a competent ops team in place before the next crisis.

hyperpape · on Feb 1, 2017

I'm surprised by the statement that 4 gigs of replication lag is normal. However, I don't manage backups for anything larger than personal pet projects, so I don't have a sense of scale.

aexaey · on Feb 1, 2017

> don't have a sense of scale

From [1], complete db is ~300GB and from some iffy pixel measurement of the graph at the very bottom of that page, copying speed between otherwise idle db hosts was about 22.8 GB/hour (in-production replication is probably slower than that).

From that, 4GB of replication lag would represent 1.3% of db by size, or 10+ minutes of lag (as measured by time required to catch up under ideal circumstances).

[1] https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...

hyperpape · on Feb 1, 2017

I didn't think to eyeball the graph to guesstimate how long the 4GB translated to, so thanks.

However, scale was the wrong word for what I was wondering about. My question should've been whether 1% of your total DB/10 minutes of replication lag seems reasonable/nothing to worry about, like the article suggested.

joking · on Feb 1, 2017

It was an issue, part of the reason that a tired person was working trying to reduce it.

stock_toaster · on Feb 1, 2017

Thanks for the context, in my personal experience at jobs dealing with large databases, we always used "minutes behind" to determine how well our replication was keeping up. This was the first I had heard of someone using data size for that same metric.

johannes1234321 · on Feb 1, 2017

Both have different uses. Seconds behind is useful to see how critical this is to the user (they see outdated data), data behind tells you about load and let's you guess how much time is required for catching up

jwilk · on Feb 1, 2017

https://news.ycombinator.com/item?id=13537052

willemmali · on Feb 1, 2017

^ Related HN discussion on the live report @ Google Docs

pzh · on Feb 2, 2017

Does nobody else find the report cringeworthy? Apparently, there are some junior engineers fumbling around and committing serious errors, but where are the senior ones and the process/failsafes to prevent all this?