Hacker News new | past | comments | ask | show | jobs | submit login

data loss due to a broken replica-repair strategy.

That's the thing: Redis has no read repair strategy. Redis has a "be an exact copy of your master" strategy. It's not a secret. It's the exact design.

The complaints are like yelling at Linus when you rm -rf / your entire machine. Sure, it sucks, but it's a repercussion if your own actions, not a fault in the system. If you don't want to rm -rf / your machine, go use an OS designed for babies (cough ubuntu cough).

(Plus, there are already Redis improvements (designed within a day or two of the original problem being reported) to provide workarounds to users who _do_ want to run that exact use case. The answer to problems is solutions—not complaining and blaming endlessly.)

"the 99% percentile is bad" in a more negative way than he intended?

Text. It's only text. You can't read the intonation and people want to read absolutes. People want to read anger. Always anger. Always confrontation. It's possible the author of the text didn't mean to insult your mother. Breathe. It'll be okay.

The entire goal of the Internet is to get ALL THE ATTENTION YOURSELF. If anybody hates you online, it's because you got attention and they didn't. Nobody hates insignificant people. So, often times people with lower profiles/attention in conversations will try to increase their attention profile by arguing/hating the high-attention people.

If people hate you, you've already won.




When you say "be an exact copy of your master" is the strategy, you're kind of missing the problem here - who should be the master? In this case, Redis made a really dumb choice of which node should be master, and brought everything in sync by replicating emptiness instead of replicating data. There are very few use cases where that would be preferable. People have pointed out well known and fairly simple solutions to the selection problem, which are leading to the fixes and workarounds you've mentioned (though without the courtesy of acknowledging where the ideas came from).

Your analogy to "rm -rf /" is invalid, because that's well defined, well documented, and well known behavior. That's not true of Redis's autonomous and non-deterministic response to a failure (not to a user action). No spec or doc precluded choosing a different master and preserving data instead of discarding it. In the absence of such explicit guidance, preserving data should always be the default. How can it be user error when the user did nothing? Redis did the wrong thing because something was missed in its implementation, not because of any rational or deliberate choice.


you're kind of missing the problem here - who should be the master?

The choice of master is a static configuration set by the user.

Redis itself has no failover or promotion ability. There's an additional thing called Sentinel that can failover and promote individual Redis instances, but it is designed to recover complete instance failures (without immediately restarting), so a quick restart means no failover happens [an improvement to the "quick restart" scenario is showing up soon].

Redis made a really dumb choice of which node should be master,

(see previously; master is static, defined by the user)

pointed out well known and fairly simple solutions to the selection problem

(redis doesn't select things)

Also, this issue showed up last week. Last week. People are making it sound like this issue has been ignored for years. Nobody ran into this (and reported it) until recently. This use case is already being adapted into SOP Redis capabilities soon.

Try running into a big problem with any other DB and getting both attention and a concrete fix within two weeks. For free. The entire progress of the project has paused to address these immediate user issues.

because that's well defined, well documented, and well known behavior.

The Redis behavior is: always be a copy of a statically configured master. When the master has an empty dataset, all the replicas replicate an empty dataset. Pretty simple. :)

No spec or doc precluded choosing a different master and preserving data instead of discarding it.

Yup, specs and documentation did exactly that. Redis has no failover capability on its own.

preserving data should always be the default.

Ooops, you just re-invented the Mac trashcan.

How can it be user error when the user did nothing?

The user disabled persistence, enabled replication, restarted the process with zero data, then the replication recovered and stayed in sync with the newly zero-data master.

something was missed in its implementation, not because of any rational or deliberate choice.

nopers. more a lack of thinking it through from the user's point of view. an exact copy of nothing ends up being nothing.


"People are making it sound like this issue has been ignored for years."

Hasn't it been? How does leaving that latent in the system for years make things better? I rather think it reflects on an inability to reason about failure modes (including user failure modes), and deal with them proactively instead of after data was lost.


Arguing about developers not being omnipotent isn't very stable.

The users intentionally configured their options and the system responded exactly as it should have, given what it was asked to do.


This isn't about about omniscience (not omnipotence BTW). This is about a far lower standard of basic diligence, expected and met by most people who work on data-storage systems. If you're given some data to store, and there's an obvious way to retain/recover that data despite and intervening failure, then failing to do that is a betrayal of the most basic trust people put in data-storage systems. Congratulations, you've implemented the distributed-system equivalent of linking fsck to mkfs. Well done. Go pat yourself on the back for conforming to your specification.


I don't think you're understanding redis or this problem correctly.

Redis lets you have slaves which mirror the master. Hundreds of thousands of redis installations use this pattern to provide read scaling and offline master-loss persistence, and in the normal case, this works great. I myself have implemented systems with hundreds of redis instances which have gracefully survived the loss of the primary.

In this particular instance, the user turned off persistence, didn't understand the ramifications, and then brought the master back up with an empty database after a hard kill without thinking things through.

Fortunately, the user was savvy enough to have kept backups off the slaves, as is the usual pattern, and so was able to continue service.

This is not a normal pattern and goes against the general practice.

Does that help?


I understand what you're saying, but I don't think it's a sufficient reason to throw away data. I've seen hundreds of cases where a GlusterFS user went against our advice and did something that ended up making things worse. Sometimes they even lost data. Of course, they always blame us. I'm pretty sure people who have worked on every single data-storage system ever have had similar experiences. Sometimes the user is just wrong and it's their own fault. Sometimes they're right because we made it far easier for them to make things worse than to make things better. In those cases we have to stop making excuses like "user error" or "RTFM" or "against general practice" or whatever. We need to help the user by not handing them bags of explosives. Which do you think is a better choice here?

* Default to preserving already-replicated data, provide "clean start" as an option.

* Default to throwing away data, maybe-someday implement an option to use data that's already present in the system.

Blaming the user won't prevent another user from making the same mistake with the same result. Saner defaults, and an implementation to support them, will. Who's going to complain that you saved too much of their data?


The defaults are sane, and in fact the user here had to explicitly turn them off in order to do the thing they wanted to do. Once you reach into a configuration file and change a setting, I can't think of a software system in the world that protects you from your choice. Could you maybe name a few?


The user turned off persistence. There's no reason for a normal person to suppose that also means ignoring data that's in the system when the master comes up. The fact that the two are inextricably tied to one another in the Redis implementation is not the user's mistake.


What do you think a slave should do if it is told to replace its state with empty state? How about half-empty state? There's really no answer that's satisfying for every possible use case (certainly I don't want my slaves to refuse if I tell them to clear the database completely on purpose). And indeed you haven't given any examples of databases that try to do 'better'. I think that's because there aren't any.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: