Call me maybe: Aerospike

Ankhers · on May 5, 2015

I currently work for a company that uses Aerospike quite heavily. In the past couple weeks, we have begun to notice data inconsistencies in our counters. We are seeing fluctuations in the data, despite having no decrement operations.

We have the enterprise edition of Aerospike, allowing us to be in constant contact with their support team and developers. A couple weeks later, and we still have no idea why this is happening. When dealing with monetary values, these fluctuations are very bad for us. Needless to say, we have begun migrating away from Aerospike.

jboggan · on May 6, 2015

What is the rationale for storing monetary values in this sort of system? Not being snarky, just legitimately curious what scale of service could possibly necessitate that and what solutions didn't work beforehand.

CyrusL · on May 6, 2015

Transactions in AdTech are different than normal payments.

For example, imagine an ad campaign spending $30k/month at a rate of $5 per 1,000 impressions. The customer may want their budget spread evenly throughout the month, so the software sets a daily budget of $1000. But this really represents 200,000 daily impressions, each of which is a transaction that subtracts from the available balance in real-time. The buyers software is talking to an ad exchange and keeping track of the budget every time an individual impression is won.

To add some more complexity, the impressions are probably billed as second-price auctions, so they aren't all exactly $0.005 each. Some are $0.00493, some are $0.00471, ect. Each one of these numbers is reported back from the exchange to the buyer's software in real time and the buyer is responsible for managing their budget.

This is just an example, but hopefully it illustrates how it can become impractical to account for this kind of thing using something more traditional like PostgreSQL. It would be reasonable to log all the impressions to something like Hadoop for the analytical piece of the software, but there needs to be something more real-time for budgeting to prevent overspending. The big ad exchanges can host hundreds of thousands or even millions of auctions per second, so not turning off bidding can be very costly.

This process of auctioning ad impressions across many buyers through an API is called real-time bidding.

hurin · on May 6, 2015

Why does this need to be in real-time, if their daily budget is $1000, you can still wait quite a bit and then apply increments in aggregate (e.g. hourly)? More so it sounds the customers aren't inter-connected - it hardly seems like the complex distributed problem.

cldellow · on May 6, 2015

The impressions may be sparse, e.g. say you're retargeting CEOs (demographic information you're getting from a DSP) who have visited your website in the last month (via a pixel you drop) who are in New York City (via a geoIP DB).

So, fine, a probabilistic model might work well. And you might decide to bid on 100% of impressions. And you might decide that you have to bid $200 CPM to win -- which you're OK doing, because they're sparse.

And then say that FooConf happens in NYC and your aggressive $200 CPM bid 100% of the time blows out your budget.

Often you can charge the customer you're acting on behalf of your actual spend + X% up to their campaign threshold. So you really want to ensure that you spend as much as possible, without spending too much. Pacing is hard. Google AdWords, for example, only promise to hit your budget +/- 20% over a 1 month period.

hurin · on May 6, 2015

I'm really not seeing what you gain out of running fine-grained control all-the-time here. Even if it were vital for a customer that you hit a budget target exactly you could dynamically change the granularity of control as you got closer. If anything predictive modeling would give you better budget use when you do have the flexibility than granular adjustment would (I don't know much about the area though and just going with your description of the problem here.)

phamilton · on May 6, 2015

A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?

With a real-time data stack you can avoid the duplicated ad a good percent of the time. Better experience for buyers, for publishers, and for users.

jmalicki · on May 6, 2015

Or you could just store that in a cookie.

phamilton · on May 6, 2015

Store what in the cookie? Every ad the user has seen across the entire web, along with how many times they've seen it in the last N minutes/hours/days/weeks?

A cookie won't fit all that data and a more traditional database generally won't work. In memory k/v stores like Redis won't work due to data size (TBs of data). Hbase/Cassandra/etc sort of work with latency in the 5ms range. That's fairly expensive in a 90ms SLA, but you can make it work. It does limit the amount of work you are able to do.

vtuulos · on May 6, 2015

We (Adroll) have been very happy with DynamoDB for use cases like this. Works fine with ~500B keys, while maintaining very low, and consistent, latencies.

nkohari · on May 6, 2015

+1 -- we also used DynamoDB at Adzerk for the same use case.

easytiger · on May 6, 2015

The alst 4 ads

phamilton · on May 6, 2015

Even then we need a fast lookup. 4 ads means 4 advertisers, 4 creative, 4 potentisl clicks, imps, mids/ends (video), conversions/ etc. There are also multiple cookies depending on which exchange the ad came from. Having to map them on the fly requires a fast and large datastore.

hurin · on May 6, 2015

> A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?

Yeah, but when that happens I usually don't think, oh hey they are lacking an optimal in memory distributed database solution.

I think, well... their engineers suck. Or they don't care. Pick one.

edit: His point is vague, so there is nothing technical to respond to. I am very much interested in a good technical example - but the things mentioned so far are by all appearances relatively straight-forward and linear, hence lack of effort or bad engineering are the only reasonable assumptions left.

cwyers · on May 6, 2015

I don't get your point here. He's explaining why it works the way it does. You're saying you don't think about it as a user. That doesn't invalidate or even respond to his point.

phamilton · on May 6, 2015

Volume and latency requirements make it more difficult to track individuals on the web. It's an easier problem to solve in 50ms. It's also much easier to solve when it's only a million individuals rather than a couple hundred million individuals.

Like most problems, scale makes it hard.

hurin · on May 6, 2015

It's simply not a difficult problem: there are no consensus requirements between individuals, so scaling can't be made any harder by increasing N.

cldellow · on May 6, 2015

phamilton touched on another good example - the budget can be implicitly set per-user via a frequency cap. If you see the user and have a chance to bid on an impression for them, the odds are good you'll see them again in two seconds -- winning two auctions means you've overspent by 100%. Oops.

minaguib · on May 6, 2015

Perhaps also worth pointing out that in RTB, quite often you won't know if you've actually won the auction or not for a few seconds later (when the exchange calls you back with a win notification, or you get it directly via a pingback from the user's device)

During that delay you might have actually already processed new bid requests (auctions) for the same user.

Depending on the order's characteristics and how much you're willing to deviate from target - especially when observed within a small time window - the above poses additional challenges w.r.t. overspending.

jitix · on May 6, 2015

IN such scenario, 'near' realtime would work just fine. Just process the impressions through storm or spark and put the results in HBase (CP type store) or even PostgreSQL.

jboggan · on May 6, 2015

Fascinating, thank you.

DrJosiah · on May 7, 2015

Having stored money values in Redis several times in the past (sometimes without replicas at all!), the answer is knowing how much you can trust the system.

I trust enough to get the job done, but not enough to get bitten when these systems drop data. Because here's the truth: they all drop data.

obituary_latte · on May 5, 2015

Can you add any detail to this anecdote? It's interesting and important, but detail might help steer others appropriately. What kind of inconsistency? What kind of fluctuation?

Ankhers · on May 5, 2015

I'm current working in AdTech. We are using counters to keep track of a lot of things that are important to us (e.g., reasons for not bidding on a given campaign, money spent within a given transaction, number of bid requests we get by exchange, etc). I personally have found two different, yet very similar, data fluctuations.

The first I found when debugging an issue. I noticed the counters going up, dip down, then continuing up. Rinse repeat. (e.g., 158 -> 160 -> 158 -> 170 -> 175 -> 173 -> 180)

The second I found was when trying to debugging the previous issue. I noticed the counters were essentially cycling. (e.g., 158 -> 160 -> 170 -> 158 -> 160 -> 170). This just repeated for the duration we watched the counters (approximately five minutes).

Please note that I used small numbers here. The counters I was monitoring were in the hundred millions, and I saw decrements in average between 2-3k.

justinsb · on May 6, 2015

Isn't this exactly what the article found? (search for split-brain)

linc01n · on May 6, 2015

I work in AdTech too. I'm still looking for a perfect counter solution. The counter we are using always overrun (which is better than up/down/up/down). We still manage to hack it by patching the number periodically.

p.s. I am working in an Ad Network but not plugging into exchange. Our system is not capable for that.

annnnd · on May 6, 2015

Genuinly curious: did you try implementing realtime stream processing solution, using Storm or similar?

jhugg · on May 6, 2015

VoltDB is pretty fantastic at counters. (VoltDB Engineer)

Rapzid · on May 6, 2015

Some of this stuff looks to me, as somebody not familiar with the domain of course, as a really good use case for event sourcing and in particular something Kafka/Samza could tackle well.

For future consideration and delayed evaluation of course. I guess if you absolutely must have the most up-to-date information so you can make decisions on it RIGHT NOW that wouldn't work very well :|

Or would it? If your bids and stuff are also going through the event stream..

SystemOut · on May 6, 2015

Bids generally have to be responded to within 100ms on these AdTech stacks and if you start to timeout they slow down bids being sent to you....which makes your bid manager less desirable by advertisers. You can probably use stream processing to build and update your models but I'd be surprised if you could handle the bid responses through the same mechanism.

23david · on May 6, 2015

Very strange. I've never encountered something like that with any database or cache. Wonder if it's somehow related to the way that your cluster is setup?

How big is this cluster? Are you writing and reading to the entire cluster, or do you have certain nodes that you write to and others that you read from?

bbq · on May 6, 2015

The analysis in the article shows that Aerospike is designed, intentionally or not, as a loosely accurate data store. It doesn't matter how you set it up or use it.

jamshid · on May 6, 2015

You use distributed databases and have never encountered an inconsistency? What scale? What are you using, I guess not any of these: https://aphyr.com/tags/jepsen.

Ankhers · on May 6, 2015

I have to admit, I'm not 100% on the entire configuration. However...

We have two clusters of 8 nodes each. Each cluster is setup with 2 factor replication. The clusters are setup with cross datacenter replication.

Your read / write question is a little hard to answer. In Aerospike, a given key will always reside on the same node, something to do with how they optimize their storage. Which means that anytime you write to, or read from, a given key your query will always be routed to the same node.

ngx1234 · on May 6, 2015

When Aerospike ships XDR batches it does not replay events, it just re-syncs the data. This is true even for increments. So if cluster A has 10 increments of n to n+10, and cluster B has 20 increments of n to n+20, it's possible XDR will ship A to B and cluster B gets set to n+10. XDR only guarantees data consistency if writes are 15 mins apart and your cross datacenter network doesn't go down.

The suggested method of solving this is to have two keys, one for each cluster, and XDR both keys. Then add them together in the app. You can maybe do it through a lua script, though I haven't tried.

cmpxchg16b · on May 6, 2015

This is interesting, and of course I have a couple of questions, but only two of them really matter: what client are you using, and are you using the cross-datacenter (XDR) replication functionality?

We* tested the increment functionality heavily (300K-1M aggregate ops/sec) before we turned it on in revenue service. We use it for a couple of different things, event counting is absolutely the major use case.

In a single-cluster world, it works phenomenally well. In a XDR world, things get a little tricky, and we had to change the way our application logic worked to compensate for it.

Any more information you can share about your use case?

*a big ad tech company that uses Aerospike heavily

digitalzombie · on May 6, 2015

What are you migrating to ?

As is away from Aerospike to... ?

Thanks, very interesting anecdote/case.

Ankhers · on May 6, 2015

We are first evaluating MongoDB. I believe the main reason behind this is we are already using Mongo in other parts of our application, so there is no additional setup when converting.

Note that nothing is set in stone. The decision to begin migrations only happened today. It is possible that we will end up using some other technology altogether, or even we find out the issues we are having with Aerospike and continue using that service.

ploxiln · on May 6, 2015

oh... dude... exact same mistake twice...

OK, let me try to be more constructive. Since accounts are independent, shard based on account (in the application, not in some magic shard-distributing layer). Treat each shard as its own cluster.

If you want super fast requests but can accept being down for an hour or two a couple times a year, a shard can be a single beefy host with a replicating slave. I'd consider either Redis or Mysql/Postgresql. Really, these old-style sql databases can be the fastest things that have the kind of consistency you need.

I've maintained a mongodb cluster configured a couple of different ways. Performance at high load and reasonable consistency is not as great as some older alternatives.

preetamjinka · on May 6, 2015

You should take a look at "Call me maybe: MongoDB" [0]

[0] https://aphyr.com/posts/284-call-me-maybe-mongodb

sans-culottes · on May 6, 2015

Not to mention "Call me maybe: Redis" [0]. Rinse and repeat for every NoSQL database out there.

Point is, @aphyr skewers everybody. Aerospike is just the flavor of the month.

[0] https://aphyr.com/posts/283-call-me-maybe-redis

jbergens · on May 6, 2015

I think FoundationDb got great scores from Aphyr, sadly it is now owned by Apple. For the others the big problem is when they promise some kind of ACID and cannot achieve it, if they were explicit with what the supported all customers could make informed decisions. Relations systems are generally bad ad horizontal scaling and can get very slow with full ACID over many servers.

masklinn · on May 6, 2015

> I think FoundationDb got great scores from Aphyr,

FoundationDB ran Jepsen internally and reported stuff[0], Kyle never worked with it.

[0] http://blog.foundationdb.com/call-me-maybe-foundationdb-vs-j... half broken now, none of the images load for me. @aphyr seems to have taken them at their word wrt testing though: https://twitter.com/aphyr/status/405017101804396546

annnnd · on May 6, 2015

Curious: how difficult is it to replicate aphyr's tests? Can we beleive independent entities posting jepsen reports on some random DB?

lmm · on May 6, 2015

I think that's a false equivalence. Aphyr is pretty positive about Riak (with the correct configuration) and Cassandra (when used for appropriate scenarios). If I was choosing a new system those are the two I'd be looking at.

woozy · on May 6, 2015

Antirez was very receptive http://antirez.com/news/55

masklinn · on May 6, 2015

Aphyr wasn't overly impressed by that response though https://aphyr.com/posts/287-asynchronous-replication-with-fa...

bluef00t · on May 6, 2015

Try HBase. At another AdTech company, we have been very happy with it. https://eng.yammer.com/call-me-maybe-hbase

annnnd · on May 6, 2015

Nooooooooooooooooooooo! Seriously, no! Use mongoDB, PostgreSQL, even flat files if you must, but HBase?

We used it in production about 3-4 years ago and it was a nightmare from both usage and especially maintenance point. Fortunately we had a flat-files based backup system so we were able to rescue data every! Single! Time! the damn thing crashed and took (part of) data with it.

Of course, this is anecdotal evidence, and things might have changed from then, but I wouldn't touch it. Life is too short.

EDIT: Also, I am curious how the results in the above link would compare to aphyr's if he performed the test on HBase?

bluef00t · on May 6, 2015

I see where you are coming from. HBase was unstable 3-4 years ago, but after a great amount of dev effort and battle hardening from Cloudera, Salesforce, etc., it is very stable now. We have ~ 400 nodes running in production for a very critical use case and have seen 0 data loss edge cases in the last 2 years, along with some of our servers running > 6 months without any reboots.

We use is in a very real time use case with latency requirements of single digit milliseconds, and if you tweak it the right way, you can the required performance from it, along with easy horizontal scaling.

Also, I am curious too for aphyr to take on HBase, but I don't think the result would be different since running Jepsen is straightfoward and not much to a person's interpretation. The results and further experiments are what aphyr does nicely.

annnnd · on May 7, 2015

Thanks for the info on HBase stability. I probably won't use it again (once burnt...), but if they really managed to pull their act together - good for them!

perlgeek · on May 6, 2015

When storing financial data, I'd certainly go with some kind of event sourcing: store deltas / financial transactions, not counters. The counters are just a sum over all deltas.

If performance is an issue, you can make the counters available in a second database that's only for reading, and updated from the original deltas.

obstinate · on May 6, 2015

It seems from this article that Aerospike can lose acknowledged writes, so that would not be enough to save them.

drvsrinivasan · on May 7, 2015

Today, the inconsistencies here were diagnosed by company and Aerospike to be caused by two clusters connected with XDR concurrently writing data to the same counter and shipping to each other and intentionally overwriting some data (bad design that somehow slipped through the cracks). So, this issue is unrelated to the Jepsen network partitioning tests that is the subject of the original article. The work that @Aphyr is doing is very valuable and much appreciated. (Aerospike Founder)

smt88 · on May 5, 2015

Yikes. What made you choose Aerospike in the first place?

Ankhers · on May 5, 2015

Sadly, a requirement of the application is that it needs to be backward compatible with the previous system.

I'm sure many of you know, this leads to quite a few issues.

yellowapple · on May 5, 2015

When a product claims to have 100% uptime, I immediately cringe, knowing full well that they're probably full of bovine manure.

Good read.

saryant · on May 5, 2015

Same with "exactly once delivery".

grogers · on May 6, 2015

My analog to this is when someone claims that partitions are rare so you can ignore them, any product they make is very likely to lose data during partitions.

0xbadcafebee · on May 6, 2015

https://www.youtube.com/watch?v=pjvQFtlNQ-M

bketelsen · on May 6, 2015

Many kudos to Stripe for funding this. Truly a great gift to the community.

mdellabitta · on May 6, 2015

I'm imagining the job interview.

"We want to sponsor your distributed database research, but not your Barbie animated GIF production."

stephen_mcd · on May 5, 2015

The rotated A in the Aerospike logo reminds me of a system that's fallen over, and now you can't unsee it:

http://www.aerospike.com/

drivers99 · on May 5, 2015

Trying to read aerospike's front page to figure out what it is reminds me of this: http://shouldiuseacarousel.com/

oska · on May 5, 2015

"A system"? Not sure what you mean by that.

To me, it was immediately apparent that they are making the whole word Aerospike look like a rocket, a motif they repeat through their home-page.

simoncion · on May 6, 2015

Worker 1: "The server has fallen over."

Worker 2: "I'll go restart it."

See also: The frequent interchangeability of the words "system" and "computer".

AnDowNS · on May 5, 2015

My co-worker glanced at my screen and thought I was reading about erospike, which has a decidedly different feel, especially given the phallic imagery on the page.

curun1r · on May 5, 2015

Really? When I look at it I only see an eye (viewed in profile) with the legs of the A being eye lashes.

SchizoDuckie · on May 5, 2015

Oh wow, so this is not ACID, not "Eventual Consistency", but "Eventual Inconsistency"

cschneid · on May 6, 2015

What is the 3 color chart that I've seen posted in several of these articles? I get that it breaks down the different CAP combinations with the things that it allows/implies. But is there a good breakdown of all the terms and what they mean?

dwetterau · on May 6, 2015

Here is a paper [0] out of Berkeley that explains the chart you're asking about (the chart appears on page 8). For more information on each isolation level you might have to refer to the cited works or other sources.

[0] http://db.cs.berkeley.edu/papers/vldb14-hats.pdf

cschneid · on May 7, 2015

Thank you! That's perfect.

gmagnusson · on May 6, 2015

I've used Aerospike at scale (approx 1MM tx per second) in private network, and smaller loads in cloud. I have always found it to be fast, reliable and extremely easy to operate (upgrade, modify cluster members, etc) w/o any downtime or interruption. It is a critical tool in my toolbox. I also have found their support and engineering team to be excellent.

I admire the work that Aphyr does - though at the end of the day, I need to build systems that work for the problem I'm trying to solve (and I have to choose from real things that are available).

Aerospike isn't the solution to every storage problem, and if you are choosing technology based on marketing material, you're probably going to be disappointed.

These technologies in general are trying to address really hard problems and design and architecture is the art of balancing tradeoffs. Nothing is going to be perfect. Yet.

ploxiln · on May 6, 2015

"though at the end of the day" "nothing is perfect" ... Aerospike makes blatantly ridiculous promises in their high-level descriptions of their database. That makes our jobs harder (before Aphyr makes them easier) because we don't know what Aerospike is actually good at or exactly what kinds of data loss potential we need to architect our systems around.

Isn't it kind of annoying that some technical projects bolster their popularity/ecosystem with very fancy websites and impressive/competitive claims, but to really do your job right you have to throw all that away? The best you can do is try to get a sense from the reports of others who have tried something (and may or may not have been rigorous in their evaluation) so you can pick good candidates to even put through trials. (so again, thanks Aphyr)

gmagnusson · on May 6, 2015

I can't and won't defend Aerospike's descriptions on website or white paper. And yes, "Thanks Aphyr".

I came across Aerospike technology via a pre-existing system at a previous employer, and watched that system scale up and perform in a serious way. It wasn't all unicorns and roses all the time as real life never is, but in the context of the real world, it was great. The software is rock solid in a way I've rarely come across, and support was spectacular. (I forget my current production clusters are even running sometimes they are so stable, reliable and self-operating)

And at the end of the day, there was no other solution out there remotely competitive that we could find. And I looked - not because we were dissatisfied, but because that was our fiduciary responsibility to the company, to ensure that we were deploying the most cost-effective systems that met our feature and performance requirements.

Ultimately yes - I think that as an engineer, you need to understand what your tools are really capable of and avoid doing what I call "BDD" (Blog-Driven Design). That isn't the ideal answer - it would be nice to have a reliable understanding of the capabilities of the materials we use to build systems (like civil engineers can reason about materials like steel and concrete in repeatable ways) but what we call "software engineering and architecture" is still a very young discipline, very often with unrealistic expectations about our ability to deliver in given budgetary and temporal constraints, so we do what we can.

nemothekid · on May 6, 2015

While people tend to use Aphyr's posts as anti-nosql-cannon-fodder, I don't think he has ever advocated for anyone to straight up not use a database or used any of these results to show that no one should use NoSQL ever.

These seem like highly detailed Github Issues (infact the recent elasticsearch was a GH issue-turned-blog post), and these issue are brought to attention so that they could be fixed - not to slander the name of the company (and when they are, everyone benefits). IIRC, even after finding these bugs were published he continued to use elasticsearch.

Given how hard these problems are and how difficult they can be to reproduce, these writeups seem to be the most appropriate way to highlight these issues.

That said, if I was an aerospike user I'd be happier knowing this issue exists, someone has debugged it, and supported a detailed report about rather than being called in at 3am and discovering our data is funky.

draven · on May 6, 2015

In addition the testing methodology (the blog post) and the code (on github) are available so the tests can be reproduced to validate any enhancement or bugfix.

Those blog posts are also a great at debunking marketing claims.

pkaye · on May 6, 2015

Fast, functional or reliable... pick two.

me_again · on May 6, 2015

ITYM consistent, available or partition-tolerant...

Jweb_Guru · on May 6, 2015

...but you can't pick CA.

Something Aerospike didn't realize.

(And nope, sorry, I'm completely uninterested in your anecdotes about how you haven't personally lost data when [1] there's a clear data loss scenario highlighted in the post, [2] Aerospike actively recommend services like EC2 and GCE that routinely partition, and [3] there are people in this thread who have experienced the same problems).

acqq · on May 6, 2015

I was wondering what is going on, with the titles of this form appearing on HN and now browsing the blog, it appears that the author is the fan of the "Call me maybe" titled posts (a lot of them there, and then also here!). From what I understand, this phrase, as used in the titles by him, seem to mean to him something like the "review of" something or the "comment on" something. For what it's worth.

beberlei · on May 6, 2015

It is a pun on this song https://www.youtube.com/watch?v=fWNaR-rxAic

acqq · on May 6, 2015

Apart from author being obviously inspired by the song I admit I don't see any connection or anything worth naming that "the pun." Maybe it's just me.

chipsy · on May 6, 2015

Making a phone call is an asynchronous event, and as the song suggests, sometimes you give someone your number but they never call back. With any distributed system in real world conditions, a similar situation arises where a request doesn't get handled or is lost along the way.

</explainer>

acqq · on May 7, 2015

So he uses the phrase instead of "distributed systems?" (shrug)

lmm · on May 6, 2015

It's explained at the top of the original post: https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-a...

jack9 · on May 5, 2015

The minimum useful licensing is in the tens of thousands of dollars. For the SLA they offer, it makes sense. Many well known Ad Serving companies utilize Aerospike (at a fraction of the cost of their previous solutions). It's very impressive result, per machine, from an operational standpoint.

theVirginian · on May 6, 2015

"schemaless" nope

tootie · on May 6, 2015

Is there any reason he's never tried to analyze a "classic" RDBMS like Oracle or SQL Server? I have to imagine they'd clobber a lot of this hipster technology.

electrum · on May 6, 2015

He tested PostgreSQL: https://aphyr.com/posts/282-call-me-maybe-postgres

threeseed · on May 6, 2015

I don't understand how this is comparable though.

All of other databases were tested in clustered mode. Why not PostgreSQL as well ?

Jweb_Guru · on May 6, 2015

Postgres doesn't have a builtin clustered mode, or claim to be totally available (and also, incidentally, doesn't guarantee serializability for hot standby replication targets, which is prominently stated in a bright red box that says "warning" in the documentation on serializable isolation, i.e. the first place you would look). It claims to be CP in a single-node configuration (which it is) and aphyr tested it on those claims. Remember, Jepsen isn't about proving that a database can't violate CAP, nor is it trying to say "all databases are crap." It's about verifying the marketing claims and then determining whether there are any mitigating strategies. The fact that many databases make unreasonable marketing claims is unfortunate, but certainly not a requirement.

It's also worth posing this question in reverse: what would happen if these distributed databases were tested in a single-cluster configuration? As noted in the most recent article on Elasticsearch, many of them (e.g. Elasticsearch, Cassandra, and Riak) acknowledge writes before fsync and can therefore lose data due to issues like `kill -9`, power loss, and other exceptional conditions, while Postgres doesn't. For a single-node database this robustness is very important, while he argues that it isn't as important for a distributed one. Because these databases aren't designed to be used as single nodes, aphyr didn't substantially ding them for that. Again, what's important is whether the database does what its documentation says it does when used as its documentation says it should be used.

rsynnott · on May 6, 2015

I don't think postgres has a first-party clustering system.

samkone · on May 6, 2015

Lol He can't because Oracle licensing forbids to do this kind of work and make the results public.

tootie · on May 7, 2015

Thank for a real answer. Do you have a source for that assertion. Is Microsoft the same?

jhugg · on May 6, 2015

Well, those systems are largely not distributed, barring SQL Parallel Data Warehouse and, very arguably, RAC.

Jepson tests network partitions... so less useful.

tootie · on May 7, 2015

The whole draw of Oracle for the last 20+ years is how well it can replicate across a cluster over a network. Maybe it's not quite the same as how modern KV stores work, but it's still guaranteeing consistency across a network and is therefore a candidate for jepsen testing.

wglb · on May 6, 2015

Here is a bit of an index page: https://aphyr.com/tags/jepsen. It doesn't appear that he has. And yes, I would guess that they would indeed clobber those written there.

preetamjinka · on May 6, 2015

Postgres is there.

wglb · on May 7, 2015

Yes, it does. My oversight. Thanks.

elchief · on May 5, 2015

If you're big enough to require something like Aerospike, you're rich enough to build something like F1.

vosper · on May 5, 2015

I dunno, there are ad-tech companies with less than 20 engineers who're processing millions of events per minute on their endpoints, and trying to do various things with that data. This from personal experience.

Of course, I would love if someone gave me the mandate to go out and build something like F1...

rbetts · on May 6, 2015

You couldn't even hire a couple of experienced engineers let alone come close to affording to build F1. There are several zeros in the cost difference.

nemothekid · on May 6, 2015

I can think of quite a few companies that are big enough to require something like Aerospike, but wouldn't be willing to outfit all their machines with a GPS receiver.

0xbadcafebee · on May 6, 2015

Even if your company has 20 billion in cash, if your department isn't called "sales" or "marketing", good luck getting the budget for that.

And honestly, if you have money, it's a lot simpler and less risky to just hire 20 DBAs and programmers to build a database application that can handle that kind of operation. Low latency, network-partition-resistant, high-performance database applications are not a new thing.

DAddYE · on May 5, 2015

Awesome read and still quite impressed by Aerospike.

preetamjinka · on May 5, 2015

What is impressive about being somewhat misleading and cutting corners to gain performance?

DAddYE · on May 5, 2015

As said by Aphyr this product is ideal in the ad-tech. Of course is misleading but good engs will test products to see if it fit their scenario without basing decision on marketing slides. Yes, ideally they should be more clear and I bet after this article they will.

rgbrenner · on May 5, 2015

As said by Aphyr this product is ideal in the ad-tech

No, Aphyr said the data loss is ok for ad tracking and analytics because it doesn't matter. That's very different.

And if that's the case, then why make those claims.. they could just as easily give accurate info to their customers, and the customer could decide if that fits their case. Instead they claim something very difficult (if not impossible), and let their customers find out it's not true (possibly after it's too late, and they've already lost valuable data).

Dylan16807 · on May 5, 2015

It's easy to make something that's fast and doesn't always work. Give a coder a day and they can make a high speed, replicated, key-value store that loses a little bit of data during partitions. Heck, that could be a single homework assignment in an algorithm or distributed system class.

manigandham · on May 6, 2015

We use Aerospike heavily. It works just fine.

I'm constantly surprised by the general tone of comments on posts like these as if it's some crazy revelation that this software still obeys the fundamental laws of distributed systems.

There is no perfect database out there, all of them will fail with network partitions. Aerospike was designed to work in clusters that are very close together, often the same rack. It has much tighter timings and tolerances in exchange for providing much higher performance in certain situations and definitely has one of the best SSD focused storage systems I've come across.

If you don't have a high performance network interconnect between nodes, then there will be more issues with Aerospike since it relies on that more than some other system that use Paxos for all writes (like aphyr mentions). We run several TB's of data accessed at 100k+ TPS including very fine grained counters and everything works. And yes, we run on the cloud in AWS and SoftLayer and have yet to have major problems with the proper network setup.

Btw, there is a comment below from the current CTO of AppNexus, one of the companies that pioneered real-time bidding for digital ads and runs several million auctions per second on one of the biggest ad exchanges available. They were the first customer for Aerospike and from everything I've learned from their team, it works really well for them, and they definitely are not happy to just "lose" data however insignificant it might seem. Volume changes everything and even a fraction of a percent will add up. We trust Aerospike because it's been hardened by lots of much much larger companies with very high production usage, the key is being aware of all the technical requirements and the environment you're deploying in.

I think the real major issue here that people seem upset with are the general claims and marketing information. I can't speak to all that and there are definitely some things like 100% uptime which do seem overly confident, but this is true of every single technology vendor out there unfortunately. I'm not saying Aerospike is any better or worse as a company but marketing material only goes so far and it would surprise me if further research wasn't done for any mission critical system.

teraflop · on May 6, 2015

> There is no perfect database out there, all of them will fail with network partitions.

Some of them will fail in a way that keeps your data safe, others will fail in a way that preserves uptime but gives you temporarily inconsistent data. Aerospike apparently does neither. Why is it unreasonable to expect them not to falsely claim otherwise?

The "crazy revelation" for me was not that Aerospike's software is, like everything else, subject to the CAP theorem. It's that they apparently think it's awesome to claim that it isn't, and charge tens of thousands of dollars for their product on that justification.

chisleu · on May 6, 2015

Most high end DB support is very expensive. Cassandra, Oracle, etc, etc.

What bothers me more than anything was Aerospike felt the need to compare a benchmark done on an in-memory dataset directly with a Cassandra benchmark on dataset that was many fold larger into disk. They made this comparison and said "look! Aerospike is x times faster than Cassandra!"

That was the end of giving a shit for me. When people feel they need to lie to convince you of something, that is when I know that I don't want what they want.

Incidentally, that is also how I stopped caring about politics.

manigandham · on May 6, 2015

1) Aerospike is open-source and has a free community edition if you need it.

2) Yes, marketing claims are BS. If this was a reason to not use something, we'd have to stop using pretty much every other commercial piece of software we have. That's why we test and run software in our environment, and there... aerospike works. Really well. Even with network partitions. So I can understand kyle's tests in this post and the reasoning and results but there's still a big gap between this testing and the reality our company has experienced.

yellowapple · on May 6, 2015

> Yes, marketing claims are BS. If this was a reason to not use something, we'd have to stop using pretty much every other commercial piece of software we have.

Which is what I at least have indeed opted to do; I avoid commercial software like the plague for this very reason, using it only when there isn't an alternative (like when it's a legacy system that has to be interfaced with). There are plenty of free software projects that don't make outrageous marketing claims and - therefore - aren't nearly as susceptible to disappointment and wasted money.

Aerospike's claims border on the realm of false advertising (if they don't actually classify as false advertising, which is a big "if"; the claim of 100% uptime is dubious at best and more likely to be an outright-malicious lie). Why should they get my money?

obstinate · on May 6, 2015

> We trust Aerospike because it's been hardened by lots of much much larger companies with very high production usage

Why would you trust something after its untrustworthiness is demonstrated before your eyes? Just because some other companies use it and haven't yet publicized dissatisfaction? That is not how you make sound engineering decisions.

manigandham · on May 6, 2015

Demonstrated? It's just a single post, we would have to replicate these results ourselves and our specific environment.

We use this 24/7 in a production system and have not encountered any issues and it matches actual data and experience from real conversations and meetings with other companies. We don't make decisions from blog posts.

vruiz · on May 6, 2015

aphyr is a well respected expert in the matter. You can reproduce the test yourself https://github.com/aphyr/jepsen

Consider that if you have not experienced any issues yet it might say more about your network stability than aerospike. Of course you can chose to ignore it, but network partitions eventually do happen, and when it does I hope your data is not mission critical.

manigandham · on May 6, 2015

Are all the other companies and engineers using it useless then? One guy is not enough to make any opinion either way. And yes, we put a lot of effort into making sure the hardware and network are good, because when that works right everything else works well too. Trying to solve hardware issues with software is bound to lead to misery.

We have network partitions all the time, that's how we upgrade. On average each node is replaced every 2 weeks and we just terminate it through the API (both softlayer and aws). No big deal and we haven't lost any data yet, confirmed by other records in other datastores that have to match up.

If aphyr's post is the ultimate rating, why would anyone use anything thing else he's written about?

vruiz · on May 6, 2015

> Are all the other companies and engineers using it useless then?

Of course not, software is about trade-offs, and every company has different use cases. Is it their primary datastore? if they lose data do they lose some data samples for a recommendation engine? or someone's money? I wouldn't assume "if it works for them it works for me".

> Trying to solve hardware issues with software is bound to lead to misery.

I strongly disagree with that, I believe exactly the oposite.

> If aphyr's post is the ultimate rating, why would anyone use anything thing else he's written about?

He does not judge the system's usefulness, throughput, etc. But he's a good benchmark for distributed system's reliability. While he might not test every possible scenario, if he says software X loses data on conditions Y, I do believe him. It's still up to me to decide if that matters for my use case or not.

themartorana · on May 6, 2015

Sad that this was downvoted. I'll add an upvote because I think it's relevant.

pjc50 · on May 6, 2015

Genuine question: how do you know you've never lost a single write?

manigandham · on May 6, 2015

Different sets of data goes to different pipelines.

eg: aerospike will have counters to cap a certain transaction and when we do offline aggregations from logs written through a completely different system, the numbers have to match.

If something is capped to spend $100 and the aggregations don't match up to exactly $100, then there's something wrong, especially with very fine grain numbers.

woozy · on May 6, 2015

I adore its SSD read/write optimization. The rest were understood as marketing materials.