Achieving 100M database inserts per second using Apache Accumulo and D4M [pdf]

dijit · on Jan 23, 2017

I hate to be a hater.

But the big issue with databases I've worked with is not how many inserts you do per second, even spinning rust, if properly reasoned can do -serious- inserts per second in append only data structures like myisam, redis even lucene. However the issue comes when you want to read that data or, more horribly, update that data. Updates, by definition are a read and a write to commuted data, this can cause fragmentation and other huge headaches.

I'd love to see someone do updates 1,000,000/s

jamesblonde · on Jan 23, 2017

MySQL Cluster (NDB - not Innodb, not MyISM. NDB is the network database engine - built by Ericsson, taken on by MySQL. It is a distributed, in-memory, no shared state DB). >50% of mobile calls use NDB as a Home Location Registry, and they can handle >1m transactional writes per second. We have verified the benchmarks, that you can get millions of transactions/sec (updates are about 50% slower than reads) on commodity hardware with just gigabit ethernet. Oracle claim up to 200m transactions/sec on Infiniband and good hardware: http://www.slideshare.net/frazerClement/200-million-qps-on-c...

Two amazing facts about MySQL Cluster: (1) It's open source (2) It's never penetrated the Silicon Valley Echo Chamber, but is still the world's best DB for write-intensive transactional workloads.

viraptor · on Jan 24, 2017

Another reason may be that it seems you need to be an expert in internal workings of the ndb to deal with it. I tried to use it. Over a month ended up with a few cases of "can't start the database", "can't update", etc. with no real solutions. There's not that much information about ndb on the internet, so googling doesn't help. The best help I got was from mysql-ndb irc channel, but in practice when things went bad, people said something like "if you send me the database, I'll help you figure it out". This does not work in practice.

I feel like it would be more popular if people actually wrote about how they're using it. But how do you start when even basic configuration bugs are still open without progress: https://bugs.mysql.com/bug.php?id=28292

(this was a few years ago or so, maybe things changed)

jamesblonde · on Jan 24, 2017

Yes, that's a good few years ago. Stability issues were mostly fixed around 8 years ago. It's now very stable. There are products like www.severalnines.com and MySQL Manager to setup and manage instances for you with a UI. If you want to roll your own, there are chef cookbooks for installing ndb - https://github.com/hopshadoop/ndb-chef .

But if you really want to squeeze performance out of it, read Mikael's blog: http://mikaelronstrom.blogspot.se/ and Frazer's blog: http://messagepassing.blogspot.se/

doikor · on Jan 23, 2017

Some systems (like Cassandra) has upserts and can do writes without reading. Though you loose any kind of "transaction" safety in that you are not sure what you are writing on top of. But in my experience for the vast majority of cases that is ok.

gopalv · on Jan 23, 2017

> Some systems (like Cassandra) has upserts and can do writes without reading.

That forces the latest-version resolution into a read-side problem - systems like that cannot handle any sort of complete table-scans but can only really work for a pre-known key.

With a known unique key, this is somewhat possible to use systems like this, but extremely expensive when what you need is BETWEEN '2017-01-01' and '2017-01-02'.

doikor · on Jan 24, 2017

We have a lot of data like that and just use a compound partition key of (year, month, day) tuple and then we can ask stuff on day resolution. We also add exact timestamp into clustering key so you can only ask for more precise data.

With Cassandra (well most nosql dbs that I've run into) you model your data in the db by the queries. So if you want to scan by date then you create a new table (or a materialized view) with the date as the key you query for. Though even for then you have to keep in mind the amount of data to make sure you don't have too much data per partition and hotspots.

jjirsa · on Jan 24, 2017

Cassandra has 2 components to its primary key - 1 that sets the partition (which includes which nodes get the data), and 1 that sets the clustering/sorting/ordering within that partition.

If you use a clustering key, scans become trivial within a given partition.

gegtik · on Jan 24, 2017

depends on your structure. you could scan though dates as a partition's cluster key for example

X86BSD · on Jan 23, 2017

You are so right. Without giving to much detail the wife is working on a project with a branch of government. They use Oracle. Her teams biggest problems are and have been the DB falling over during updates or joins. Were talking billions of rows of government data. Running out of table space, loads taking literally days to complete. Thing's that I have never seen happen. I'm not sure if it's the DB architecture itself or clueless people doing really dumb things. I only get to see these failures second hand.

throwawayish · on Jan 23, 2017

If they upgrade to all in-memory databases their full-table scans will run faster

/s

espeed · on Jan 23, 2017

See LMDB (embedded) and ScyllaDB (distributed):

The Lightning Memory-Mapped Database (LMDB) https://www.youtube.com/watch?v=Rx1-in-a1Xc

ScyllaDB http://www.scylladb.com/

espeed · on Jan 24, 2017

Also Apache Geode (distributed, in-memory, consistent, durable, crazy fast) -- seriously solid tech.

http://geode.apache.org/

ddorian43 · on Feb 2, 2017

Also SnappyData which uses code from geode.

itchyouch · on Jan 24, 2017

Not really a database per se, but trading system matching engines can maintain updates in excess of 1M/s. Of course these are small 60 byte messages updating 10-15 bytes of a given order book in something like a RB-tree, so it's not as impressive.

What is impressive I guess is that they can sustain 1M updates/sec at a 99.9th percentile update round-trip of ~10 microseconds or so, with medians at around ~4 microseconds.

Xorlev · on Jan 23, 2017

Since each ingest process talked directly to the Accumulo tablet local to it, it really measured loopback+RPC+DFS performance. Knowing how these things usually go, it might have been 100M rows/s but only 100k-1M RPCs/s. It's still quite impressive, but it's important to keep it in perspective. For example, I believe Google's C* 1M writes/s demo also included real network overhead from driver processes. Additionally, that was with the WAL on, vs. this Accumulo run which disabled the WAL.

Our graph store (HBase, SSD) on 10 nodes can easily support 3M edges/s read/stored, but thats ~40k RPCs/s given our column sizes and average batch size.

eru · on Jan 23, 2017

> However the issue comes when you want to read that data or, more horribly, update that data.

A log structure for your database would make the update case more similar to the append case, wouldn't it?

(There are definite limits to that technique, but it does work for eg some file systems---which are also a type of database.)

throwawayish · on Jan 23, 2017

A transaction log needs to be cleaned up, which takes time. You might amortize cleanup over many small objects.

> but it does work for eg some file systems---which are also a type of database

No mainstream FS uses the txn log as a primary store. The log(s) are only around for active writes to metadata, as soon as the second write is out the spot in the log can/will be reused. Similar to the clean up case FSes try to batch as many ops as they fit into their log(s) before flushing the transaction.

Apart from the CoW herd (ZFS, btrfs, HAMMER) most FSes either can't or won't journal data by default anyway. -- Which is a fairly often overlooked point, many people seem to assume that a journaling FS means that everything, include their application data, is journaled, which isn't just wrong, but can also be a rather dangerous assumption; depending on application.

eru · on Jan 24, 2017

Thanks for giving more background to my vague hunches!

CyberDildonics · on Jan 24, 2017

In a key value store? That can be done at 1,000,000/s per core.

espeed · on Jan 23, 2017

This is work is related to the MIT D4M Course, GraphBLAS, and Graphulo:

Standards for Graph Algorithm Primitives http://www.netlib.org/utk/people/JackDongarra/PAPERS/GraphPr...

GraphBLAS: A Programming Specification for Graph Analysis [video] https://www.youtube.com/watch?v=6tnzSiq8QBo

http://graphblas.org

Graphulo: Graph Analytics in Apache Accumulo [video] https://www.youtube.com/watch?v=nsmFjZNl60s

https://github.com/Accla/graphulo

MIT D4M: Signal Processing on Databases http://www.mit.edu/~kepner/D4M/

https://ocw.mit.edu/resources/res-ll-005-d4m-signal-processi...

Video Lectures: https://www.youtube.com/watch?v=zNGKX-4PRsk&list=PLUl4u3cNGP...

Book: Graph Algorithms in the Language of Linear Algebra http://epubs.siam.org/doi/book/10.1137/1.9780898719918

D4M: Bringing Associative Arrays to Database Engines (2015) [pdf] https://arxiv.org/pdf/1508.07371.pdf

privacyfornow · on Jan 23, 2017

I think this is a piece of the pie. The thing to recognize that I think is just as important is that it is possible to state several common use cases (even synchronous microservices) as collections of append only immutable logs for system of record and an in-memory read/view state for readers and mutating functions.

I am using this pattern in risk, fraud and commerce and once new members in my teams get over the mental barrier of decoupling the append only log from state, it all just clicks for them.

freeman478 · on Jan 23, 2017

Looks interesing ! Isn't this some kind of event sourcing or am I missing something ?

nrjdhsbsid · on Jan 24, 2017

So... When do you reach the point Where it's better to just use a hashtable in ram? Super high speed "in memory" databases are still beat by manipulating the data structures yourself.

I feel like there's very limited applications where all out speed is important and it's better to use a database than just do the operation yourself in ram and save the network overhead.

You can insert billions of items per second into a hashtable, and when you're working in your own app memory transactions aren't needed.

jandrewrogers · on Jan 23, 2017

The key point is that they managed to do it with Accumulo; the insertion rate through storage is otherwise unremarkable. For 10 GbE clusters, line-rate insertion has been relatively easy to achieve for several years now.

An important point is that they disabled all of the durability, replication, and safety features. Graph500 records are quite small so that insertion rate given the size of their cluster implies average throughput that is significantly less than line-rate.

hmottestad · on Jan 23, 2017

I was wondering. Does anyone here use Accumulo internally or for a client?

I had not heard of it before and the line "widely used for government applications" made me wonder why I hadn't. I'm a consultant working with graphs in Norway and this database is completely new to me.

Scaevolus · on Jan 23, 2017

It was developed by the NSA based on Google's Bigtable paper, so it has a lot of US government users. I think Apache HBase or Cassandra are far more popular NoSQL solutions for most users.

mikecb · on Jan 24, 2017

If you don't need cell level access control, you probably wouldn't be looking for it. I guess hbase now offers this too.

threeseed · on Jan 23, 2017

No. It's not popular at all since you can't get enterprise support for it.

With Cassandra you can goto DataStax and if you are using HBase often you are using Hadoop therefore you can get support from Hortonworks or Cloudera.

mikecb · on Jan 24, 2017

Really? Sqrrl and Cloudera offer support.

arp2600 · on Jan 24, 2017

Hortonworks also offers support for Apache Accumulo.

espeed · on Jan 23, 2017

The principles applied in the paper aren't limited to Accumulo.

lngnmn · on Jan 23, 2017

The word database means transactions committed to a persistent, durable storage (such that the data could survive a reboot).

castis · on Jan 23, 2017

iirc "database" means an organized collection of data. persistence is just a nice feature.

threeseed · on Jan 23, 2017

It literally means: "a structured set of data held in a computer, especially one that is accessible in various ways."

You are right that restart survivable persistence has absolutely nothing to do with it.

sbov · on Jan 23, 2017

This comment branch is a weird topic, but the implication of these definitions is that modern languages come packaged with multiple databases, some of which can scale to billions of writes per second (arrays, lists, sets, etc.)

flukus · on Jan 24, 2017

So I can write a linked list and call it a database?

eternalban · on Jan 24, 2017

No. OPs have datastores confused with databases.

People used to get filthy rich (hi there Larry) building databases (which has all sorts of data access & management goodies) until 21st century came along and in the name of "progress" a hashtable with an HTTP interface was called a "database". Of course as nod to the said filthy rich guys from the 20th century we called them "noSQL" "databases" so as to let them continue to charge money from all those silly people in the 20th century that built their information systems on (the now "defunct" :) "databases".

lngnmn · on Jan 24, 2017

> OPs have datastores confused with databases.

The other way around. All the in-memory stuff is nothing but a cache, distributed or not.

Datastore is a layer, a set of routines and interfaces, which maintains persistent storage, such as Informix C-ISAM.

I am old-school Informix DBA, so we knew that Larry has been a cheater.

eternalban · on Jan 24, 2017

> All the in-memory stuff is nothing but a cache, distributed or not.

That's flat out wrong.

A cache is a limited store that maintains a sub-set of the data that has been placed in it. Typically the eviction policy is temporal.

A database or datastore on the other hand requires explicit removal of data that has been placed in it.

p.s. Which is why we can meaningfully talk about "cache misses" in context of a perfectly functional cache, but "missing" data in a database is generally discussed in context of a buggy datastore or database.

p.s.s. Most certainly we can have pure in-memory datastores/databases, as semantics of a database are orthogonal to its datastore component. And you don't have to take some random hn user's word for it. Go ahead and ask Michael Stonebraker. :)

lngnmn · on Jan 24, 2017

> That's flat out wrong.

Oh, come on. Old time RDBMS have been designed to survive a sudden power outage - the system would rollback all the partially committed transactions to recover to the consistent state of the database.

The durability and data consistency is what defines a database in the first place.

For cache vs. database metaphor - think of the difference between a RAM-disk and HDD. This is fundamental difference. Protocol details are irrelevant.

This is why, say, sqlite is a database, while redis is not.

eternalban · on Jan 24, 2017

You are confusing the distinction between a cache & a non-durable datastore. Of course Redis is not a database, did I say it was?

Data is implicitly removed from a "cache" in course of normal operation.

lngnmn · on Jan 24, 2017

Lets try it one last time.

In-memory database is as much a database as a RAM-disk is a disk.

eternalban · on Jan 24, 2017

Reducing technology assessment to puns is ultimately not very informative. I recommend you read up on Stonebraker's work the past few years. Durability issues of in-mem DBs are addressed.

It is certainly true that many pure mem vendors out there are playing fast and loose with terminology, but if one were to accept your assertions, then when the day arrives when spinning disks are antiques then I guess we no longer would have databases.

lngnmn · on Jan 24, 2017

Nope. Old definitions still holds. If some hipsters are re-using the word to bullshit people it does not mean that correct connotations are deprecated.

throwawayish · on Jan 23, 2017

I correct you. You are wrong.