A couple points I would make with respect to the article:
- In-memory databases offer few advantages over a disk-backed database with a properly designed I/O scheduler. In-memory databases are generally only faster if the disk-backed database uses mmap() for cache replacement or similarly terrible I/O scheduling. The big advantage of in-memory databases is that you avoid the enormously complicated implementation task of writing a good I/O scheduler and disk cache. For the user, there is little performance difference for a given workload on a given piece of server hardware.
- Data structure and algorithms have long existed for supercomputing applications that are very effective at exploiting cache and RAM locality. Most supercomputing applications are actually bottlenecked by memory bandwidth (not compute). Few databases do things this way -- it is a bit outside the evolutionary history of database internals -- because few database designers have experience optimizing for memory bandwidth. This is one of the reasons that some disk-backed databases like SpaceCurve have much higher throughput than in-memory databases: excellent I/O scheduling (no I/O bottlenecks) and memory bandwidth optimized internals (higher throughput of what is in cache).
The trend in database engines is highly pipelined execution paths within a single thread with almost no coordination or interactions between threads. If you look at codes that are designed to optimize memory bandwidth, this is the way they are designed. No context switching and virtually no shared data structures. Properly implemented, you can easily saturate both sides of a 10GbE NIC on a modest server simultaneously for many database workloads.
> In-memory databases offer few advantages over a disk-backed database with a properly designed I/O scheduler.
Hmm...Michael Stonebraker would probably disagree with you on that. For his different newer projects, he claims and somewhat convincingly shows 10x or more performance improvements. The analysis is that ~10% of a typical disk-based system is useful work. By radically simplifying as a result of doing away with the disk-store, you remove those overheads.
A similar argument (also with benchmarks) is made by the RAMCloud people, they claim up to 1000x perf. increase over disk-based storage for a data-center.
Since I am mostly a dabbler (but have also managed to outperform RDBMs by factor of 1000 or more using RAM-based techniques), I would be curious as to what these papers get wrong.
No, Stonebraker is making many architectural assumptions that do not necessarily hold. Some observations about the design of good disk-based systems:
- Typical inexpensive server and disk systems today are quite a bit different than even five years ago. A database internals design heuristic that would be valid several years ago no longer applies in many cases. An "optimal" design can vary widely due to relatively small changes to the assumed constraints.
- The disk bandwidth typically exceeds network bandwidth by a significant degree. Even if you are saturating the network with inserts and similar, in theory you should be able to drive that through storage if you use the available IOPS efficiently and proactively.
- RAM is RAM. Whether backed by disk or not, the amount of data it can hold is approximately the same. Any significant differences in what you can put in RAM between in-memory and on-disk architectures has more to do with internal implementation and design choices, it is not intrinsic.
- In modern database kernels, no thread is waiting for I/O operations to complete. There is also few or no tree structures, locking, secondary indexing, context switching, and similar which are frequently the source of poor throughput. It tends to be much more linear, pipelined, and parallel.
All of the above aside, regardless of design, a database engine only needs enough throughput to saturate the network interface under load. Once the network is saturated, nothing you do in the database engine will improve the effective throughput of the database.
It turns out that it is pretty straightforward to design a disk-backed database kernel that can saturate 10GbE full-duplex for a wide range of workloads. Consequently, the supposed performance benefits of in-memory are largely moot. It simplifies implementation but does not address a real performance problem on modern hardware relative to a modern database kernel design.
"...single-threaded, lock-free, doesn’t require disk I/O in the critical path, ..."
IIRC, the single-threadedness is made possible in large part by not having I/O on the critical path. (When you wait on I/O, a single threaded design stalls completely, so either you multi-thread somehow or your throughput dies).
Can you give some refs to make these statements more real?
I'm trying to understand whether an in-memory db would be faster than postgres, for instance. What's the postgres I/O scheduler, and is it good? Are there benchmarks somewhere showing the difference?
Why does the I/O scheduler make a difference for in-memory vs. disk databases? Is this a subsystem that caches the database in memory? Are you saying that, with proper caching, a disk-based database is as efficient as an in-memory db?
PostgreSQL is a bit of a hybrid in terms of scheduling. It has its own disk cache but still goes through the kernel mechanisms. Classic database I/O schedulers are not portable so it is a bit of a challenge to implement one in big open source projects for reasons outside of technical ability.
With proper cache and I/O scheduler, the same workload will fit in memory, so the only way the disk gets in the way is if the I/O scheduling does something suboptimal with respect to writes (which happens a lot with the kernel caching behavior).
Modern database servers typically have more disk bandwidth than network bandwidth and therefore most write workloads should be able to go through storage at the network's wire speed, at least in theory. In practice, I/O scheduling behavior from memory to storage tends to be bursty or poorly timed. Consequently, the instantaneous I/O bandwidth requirements can exceed the effective disk bandwidth for brief periods and performance degrades.
A really good database I/O scheduler, cache, and execution engine work together to basically makes sure that peak bandwidth demands to the disk subsystem are never much worse than the network wire speed. Achieving this is not trivial and requires a lot of clever dynamic resource optimization but many sophisticated database engines implement this to some degree or another.
I think it also depends on how you structure your database and how you use it. For some things a NOSQL design is great and most of them don't have to lock over multiple tables they might be very fast at getting structured records. They might be terrible if you try to join different records (especially if they are located on different shards) and some don't have ACID if that is a requirement. Some databases like Datomic stores information in a completely different way that may be very fast for some use cases.
If you have an application or a specific use case you can try a few different databases and measure how they work. I tried a small memory database and got much better performance than SQL Server gave me while the queries were easier to write, for other things this database would not have worked at all or it might have been about as fast as a relational db. As a general rule the memory db providers seems to make the case that they are faster than normal relational databases. Take a look at VoltDb or StarCounter for example.
How is that in any way relevant? All ARIES write-ahead-logger systems are single-writer to their logs too. No matter what kind of BS they sell you about high concurrency with fine-grained locking, it all funnels thru that single bottleneck.
The fact is that LMDB's single-writer design is not a concurrency liability in real workloads.
- Easy sharding, a la Elasticsearch. I want virtual shards that can be moved node to node and an easy to understand primary/replica shard system for write/reads. I want my DB nodes to find each other with an easy discovery system with plugins for AWS/Azure/Digital Ocean etc.
- Fucking SQL. I don't want to learn your stupid DSL. I want to give coworkers a SQL client a say "go! You already know how to use this!". If I want a new feature, then dammit, build on top of SQL the way PostgreSQL has. Odds are, regardless if its some JSON API or SQL, my language will have a client for it that will be superior than writing raw queries anyway.
- Easily pluggable data management systems. For example, if I do a lot of SUMs and I know I'm not doing writes very often, I want to use CStore. If I'm storing a bunch of strings, I want to able to index it anyway I please - maybe one index with Analyzer/Tokenizer X and another with Analyzer/Tokenizer Y - all in a nice inverted index. Good, I can make an autocomplete now. Oh, and sometimes I want a good ol' RDBMS.
- Reactive programming! It works well in the front end and it'd be amazing in the backend. For example, I want to make a materialized view that's the result of a query, but that gets updated as new rows get inserted or as the rows it uses gets updated. Let's call it a continuous view or something. Eventual consistency is fine. Clever continuous views can solve a lot of performance issues.
- I want to be able to choose if a table/db is always in memory or not. I don't care about individual rows - that sounds like someone else's problem.
- Easy pipelining - these continuous views mean that an insert can span a lot of jobs because one continuous view can be dependent on another. I want my database to manage all of this for me and I want to forget that Hadoop ever existed. I want to be able to give my database a bunch of nodes that are just for working jobs if need be. Maybe allow custom throttling for the updates of these "continuous views" so the queries don't get re-run every update if they're too frequent.
- While I'm at it, I want a pony, too. But I'd settle for this being open source instead.
There's a lot of possible directions for the DB world in the next decade. Me, I think the line between DBs and MapReduce/ETL/Pipelining is going to be blurred.
I have a LOT fewer requests than you, but this one is just so irritating:
>- Fucking SQL. I don't want to learn your stupid DSL. I want to give coworkers a SQL client a say "go! You already know how to use this!". If I want a new feature, then dammit, build on top of SQL the way PostgreSQL has. Odds are, regardless if its some JSON API or SQL, my language will have a client for it that will be superior than writing raw queries anyway.
It pisses me off to no end these developers have to out-think SQL and re-invent the whole damn new wheel for "efficiency's" sake. It seems that no one takes into account friction costs and instead just waggles their dick around saying "Look how smart this new language is or how I co-opted [erlang,perl,JS,etc] into being the query language!" Just stop it. Some people making less than $150k/year are going to have to use this and they are good at DB analysis but not bullshit esoteric language writing.
This! Yes, this is why I like Redis and Cassandra if I'm not going with a standard SQL DB. The first acts like a KV store on steroids and Cassandra has CQL which is almost exactly like SQL except without joins (ok, slightly more than that.) If Cassandra could get it right, why can't everyone else?
I'm sick and tired of this elitist new feature for new feature sake as well. Some of us have to work for a living and would like to go home after the workday instead of catching up on the new flavor of the week.
> Fucking SQL. I don't want to learn your stupid DSL.
I really like one change in LINQ syntax (and probably various other SQL-inspired DSLs), which is to invert the basic query structure, enabling better IDE support:
from thing in things ... (yay, now my editor knows what a "thing" is and can provide better code completion and type checking for the remainder of this deeply-nested behemoth of a query that probably should be expressed as hundreds of simple statements instead of cramming half of the day's work into a single "statement" with no obvious rules for indentation because, after all, run-on sentences are notoriously annoying to read and generally should be avoided).
That alone is worth investing a few minutes to learn something new.
Thank you. Doing a join on an SSD isn't as big a deal as on a spinning platter. Find the N regions of storage and pull 'em in. (replace tricky disk scheduling algos for simple FIFO or priority queue of requests)
Many of these denormalized "document" storage systems are likely to look like real legacy cluster-bombs in a few years.
Well, so long as I'm spitballing my dreamDB's "continuous views", then why not have a concept of "foreign documents"? Let's define "foreign documents" as a type of field that is an exact copy of a document in a separate table. This field gets updated when the original document is updated.
This could apply to denormalization really well! For example: let's say you have a table called "items", a table called "customers" and a table called "sales". A sales document is just a item foreign document + a customer foreign document + a date.
The item foreign document is literally an exact a copy of a document in your items table that gets updated whenever the original gets updated. So if you change the customer information, the customer information in the corresponding "sales" documents get updated.
This is a terrible example because it's not a useful/practical use case, but maybe it could be useful if you have enough data where this join is unreasonable.
You can use the same architecture to update your continuous views as you do for your foreign documents. And add the same syntax with throttling and whatnot.
Something like this is available in postgres, mssql, etc and is called a "materialized view" [0]. In the postgres case, you have to update the materialized view manually, but in the mssql case, the updates are done automatically when the origin data is updated.
>> "I want to make a materialized view that's the result of a query, but that gets updated as new rows get inserted or as the rows it uses gets updated."
RavenDB[0] gets a lot of this right for the .NET stack.
RavenDB creates something very much like your continuous view -- called indexes[1] in RavenDB -- for every query you've run against it. Any successive queries aren't really queries at all, but are results from a pre-computed set.
When new data is inserted, the index gets updated asynchronously, leaving you with eventual consistency for your queries.
Raven creates these indexes automatically when you query it. It then maintains these indexes: long periods without querying an index will relegate the index to idle (updates on low priority), and eventually abandoned (no longer updated). This way, your app hot path -- the queries most often used -- remain the fasts, alleviating a lot of performance issues.
GigaSpaces XAP (http://www.gigaspaces.com) gives you a big fraction of that. You can create an in-memory distributed grid that acts as both an object store and processing platform, that is both resilient (self-healing) and scalable. The CTO of Elasticsearch used to work for GigaSpaces.
The continuous view concept would be amazing. It's possible to get get most of the way there with triggers, but it always feels like there's a possibility that I might break something that I would prefer to defer to defer to database experts.
These would be great. One thing I had need for at a previous job was a combined search / compute engine.
We basically had a search cluster that would return ids that would get joined into the actual data in another system, then we'd do heavy post-processing and push it off to another system for slicing and dicing.
Not sure if it's any use to anyone else, but a system that allowed people to push search queries into a cluster, do some user defined post-processing and then have the engine pipe that to an endpoint would have been interesting for that use case.
I believe there's some Hadoop eco-system bits that do a bit of this now (Impala comes to mind).
>Fucking SQL. I don't want to learn your stupid DSL... If I want a new feature, then dammit, build on top of SQL
What I hear is "I don't know anything better than the incredibly cut-rate solution I've been taught how to abuse, so I'm opposed to any improvements."
SQL is a human interface language. I absolutely don't understand the desire that people have to use it for IPC. The tremendous set of SQL injection vulnerabilities would never have existed in the first place if people used proper IPC systems for IPC.
If you make a system that's designed to replace SQL, doing everything SQL does and more, built by people with a lot of SQL experience, great!
But you get a lot of data stores making custom APIs without a comprehensive plan. That's what causes frustration. APIs that aren't generic. APIs that don't handle relations well or at all, sometimes with it bolted on too late.
It isn't what I am willing or able to learn; I am a good software developer that can pick up your new DSL or whatever crap you want to invent.
But guess what? The guys making $70k/year doing DB analysis and reporting who have little to no software development background can't. And they are the ones that live in the hell you've created.
It's not even that. Everytime I have to write an aggregation in Elasticsearch, I cringe. JSON is a terrible human format, and their annoying nested trees to do the simplest thing really suck to type out. I understand SQL doesn't cover everything a DB may expose, but then allow it for what it can do, or come up with nice syntax instead. Don't make me deal with your AST.
re: materialized views -- you can kind of fake them in PostgreSQL by creating a view, populating it, and then defining triggers on insert, delete, and updates to your source tables.
(I'm coming from the perspective of having to use Oracle at work - and I'm happy to be corrected if my experiences do not match your wishlist)
> Easy sharding, a la Elasticsearch
Oracle RAC lets you add nodes to the cluster. Helps with CPU - doesn't really help with disk contention
> Fucking SQL. If I want a new feature, then dammit, build on top of SQL the way PostgreSQL has.
There is no good story here for Oracle. Postgres is amazing (and has the potential for more amazing) here. I'm surprised the legions of NoSQL developers aren't contributing to psql.
> For example, I want to make a materialized view that's the result of a query, but that gets updated as new rows get inserted or as the rows it uses gets updated. Let's call it a continuous view or something. Eventual consistency is fine.
This is how (certain) [1] Oracle Materialised Views already work (.. with some effort..) unless I've misunderstood you.
But you have to:
1. Define "logs" on all the included tables (observables)
2. Only define certain types of queries, otherwise they won't dynamically update
3. If not 2, then schedule the updates on a timer/job (eventually consistent)
4. Write your joins with the old syntax (WHERE A.fieldd = (+) B.field)
> I want to be able to choose if a table/db is always in memory or not. I don't care about individual rows - that sounds like someone else's problem
Oracle's In memory feature[0] supposedly offers this though I haven't had a chance to get at it yet.
> While I'm at it, I want a pony, too.
If you've got the money to pay for all the Oracle gear above, then you have more than enough money for a few Ponies ;)
I come from a PSQL background where materialized views are manually refreshed :)
After skimming Oracle's docs, I'd settle for something as simple as:
CREATE MATERIALIZED VIEW my_mview
REFRESH FAST ON COMMIT
THROTTLE 10s
AS ( ... )
Does Oracle use deltas to refresh these mviews? It'd super cool if it did! For example, if the mview is a sum of a bunch of rows, then it could use the logged changes on those rows and its existing results to make the new mview.
Oracle supported both full refresh and incremental refresh as of at least 10 years ago (maybe longer).
Fast refresh requires creating "MATERIALIZED VIEW LOGS" on the source table(s) and covers most (but not all) aggregations/groupings in the mview SQL. Once it's setup, DML to the source tables gets logged to the mview logs and a "fast" refresh of the mview uses it to incrementally update the mview's data.
I'm actually trying to plan out some experiments towards building something towards this sort of vision, and it seems that I may soon be in a fortunate situation where I'll be able to do production large scale experiments towards this very vision, and perhaps be able to open source pieces over time.
Point being, you're articulating a need thats definitely there.
Quick question about the pony. I mean, about Cstore. What do you think is the best option for doing that nowadays? (By that I mean columnar stores that are hopefully optimized with the usual tricks like inlined compression. I know Vertica works but if are there cheaper (free?) alternatives?
It's impractical to write maintainable code in the forms of SQL I'm familiar with. But maybe someone here knows better.
You should be able to abstract logic out to blocks for the duration of the query. Along these lines:
cset means select id, venue_id from customer
select venue_name from venue where id = cset.venue_id
cset would be scoped to your connection.
Instead of pages of copy-and-paste queries where the levels are all mixed up with one another. You can get something similar by polluting the namespace (e.g. views, temporary tables) but that's hacky and not the same as good.
Well that's why (in my opinion) you should _generate_ the queries, not _write_ them. Use an ORM or whatever flavor of basically a SQL generator you prefer. Or even easier, use any programming language to concatenate around some strings and build your SQL.
The power of using SQL comes from the fact that, no matter what, it's always there. You can always write/buy/find another client for your DB in another language, and as long as it generates SQL, it's all good.
I don't use Redis, but one thing I like about it is that it's a fairly generic storage layer that you can adapt into various data models; you can implement higher-level tools using its lower-level ones. I would love to be able to build a whole database machinery out of specific primitives.
For example, a "column" is basically a primitive you should be able to instantiate using some kind of storage strategy: keep it in RAM, keep it partitioned into chunks, append-only, compressed, sorted, remote etc.
The main thing that I want is to separate data from queries. It's always baffled me that traditional RDBMSes choose to commingle tables and indexes. Why does my commit have to wait for the indexes to catch up? Why do indexes have to be involved in the slow, complicated system of updating tuples that involve transactions, locking, paging and so on?
A better approach is to separate the two entirely:
1. The careful, slower ACID core that journals, fsyncs, keeps my data safe and provides a rigorous data model.
2. The super fast read-only, distributed indexes that organize the data in the most efficient way for querying (mostly through RAM and vectorization).
Now, I don't really need my write node to be a super-complicated distributed, replicated, gossip-based hash-ring-sharded, quorum-coordinated eventual consistency monstrosity. Single master is fine as long as you can keep standby replicas that are easy to fail over to.
What I want is for there to be a whole gaggle of read-only slave indexes that get my data reasonably quickly and are super fast to query. Indexes can be eventually consistent; but they don't need to be safe. Indexes can be reconstructed from the master data, after all. All we need to do is for indexes to ingest a continuous stream of data changes from the master.
Again, the master can be a bit slow: As slow as Postgres, at least. But the indexes, since they don't need to deal with transactions or locking or anything of the sort, can be really fast. Splitting the two up means they can each worry about different things, and apply different tolerances and constraints to what they do.
Today, we accomplish something similar by using ElasticSearch with Postgres. It's not good enough in the long run; they have completely different query mechanisms and data models, for one. It's also difficult to keep ES in perfect sync, and ES is generally heavy-weight; ES indexes are beasts, not very mobile. For example, the schema is mostly static, and changes require cloning the index (which I find weird). Still too fulltext-oriented; it's just not quite as good at non-text stuff. GIS support is lacking. And so on. A uniform query/database system is definitely needed.
> It pisses me off to no end these developers have to out-think SQL and re-invent the whole damn new wheel for "efficiency's" sake.
Worked with someone who did this a couple years ago. I keep trying to get him to explain why he was reinventing sql, and never did get a clear answer to that.
> just waggles their dick around saying
Sorry to be the one to say it, but I don't think that part was necessary or appropriate to this community. Also, using the collective gender-neutral "their" in conjunction with dick waggling is just funny, now that I think about it.
On one side there is California, where "dick wagging" is an unnecessary instance of gendered language, certain to drive women out of the tech industry.
On the other side there's the rest of the world, where being offended by "dick wagging" seems dazzlingly childish. It's just a penis, most grown men and women know what they are and observe their motions on a daily basis.
Avoiding gendered words as a recept for more women in IT seems very funny to me.
My language forces you to add gender-dependent postfixes to every verb and noun. Nobody have a problem with that, and I think there's more women in IT here than in US.
It's obviously not the source of problem. The source is people being jerks. You don't fix that by changing the non-jerks language (jerks won't care).
In fact nonjerks weren't jerks because they treated women the same as everybody. When you embarass them into treating women differently (more carefully) it becomes more awkward and reminds women more often they are different here. Counterproductive IMHO.
Regarding unnessesary vulgarity - it seems weird for me, no matter if it's gendered or not. But people use different styles of communication, that's the diversity people are so proud of.
Regarding SQL - yes please. It's a perfectly good wheel, don't reinvent it without a really good reason.
>My language forces you to add gender-dependent postfixes to every verb and noun.
If there's truly no gender neutral way of saying something, then I'd imagine that people always use one gender for generic cases, and as such it's not exclusionary. Which is different from having the option and not using it.
>The source is people being jerks.
Non-jerks being exclusionary is arguably far more of a problem than isolated jerks. If one person says "lol women suck at computers", they're obviously a moron, and can be dismissed. Whereas if everyone around you subtly assumes that all programmers are male, it's far more hurtful, because they're people that you respect.
>When you embarass them into treating women differently
Asking for gender neutral language isn't asking people to treat women differently.
> If there's truly no gender neutral way of saying something, then I'd imagine that people always use one gender for generic cases, and as such it's not exclusionary. Which is different from having the option and not using it.
Well, the option is always there, you can use "The person that is using this application clicks the button X" instead of "User clicks the button X". "Person" is female so you then use female for every verb. It's just longer, inconvenient and sounds like legal text so nobody does it. People just write everything in male versions. Even "Are you sure?" is different depening on gender of "you", and I'm yet to see application that uses anything other than the male version for dialogs (except for the applications that know your gender somehow).
I think the distinction beetween using defaults without assuming gender, and assumming gender is important, but some people argue for using the contrived language to not be exclusionary, and that's what I'm against, and what I think is counterproductive.
Imagine if everybody talked in legalese when you aproach them, and switched to normal language otherways.
Humans are pretty flexible when it comes to language, what sounds awkward or verbose can often become completely natural over time. That said I agree that it's easier to make smaller language changes stick.
There might be an alternative way to phrase things neutrally that's easier to say though. (passive?)
Yes, passive works as well, as long as you don't need to have 2 or more agents doing different things. Still it's longer, contrived, and sounds like legal text.
Regarding small changes - not possible :) In English I would like "hen" to exist, cause it's often useful. In Polish it wouldn't change a thing, you'd need to change the underlying system if you want gender-neutral language, and at that point you may as well invent a new language. Every word (ending) will change anyway, that means words work differently with cases, numbers, plurality works differently. People would need to relearn the whole thing.
But most importantly I really don't think the language is the problem. In Poland there are more female doctors than male. Same with teachers. Yet the default word for unknown gender doctor or teacher is male, and the plural is made from male versions usually. Haven't stopped women to dominate these fields.
Yet if neither men nor women wish to change their own language, why must it be changed?
Sometimes HN reminds me of Christian missionaries: interposing themselves into foreign cultures that they do not understand and saying "No! No! It's all wrong! You do it our way."
Meh, nobody is using guns, and the discussion is interesting. I just don't like the assumptions some people make (I've heard that my culture is inherently sexist for example).
> If there's truly no gender neutral way of saying something, then I'd imagine that people always use one gender for generic cases, and as such it's not exclusionary.
But it is almost always male, even in languages that have the concept of neuter. Defaults matter.
Myself, I view it not so much as a cultural clash, but rather an academic one. People sometimes just do not know how to describe the phenomena which perplex, enrage, disturb and perturb them so, and the frustration is such that they result to the vulgar form, which .. after all .. is as you say, a fairly universally understood level of concourse and therefore has its appropriate applicability to the conversation, but naturally - since all language is a system of power and control - with the aforementioned risks of offense. Those who choose to react in offense, and those who choose to see the language being used in light of the subject, may well be the end of the conversation. But, this is a language discussion, no? I myself find SQL elegant, whereas compared to the NoSQL DSL's that shizzle the newschool nizzle, shits fucked yo.
So now we have learn what to say and what not to say from feminists ?
If you take offense at every single fucking thing that's your problem.
Want Women in tech, say this and that. I won't. No one with descent background is explicitly trying to keep women from entering tech. In fact, if you are deep in tech no one has time to even make deliberate efforts that.
STOP thought policing.
Look , this is free country. While I appreciate your sentiment please don't come here for thought policing.
Everyone has freedom to say what they want. If you start taking offense for every single thing written I think you are in wrong place. Go back to Reddit.
Look , what you have done ? Discussion was good going on with everyone expressing what can and can not be good with SQL and Database until you brought up this stupid gender bias thing. Then there are at least 10 replies explaining why GP's comment is okay or not okay.
Please get off from this site and don't waste our time.
Dang - You seem to be moderating HN. Can we have policies in place for removing such comments ? It is unnecessary distraction.
This is one of those situations where it's impossible to point out the problem in the thread without adding to it. In the future, please email us at hn@ycombinator.com instead; it will get to us sooner and more reliably anyhow.
I've detached this subthread and marked it off-topic.
"I keep trying to get him to explain why he was reinventing sql, and never did get a clear answer to that."
SQL embeds in itself certain fundamental assumptions about operating so deep that you can't even see them. For instance, despite frequently being referred to as a "relational" query language, it requires some violations of the original relational logic. One example: The original relational logic looks a lot more like a tuple store where rows can freely contain arbitrary columns than the rigid row schema format that SQL so throughly assumes. For another, it has no concept of sharding, or indeed any other particular data-oriented structure of the database, which is simultaneously one of its strengths (it has survived precisely by not baking in such assumptions) but is also one of its weaknesses. If your database is sharded, for instance, there's nothing particularly in SQL itself that will guide you towards understanding whether you're writing a query that will or will not cross shards. You just have to "know". In fact there's a great deal of things about SQL performance that aren't reflected in the language and you just have to "know".
In theory, a new query language could potentially address these issues. In practice, what I've seen from "new" query languages has more to do with being "easy to implement" for the new store, and if that does or does not happen to meet any of these criteria, well wasn't that a happy coincidence. Someday I hope to see a well-considered sequel to SQL (no pun intended), but the way of thought that SQL affords has so thoroughly taken over the world that it is hard to see past it, much as it is hard to see past the von Neumann computer architecture into anything else. (I am not one of those people who think either von Neumann architecture or SQL are terrible and it all went wrong because we got stuck on them, but that's not to say they are the only way of doing business, either.)
A variant of SQL that can adapt to variable column rows should certainly be possible, much as SQLite implements a sql variant with variably typed columns.
Cassandra's CQL is an SQL variant/subset (albeit without joins, which don't scale out horizontally anyway) that is adapted to variable partitions with rows grouped into partitions. It addresses the problem of locality by embedding partition awareness into the query language.
Yes, that is the best example I've seen. But, by the same token (pun intended this time), it is only SQL-ish. It isn't SQL, which of course they are up-front about by calling it something else. Careful study and some practice usage of CQL can provide a practical object lesson in why "give me SQL!" is not always a reasonable demand to make of a datastore unconstrained by preconceived notions about semantics. SQL presupposes/affords more than meets the eye.
> in conjunction with dick waggling is just funny, now that I think about it.
You understand that 'dick waggling' in this sense does not refer to the mechanical motions of a penis, right?
That it's what we call in English a "euphemism"? The gender-neutrality of the subject makes perfect sense when you understand that this euphemism, unlike human genitalia, is not gender specific.
In this case, "dick waggling" is a euphemism to mean "unwarranted bravado" or "unnecessary behavior performed only to prove that it can be accomplished".
You understand that it isn't a "euphemism" as euphemisms are substitutions for vulgarity, not substitutions of vulgarity. Notice how the replacement you suggest, "unwarranted bravado," isn't blunt or vulgar? The term, as it was used, is a metaphor. As a metaphor, its imagery is fair game for criticism.
As for the criticism itself, I think it stands as the phrase is unnecessarily gendered.
This article is full of so much logical fallacy I'm surprised it made it here. And it's an advertisement none the less.
Creates a red herring by stating he's been doing this a long time and has seen it all.
Creates straw man after straw man in the trashing of memory caches (avoids their use cases), Dynamo (there's a good reason tons of people use various NoSQL Databases) and Hadoop (C'mon, now).
He also creates more logical fallacy in calling various concepts silver bullets that ended up having problems. I don't think anyone serious about technology thinks replication, sharding, load balancing "solves everything". Nothing is a silver bullet and anyone who says something is is selling you something...
And then he fails to really address the MemSQL uses replication, sharding (in a limited sense since the core SQL concept of a JOIN is wrecked here and they have a big warning on their troubleshooting page about an error you users must see often).
SQL is great but I have plenty of great reasons to use other data stores. SQL isn't a silver bullet for data.
Point is, he is calling MemSQL a silver bullet and is obviously trying to sell something while ripping plenty of great ideas and concepts by picking the worst implementations of them and largest misunderstandings of them.
Yes. Or as I've said: memory is the new disk. This is why PMCs (performance monitoring counters) are more important than ever, to provide observability for cache and memory analysis. (I'd like some PMCs made available in EC2. :)
> It’s been 65 years since the invention of the integrated circuit, but we still have billions of these guys around, whirring and clicking and breaking. It’s only now that we are on the cusp of the switch to fully solid-state computing.
Am I missing something, or should it read "hard disk" rather than "integrated circuit" here?
He's referring to the picture of the hard drive above that line when he says "these guys". It took me a couple reads through that sentence to grasp the meaning = "Why are we still using spinning metal contraptions to store data 65 years after the invention of the integrated circuit?"
"These guys" refers to the hard drive in the picture just above that paragraph. It means: even though integrated circuits are old, we're just getting around to solid-state storage.
Amazon doesnt expose much of these statistics (how fast of ram do i get with a M3.large or a c3.med etc) . Does this mean real performance is for those who own their servers?
RAM performance is the same for VMs and bare metal. And the ~1.5x performance difference between different grades of DRAM (e.g. 1333 vs. 1600 vs. 2133 MHz) is negligible compared to the massive cache-RAM and RAM-flash gaps.
And speaking of cache, lstopo (from the hwloc package) does work correctly under EC2.
It may be the same in the sense that hypervisors don't explicitly limit it, but on a multicore host you're sharing memory bandwidth with the other guests, in the common(?) case when the host has more cores than a guest. You can also experience increased latency when there is access contention.
It's easy enough to find out. Create an array of linked-list nodes, and have them point to each other randomly (hint: std::random_shuffle w/an array of indices). Write a routine that traverses the list N times. Time the routine for larger lists. You should see a jump as your list gets larger than each stage of cache.
That's how it has always been. We take different storage/memory technologies, sort them by their speed and price, put the fastest but most expensive closest to the CPU and the slowest but cheapest as far as possible. Minimizing memory footprint allows us to do more work on the faster end while minimizing storage cost allows us to store terabytes of data on your bookshelf.
There might have been just two or three levels initially: cpu register(s), system ram, and external storage. Now the spread has several more steps: registers, L1 cache, L2 cache, maybe L3 cache or part of memory as disk cache, SSD (either as a standalone drive or as an on-disk cache inside a traditional hard drive), and the good old spinning platter. We've mostly let go of tape storage by now but those are still sold for their capacity.
However, from the programmer's point of view, nothing has necessarily changed.
We have several levels of storage, more than before, ranging from the fastest on-chip cache ram to the mechanical storage and we still optimize our programs to run mostly in the fastest tip of this memory pyramid. What has changed is the size of the spread itself: the gap between the fastest and the slowest is huge in numbers. But relatively, not so much.
A quick guesstimate of the ratio of microseconds needed for a zero-page read in C64 vs. reading a byte from the 1541 floppy drive versus a read from cpu cache vs. a read from a spinning platter tells that the relative difference still roughly on the same order of magnitude. From various sources, I get a figure between 50-100 million times faster between the fastest and slowest read.
That is also what makes programming so much fun: everything gets redone all the time and the pace of advancements is crazy yet some things don't change. We just do more complex things but still bump into essentially the same tradeoffs.
Database vendor frames history of computing in database evolution, makes snide remarks about competing technologies, admits it has no idea where the world is going while invoking the 'history repeats itself' notion. Well, duh.
OTOH, databases are only one component of modern architectures, which the article correctly asserts are largely limited in terms of scalability by throughput and latency. However, scalability is often secondary to functionality. And in terms of functionality, the long list of database types trawled out through the article only serve to highlight the real chokepoint: cognitive overhead.
Perhaps what we really need are tools that enable us to more easily stop and think about the problem. Ideally, tools to test, profile, compare and switch between storage or other subsystem architectures without having to delve in to infinitesimal intracacies of each.
Success really depends on the conception of the problem, the design of the system, not in the details of how it's coded. - Leslie Lamport
I once learned, in the good old mainframe times, that there are 3 sizes of databases: Small size that fit into RAM, medium size that fit on one computer, and big databases, that require a cluster of computers.
The relational model, and SQL databases play their strong roles in medium size databases, but are to much overhead for a fast small database, and do not scale well for big databases.
It was hoped at that time, that Moors law will beat Wirths law (who claimed this law much later), that big databases will soon be medium sized, nobody would care about performance of small databases that much, and we could happy use SQL for all problems. This was true for surprisingly long time, and still is, if your problem fits into a medium size database.
Unfortunate, computer history turns in cycles, and tends to forget lessons from the past. Coding access to a bunch of different databases was at least standard under COBOL. Coding for half a dozen NoSQL databases now, is a complete mess.
That is a nice scale to think about. When I look back most systems I've been involved with are small with todays servers. Rarely more than 50 GB and that actually fits into memory. How many systems actually needs more than about 1000 GB of database data? Things like images and videos can be stored separateley anyway, I'm talking about other data together with metadata about images/videos/files.
Okay, if the title is correct, then to heck
with traditional RAM and, instead, have very long
addresses, say,
a(i).b(j).c(k) ...
stored in, say, a key-value store. Then, as usual for
caching, just hash that long address.
Why do that? Mostly no one really wants the
sequential addresses, and a lot of work in
software and the processor is calculating those
sequential addresses nearly no one really wants
anyway. So, e.g., software collection classes,
just let the keys be the long addresses and
f'get about AVL trees, red-black trees, etc.
And for sparse matrices, just use the row and
column indices as the addresses and f'get about all
the tricky addressing for sparse matrices. Etc.
I've always found very unfortunate that memsql is not open source. It looks very interesting. VoltDB seems to fill a similar niche. Has anyone tried both?
- In-memory databases offer few advantages over a disk-backed database with a properly designed I/O scheduler. In-memory databases are generally only faster if the disk-backed database uses mmap() for cache replacement or similarly terrible I/O scheduling. The big advantage of in-memory databases is that you avoid the enormously complicated implementation task of writing a good I/O scheduler and disk cache. For the user, there is little performance difference for a given workload on a given piece of server hardware.
- Data structure and algorithms have long existed for supercomputing applications that are very effective at exploiting cache and RAM locality. Most supercomputing applications are actually bottlenecked by memory bandwidth (not compute). Few databases do things this way -- it is a bit outside the evolutionary history of database internals -- because few database designers have experience optimizing for memory bandwidth. This is one of the reasons that some disk-backed databases like SpaceCurve have much higher throughput than in-memory databases: excellent I/O scheduling (no I/O bottlenecks) and memory bandwidth optimized internals (higher throughput of what is in cache).
The trend in database engines is highly pipelined execution paths within a single thread with almost no coordination or interactions between threads. If you look at codes that are designed to optimize memory bandwidth, this is the way they are designed. No context switching and virtually no shared data structures. Properly implemented, you can easily saturate both sides of a 10GbE NIC on a modest server simultaneously for many database workloads.