Hacker News new | past | comments | ask | show | jobs | submit login
Umbra: an ACID-compliant database built for in-memory analytics speed (umbra-db.com)
227 points by pbowyer on Jan 26, 2020 | hide | past | favorite | 100 comments



There's been an explosion of new DBs, but I haven't found anything that really beats Postgres or MariaDB for most workloads. The main advantages of these battle tested DBs is that they're easy to operate, well understood, full featured, and can handle most workloads.

It does make me wonder what will be the next big leap in DB technology. Most of the NoSQL or distributed DB implementations have a bunch of limitations which make them impractical (or not worth the trade offs) for most applications, IMO. Distributed DBs are great until things go wrong, and then you have a nightmare on your hands. It's a lot easier to optimize simple relational DBs with caching layers, and adding read replicas scales quite effectively too.

The only somewhat recent new DB that comes to mind which had a really interesting model was RethinkDB, although it suffered from a variety of issues, including scale problems.

Anyway, these days I stick with Postgres for 99% of things, and mix in Redis where needed for key/value stuff.


The issue that I frequently run into is not that I'm looking for a fancy distributed/sharded database because of reasons of performance, but because I need to store large amounts of data in a way that allows me to grow this datastore by "just adding boxes" while still retaining a few useful database features. I'd love to use Postgres but eventually my single server will run out of disk space.

Now, one approach is to just dismiss this use-case by pointing at DynamoDB and similar offerings. But if for some reason you can't use these hosted platforms, what do you use instead?

For search, ElasticSearch fortunately fits the bill, the "just keep adding boxes" concept works flawlessly, operating it is a breeze. But you probably don't want to use ElasticSearch as your primary datastore, so what do you use there? I had terrible experiences operating a sharded MongoDB cluster and my next attempt will be using something like ScyllaDB/Cassandra instead since operations seem to require much less work and planning. What other databases would offer that no-advance-planning scaling capability?

Somewhat unrelated, by I often wonder what one were to use for a sharded/distributed blob store that offers basic operations like "grep across all blobs" with different query-performance than a real-time search index like ElasticSearch. Would one have to use Hadoop or are there any alternatives which require little operational effort?


Checkout CockroachDB if you want to have the 'add a node' for additional storage option like MongoDB has. It's Postgres compliant, for the most part, and has a license that most of us can live with for the projects we build.


No one ever mentions Couchbase here, I guess it's a bit under the radar despite being used by so many big companies. I used it back when I was at Blizzard and thought it was pretty amazing. It's sort of like, what MongoDB could be if it was actually good. Ridiculous speed and scalability, and the new versions have included analytics and full-text search. Give it a look.


the question with many databases using rotational based storage engines is not how much you want to store, but how much you want to query. Couch requires 99% cache hit rates for queries, they tell you to add memory if you are lower. which really makes it an in memory database with a disk based backing store, suitable for modest data sizes. i have also never seen a positive jepsen test, so i believe it is known to lose data in split brain modes.


Not open source?

Not even transparent pricing?

https://www.couchbase.com/pricing


The Community Edition is open source, though the Enterprise Edition has some proprietary extensions.

It’s admittedly not at all obvious from the repo name, but the following is the top level repo for Couchbase Server (which uses Google’s `repo` tool): https://github.com/couchbase/manifest


> proprietary extensions

Geez, with all these limitations you may as well use another, proper open source database.

https://www.couchbase.com/products/editions


> I need to store large amounts of data

What kind of data and amounts do you frequently encounter that is a good fit for relational storage but difficult to fit in a single server? Honestly curious.


It's not relational but key-value based, but I do need things like "update" operations, hence I can't use a simple blob store.


Thank you for taking the time to answer! IMNHO key-value can be a great fit for relational storage. But I'm still curious about what(why) and how much... Servers can grow pretty big these days.

And you say update... Across keys, or some keys depend on others? Otherwise it sounds like a perfect fit for sharding with reduced constraints/requirements.


Try FoundationDB it's a distributed ACID key value store, can have databases scale up to 100TB, scales horizontally, and can handle an enormous amount of concurrent writes and reads while maintaining relatively low latency per read/write.

You can even use the Foundation Document Layer which is API compatible with Mongo.

It definitely takes some getting used to, but I think it's pretty fucking great, once you do.


Yeah, I was super excited when FoundationDB was made Open Source, but since then I haven't really heard anything about it or anyone using it in production, even less so with the Document-DB layer (Mongo). Any experiences you'd like to share?


Lots of people using it in production. Snowflake, IBM Cloud, Goldman, VMWare.

CouchDB is re-architecting onto it, https://youtu.be/SjXyVZZFkBg


I love it personally. I think right now those who are using it are using it as their secret sauce. Its essentially what people want from BigTable or one of the other proprietary offerings from the cloud providers, except actually open and free.

I think the biggest reason for a lack of noise about it, is the overhead of learning it is pretty high. You're not going to find folks writing their first "Nodejs + Express" applications using it. Additionally, you really have to know why most distributed databases suck ass to know why FoundationDB is so good.

Example: I have a cluster of three VMs running FDB on my home server, and over the past week I've accidentally hit the power switch three or four times. At no point did I have data loss in the cluster, the cluster "immediately" comes back up, and is ready to go. Adding machines to the cluster is unbelievably easy, especially if you have ever even tried grokking how to horizontally scale PSQL.

I'm close to releasing an Elixir-based Entity Layer which is pretty uhhh, ~~shitty~~ lacking, at this point feature wise, but it does make storing structured data a bit easier. I'm hoping it'll be more useful for helping people learn how to use FoundationDB than something folks are putting into production (though I'm dog-fooding it).


Have you considered disaggregating your storage from your compute and using something like NVMe-oF or a traditional iscsi or fiber channel SAN to mount storage to your database server? With a block level storage fabric you can fail over instantly, run read only replicas and even support distributed writes in some cases.


> With a block level storage fabric you can fail over instantly, run read only replicas and even support distributed writes in some cases.

Yes, but it's also a wonderful way to corrupt a DB as soon as anything goes wrong in your storage fabric system.

And this type of corruption is generally not the recoverable one.


Combine it with a data-safety oriented systems like ZFS and your only issue will be downscaling storage without downtime (though that can also be worked around, if in slightly less safe manner, by putting ZVOLs on thin allocations on SAN)


> Yes, but it's also a wonderful way to corrupt a DB as soon as anything goes wrong in your storage fabric system.

Why is this more likely to corrupt a DB than having it on local disks, when something goes wrong?


> Why is this more likely to corrupt a DB than having it on local disks, when something goes wrong?

Local filesystems (ext4, xfs) have been designed to run over reliable internal bus and over spinning disk. They are (almost) able to stand most of the outage happening in this context ( powercut, corrupted block, missing flush ).

Put them over a non-reliable "normal" network, where you can get "savage" cable unplugged, faulty controller, paquet loss, out of order delivery, buggy middle box and you explode the number of scenario that can go wrong and will go wrong.

Block level I/O virtualization is amazingly useful, but (in my mind) it should be used with care...

I already heard of case in prod where the Master DB, the Slave DB and backup finished all in the same virtualized block-level storage.... try to guess what happened next.


Thanks, that's a great explanation.


Clickhouse seems to be another great option for what you've described.


He needs updates, and Clickhouse isn't really made for that. But otherwise I agree.


For the TSDB part you might use Scylla standalone, or Scylla+KairosDB. We also work well in tandem with Elastic if you need it for ad hoc queries. We have a two part blog series on the why and how: https://www.scylladb.com/2018/11/28/scylla-and-elasticsearch...


See YugabyteDB. Reusing postgresql code and adding sharding with transactions & synchronous replication. Source: work there.


What is the difference between one server with many TB vs multiple servers with less space?


With multiple servers you can add space to each of them. With a single one there is a much lower limit to what you can do - that's the idea behind vertical/horizontal scalability. That, and the systems with multiple nodes can be made more reliable than single node servers.


One server is still going to run out eventually. To give you a very concrete example: The dedicated boxes I use at Hetzner have 2x1TB NVMe SSD by default. I can order additional disks for sure, but even so you'll struggle to get above a few TB of NVMe per box. But, adding another box is cheap and easy.

Plus, a single server with many TB is a big single-point-of-failure, and if you want to scale it (vertically) you still have to take it down.


If the data is active, it's not enough to just throw more storage on the server. More storage means more of other things: Memory. I/O bandwidth. Processing power. Could keep adding those as well, but eventually it's faster and much cheaper to add additional servers.


Backups are easier/faster? A machine with a 5TB table will take forever to dump with a single thread, but 5 servers with 1TB shards will dump it more quickly.


Vertical scaling (getting a beefier machine) is less preferable than horizontal scaling in the case of cloud providers, because a) it usually costs more to upgrade instance types than to run multiple smaller instance types, and b) eventually you hit a ceiling of available instance types as you grow.

But you can still make this work. In our case, we'd ended up going the first route, and we ended up adding AWS EBS as the filesystem block store for our Postgres database, which is easy to resize dynamically without incurring downtime or other issues.

The downside of a vertical scaling approach is, well, you don't get HA "for free". You have to manually configure followers and standby nodes for the sole machine. You have to worry about your own replication. If failover and management is abstracted out for you, as in RDS, then it's easy to live with - otherwise very painful.

tl;dr go with managed DBs if you just need a DB and don't particularly have to optimise query performance outside of DB config.


Availability is one, most distributed systems also replicate.


yup, if you have infinite time hadoop will fit the bill nicely. but looking at the cost/skills to operate, maybe Elasticsearch is still the better offering


Agreed, now that JSON/JSONB support is so good in postgres and MySQL, I see less and less of a reason for the NoSQL databases of yesteryear.

There was a really good post from Martin Fowler a while back that the popularity of "NoSQL" was really because it was "NoDBA" - app devs could sidestep the bottleneck of needing to get DBAs involved whenever you needed to persist an extra object field. While it's easy to abuse JSON storage in postgres, for things that are really just "opaque objects", vs. relational properties, appropriate use of JSON columns can save a ton of unnecessary overhead updating schemas.


True. However, the top 1% of users have 1000+ TB data warehouses where Postgres or MariaDB is not an option. These use cases do not require ACID/OLTP, though. This is why projects like Presto thrive. I think the next obvious leap for data management is bridging the OLTP OLAP gap and having the same database providing both, using the same query engine and different storage engines. Moving data from OLTP systems to OLAP always had its challenges; many companies, OSS projects, etc. wanted to solve it, with mixed results.


I manage about a petabyte of data using a Citus as the distributed query engine on top of Postgres nodes. It’s nice because you can query with Postgres but there’s a decent number of sharp edges at scale.


If you want distributed DBs built on top boring old SQL databases, there is always https://vitess.io/. Not played with it too much myself, but it's been tried and tested by big companies (it was originally built by youtube), so worth a try.


So for the past few years since about ~2013 I've been on and off keeping an eye on Michael Stonebraker and his work on VoltDB (based on lessons learned from h-store) and Scidb and in general his criticisms of nosql and newsql variants.

I think Scidb, being a column-oriented dbms geared for multi-dimensional arrays (datacubes) is very interesting given current trends, and there are only a handful of similar dbs around, the other two that interest me are rasdaman and monetdb. I don't know if OmniScidb counts as a datacube db but it is also really interesting especially due to it's gpu and caching model.

As a sysadmin/ops type having to deal with monitoring timeseries db's are also something I like to keep an eye on. It used to be mostly rrdtool in this space, but now I am comparing prometheus and influxdb. Like OmniSci, another case of something that's not quite in the same db model but might even better a better solution for the space (metrics) is Apache Druid. (elasticsearch being another than can be massaged to fit as a quasi tsdb as well) I think there is some room to unify the monitoring/metrics and log storage arena into one space (usually they are separated which adds admin overhead) and right now I really like druid as a potential for this.

Another interesting application of timeseriesdb's that I have been keeping an eye on is in the quant/algo trading area. Most people have been using kdb+ there but many are looking for replacements and there are some really good conversations to be found about the kind of limitations they are hitting.

I'm just a sysadmin who likes to keep up with whats going on, and my db knowledge is limited, but I do have a process for narrowing my focus to the dbs I reference. It must be open source, bonus points to gpl or apache licenses. The language it is written in is important but not a deal breaker (very tired of so many java based dbs). I don't like it when they are tacked on top of an "older" tech (such as kairos on cassandra, timeseriesdb on top of postgres, opentdsb on top of hbase, kudu on hadoop, etc). Being either filesystem aware or agnostic can be nice features (playing well with ceph, lustre, etc) Not saying this is the sort of selection criteria others should use just giving some info on mine.

A few more interesting mentions: clickhouse, gnocchi, marketstore, Atlas (Netflix), opentick (on top of foundationdb).


One of the main pain points I keep hearing about kdb is the complexity and its consequences in terms of time spent trying to do anything, and money spent on consultants. The language is efficient and you can write pages of Java with 2 lines of code (they keep bashing Java for some reason). But it takes very specialized talent to get there. I heard that Bitmex was paying up to 6 figures to find consultants in London and have them move to HK which illustrates the degree of key-man risk inherent with kdb.

Performance is good but comes at a massive usability cost. kdb+ brag about very succinct commands and even error messages but the small gain in efficiency from one less character in the error message is at the detriment of the user trying to work with it.

QuestDB (http://questdb.io) is an alternative to kdb as time-series data store (disclosure, i am one of the authors). We left trading to build it and show you can get the same speed without sacrificing usability. Users get better performance than kdb+, but can use SQL, ACID transactions. It is already in production with a few exchanges and HF, and it is open-source!

Hopefully, kdb users looking for a replacement find questdb helpful for their projects :-)


Thats pretty cool! Both QuestDB and that you decided to open source it. Thanks for the response.


Just a small note: Kudu shares no code or provenance with Hadoop. Only commonality is ecosystem (eg you can use spark or Impala to query it)


Nice, I didn't realize, guess it needs to go into the "reevaluate" list then. Sometimes Apache products tend to blend together in my mind and I get their capabilities confused/conflated.


Agree. Boring Tech is Good. Postgre + Redis / Memcached is probably good enough for 99% of use case. ( I still wish Postgre made shading easier )

RethinkDB, CockroachDB and FoundationDB are worth keeping an eye on.


> I still wish Postgre made shading easier

Yeah, MySQL is ahead of the curve there...

https://demozoo.org/productions/268459/


ROFL, I was wondering what is going on there for a moment. Turns out "Sharding"..... is not a word and gets auto corrected to shading.


RethinkDB is pretty much dead at this point I think?


They released their new version a few months ago I think. Along with a blog post explaining why it took so long.

Still, I can only be carefully optimistic.


The thing is most applications don't really have data across regions that are related. So, there really is no need for distributed databases for most of the use cases which actually can be solved by sharding. Also, most of the applications already gracefully handle DB failures by failing over to stand-by replicas which PgSQL and MariaDB already provide.

However, I do think the key innovations are in building control planes around existing relational and NoSQL databses for scaling/sharding them across a set of resources to minimize cost while meeting performance and availability constraints.


> The only somewhat recent new DB that comes to mind which had a really interesting model was RethinkDB

In what way was it interesting? It was a document DBMS that supported MVCC.


I’m excited for the time when databases are built assuming that I/O is no longer the primary bound for distributed deploys, and multi-node by default deploys are a thing :)


Yeah, in the advent of byte-addressable NVM... I think we have to rethink a lot of stuff and I'm sure we can get rid of a lot of stuff, which isn't needed anymore or should be replaced with light weight components. I'm trying to achieve some of this with https://sirix.io. However, I hope more and more people will get involved over time as it's of course completely Open Source.


This is a database for specialized workloads and not a general purpose DB like Postgres or MariaDB. This offtopic thread could derail discussion on an interesting topic.


a truly novel competitor to postgresql is Datomic. Datomic takes the ideas of the Clojure programming language (immutability, functional programming, data is namespaced) and extends them through to an ACID database.


Datomic is great but their focus on the paid AWS version limits its potential.


> can handle most workloads.

They really shine for read heavy workflows that can tolerate a stale read every once in a while. If on top of that you have reasonable shard-ability you get near infinite scalability.

While that might cover a large portion of the database usage landscape, I'd hesitate to call it most. There's a reason OLTP was coined as an acronym--it's a pattern that comes up a fair bit.


You must not be doing a lot of OLAP or very highly concurrent workloads.

Netezza, Informix, KDB and others will all well outperform open source dbs.

All open source dbs are absolute trash for time series an other olap type queries.


ClickHouse is open source it's def not trash for olap and ts


clickhouse could barely complete a 15 min moving average; last time I checked it required a very slow correlated subquery. That's pretty much where I stopped evaluating it.

edit: after looking it up again, looks like that is still the case, and you have have to be fairly limited with cumulative aggregates if you want the keep performance. maybe someday, but as of now, still not very good.


We used ScyllaDB at my last employer to store time-series data, it worked quite well, though you do have to add a lot of supporting code, it doesn't come out-of-the-box for free.


ScyllaDB isn't really OLAP - it mostly just a glorified logger that throws stuff on disk. It is a terrible TSBD when you really need a TSDB - like basically all open source dbs. we're looking at timescale right now, but basically as the best of the worst. i have the budget no way i'd be using it.


We recommend running KairosDB on top of Scylla for TSDB use cases.


Great work and very interesting ideas. I'm working on a versioned database system[1] which offers similar features and benefits:

    - storage engine written from scratch
    - completely isolated read-only transactions and one read/write transaction concurrently with a single lock to guard the writer. Readers will never be blocked by the single read/write transaction and execute without any latches/locks.
    - variable sized pages
    - lightweight buffer management with a "kind of" pointer swizzling
    - dropping the need for a write ahead log due to atomic switching of an UberPage
    - rolling merkle hash tree of all nodes built during updates optionally
    - ID-based diff-algorithm to determine differences between revisions taking the (secure) hashes optionally into account
    - non-blocking REST-API, which also takes the hashes into account to throw an error if a subtree has been modified in the meantime concurrently during updates
    - versioning through a huge persistent and durable, variable sized page tree using copy-on-write
    - storing delta page-fragments using a patented sliding snapshot algorithm
    - using a special trie, which is especially good for storing records sith numerical dense, monotonically increasing 64 Bit integer IDs. We make heavy use of bit shifting to calculate the path to fetch a record
    - time or modification counter based auto commit
    - versioned, user-defined secondary index structures
    - a versioned path summary
    - indexing every revision, such that a timestamp is only stored once in a RevisionRootPage. The resources stored in SirixDB are based on a huge, persistent (functional) and durable tree 
    - sophisticated time travel queries
As I'm spending a lot of my spare time on the project and would love to spend even more time, give it a try :-)

Any help is more than welcome.

Kind regards Johannes

[1] https://sirix.io and https://github.com/sirixdb/sirix


> - completely isolated read-only transactions and one read/write transaction concurrently with a single lock to guard the writer. Readers will never be blocked by the single read/write transaction and execute without any latches/locks.

> - variable sized pages

> - lightweight buffer management with a "kind of" pointer swizzling

> - dropping the need for a write ahead log due to atomic switching of an UberPage

LMDB made those same design choices and is extremely fast/robust.


In my particular case it was also a design decision made back in 2006 or 2007 already. It's designed for fast random reads from the ground up due to the versioning focus (reading page-fragments from different revisions, as it just stores fragments of record-pages). I'll change the algorithm slightly to fetch the fragments in parallel (should be fast on modern hardware, that is even SSDs and in the future for instance with byte-addressable non-volatile memory).


For some context, this project is from one of the leading research groups in high-performance main-memory OLAP databases. Neumann’s 2011 paper, in particular, basically invented the modern push-driven operator-collapsing approach to query compilation.


This is a tidy and thoughtful database architecture. The capabilities and design are broadly within the spectrum of the mainstream. At this point in database evolution, it is well established that sufficiently modern storage architecture and hardware eliminates most performance advantages of in-memory architectures. However, many details of the design in the papers indicate that this database will not be breaking any records for absolute performance on a given hardware quanta.

The most interesting bit is the use of variable size buffers (VSBs). The value of using VSBs is well known -- it improves cache and storage bandwidth efficiency -- but there are also reasons it is rarely seen in real-world architectures, and those issues are not really addressed here that I could find. Database companies have been researching this concept for decades. If one is unwilling to sacrifice absolute performance, and most database companies are not, the use of VSBs creates myriad devilish details and edge cases.

There are techniques that achieve high cache and storage bandwidth efficiency without VSBs (or their issues) but they are mostly incompatible with B+Tree style architectures like the above.


I think with modern hardware as for instance the now available first byte-addressable NVM variable sized pages and buffers should in theory get more widespread use and the reading/writing granularity gets more fine granular in the next years. I think as of now the Intel Optane memory however still has to fetch 256 Bytes at the minimum.

However, variable sized pages also allow page compression.

Can you give us some links to the mentioned issues and techniques that achieve high cache and storage bandwith efficiency without VSBs?


I can explain it, the methods are straightforward. As with most things in database engine design, much of what is done in industry isn't in the literature.

The alternative to VSBs is for each logical index node to comprise a dynamic set of independent fixed buffers, with each buffer having an independent I/O schedule. This enables excellent cache efficiency because 1) space is incrementally allocated and 2) the cache only contains parts of logical node that you actually use. References to the underlying buffers remain valid even if the index node is resized. Designs vary but 8 to 64 buffers per index node seems to be the anecdotal range. The obvious caveat is that storage structures that presume an index node is completely in buffer, such as ordered trees, don't work well. Since some newer database designs have no ordered trees at all under the hood, this is not necessarily a problem. There are fast access methods that work well in this model.

The main issue with VSBs is that it is difficult to keep multiple references to the page consistent, some of which may not even be in memory, since critical metadata is typically in the reference itself. A workaround is to only allow a single reference to a page, but this restriction has an adverse impact on some types of important architectural optimization. The abstract objective makes sense, but no one that has looked into it has come up with a VSB scheme that does not have these tradeoffs for typical design cases. That said, VSBs are sometimes used in specialized databases where storage utilization efficiency (but not necessarily cache efficiency or performance) is paramount, though designed a bit differently than Umbra.

The reason to use larger page sizes, in addition to being more computationally efficient, is that it gives better performance with cheaper solid-state storage -- storage costs matter a lot. The sweet spot for price-performance is inexpensive read-optimized flash, which works far better for mixed workloads than you might expect if your storage engine is optimized for it. Excellent database kernels won't see much boost from byte-addressable NVM and people using poor database architectures don't care enough about performance to pay for expensive storage hardware, so it is a bit of a No Man's Land.


Could you elaborate on the problems with VSBs or perhaps point me to a paper that discusses them in detail?


I love seeing this: there are massive opportunities to build fundamentally differently architected database based on evolving computer architectures (ram, persistent ram, GPU, heck - even custom hardware) as well as improved understanding of ACID in distributed environments. SQL remains an important API :)


ACID GPU RDBMS exist.

https://sqream.com


Potgresql has some GPU (&nvme) support already through an extension: https://github.com/heterodb/pg-strom


From the paper.

> and subsequently allow the kernel to immediately reuse the associated physical memory. On Linux, this can be achieved by passing the MADV_DONTNEED flag to the the madvise system call.

Shouldn't this be MADV_FREE? This instantly reminded me of this classic Bryan Cantrill talk https://youtu.be/bg6-LVCHmGM?t=3529

Edit: It seems that the Linux behavior is relied upon? From later in the paper.

> Note that it is even legal for a page to be unloaded while the page con-tent is being read optimistically. This is possible since the virtual memory region reserved for a buffer frame always remains valid (see above), and read accesses to a memory region that was marked with the MADV_DONTNEED flag simply result in zero bytes. No additional physical memory is allocated in this case, as all such accesses are mapped to the same zero page


> It is a drop-in replacement for PostgreSQL.

Well, that's a bold assumption as pg is speaking one of the richest sql dialects out there. And it also means it supports pg WAL protocol ?

The product is backed by solid research, so I suppose that there must be some powerful algorithms built-in, with a good coupling with hardware [1].

So the last question is how the code is made and tested, because good algorithms are not enough for a having a solid codebase. pg+(redis/memcached) is battle-tested.

Seems to use some common ideas with pg such as query jit compilation but mixes it with another approach.

> Umbra provides an efficient approach to user-defined functions.

possible in many languages using pg.

> Umbra features fully ACID-compliant transaction execution.

jepsen test maybe ?

Didn't harvest the clustering part neither.

[1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf


Perhaps I am a bit slow, but could someone else with better understanding ELI5 what benefits this provides over Postgres?

I would really appreciate it.

The only bit I really understood was:

    The system automatically parallelizes user functions
Now granted, I only understand how DBs work from a user-facing side so that might be a barrier here.



Many of the innovations in DB research are for people with very large and very analytic workloads. That is not most users on HN, who often rely heavily on point updates and queries.


I'm still surprised that the industry barely knows about ClickHouse. Very few times I had the impression the be adopting a game changer technology and that's the case with ClickHouse. We only currently use it for analytical purposes but it's been proven that it's very valid solution for logs storage or as time-series DB. I already have in my roadmap to migrate ElasticSearch clusters (for logs) and InfluxDB to ClickHouse.


Does Clickhouse already have an inverted index capabilities or how are you going to search for logs containing "error"? Just LIKE's performance is going to be enough? Or it's not the case for you?


Amazing! I wonder if this is going to be acquired in similar way like HyPer. Commercialization of HyPer took a lot of resources, I wonder what state is Umbra in.


> Commercialization of HyPer took a lot of resources

Do you have any more information on this? I saw HyPer had been acquired by Tableau, and assumed it was a finished product they bought.


Thomas Neumann told me in person, that they will not sell Umbra


Isn't that exactly what the Instagram founders said?

I'm perfectly willing to belief that they have no intention of selling. But that's really not a promise one can easily make. Even if you're capable of withstanding the allure of whatever large sum someone is offering, it's always possible to be faced with a choice of selling or shutting down, or selling or not being able to afford your spouse's/child's/own sudden healthcare needs.


Hi, just a quick note that your comment on internet in the thread about Turkey is dead (shadowbaned) despite being relevant. You should contact the hn team at hn@ycombinator.com


You can also resurrect people's auto-mod-hidden comments (frequently new users, especially with links) by clicking the time to get to the comment's page, and clicking "vouch". (Needs >30 karma.)


Cool, I never knew that. Thanks!


Can it be open sourced?


I’m surprised it isn’t. AFAICS a lot of the initial development was done by students and thus paid for by German taxpayers. Why isn’t it open source? I am confused.


Anyone know of any benchmarks are specific features this has over other DBs? "Built for in-memory speed" might as well say "web scale."

That browser based query analyzer is cool.


the innovation has happened at cloud database projects. dynamo, redshift, cosmos, big query, etc. they don't publish their code, but there is plenty happening under the covers. at this point, i think anyone who acquires machines and installs software has a desire for pain, and isn't making a sensible business trade off - unless you are an infrastructure company or are operating at high scale.


> Drop in replacement for PostgreSQL

Well that's impressive. Can I just drop this into my test suite and get a mega speed improvement? Could be worth it.


Where’s the source code? It’s open source, I guess?



That group has been doing interesting and industry-relevant work for a long time. Not surprised they're trying to commercialize it as existing databases didn't really pick it up.


According to the link, it's by the same people.


Same authors


Hyper, which was created by the same group, can now be used for free with the Tableau Hyper API https://help.tableau.com/current/api/hyper_api/en-us/index.h...

I especially like the super fast CSV scanning!


This thread makes it pretty clear to me that managed DB services from the cloud providers are a Very Good Idea on their part.


One aspect where rdbms development have forgotten to work is to become a real contender for Access/dBase family.

You will see a lot of people chasing the "Facebook wanna-be" kind of workloads.

I work with small/medium companies (or that are big, but with < 1TB of data). I bet 90% can't pass the first stages of data manipulation:

- Most(all?) rdbms have the same datatypes, mean: Use of nulls (bad) not algebraic types (sad), very unfriendly means to model data/business logic

- Use only SQL, that is impractical to anything bigger than basic queries. I work before with foxpro: I could do ALL THE APP with it, including GUIS, reports, etc. So to say SQL is disappointing is to say the less.

- All the engine stuff is a black box. That is great, until you wanna do your own index, store columnar data, save array or text or whatever you want, plug into the query executor and do your own stuff, etc.

You know, if you have JS and I tell you you can't code your own linked list, you will ditch that quickly. Sometimes, if you db engine allow to plug into the storage I could save my graphs for the only time I need it, instead of hack around putting it in tables or worse: Bring ANOTHER db engine to make my life hard.

Wait! Why?

All that stuff that some put in the middle-wares, model or controllers? In any other product you will reject the tool if you can't, but rdmbs FORCE to use anything else to finish it, despite the fact most run in the same box.

- Everyone add your own auth tables/logic, because the one implemented in rdbms is for a use case that not exist anymore. Then do it wrong, of course

- We are in the http world, but rdbms need something else for that.

- Import/Export data is still SAD.

- Import/Export data outside the few half-backed attempts (like csv, that a lot of time better you do it with python) is impossible in most. Using foreign adapters could work yet, you need to step out the black box, bring the adapter, compile it, install it, then use it. You need to become a C++ developer, despite you pretend to be a SQL one.

That is making thing worse because:

- Not exist a "rdbms" packager managers.

Look, how great if you just:

    db install auth-jwt
    db install csv-import
end. Like everyone else

- Making of forms and reports. You can bet any company (even users!) will kill if their dbs allow to create reports and forms. Yep, alike access. Yep, that is because access is still a thing with the most weak db engine in town.

- You wanna allow to send emails, connect to external apis, call system services, etc.

But why? Is not that problematic? Well, if you JS (a language FOR THE BROWSER) allow it, why not your db? I live in that world before (foxpro) and it work GREAT.

---

A lot of this stuff is because the rdbms are look with a too narrow POV. Is crazy: People use half-finished NoSQL engines and are happy making his own query engine, yet talking about do else than SQL (or: plug into the query parser so i enhance it) will sound crazy to some.

Rdbms get leap-frog by NoSQL because until very late, get stuck in a mindset and use cases of the 80s.

Not exist ANYTHING that say you rdbms must be like all the others.

Broad the view is what, I think, rdbms need to do to get invigorated, and considering that are also performant, and very good, then will made to conquer the world!


isn't sqlite doing the same ???




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: