Modern Data Lakes Overview

FridgeSeal · on Feb 23, 2020

Having spent the last ~8 months at my work grappling with the consequences and downsides of a Data lake, all I want to do is never deal with one again.

Nothing about it was superior or even on par with simply fixing our current shortcomings OLAP database setup.

The data lake is not faster to write to; it’s definitely not faster to read from. Querying using Athena/etc was slow, painful to use, broke exceedingly often and would have resulted in us doing so much work stapling in schemas/etc that we would have been net better off to just do things properly from the start and use a database. The data lake also does not have better access semantics and our implementation has resulted in some of my teammates practically reinventing consistency from first principles. By hand. Except worse.

Save yourself from this pain: find the right database and figure out how to use it, don’t reinvent one from first principles.

towelpluswater · on Feb 23, 2020

Completely agree with you. Data lakes were marketed well because, well... data warehousing is hard, and a lot of work. Data lakes don't make that hard work disappear, it just changes how and where it happens.

I've found data lakes complement DW's (in databases) well. Keep the raw data in the lake and query as needed for discovery, and load it into structured tables as the business needs arise.

Data lakes alone are doomed to be failures.

luckydata · on Feb 23, 2020

I don't think anyone ever suggested that. The use case for a data lake is precisely the one you describe, it allows you to start collecting data without having to do a lot of work ahead of time before y9u know how you actually want to structure things. Allows for schema evolution too. It's not a panacea, it's just a way to avoid the inertia most large data projects have.

towelpluswater · on Feb 23, 2020

Nobody here suggested it, just something I see organizations doing quite often.

(edit: the rationale behind this tends to be that you can avoid the heavy lifting of ETL/transformation logic by just using a data lake - obviously not the case, as most of us know)

threeseed · on Feb 23, 2020

I've worked on nearly a dozen Data Lakes. I have never seen nor heard of anyone who said that Data Lakes meant you could avoid ETL. If anything it has necessitated more of it as users expect to join these disparate data sets.

There is after all a reason that the role Data Engineer became popular just as Data Lakes become popular.

towelpluswater · on Feb 24, 2020

Just means we have different anecdotal experience, then. Very little of mine has been in the tech industry.

threeseed · on Feb 23, 2020

No. Data lakes were marketed well because they are significantly cheaper and solve long standing problems.

S3 is basically free and has unlimited scalability. Oracle, DB2, HANA, SQL Server etc are ridiculously expensive and struggle under high concurrent load even with QoS in place.

MSM · on Feb 24, 2020

S3 != a data lake.

If you're able to solve the problems that you were previously using oracle or SQL Server for with S3, more power to you, but the truth is that to replicate the functionality of that old Oracle server you'll start with S3, but you'll also want some querying (Aurora? RDS? Hbase?), probably some analytics and ingestion (Redshift? Kinesis? Elastic? Hive? Oozie? Airflow?), along with some security now that you've got multiple tools interacting (Ranger? Knox?), probably some load balancing (Zookeeper?), maybe some lineage and data cataloging (Atlas?), etc.

In my experience what starts with "Just throw some data in S3, forget that old crusty expensive server!" ends with 22 technologies trying to cohesively exist because each one provides a small but necessary slice of your platform. Your organization will never be able to find one person who is an expert in all of these (on the contrary, you can find an Oracle, or DB2, or SQL Server expert for half the money) so you end up with seven folks who are each an expert in three of the 22 pieces you've cobbled together, but they all have slightly different ideas on how things should work together, so you end up with a barely functioning platform after a year's worth of work because you didn't want to just start with a $400k license from Oracle.

threeseed · on Feb 24, 2020

Not sure what you are talking about.

If you have S3 you can use Athena, Redshift Spectrum or Spark as query layer. It's not 22 technologies.

You don't need ElasticSearch, Ranger, Knox, Zookeeper etc as they have nothing to do with querying.

dx034 · on Feb 24, 2020

But then it's far from basically free. Even overpriced Oracle databases can end up cheaper than locking into AWS in these cases (my experience).

derefr · on Feb 24, 2020

I think the presumption that's differing here is query workload.

An OLAP database is, in the default case, an always-online instance or cluster, costing fixed monthly OpEx.

Whereas, if your goal in having that database is to do one query once a month based on a huge amount of data, then it will certainly be cheaper to have an analytical pipeline that is "offline" except when that query is running, with only the OLTP stage (something ingesting into S3; maybe even customers writing directly to your S3 bucket at their own Requester-Pays expense) online.

billman · on Feb 24, 2020

My biggest problem with Oracle is not the database itself. There is no doubt that Oracle is a fine piece of software, and is bullet proof, and has decades of experience built into it.

My problem is the scalability and elasticity of it's licensing model. It doesn't meet the needs of today's analytics without spending enormous amounts of money up front.

cjalmeida · on Feb 24, 2020

Nope. One can start easily with Airflow+Spark(ERM)+Presto+S3 and get about 80% what'd get from your run of the mill Oracle database. At a fraction of the price, without half the headache in procurement, licensing or performance tweaking. And better scalability.

You'd be looking at $M in licenses for anything half-serious based in Oracle tech. Becoming good at replacing Oracle stuff probably has been one of the best paying jobs for a while.

FridgeSeal · on Feb 24, 2020

They _appear_ to solve a bunch of problems by simply punting them down the road into downstream applications.

None of the databases you listed there are OLAP databases.

Clickhouse, TiDB, Redshift, Snowflake, etc are significantly more suitable and should be the target of comparison here.

manigandham · on Feb 24, 2020

S3 is just storage. It doesn't provide any querying, crawling, metadata, provenance, or other details required for data at scale.

That's why AWS has entire product suites from Athena, Redshift Spectrum, Data Lake Formation, Glue, etc to help companies actually do something with the files stored in S3. And it's often a mess compared to just fixing their processes and ingesting it properly into a SQL data warehouse first.

threeseed · on Feb 23, 2020

For smaller use cases data lakes probably don't make sense.

But data lakes have arisen from the enterprise where the centralised data warehouse was the standard for the last few decades. They know how to use a database. They know how to model and schema the data. And they know about all of the problems it has. They didn't buy into the data lake concept because it's trendy.

Fact is that for large enterprises and for those with problematic data sets e.g. telemetry databases simply don't scale. You will always have priority workloads e.g. reporting during which time users and non-priority ETL jobs come second. And often Data Science use cases are banned altogether.

The reason data lakes make sense is because it is effectively unlimited scalability. You can have as many crazy ETL jobs, inexperienced users, Data Scientists all reading/writing at the same time with no impact.

Generally you want a hybrid model. Databases for SQL users and data lake for everything else.

FridgeSeal · on Feb 24, 2020

> Generally you want a hybrid model. Databases for SQL users and data lake for everything else.

I do a mix of data science and software engineering, dealing with the datalake is a nightmare and I avoid it at almost all costs.

You know what the first thing everyone I worked with wanted to do after pointlessly pouring everything into the black hole that was the datalake? Re-implement some kind of SQL (and database semantics) back on top of it again; except now it's worse.

oconnore · on Feb 24, 2020

This doesn’t make any sense in the context of tools like Snowflake and BigQuery, where the allocation of compute is separated from the data itself. You can scale each of these use cases independently without cross-domain impact.

The data lake model seems to be more about not wanting to commit to a warehouse (for example: future proofing, looking at non-relational data, etc.).

dikei · on Feb 24, 2020

> The reason data lakes make sense is because it is effectively unlimited scalability. You can have as many crazy ETL jobs, inexperienced users, Data Scientists all reading/writing at the same time with no impact.

Eh, almost all Data Lakes cannot handle small files well. All it takes is for someone to write 100 million of tiny files into the Data Lake to make life miserable for everyone else.

threeseed · on Feb 24, 2020

So don't write small files ?

Every time I've seen someone do this it was a mistake and quickly resolved. Either you have way too many partitions in a Spark job or you are treating S3 like it's a queue. And if you really do need lots of delta records then just simply have a compaction job.

dikei · on Feb 24, 2020

Well, inexperience users/data scientists tend to not care about what they write out :)

Nevertheless, my point is a Data Lake's does not offer free unlimited scalability. It takes a lots of effort and good engineering practice to make a Data Lake run smoothly at scale.

derefr · on Feb 24, 2020

Data scientists shouldn't be writing anything to the Data Lake. Data Lakes store raw datasets (sort of like Event Streaming databases store raw events.) In academic terms, they store primary-source data.

Once data has been through some transformations at the hands of a Data Scientist, it's now a secondary source—a report, usually—and exists in a form better suited to living in a Data Warehouse.

Data Lakes need a priesthood to guard their interface, like DBAs are for DBMSes. The difference being that DBAs need to guard against misarchitected read workloads, while the manager of a Data Lake doesn't need to worry about that. They only need to worry about people putting the wrong things (= secondary-source data) into the Data Lake in the first place.

In most Data Lakes I've seen, usually there are specific teams with write privilege to it, where "putting $foo in the Data Lake" is their whole job: researchers who write scrapers, data teams that buy datasets from partners and dump them in, etc. Nobody else in the company needs to write to the Data Lake, because nobody else has raw data; if your data already lives in a company RDBMS, you don't move it from there into the Data Lake to process it; you write your query to pull data from both.

An analogy: there is a city by a lake. The city has water treatment plants which turn lakewater into drinking water and pump it into the city water system. Let's say you want to do an analysis of the lake water, but you need the water more dilute (i.e. with fewer impurities) than the lake water itself is. What would you do: pump the city water supply into the lake until the whole lake is properly dilute? Or just take some lake water in a cup and pour some water from your tap into the cup, and repeat?

cjalmeida · on Feb 24, 2020

In my experience, that's easy to solve from an operations point of view and it just take a couple of easy to teach tricks to avoid it.

However, the scaling limitations of traditional RDBMS are insurmountable when trying to do things like data science, for instance.

cateye · on Feb 23, 2020

Isn't it just a paradox to store infinite data, to use it later for very specific things without having to define it first?

It sounds very common sense to not to "limit the potential of intelligence by enforcing schema on Write" while in reality, the same problem just shifts (or gets hidden) in the next steps.

For example: there are 10 data sources with each 100TB of data. I aggregate these to my new shiny data lake with a fast adapter. Just suck it all without any worries about Schema. So, now I have 1PB of semi unstructured data.

How do I find the fields X and Y when these are all named differently in 10 sources? Can I even find it without having business domain experts for each data source? How do I keep things in sync when the structure of my data sources change (frequently)?

It seems like there is an underlying social/political problem that technology can't really fix.

Reminds me the quote: "There are only two hard things in Computer Science: cache invalidation and naming things."

bradleyjg · on Feb 24, 2020

> Reminds me the quote: "There are only two hard things in Computer Science: cache invalidation and naming things."

and off by one errors!

derefr · on Feb 24, 2020

You're not necessarily ingesting semi-unstructured data. Common Data Lake file formats (Avro, Parquet, ORC) are in fact highly structured, and even self-describing in their schema, with format-features like schema evolution allowing sibling datasets produced at different times to have "different" schemas which nevertheless have a single defined schema as the output when the datasets are unioned together.

The idea, though, is that, if your Data Warehouse wants the data in the form of e.g. a daily-aggregate accounting ledger, then your data sources might be of various time granularities and might be denormalized in different ways (one source with separate Invoices with Transactions foreign-keyed to an Invoice; another with just Transactions with root-level metadata like timestamp directly on them; etc.)

All of the transformations between the source formats and the destination format here are, in some sense, "transparent"—a sufficiently-advanced DBMS query planner could generate an OLAP expression to turn one into the other without understanding the problem domain. It's precisely because of this that, in many cases, it's cheaper to not worry about these kinds of transformations until you need to compute on the data. It's just a bunch of trivial stuff, that you can easily normalize in the computation step, but where fixing it on ingest would have been a whole expensive cluster operation to rewrite terabytes of data, and would require the OpEx of a whole additional set of always-online Hadoop cluster-nodes to fix marginal data as it comes in. Even though you're just going to be touching it all again anyway when you run it through the compute step.

sologoub · on Feb 23, 2020

Not a bad read, but it’s written from the perspective of large mature operations. If your company is just starting out, the advice is actually there but not quite spelled out - use S3/GCS to store data (ideally in parquet format) and query it using Athena/bigquery.

dx034 · on Feb 24, 2020

Importantly, there are also open source tools out there. Especially if you're starting out, locking into AWS or GCP can quickly become extremely expensive and limiting. Setting up a vendor independent data lake isn't that much more work and can pay off quickly.

sologoub · on Feb 24, 2020

This depends on the skill set available and the goal of the company. My previous employer tried the open source route, but then the normal things happened - people left, documentation was lacking, new people preferred other tools, then those new people left eventually. After a few years, it was a tangle of half-done implementations and no one there fully understood how these worked. Committing to rolling your own really does mean committing. Maintenance is not cheap, so paying for part of it with “vendor lock-in” could be practical for some.

My comment was intended for those just starting out. If you don’t really know what you are doing yet with data, it best to focus on your core company objectives and not burn valuable engineering time on infra you can buy for now. Unless that data stack is your core business.

meritt · on Feb 23, 2020

Fairly new to this topic and coming from a traditional RDBMS background. How do you go about deciding how many rows/records to store per object? And how does Athena/Bigquery know which objects to query? Do people use partitioning methods (e.g. by time or customer ID etc) to reduce the need to scan the entire corpus every time you run a query?

lmkg · on Feb 23, 2020

From the Google side: In traditional BigQuery, the answer to all three questions are related. You shard the files by partition key and put the key into the file name. You can filter the file name in the WHERE clause, and the query will skip filtered objects, but otherwise fully scan every object it touches.

There is apparently now experimental support for using Hive partitions natively. Never used it, literally found out two minutes ago.

The number of records per object is usually "all of them" (restricted by partition keys). The main exception is live queries of compressed JSON or CSV data, because BigQuery can't parallelize them. But generally you trust the tool to handle workload distribution for you.

This works a little differently if you load the data into BigQuery instead of doing queries against data that lives in Cloud Storage. You can use partitioning and clustering columns to cut down on full-table scans.

sologoub · on Feb 24, 2020

That’s basically how GA export worked from my previous work - everything in a session is nested. Upshot is basically what’s above - easy to filter and you don’t get partial data.

The catch is if you need to filter by a property of the session, you are opening every session in range to check if it’s the one you want. That gets expensive quickly and is a bit slow.

For data lakes, parquet and Spark support fairly sane date partitioning. Partitioning by anything else is a question of whether you need it, such as a customer ID, etc. but remember this is a data lake, not a source table for your CEOs daily report. The purpose of the lake is to capture everything that you sanely can.

When you can’t store everything, usually due to cost, you then have to aggregate and only keep the most valuable data. For example in AdTech, real-time bidding usually involves a single ad request, hundreds of bid requests, a few bid responses and the winning bid. Value here is inversely related to size - bid requests without responses are useful for predicting whether you should even ask next time, but the winning bid + the runner up tell you a lot about the value of the ad request.

For structuring warehousing for reporting/ad hoc querying, to me the flatter the better - this uses the native capabilities of columnar stores and makes analysis a lot faster. Downside, good luck keeping everything consistent and up to date. Usually you end up just reprocessing everything each day/hour/whatever the need is, and at a certain point say no new updates to rows older than X.

The cool thing about modern data warehouses, is that they include interfaces to talk to the data lakes, so your analysts don’t have to jump to different tool chains, such as Redshift Spectrum (which is basically Athena) and the aforementioned BigQuery ability to use tables, streams and files from GCP.

It’s an incredibly productive time to be working with all this! Even 10 years ago, you’d need a lot of budget and a team to just keep the lights on, today it’s all compressed into these services and software.

sologoub · on Feb 24, 2020

To summarize the answers below - it all depends on what you are trying to do. Data lakes are generally less structured than other things. They can also contain non-text things, like images and videos that can also be mined.

Sounds like you are thinking more of a data warehouse, which is structured data on an engine that’s designed for querying large volumes of data. I’d recommend first starting with your objectives and then going for what solves with least amount of “stuff”.

I don’t work on data warehousing or pipelines now, but when I did a year ago, AWS and GCP both offered great tools with slight differences, where AWS was a bit pricier to start, but focused on more predictable pricing and GCP was much cheaper with pay as you go, but you could get yourself in trouble with cost by not following their best practices.

dswalter · on Feb 23, 2020

If you're using AWS athena for querying, you're also using the aws glue catalog (managed hive metastore-ish service) to know where partitions are, but yeah, you'll need to partition and sort your data to make sure you're not doing full table scans.

sologoub · on Feb 24, 2020

Glue worked well for my previous gig, but honestly it felt like a bit of an overkill. If you have a large org and a lot of tribal knowledge + new fields showing up out of the blue, yes you need to organize and keep track.

If you are a relatively small operation, I’d recommend weighing additional complexity over the benefits. Sometimes a few we’ll written pages can suffice, other times you need to make the investment.

FridgeSeal · on Feb 24, 2020

First step is to figure out whether you actually need a datalake.

I’d recommend starting off with an OLAP database and going from there, reaching for a datalake once-and only once-you’ve reached the limits of the OLAP db.

mattbillenstein · on Feb 23, 2020

Can you query parquet from bigquery without loading it into a table from gcs?

I've gotten pretty far with jsonl on gcs and bigquery - even some bigquery streaming for more real-time stuff.

lmkg · on Feb 23, 2020

If the data is in Cloud Storage, BigQuery can query it in-place without loading it. BigQuery calls this an External Data Source.

https://cloud.google.com/bigquery/external-data-sources

My biggest papercut with using this was having to make sure that all of the locations matched exactly.

georgewfraser · on Feb 23, 2020

Modern data warehouses (Snowflake, BigQuery, and maybe Redshift RA3) have incorporated all the key features of data lakes:

- The cost of storage is the same as S3.

- Storage and compute can be scaled independently.

- You can store multiple levels of curation in the same system: a normalized schema that reflects the source, alongside a dimensional schema that has been thoroughly ETL’d.

- Compute can be scaled horizontally to basically any level of parallelism you desire.

Given these facts, it is unclear what rationale still exists for data lakes. The only remaining major advantage of a data lake is that you aren’t subject to as much vendor lock-in.

cjalmeida · on Feb 23, 2020

Not being subject to vendor lock-in is huge in itself.

You can save plenty of money if you have the scale to move out of S3. That’s important because you can usually trade CPU for storage by storing data in multiple formats, optimized for different access patterns.

But mostly, the Hadoop ecosystem is very open. The tools are still maturing and it’s easier to debug open source tools than dealing with the generally poor support in most managed solutions.

charlieflowers · on Feb 24, 2020

Can you clarify what you mean by "if you have the scale to move out of S3"?

Why does it take scale to move out of S3? And I thought S3 was cheap, so how would moving out save money?

cjalmeida · on Feb 24, 2020

S3 is flexible and scalable, but it is not cheap. I'd be hard pressed to run the numbers now, but at some point it's cheaper to just do storage yourself.

But to be fair, you'll go on-premise due to the computing or bandwidth costs first. And you'll likely move data to the same datacenter to avoid expensive transfer costs.

I've also had to work in places where you simply could not put your data in the cloud due to regulatory reasons.

bargle0 · on Feb 24, 2020

Amazon has to make a profit on selling services. You don’t have to make a profit providing services internally. There are certain inelasticities that both of you have to pay for: power, real estate, internet, etc. If you’re big enough, you can do it cheaper than Amazon.

dx034 · on Feb 24, 2020

S3 isn't cheap if used with other services. Either you use AWS for everything or you pay with bandwidth. It's cheap to get your data in, using it or getting it out isn't cheap at all.

grakic · on Feb 23, 2020

Recently I saw a term "lakehouse" for applying data lake design on data warehouse technology. With a big house, you can have a lake inside.

vgt · on Feb 24, 2020

Yes!!!! I wrote some words just on this topic recently!!

https://medium.com/@vtereshko/data-warehouse-storage-or-a-da...

(Pm on BigQuery)

killjoywashere · on Feb 23, 2020

So, let's say I have a DB of a million rows, anticipate having 100M rows of archived data, then adding 5M rows per year; each of my rows has some metadata and points to an image on the order of 10 gigapixels, in a bucket.

There is presently strong interest in associating this data with other DBs, of which I am aware of about 80, with a total of probably 500-1000 tables, along with some very old "nosql" b-tree datastores in MUMPS. There are new $10M+ projects coming online around the enterprise roughly every day.

Where would you start?

jiggawatts · on Feb 24, 2020

That's a hilariously small amount of relational data that your phone could probably handle with decent performance. I made larger databases than that back in 2005 on a single commodity server. I wouldn't be surprised to see PowerBI manipulating that in-memory on a desktop.

Microsoft SQL Server with Clustered ColumnStore tables would make practically all queries fast on that, especially if most queries are only for subsets of the data. PostegreSQL could probably handle that too, no sweat.

Also see "Your data fits in RAM": https://news.ycombinator.com/item?id=9581862 which would mean that you could do in-memory analytics of your relational data with SAP HANA or SQL Server if you really needed that kind of performance: https://docs.microsoft.com/en-us/archive/blogs/sqlserverstor...

You can spin up either SQL or HANA in the cloud or on Linux, so you don't even need Windows. Both can be connected to just about any other database you can name, often directly for cross-database queries. SQL 2019 is particularly good at virtualizing external data: https://docs.microsoft.com/en-us/sql/relational-databases/po...

10 gigapixel images are a completely separate problem. If you need individual images to be fast to view, you want some sort of hierarchical tiling like Google Maps does. If you're processing them with machine vision or something, then you want whatever makes the ML guys happy.

PS: I hope you're not working on DARPA's spy drone, because then please disregard everything I said and delete your data for the good of humanity: https://www.extremetech.com/extreme/146909-darpa-shows-off-1...

killjoywashere · on Feb 24, 2020

Parent responding: to be clear I was not impressed by my own row count, if anything, I was trying to make it clear this would not be a burden for a traditional postgres instance. I recall a postgres user group I attended where a guy had been working on handling a billion writes per second (consulting for Cymer if I recall). My whole dataset is less than 1 second worth of that guy's data. And since Cymer is in the photons business, I'm willing to bet they were downsampling heavily.

My question is more the specific mix of problems: a DB, a ton of image data, and other adjacent DBs that people want us to play with. How would you set that up?

I work on cancer, so, definitely not spy drones.

dikei · on Feb 24, 2020

For this amount of data, I would use good old Postgres, partition the data by ingestion time, then just detach the old partitions when you need to archive it.

For joining data from multiple database, if the data is large, I would use something like Presto(https://prestosql.io/) to join and process the data. But that's partly because we have already had Presto clusters running.

jiggawatts · on Feb 24, 2020

Keep in mind that commercial databases are still substantially better for bulk data performance than most open source offerings, and tend to have better compatibility with other commercial databases. E.g.: Microsoft and Oracle are competitors, but it's always going to be a certainty that you can connect them directly to each other.

Similarly, it's actually hard to beat MS SQL Server for OLTP workloads, especially at moderate (~1TB) scale or for ad-hoc queries that require parallelisation but not distribution to a cluster. In other words, it's great for "Medium Data".

It does actually scale to large clusters with the new SQL Parallel Data Warehouse: https://docs.microsoft.com/en-us/sql/analytics-platform-syst...

That's also available as an Azure service if you want to have a play: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sq...

But realistically, distributed clusters are almost certainly not what you need. They're complex and slower for simple queries that could be answered by one box with a good indexing scheme. Just to reiterate: for large tables with hundreds of millions of rows, you want a modern, column-oriented database. I can't stress this enough: if you haven't yet played with SQL's ColumnStore, go spin up an instance in Azure or AWS and give it a go on one of your larger tables. It's crazy good. I've seen compression ratios of 50:1 and query performance improvements of 300:1 with basically zero hand-tuning of indexes or any such thing.

There's a reason people pay $10k+ per core for Enterprise SQL Server licensing. But hey, if you're penny-pinching on a $10M project, then as I said, MySQL and PostgreSQL will work. They're better at replication, clustering, and MySQL (only) is better at low-latency for trivial queries. But they tend to be poor at connecting to commercial or otherwise quirky data sources. So then you'd probably have to layer something like Apache Drill on top: https://en.wikipedia.org/wiki/Apache_Drill

Znafon · on Feb 23, 2020

I had lot of success with Clickhouse recently for tables that are 200+ millions row and that was with the Log table engine, not the Merge tree one so I would expect it to get even faster when we change.

It's very easy to setup so you should be able to test it quickly to see if it fits your needs.

anshumania · on Feb 24, 2020

For a use case where we ingest hundreds of millions of data points to hadoop then run spark etl jobs to partition the data on hdfs itself. And then next day we have several million updates on the several datapoints from the last day(s). What would be recommended on a hadoop setup ? HBase ? Parquet with Hoodie to deal with deltas ? Or Iceberg ? Or hive3 ?

billman · on Feb 24, 2020

First mention of hoodie here. I'm surprised.

TikkiTaki · on Feb 24, 2020

User experience with any OpenSource software, especially in distributed computing and storing domain, depends on the quality of your system administrators and data engineering teams. If you couldn't connect software produced by different companies properly, it would be a pain to work with this software's zoo. Most people who used, for example, AWS stack don't want to return to OpenSource, because Amazon team tests interactions of their systems and uses properly config files which inexperienced system administrators can't. Additionally, you shouldn't use distributed storing systems if you can use horizontal sharding with SQL storing systems.

slt2021 · on Feb 24, 2020

data lakes are nightmare in terms of security. Capital One breach happened partly because they just pour all data in the lake, as does every other monkey in data lake business. Role based access control, zero trust, principle least privilege, service account management in a data lake? hahaha, nope, we don't do that here

I will never trust a company that stores everything in one data lake, that's major data breach just waiting to happen.

nixpulvis · on Feb 23, 2020

So where are we on Data Lakes vs NewSQL [1].

[1]: https://en.wikipedia.org/wiki/NewSQL

ozkatz · on Feb 23, 2020

Most “NewSQL” databases are designed for OLTP use cases (i.e. many small queries that do little aggregation). Data Lakes are optimized for OLAP (i.e. doing a smaller amount of queries, but aggregating over large amounts of data).

As an example, Athena would do a terrible job at finding a specific user by its ID, while Spanner would behave just as poorly at calculating the cumulative sales of all products for a given category, grouped by store location (assuming many millions of rows representing sales).

Hope this analogy makes sense.

FridgeSeal · on Feb 23, 2020

I think you're selling some of these "NewSQL" DB's short, TiDB/TiKV for example appears (I haven't personally used it yet) capable of supporting both OLTP and OLAP workloads due to some clever engineering and data structures behind the scenes.

manigandham · on Feb 24, 2020

TiDB relies on Spark to do analysis, using their TiSpark integration package. It's not built into the database but offers a smoother install than operating a Spark cluster separately.

The only "newsql" database that truly does OLAP+OLTP (now called HTAP) well is MemSQL with it's in-memory rowstores and disk-based columnstores.

ilovesoup · on Feb 24, 2020

(I'm a dev of TiDB so I might be biased.) Yes and no. The yes part is that TiDB still rely on TiSpark for large join query as well as bridging big-data world. TiDB itself cannot shuffle data like MPP database yet. On the other hand, TiDB without TiSpark is still comfortable of those dimensional aggregation queries (which are typical analytical queries as well). The no part is, TiDB now has a columnar engine (TiFlash) for analytics and providing workload isolation. TiFlash can keep up to date (latest and consistent data to be more specific) with row store in real-time in separated nodes via raft. IMO, HTAP should be TP and AP at the same time instead of just "TP or AP you choose one". In such cases, workload interference is real deal. Especially when you are talking about transactions for banking instead of streaming in logs. In such sense, very few, if any, "newsql" systems achieved what I considered true HTAP. For more details: https://pingcap.com/blog/delivering-real-time-analytics-and-...

Welcome to try it in March with TiDB 3.1.

gigatexal · on Feb 24, 2020

Anyone use Apache iceberg with success?