Hacker News new | past | comments | ask | show | jobs | submit login
Understanding Parquet, Iceberg and Data Lakehouses (davidgomes.com)
312 points by davidgomes on Dec 30, 2023 | hide | past | favorite | 101 comments



I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…

Here’s the Apache Iceberg table format specification:

https://iceberg.apache.org/spec/

As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.

This is nominally the Delta Lake equivalent:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md

I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.

Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).

My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

Editing to append this GitHub history, which is unfortunately not reassuring:

https://github.com/delta-io/delta/commits/master/PROTOCOL.md

Random features and tweaks just popping up, PR’d by Databricks engineers and promptly approved by Databricks senior engineers…


I agree with all of this. Databricks are also holding back features from open source Delta (like bloom filters), which is their right. But then you can't claim it is a community-driven open format, unless it is an animal farm version of that, where one of the versions is the Pig (some are more equal than others).


Databricks has a lot of nice closed-sourced components, e.g., Unity Catalog, Delta Live Tables and Photon (a C++ implementation of Spark).

Delta itself seems fairly open-source: https://github.com/orgs/delta-io/projects/10/views/1 and hopefully someone will implement Liquid Clustering!


I've implemented Delta support from scratch for a component of Microsoft Fabric, and my feeling is that the "spec" is fairly inadequate without additional experimentation on the Spark implementation. It also requires you to be able to support Spark SQL expressions if you want to make use of features like computed columns and check constraints, and those are even more-poorly documented.


> I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

Sentiment echoed.

I’m ultra cautious of stuff offered by databricks in general. I think they’re only nominally open source, and shouldn’t be trusted.

I’ve also used Delta lake before, there were some really frustrating shortcomings and a lot of “sharp edges” in its usage. We ended up dropping that project entirely, but did investigate iceberg at the time as well. Iceberg and hudi had more coherently designed feature sets, but were less supported. Really hoping this changes more in future.


Thanks for this. I've been following this space for about a year or two and was wondering why Iceberg was more popular in open source.

Over the past six months I got the impression that Delta is pulling ahead in the race as Iceberg is struggling to provide tools for people not in the JVM ecosystem. Delta is a lot more accessible in that way.


DuckDB (lightweight, non-JVM, many language bindings) supports querying from Iceberg now.

https://duckdb.org/docs/extensions/iceberg.html

You still need Spark to generate the Iceberg metadata though.


Snowflake is rolling out Iceberg support and not Delta support, I think that says a lot.


Bigquery too.


I guess you are referring to delta-rs (for Python in particular). An interesting factoid here is that Databricks started delta-rs, and other companies are now driving it forward - not Databricks. I guess it is not in Databricks interest to push the non JVM ecosystem. PyIceberg is catching up. Write support is almost there - https://github.com/apache/iceberg-python/pull/41


As I remember, delta-rs was started by Scribd, not by Databricks: https://youtu.be/2jgfpJD5D6U, https://youtu.be/scYz12UK-OY


I stand corrected, then.


The fact that they use JSON for delta changes is... just stupid. For contrast, in SQL Server, it's implemented way better. Columnar storage tables (columnstore indexes, an equivalent of Parquet or ORC inside the engine) are immutable, and deltas are stored in B-Trees for compactness, ease of access and speed). At some point in time the columnstores get defragmented/merged/rebuilt in part or in whole, and the B-Tree is deleted and starts over when new changes accumulate. Doing it in JSON is, let me put it softly, a sign of bad times.

I suppose anything is better than Delta Lake. Especially Iceberg.


Microsoft is using Delta for their Fabric Lakehouse architecture and its also what OneLake is built around so now you have another massive player choosing Delta.


That’s…not exactly a winning point for Delta lake IMO.

Massive corp, with their own opaque interests and endless bodies to throw at problems has picked a favourite. That favourite being an “open” format controlled by another opaque enterprise company. I’d half expect M$ to just take it wholesale, and start modifying it to suit their own ends, until eventually the “open source” component is some skin-deep façade that completely and utterly dependent on M$ infra.


Yes, another massive player who has the resources and independent market pull to ride and steer a complex and ever-shifting “standard”.

Feels a bit like, “If Delta Lake did not exist, Microsoft would have to invent it.”


If you are a Spark shop then choosing Delta over Iceberg is a no-brainer. It's simpler and perfectly integrated. Not to mention that the Spark's Delta connector can now generate Iceberg-compatible metadata too.

The choice between the two resembles the choice between Parquet and ORC circa 2016. Two formats of broadly the same power, initially biased by a particular query engine, eventually at feature parity and universally supported.

We have got a decade of experience with OSS from Databricks so doubting their "open ecosystem" status seems a little theoretical.


I feel like this is frankly uninformed. Many iceberg shops seem to rely heavily on Spark as a primary engine. And databricks has a history of being a hostile oss force with the culture of the spark project being toxic from the start and delta’s questionable commitment to being a community project.


Great article. I've worked with Parquet files on S3 for years, but I didn't quite understand what Iceberg was, but the article explained it well. It's a database metadata format for an underlying set of data which describes its schema, partitioning etc.

Most people use Hive partitioning convention (i.e. directory names like /key3=000/key2=002/) but Iceberg goes farther than this by exposing even more structure to the query engine.

In a traditional DBMS like Postgres, the schema, the query engine and the storage format come as a single package.

But with big data, we're building database components from scratch, and we can mix and match. We can use Iceberg as a metadata format, DuckDB as the query engine, Parquet as the storage format, and S3 as the storage medium.


Very grateful of your recap, I skimmed through the article fast, but got a better understanding reading your comment!


This is a big deal in the database world as delta, iceberg and hudi mean that data is being stored in an open source format, often on S3.

It means that the storage and much of the processing is being standrdised so that you can move between databases easily and almost all tools will eventually be able to work with the same set of files in a transactionally sound way.

For instance, Snowflake could be writing to a file, a data scientist could be querying the data live from a Jupyter notebook, and ClickHouse could be serving user facing analytics against the same data with consistency guarantees.

If the business then decide to switch Snowflake to Databricks then it isn’t such a big deal.

Right now it isn’t quite as fast to query these formats on S3 as a native ingestion would be, but every database vendor will be forced by the market to optimise for performance such that they tend towards the performance of natively ingested data.

It’s a great win for openness and open source and for businesses to have their data in open and portable formats.

Lakehouse has the same implications. Lots of companies have data lakes and data warehouses and end up copying data between the two. To query the same set of data and have just one system to manage is equally impactful.

It’s a very interesting time to be in the data engineering world.


Apache Arrow and Substrait have been working towards making this a reality. I see a future where executing a query can/will send plans to many different engines distributed across the cloud, but also locally on your on machine.


Real-time Bidding on query execution? The more I think about it, I believe you actually have a viable business model here.


That’s a wildly interesting idea.

It open up another market too: compatible, scalable storage. Sell shovels in a gold-rush, and what better shovel than the substrate infrastructure that those bidding query engines would probably depend on.


If the queries can be executed by any provider, you are talking about a commodity product.

The business model of selling a commodity is wildly unlike the business model tech is in today.


The query execution might be commodity, but the purchasers will still need to store their data somewhere, and this somewhere will need to be able to service the bandwidth and requirements of the query execution providers.


It feels like you could just as well pack the runtime/engine into the job you are requesting? Am I wrong?


The point is more so in aim of creating interoperability between systems and making them in turn composable.

When there’s a common intermediate representation you can pass around those compute instructions and execute. And when there’s shared memory formats data can pass from storage to engine without serialization/deserialization.

So it wouldn’t matter if data is here or there, in this or that format, because the instructions are the same the specific interface (snowflake, MySQL, a local parquet file, etc) is irrelevant mitigating the need for glue code.


> " every database vendor will be forced by the market to optimise for performance such that they tend towards the performance of natively ingested data."

This assumes that their internal storage format has nothing to do with decades of engineering infrastructure that they built their business model around and that they would simply give all that up and compete based on just their compute layer. snowflake might as well shutup shop and return billions to the investors. Locking in data into their ecosystem is their whole business model.

Is there as good example of open standard forcing companies to give up their proprietary tech ?


That's the natural evolution of most tech markets. When the tech is young, proprietary companies dominate because they can control the customer experience better and deliver functionality that is simply too complex for open solutions. As the technology matures, customers start demanding interoperability, reliability, better prices, and eventually some employees "defect" from one of the big companies and start the open standards that replace their ex-employer, or an outsider reads a paper and re-implements the technology from scratch.

> Is there as good example of open standard forcing companies to give up their proprietary tech ?

UNIX -> Linux, BSD

Oracle/Sybase -> MySQL/PostgresQL

Symbolics/Lucid -> Common Lisp

Altair/Apple/Commodore/Atari -> IBM PC & clones

VMWare -> QEMU

Basically every tech that Google pioneered and then missed out on commercializing. Protobufs -> Avro/Parquet, MapReduce -> Hadoop, Flume -> Spark, Chubby -> Zookeeper, Borg -> Kubernetes, etc.


I’ll just point out on the Snowflake side, we’ve been very public saying we want Iceberg/Parquet to be at or as close to parity as possible with our native format. The value add is the platform, not lock in. That also forces us to be the best on open formats, which IMO is also a good thing for everyone.

Disclaimer: I work at Snowflake literally on this with my team. :)


> we’ve been very public saying we want Iceberg/Parquet to be at or as close to parity as possible with our native format

Thats great to hear. Would this mean that external iceberg tables would have the same performance as native table ? My impression of parent comment was that, eventually there would be no such thing as 'native format'. Really interested to see public statements by snowflake to that effect, would love to share that with my team.


> snowflake might as well shutup shop and return billions to the investors.

I mean, we can dream right?

There’s a bunch of companies that I don’t believe deserve their status or valuation and Snowflake is one of them.


[flagged]


...gpt? Or ... why would you summarize an adjacent thread? Am I missing a joke?


I disagree with this strongly - "The best way to store Apache Arrow dataframes in files on disk is with Feather. However, it’s also possible to convert to Apache Parquet format and others."

The best way to build your own non-JVM lakehouse is to use Iceberg for metadata, Parquet for the Data, Query with DuckDB using Arrow tables (read Parquet directly into Arrow is very low cost), and then use Arrow->Pandas or Polars (either directly or via a service with Arrow Flight).

If you put Feather in the mix, the whole Python lakehouse stack doesn't currently work.


At one point, I thought Feather did not carry any long-term format guarantees. Presumably that has now changed, but I still feel like Parquet is the best future proof option on the table.


I've heard of data lakes, but "data lakehouse" sounds like where upper class data goes in the summer to take their data-boats data-fishing.


The name is easy to poke fun at, but I think it’s a real problem. A lot of companies use data lakes to store data and warehouses to serve BI to tools like Tableau or PowerBI. They then up copying data between the two.

Querying a lake directly and having transactions, governance etc against one set of data (a data Lakehouse) can really simplify the stack and take out cost.


Ah, so the house part comes from warehouse. Not obvious to say the least.


I never understood what is meant by “data lake” in the first place, other than “heterogenous collection of large-ish data files”.


During the big data hype, I did a feasibility study for this for my previous company (typical bigcorp) and learned that most companies should not have a data lake.

A data lake is a collection of different types of structured/unstructured data like CSV, Parquet, text, images, etc. stored in an object store or some such that in principle you're able to query. The theory is that you can just dump stuff into a kitchen drawer (ELT instead of ETL) and be able to do analytics on it later.

But most enterprises already have huge investments in relational databases (SQL Server, Oracle etc.) which are decades-old optimized, typed with schema, structured engines for storing data. If you have a SQL database, chances are you already have data in the right format for analytics and building a data lake is the wrong way to go.

People in tech companies have this wrong impression that enterprises have a lot of big data, but the fact is, most of the valuable data in most companies are less than a few terabytes total. They're mostly ERP data, Excel files, and operational data from various sensors (if that).

To unstructure the (already structured) data just so it can fit into the data lake seemed like the wrong strategy, but I was surprised how much companies like Cloudera and others hyped it up so much so they could sell technologies like Hive, Spark, Presto, etc. (and streaming tech like Kafka). These are overkill for most enterprises.


Yes, a typical DWH spends a lot of cycles trying to create a single consistent interpretation of the raw data, a data lake is just this raw data, plus whatever ad hoc interpretations of it your data analysts create.

A lakehouse is basically an attempt to get most of the DWH benefit by just making these ad hoc intepretations incremental.


That's pretty much it.


Naming is hard, I hope the industry can come up with something better eventually.

It is definitely jarring in my head every time I hear it or read it.


I prefer the name, as I find there's a straight line correlation between stupid names like "Data Lakehouse" and bad engineering practices. Another sign is the dumber the buzzword, the more consultants and middle men exist to skim money off a problem that should never exist in the first place.


I am very excited about Iceberg specifically (because open-source), but the last time I looked into it the only implementation was a Spark library, and Trino's (formerly Presto, an SQL engine) Iceberg connector had a hard dependency on Hive! It is like the entire industry had a hard time divorcing its MapReduce, Hive, and dare I to say Spark, legacy.

I didn't look into Iceberg since, but plan to, and I am really looking forward for this to develop. We have the tools and the compute power today to deal with data without legacy tech, and not all data is big data either. Consequently "data engineering", thankfully, resembles the regular back-end development more and more, with its regular development practices being put in place.

So, here is to the hope of having a pure Python Iceberg lib some day very soon!


Trino no longer depends on Hadoop/Hive for any of its data lake connectors. Removing that dependency was a huge effort.


Same. I wasted a month or so of off time trying to get that old stack to work well enough to let me just insert data, left unhappy. Had Databend up and running in an hour, figured it will get easier to do it right in the future once there's a Rust impl (for portability vs Java/Hive)


Why is no one able to describe all this with more concrete ideas, like this is how you store data, this is how you connect and query - and how fast those queries will be (ie transactional speed vs "analytics" speed).


I am currently working with about 100TB data on GCP with BigQuery as a query engine and simple hive partitioning like /key3=000/key2=002/. We are happy because we can run all the queries you want and it is insanely cheap. But latency is reaching quite high levels (it doesn't matter so much for us) but I was wondering, if implementing Iceberg would improve this? Has anyone experience with this?

Overall this kind of architecture is just awesome.


There is nice summary on the topic: https://aws.amazon.com/blogs/big-data/choosing-an-open-table... ("Optimizing read performance"). Those technologies primary "Data Management at Scale" but they also extend capabilities provided by raw storage formats such as parquet. So they may help you, but the question if you are really need it. I haven't worked with BigQuery, it may include [similar features](https://cloud.google.com/bigquery/docs/search-index).

You need to define what "latency" means in your case and what is "quite high levels". We are talking about analytical data storage, it is designed for efficient batch processing. To find a single record is not a primary goal of the architecture - you will need some kind of caching/indexing for fast search. Sometimes adding "limit 1" for your single record search may solve the problem.

Be sure you are using efficent data storage format as parquet, check size of the files to be sure you don't have the ["small file problem"](https://www.royalcyber.com/blog/data-services/managing-small...), then check if you are using relevant BigQuery features. And before and after those checks run "explain" on your query, if you don't use partition keys or indexed columns your search results won't be instant in any big data system.


We're at ~1/10th your scale but query speed is a major concern due to (rightfully) impatient clients when it comes to data viz. Unless you're using BQ tables as input for high throughput compute, I'd focus on optimizing your BI tool or creating analysis tables that would prevent end-user slowdown.

Ex: Recently created a big table (by materializing fact/dimension table joins and COALESCE operations) solely for analysis purposes. It sits "outside" our normal data warehouse setup mentally, but we can still maintain data quality/lineage as it exists inside dbt. Allowed us to do away with Tableau fixed calculations and cut load/group by times for end users ~95%.


Iceberg won’t speed up your queries if you are using BQ native storage. It might speed up federated queries to GCS/S3


This looks pretty neat if you're ok moving to AWS https://www.boilingdata.com/


> However, this blog post won’t be 100% comprehensive, or even the best starting point for most people. That’s because I’m writing this for myself. I find that the best way to learn new things is by "forcing myself" to re-explain them to others.

I really like this attitude and have started embracing it myself both on paper and in notes on my website.


We have been excited to dig into the Iceberg era of more managed parquet storage... But they are still years behind on supporting fast GPU IO (GPUDirect/cuFile). So every time we look at bringing them to a customer for powering AI workloads... We hit that wall.

It seems inevitable, more of a when vs if. Being able to have our cake & eat it too will be very cool :)


For what use case? Image data storage? For text storage, Parquet is good enough today. PyTorch Data Loader and TF Data provide multi-threaded clients that read ahead in parallel and fill up an in-memory buffer that is then transferred in/out from GPUs. I agree that S3 can be a bottleneck here. That's why we have HopsFS as a global distributed coherent NVMe cache over S3. Anyscale have been doing something similar with a local NVMe cache for S3. Another interesting file format is Lance - it's like Parquet, but for image data. It has an additional index for fast random I/O within a file (to find images).


We are trying to saturate storage->pcie->gpu cards for tasks like gpu-accelerated log analytics, and this is increasingly the bottleneck


Parquet has been the lakehouse file format of choice for nearly half a decade. But we are starting to see other contenders that are optimized more for lower latency like lance https://github.com/lancedb/lance


5 years is not a super long time. It just can feel that way sometimes.


No mention of Hudi? I really liked using Hudi in a recent project. It feels so close to hitting that maturity level where it’s viable for a small team to maintain without introducing too many living parts.

Overall, I like the whole concept of the Lakehouse because it can be done cheaply.

Most datalakes turn into swamps pretty quickly, so cheaper is better.

Let it sit unused for a while in S3 and then quietly nuke it without burning money on a big compute environment.


Sorry genuine question -- what does the phrase "at Broad" at the end of the blog post's title mean or refer to? Maybe a phrase that I am unfamiliar with? I first wondered if it is the name of an organization or team -- and this post is describing what they did in that team, but that doesn't seem to be the case?

>> Understanding Parquet, Iceberg and Data Lakehouses at Broad


“At broad” isn’t customary English, and the author doesn’t seem to be a native speaker. It’s probably intended to mean something like “in general” or “a big-picture view”.


thanks for the clarification.

I am not a native speaker myself; but suspected the intention of the author was more or less the same -- thanks for confirming


Not in depth, general understanding


yes that seems to be the intended meaning; though I haven't seen such usage elsewhere (I am not a native speaker of English so my exposure is limited)

thanks


Unity Catalog isn’t comparable with Iceberg Catalogs. It’s not required for Delta to function…

There was a paper at VLDB about Delta Lake: https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf - it describes why it was created, plus details of implementation.


Good in-depth insights into each format. This complements nicely with a site I created called tech-diff (https://tech-diff.com/file/) where it provides a summary of the file formats.


"Comma-Separated Values (CSV) is a text file format that uses commas to separate values in plain text."

Except when it uses semi-colons. Or pipes. Or something else.


I was using the following RFC for the formal definition of CSV https://www.rfc-editor.org/rfc/rfc4180.html.

When the delimiter is different, you have a different file format. TSV (tab-separated values) is an example of this.


The most amusing thing is what there are symbols designated as separators

https://en.m.wikipedia.org/wiki/C0_and_C1_control_codes#Fiel...



Oh! Great article and, if the evil twist of fate would force me to write an export, I hope I would remember and use .usv

But I can't agree on 'default tools' - anything bigger than 3 columns is PITA to look at and I just Ctrl+H to replace the commas with tabs or \r\n (or whatever), depending on the format. At some point 'standart tools' doesn't matter anymore, because it doesn't worth to even use them on such files; think of 'jq'.


>I just Ctrl+H to replace the commas with tabs or \r\n (or whatever)

That may not work if the CSV is escaped/quoted (e.g. because the data contains commas).

Or you could use our Easy Data Transform software. It pretty much handles all that horrible CSV stuff for you. ;0)


Of course, but most of the time I just need to have a quick glance at the data, or find a single value which I know where it should be.

Slightly on-topic: I found what LibreOffice Calc has a decent input wizard which most of the time works and good enough.


I know it's a newcomer still under heavy development, but I'm surprised to not see Lance (and Lancedb atop it) mentioned. It crushes ORC and Parquet for most real-world data scenarios and has cheap data versioning.


In every benchmark I've looked at online, Delta Lake format seems to have drastically better performance than Iceberg. Is this fundamental to the spec, or is it possible that Iceberg can close the gap?


One thing I'm confused about is why does Iceberg need a spark deployment to function? Or am I wrong about that? I would rather avoid that ecosystem if I can.


You don't need a Spark deployment. The first reference implementations for reading and writing were in Spark.

Now, with PyIceberg, there is read support in Python. Write support should be merged very soon - https://github.com/apache/iceberg-python/pull/41 So, very soon, you will be able to read/write Iceberg tables in Python. I look forward to doing data transformations in Polars for data of reasonable scale (up to 100GB or so) and writing to Iceberg tables with PyIceberg. No Spark.


Well, what about other languages? Every language needs bindings or a re-implementation? (i.e., iceberg tables are written/queried in-process as opposed to via a network API?)


It tends to be more library dependencies than live clusters.

A lot of data lakes are managed using Hadoop and Spark so I think it’s just an artefact of that.

In the end I can’t see why you wouldn’t just be able to create and manage Iceberg files directly from a standard Python/JS/Java without that legacy.


I appreciate the clarity of this article. I know it was written by the author for themselves, but it feels like it could have been written for me!


I really liked your article.

Is this a typo: “Hive, Delta Lake and Iceberg all support support of schema registry or metastore.”?


Thanks for the typo, fixed!


How do dependencies work in this type of data lakehouse? Does the orchestration layer handle that or is there metadata within the data lake that provides completeness information?


It doesn't, maybe you mean something like Dagster or Airflow managing insertion time dependcies? Although iceberg does provide transactions across tables I believe.


Are these formats appropriate for multi dimensional gridded data or are hdf/netcdf still what people use for those?


TL;DR: In climatology, I know people are using zarr. However, I think columnar storage as in parquet also merits consideration.

My thinking goes as follows: I'm trying to read chunks from n-dimensional data with a minimum of skips/random reads. For user-facing analytics and drilling down into the data, these chunks tend to be relatively few, and I'd like to have them close to one another. For high-level statistics however, I only care that the data for each chunk of work be contiguous, since I'm going to read all chunks eventually anyways.

You can reach these goals with a partitioning strategy either in HDF or zarr or parquet, but you could also reach it with blob fields in a more traditional DB, be it relational or document based or whatever. Since any storage and memory is linear, I don't care whether a row-major or column-major array is populated from a 1d vector from columnar storage with dimensionality metadata or an explicitly array based storage format; I just trust that a table with good columnar compression doesn't waste too much storage on what is implicit in (dense) array storage.

Often, I've found that even climatological data _as it pertains to a specific analytic scenario_ is actually a sparse subset of an originally dense nd-array, e.g. only looking at data over land. This has led me to advocate for more tabular approaches, but this is very domain specific.


It's really easy to get lost in the technical jargon that the vendors who are selling products throw around, but this article has missed the important part, and spent all the time talking about the relatively unimportant part (data formats).

You need to step back and look from a broader perspective to understand this domain.

Talking about arrow/parquet/iceberg is like talking about InnoDB vs MyISAM when you're talking about databases; yes, those are technically storage engines for mysql/mariadb, but no, you probably do not care about them until you need them, and you most certainly do not care about them when you want to understand what a relational DB vs. an no-SQL db are.

They are technical details.

...

So, if you step back, what you need to read about is STAR SCHEMAS. Here are some links (1), (2).

This is what people used to be before data lakes.

So the tldr: you have a big database which contains condensed and annotated versions of your data, which is easy to query, and structured in a way that is suitable for visualization tools such as PowerBI, Tableau, MicroStrategy (ugh, but people do use it), etc. to use.

This means you can generate reports and insights from your data.

Great.

...the problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work; and it's never realtime.

That's because the process of turning raw data into a star schema was traditionally done via ETL tools that were slow and terrible. 'Were'. These tools are still slow and terrible.

Basically, the output you get is very valuable, but getting it is very difficult, very expensive and both of those problems scale as the data size scales.

So...

Datalakes.

Datalakes are the solution to this problem; you don't transform the data. You just injest it and store it, basically raw, and on the fly when you need the data for something, you can process it.

The idea was something like a dependency graph; what if, instead of processing all your data every day/hour/whatever, you defined what data you needed, and then when you need it, you rebuild just that part of the database.

Certainly you don't get the nice star schema, but... you can handle a lot of data, and what you need to do process it 'adhoc' is pretty trivial mostly, so you don't need a huge engineering effort to support it; you just need some smart table formats, a lot of storage and on-demand compute.

...Great?

No. Totally rubbish.

Turn out this is a stupid idea, and what you get is a lot of data you can't get any insights from.

So, along come the 'nextgen' batch of BI companies like databricks so they invent this idea of a 'lake house' (3), (4).

What is it? Take a wild guess. I'll give you a hint: having no tables was a stupid idea.

Yes! Correct, they've invented a layer that sits on top of a data lake that presents a 'virtual database' with ACID transactions that you then build a star schema in/on.

Since the underlying implementation is (magic here, etc. etc. technical details) this approach supports output in the form we originally had (structured data suitable for analytics tools), but it has some nice features like streaming, etc. that make it capable of handling very large volumes of data; but it's not a 'real' database, so it does have some limitations which are difficult to resolve (like security and RBAC).

...

Of course, the promise, that you just pour all your data in and 'magic!' you have insights, is still just as much nonsense as it ever was.

If you use any of these tools now, you'll see that they require you to transform your data; usually as some kind of batch process.

If you closed your eyes and said "ETL?", you'd win a cookie.

All a 'lake house' is, is a traditional BI data warehouse built on a different type of database.

Almost without exception, everything else is marketing fluff.

* exception: kafka and streaming is actually fundamentally different for real time aggregated metrics, but its also fabulously difficult to do well, so most people still don't, as far as I'm aware.

...and I'll go out on a limb here and say really, you probably do not care if your implementation uses delta tables or iceberg; that's an implementation detail.

I guarantee that correctly understanding your domain data and modelling a form of it suitable for reporting and insights is more important and more valuable than what storage engine you use.

[1] - https://learn.microsoft.com/en-us/power-bi/guidance/star-sch... [2] - https://www.kimballgroup.com/data-warehouse-business-intelli...

[3] - https://www.snowflake.com/guides/what-data-lakehouse [4] - https://www.databricks.com/glossary/data-lakehouse


All very good and useful points. One additional thing to mention is that as you are querying across the raw data with a data lake(house), performance is fundamentally worse, even if a lot of the marketing material will tell you otherwise. Usually significantly worse than if your data was in a columnar database in practice.

Depending on your use case this may or may not be a problem. For most companies I'd wager that it is a bigger problem than it first appears.


> The problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work

Isn't this kind of obsolete in 2023 with LLMs?

Sure, AI is currently slow, and very expensive, but eventually the idea of needing to query a massive unstructured data source is something that will go the way of the dodo bird when you have a technology that can magically turn unstructured data into structured data quite efficiently. And in that case, when your data is properly structured, there are database technologies that are 1000x more efficient than all of these MapReduce-esque solutions for BI.


As someone that came away confused after reading the article, this was hugely helpful - thanks.


I think the bigger problem is that replication/ingestion (ie: what fivetran does) has come to represent 'ELT'. Likely by design.

And you don't need that pesky transformation part.

Except you really do, when you get beyond having a source system or two.


I am familiar traditional JSON to relational table ETL pipelines, but how does ETL pipelines differ in data lake/house when processing data dumps, such as PDF/images?


I think I agree with this to some extent in that it's hard for me to imagine a use case where I have a bunch of clean Parquet files, nicely partitioned, in some kind of cloud storage system.

If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

I don't echo your dismissal of the idea because a whole lot of people seem to be excited about it. But I personally feel like I'm missing the use case compared to the lake + warehouse setup.

Is it about distributing responsibility across teams? Reducing storage cost? Open source good vibes?

Maybe a legitimate use case is being able to use the same data source for multiple query engine frontends? That is, you can use both Spark and Snowflake on the same physical data files.

I'd be interested to hear about this from someone who's using or planning to use a lakehouse.


In my experience,

* Storing large amounts like petabytes in any database is phenomenally expensive, just for the storage alone.

* For some kinds of data, like image data, databases are generally the wrong tool.

* The consumers of these kinds of systems may have really dynamic workloads. Imagine ML jobs that kick off 1K machines simultaneously to hammer your DB and read from it as fast as possible. Cloud-managed object stores have solved this scaling issue already. If you can get infrastructure you manage out of the way, you get to leverage that work. If your DB is in the middle, you're on call for it.


> If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

Well, depends on your requirements. You can definitely go point-to-point straight into another DB.

One reason to keep data in object storage, is it gives you a sort of “db independent” storage layer. At a previous $work, we had a tiered system: data would come in from source systems (primary application db’s, marketing systems, etc), and would be serialised verbatim in structured format in S3 (layer 1). Data eng systems would then process that data-refining it, enriching it, ensuring types and schemas, etc, which would be serialised into the next tier (layer 2). At this level they’d be nice to use, so the data analysts would operate against this data in their spark notebooks.

BI and reporting, and other applications could either use data from this layer directly, or if they had special requirements, or performed computationally difficult enough tasks, we would add another layer (layer 3) for specialised workloads and presentation layers. Layer 2 and 3 data may also be synced into data warehouses like ClickHouse.

This gave us complete lineage of data (no more mystery tables, no more “where did you get this data from”, etc), and the storage itself is reasonably cheap. Many services can query these storage layers directly, so setting up views, or projections into different layouts - even for huge quantities of data- becomes feasible and achievable with no more engineering effort than a query.

Was it a lot? Yep. Would I recommend or do it everywhere? Absolutely, 100% no I would not. Was it a good fit for that org? Yeah, arguably better than they could utilise, but for them, other approaches were anaemic and fragile at best.

Could it be done simpler now? Yep, but it got the job done then haha.

Edit to add: it was also language agnostic, which was a huge win and is an understated part of these new parquet-based solutions: you’re no longer limited to “fragile python app” or “spark cluster” to interact with your data. Rust, C#/F#, various FE tools for JS/TS (cube, etc). This is a huge win because you’re not longer tied to keeping around an aging spark/hadoop cluster that has gradually encrusted more garbage into it until it’s this massive, ultra-fragile time bomb nobody dares touch that powers mass amounts of back-office-business needs.


All of this is correct, but it misses the main point of the new table formats - they are open-source and the data can be stored on very low cost storgae - S3. So, having a data warehouse that stores TBs or even PBs of data is not as expensive as it used to be (by an order of magnitude or more). And the formats for storing the data (Parquet), its metadata (Iceberg, Hudi, delta lake), its query engine (DuckDB, Polars, Ibis) - they are all open-source.


> it misses the main point of the new table formats

I didn't miss it; it's irrelevant.

It makes, almost no difference in practice, between a competent implementation in one and a competent implementation in the other.

It makes absolutely no difference that they are open source.

Understanding the details of each of the individual components will give you no meaningful insight into how to build a lakehouse.

...because, when you slap all those parts together, in whatever configuration you've picked what you end up with is a database.

A big, powerful cloud database.

Well, you have a database now and you still have zero insights and zero idea how to get any of them; that because you didn't understand that you need to build some kind of data warehouse on top of that database. You need to load the data. You need to transform the data. You need to visualize the data and build reports on it. If you're good, you probably need to preprocess the data to use as training inputs.

I'll say it more clearly and explicitly one. more. time:

- Having a database != having a data warehouse.

- Having a big cloud database build out of cloud storage, table formats, metadata engines and query engines != a lakehouse.

Having an empty database is of no value to anyone, no matter how good it is.

All of those parts, all of those things are only the first step. It's like installing postgres. Right, good job. We're done here? Reports? Oh, you can probably import something or something or I know, powerBI is good, let's install that. It'll tell you you have no data... but... we've got the infra now right? Basically done.

It's just step 1.


can confirm that it is a nice thing to work with parquet files. Before this, we've worked for ~1 year with CSVs (I know the horror) and we made an effort to port all the 'legacy' code to Parquet files

We interface with BigQuery (via Airflow) mostly, and except one very annoying situation it's a big improvement in terms of speed (parsing floats after querying the DB is NEVER a good option).

---

In case anyone's wondering, it's basically storing and loading native numpy arrays in BigQuery via the python client(s).

You have a bunch of options (assume you have one or more cols with float32 numpy arrays):

- dataframe -> to_parquet -> upload to GCS -> GCSToBigQueryOperator (https://airflow.apache.org/docs/apache-airflow-providers-goo...)

  -> instead of storing as a `FLOAT, REPEATED` it will be stored as a STRUCT with a structure of `list>item` OR `list>element` (pyarrow==11 OR pyarrow==13).This requires a manual parsing from this 'json structure' that you get when querying the DB back to np.array -> slow and basically you are using CSVs again.

  -> Read more: https://stackoverflow.com/questions/68303327/unnecessary-list-item-nesting-in-bigquery-schemas-from-pyarrow-upload-dataframe

  -> set the schema before uploading? Nope, all values will uploaded as null in BQ.
- dataframe -> bigquery.Client -> upload the dataframe from python

  - very slow, you need to batch your data (imagine 24h vs 5 minutes kind of slow as dataframe sizes increase + necessity to keep all data in memory or batch it so extra save/load of each batch before uploading)

  - arrays are stored properly
- solution: you must do 2 things, one on the pyarrow side and one on the BigQuery side

  - `df.to_parquet(..., use_compliant_nested_type=True)` (in pyarrow==14 it's True by default, but airflow needs pyarrow==11, where it's False by default)

  - use `enable_list_inference=True` (link: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type)

  - when both of this are true (i.e. save parquet files [to GCS] using that flag and load parquet files [from GCS to BQ] using the other flag arrays can be stored as (FLOAT, REPEATED) and queried as numpy arrays out of the box without any manual management.
This took me like 1 week of debugging and reading source code, obscure SO comments and GH issues etc.


>Before this, we've worked for ~1 year with CSVs (I know the horror)

The CSV format (or lack of) is such a mess. You don't appreciate how much until you have to write a CSV parser and do real world support for it. Ugh.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: