> Data lake > Data warehouse These aren't something I would like to hear if I'm ...

bastawhiz · 2024-07-14T15:46:53 1720972013

Those are just different words for "database". What do you care what kind of database your Notion data is sitting in?

TeMPOraL · 2024-07-14T16:45:56 1720975556

A "data lake" strongly suggests there's lot of information the company needs to aggregate and process globally, which should very much not be the case with a semi-private rich notebook product.

bastawhiz · 2024-07-14T19:10:15 1720984215

They literally explained in the article why they have a data lake instead of just a data warehouse: their data model means it's slow and expensive to ingest that data into the warehouse from Postgres. The data lake is serving the same functions that the data warehouse did, but now that the volume of data has exceeded what the warehouse can handle, the data lake fills that gap.

I wrote another comment about why you'd need this in the first place:

https://news.ycombinator.com/item?id=40961622

Frankly the argument "they shouldn't need to query the data in their system" is kind of silly. If you don't want your data processed for the features and services the company offers, don't use them.

anoncareer0212 · 2024-07-14T20:52:45 1720990365

> Frankly the argument "they shouldn't need to query the data in their system" is kind of silly.

Neutral party here: that's not what they said.

A) Quotes shouldn't be there.

B) Heuristic I've started applying to my comments: if I'm tempted to "quote" something that isn't a quote, it means I don't fully understand what they mean and should ask a question. This dovetails nicely with the spirit of HN's "come with curiosity"

It is disquieting because:

A) This are very much ill-defined terms (what, exactly, is data lake, vs. data warehouse, vs. database?), and as far as I've had to understand this stuff, and a quick spot check of Google shows, it's about making it so you're accumulating more data in one place.

B) This is antithetical to a consumer's desired approach to data, which will described parodically as: stored individually, on one computer, behind 3 locked doors and 20 layers of encryption.

nojvek · 2024-07-14T19:56:21 1720986981

At the scale of Notion, with millions of users, they’d have that much data.

I’ve seen 100TB+ workloads at smaller companies. Not unusual.

iLoveOncall · 2024-07-14T21:13:59 1720991639

The concern isn't the scale, it's the use. What is there to _process_ when they're supposed to only store and retrieve to show to users?

ctippett · 2024-07-15T01:32:13 1721007133

The data doesn't have to be the content of user's notes. Think of all the metadata they're likely collecting per user/notebook/interaction – the data's likely useful for things like flagging security events, calculating the graph of interconnected notes, indexing hashed content for search (or AI embeddings?) ... these are just a few use-cases that come to mind from the top of my head.

TeMPOraL · 2024-07-15T06:30:10 1721025010

Of which security and stability seems like the only reasonable use cases. Indexing content for search globally? Embeddings? They just can't help themselves, can they? All that juicy data, can't possibly leave it alone.

bastawhiz · 2024-07-15T02:46:58 1721011618

Great, you build only store and retrieve functionality. How:

1. Do you identify which types of content your users use the most?

2. Do you find users who are abusing your system?

3. Do you load and process data (even on a customer by customer basis) to fine tune models for the QA service that you offer as an optional upgrade? Especially when there could be gigabytes of data for a single customer

4. Identify corrupt data caused by a bug in your code that saves data to the db? You're not doing a full table scan over hundreds of billions of records across almost 500 logical shares in your production fleet

These are just the examples I came up with off the dome. The job of the business is to operate on the data. If you can't even query it, you can't operate on it. Running a business is far more than just being a dumb CRUD API.

fragmede · 2024-07-15T03:17:09 1721013429

Fwiw, you should able to answer #1 and #2 without hitting the main db if you've got good observability into your system.

bastawhiz · 2024-07-15T20:13:47 1721074427

Observability data comes from a drumroll database! Most analytics products that can answer these questions are just time series data warehouses.

fragmede · 2024-07-15T20:22:01 1721074921

a database, obviously, but are you really storing metrics and logs next to customer data in the same database, or did you skip over the part where I used the word “main”?

bnj · 2024-07-14T14:34:59 1720967699

Could you expand on this?

lopkeny12ko · 2024-07-14T16:47:35 1720975655

What's there to expand on? Do you not realize how bad of a look it is for a company to publicly admit, on their own blog, the amount of time and engineering effort they spent to package up, move, analyze, and sell all their customer's private data?

This is why laws like CCPA "do not sell my personal information" exist, which I certainly hope Notion is abiding by, otherwise they'll have lawyers knocking on their door soon.

Cthulhu_ · 2024-07-14T17:07:47 1720976867

Where do they say they sell it? Citation needed; that's a legal and reputational minefield that I don't think they would admit to, like you said.

lopkeny12ko · 2024-07-14T17:41:56 1720978916

I would challenge you to find any broker who sells data (like the T-Mobile location data scandal) who says plainly and clearly they sell user data.

quest88 · 2024-07-14T17:47:35 1720979255

This is not answering the question.

bnj · 2024-07-14T17:58:41 1720979921

Right, yes, tone aside that’s very helpful- at first I didn’t understand the implication of the blog post for implementing customer hostile solutions, but you’ve helped me understand it now.

wodenokoto · 2024-07-15T04:35:11 1721018111

That’s definitely something you want to do. Datalake can be home for raw and lightly refined data in an “analytics” database such as big query or just raw parquets. This is fast for large queries but slow for small queries. So you want refined data in a “regular” database like Postgres or mssql to serve all the dashboards.

zarmin · 2024-07-14T18:49:19 1720982959

Given how infuriating their implementation is of an in-app database, perhaps it's not that surprising.