Hi all—I'm the EM for the Search team at Notion, and I want to chime in to clear up one unfortunate misconception I've seen a few times in this thread.
Notion does not sell its users' data.
Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".
As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.
When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.
Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).
I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!
This is a fantastic post that explains a lot of the end product, but I'd love to hear more about the journey specifically on denormalizing permissions at Notion. Scaling out authorization logic like this is actually very under-documented in industry. Mind if I email you to chat?
Full disclosure: I'm a founder of authzed (W21), the company building SpiceDB, an open source project inspired by Google's internal scalable authorization system. We offer a product that streams changes to fully denormalized permissions for search engines to consume, but I'm not trying to pitch; you just don't often hear about other solutions built in this space!
Curious - what do you guys use for the T step of your ELT? With nested blocks 12 layers deep, I can imagine it gets complicated to try to de-normalize using regular SQL.
(I’m not on the search team, but I did write some search stuff back in 2019, explanation may be outdated)
The blocks (pages are a block) in Notion are a big tree, with your workspace at the root. Some attributes of blocks affect the search index of their recursive children, like permissions: granting access to a page grants access to its recursive child blocks.
When you change permissions, we kick off an online recursive reindex job for that page and its recursive subpages. While the job is running, the index has stale entries with outdated permissions.
When you search, we query the index for pages matching your query that you have to. Because the index permissions can be stale, we also reload the result set from Postgres and apply our normal online server-side permission checks to filter out pages you lost access to but that have stale permissions in the index.
They didn’t say the quiet part out loud, which is almost certainly that the Fivetran and Snowflake bills for what they were doing were probably enormous and those were undoubtedly what got management’s attention about fixing this.
Snowflake as destination is very very easy to work with on fivetran. Fivetran didn't have S3 as destination till late 2022. So it literally forces you to use one of BQ, Snowflake, redshift as destination.
So fivetran CEO's defence is pretty stupid.
> Moving several large, crucial Postgres datasets (some of them tens of TB large) to data lake gave us a net savings of over a million dollars for 2022 and proportionally higher savings in 2023 and 2024.
I thought the quiet part was that they are data mining their customer data (and disclosing it to multiple third parties) because it’s not E2EE and they can read everyone’s private and proprietary notes.
Otherwise, this is the perfect app for sharding/horizontal scalability. Your notes don’t need to be queried or joined with anyone else’s notes.
Also whether this data lake is worth the costs/effort. How does this data lake add value to the user experience? What is this “AI” stuff that this data lake enables?
For example, they mention search. But i imagine it is just searching only within your own docs. Which i presume should be fast and efficient if everything is sharded by user in Postgres.
The tech stuff is all fine and good, but if it adds no value, its just playing with technology for technology sakes
I too was surprised to read that they were syncing what reads, at a glance, to be their entire database into the data lake. IIUC the reason that Snowflake prioritizes inserts over updates is because you're supposed to stream events derived from your data, not the data itself.
The whole point of a data warehouse is that you can rapidly query a huge amount of data with ad hoc queries.
When your data is in Postgres, running an arbitrary query might take hours or days (or longer). Postgres does very poorly for queries that read huge amounts of data when there's no preexisting index (and you're not going to be building one-off indexes for ad hoc queries—that defeats the point). A data warehouse is slower for basic queries but substantially faster for queries that run against terabytes or petabytes of data.
I can imagine some use cases at Notion:
- You want to know the most popular syntax highlighting languages
- You're searching for data corruption, where blocks form a cycle
- You're looking for users who are committing fraud or abuse (like using bots in violation of your tos)
1st paragraph: "Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake."
Beyond the features that the sibling comment mentioned, this kind of data isn’t really for end users. It’s a way that you can package it up, “anonymize” it, and sell the data to interested parties.
For someone like Notion, they probably aren't selling this data. The primary use case is internally for analysis (eg product usage, business analysis, etc).
It can also be used to train AI models, of course.
That "probably" is doing a lot of heavy lifting. That said, whether they sell it or not, it's all that data that is their primary value store at the moment. They will either go public or sell, eventually. If they go public, it'll likely be similar to Dropbox; a single fairly successful product, but failing attempts to diversify.
"Selling" is a load-bearing word, too. They're probably not literally selling SQL dumps for hard cash. But there are many ways of indirectly selling data, that are almost equivalent to trading database dumps, but indirect enough that the company can say they're not selling data, and be technically correct.
Notion employee here. We don't put images themselves in Postgres- we use s3 to store them. The article is referring to image blocks, which are effectively pointers to the image.
A "data lake" strongly suggests there's lot of information the company needs to aggregate and process globally, which should very much not be the case with a semi-private rich notebook product.
They literally explained in the article why they have a data lake instead of just a data warehouse: their data model means it's slow and expensive to ingest that data into the warehouse from Postgres. The data lake is serving the same functions that the data warehouse did, but now that the volume of data has exceeded what the warehouse can handle, the data lake fills that gap.
I wrote another comment about why you'd need this in the first place:
Frankly the argument "they shouldn't need to query the data in their system" is kind of silly. If you don't want your data processed for the features and services the company offers, don't use them.
> Frankly the argument "they shouldn't need to query the data in their system" is kind of silly.
Neutral party here: that's not what they said.
A) Quotes shouldn't be there.
B) Heuristic I've started applying to my comments: if I'm tempted to "quote" something that isn't a quote, it means I don't fully understand what they mean and should ask a question. This dovetails nicely with the spirit of HN's "come with curiosity"
It is disquieting because:
A) This are very much ill-defined terms (what, exactly, is data lake, vs. data warehouse, vs. database?), and as far as I've had to understand this stuff, and a quick spot check of Google shows, it's about making it so you're accumulating more data in one place.
B) This is antithetical to a consumer's desired approach to data, which will described parodically as: stored individually, on one computer, behind 3 locked doors and 20 layers of encryption.
The data doesn't have to be the content of user's notes. Think of all the metadata they're likely collecting per user/notebook/interaction – the data's likely useful for things like flagging security events, calculating the graph of interconnected notes, indexing hashed content for search (or AI embeddings?) ... these are just a few use-cases that come to mind from the top of my head.
Of which security and stability seems like the only reasonable use cases. Indexing content for search globally? Embeddings? They just can't help themselves, can they? All that juicy data, can't possibly leave it alone.
Great, you build only store and retrieve functionality. How:
1. Do you identify which types of content your users use the most?
2. Do you find users who are abusing your system?
3. Do you load and process data (even on a customer by customer basis) to fine tune models for the QA service that you offer as an optional upgrade? Especially when there could be gigabytes of data for a single customer
4. Identify corrupt data caused by a bug in your code that saves data to the db? You're not doing a full table scan over hundreds of billions of records across almost 500 logical shares in your production fleet
These are just the examples I came up with off the dome. The job of the business is to operate on the data. If you can't even query it, you can't operate on it. Running a business is far more than just being a dumb CRUD API.
a database, obviously, but are you really storing metrics and logs next to customer data in the same database, or did you skip over the part where I used the word “main”?
What's there to expand on? Do you not realize how bad of a look it is for a company to publicly admit, on their own blog, the amount of time and engineering effort they spent to package up, move, analyze, and sell all their customer's private data?
This is why laws like CCPA "do not sell my personal information" exist, which I certainly hope Notion is abiding by, otherwise they'll have lawyers knocking on their door soon.
Right, yes, tone aside that’s very helpful- at first I didn’t understand the implication of the blog post for implementing customer hostile solutions, but you’ve helped me understand it now.
That’s definitely something you want to do. Datalake can be home for raw and lightly refined data in an “analytics” database such as big query or just raw parquets. This is fast for large queries but slow for small queries. So you want refined data in a “regular” database like Postgres or mssql to serve all the dashboards.
This was a nice read, interesting to see how far Postgres (largely alone) can get you.
Also we see how at self hosting within a startup can make perfect sense. :)
Devops that abstract away things in some cases to the cloud might just add to architectural and technical debt later, without the history of learning from working through the challenges
Still, it might have been a great opportunity to figure out offline first use of notion.
I have been forced to use anytype instead of notion for the offline first reason. Time to checkout to learn how they handle storage from the source code.
> Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake.
Are they using this new data lake to train new AI models on?
Or has Notion signed a deal with another LLM provider to provide customer data as a source for training data?
> Notion does not use your Customer Data or permit others to use your Customer Data to train the machine learning models used to provide Notion AI Writing Suite or Notion AI Q&A [added: our AI features]. Your use of Notion AI Writing Suite or Notion AI Q&A does not grant Notion any right or license to your Customer Data to train our machine learning models.
We do use various data infrastructure, including Postgres and the data lake, to index customer content both with traditional search infrastructure like Elasticsearch, as well as AI-based embedding search like Pinecone. We do this so you can search your own content when you're using Notion.
It’s not a direct answer but from what Notion tell us about their own business:
* The team are based in the US, specifically California, and Notion Labs, Inc is a Delaware corporation.
* Their investment comes from Venture Capital and individual wealth. The investors are listed on Notion’s about page and are open about how they themselves became rich through VC funded tech companies.
There is a very open sense of panic in tech right now to climb to the top of the AI pile and not get crushed underneath. I would be amazed if there were any companies not enthralled by — and either already embracing or planning to embrace — the data-mining AI gold rush.
Notion is a great product but one would be naive to use it while also harboring concerns about data privacy.
This is one of the best blog posts I've seen that showcase the UPDATE-heavy, "surface data lakes data to users" type of workload.
At ParadeDB, we're seeing more and more users want to maintain the Postgres interface while offloading data to S3 for cost and scalability reasons, which was the main reason behind the creation of pg_lakehouse.
Great article, thank you for sharing! I have a question I’d like to discuss with the author. Spark SQL is a great product and works perfectly for batch processing tasks. However, for handling ad hoc query tasks or more interactive data analysis tasks, Spark SQL might have some performance issues. If you have such workloads, I suggest trying data lake query engines like Trino or StarRocks, which offer faster speeds and a better query experience.
Side-ish note, I really enjoyed a submission on Bufstream recently, a Kafka mq replacement. One of the things they mentioned is that they are working on building in Iceberg materialization, so Bufstream can automatically handle building a big analytics data lake out of incoming data. It feels like that could potentially tackle a bunch of the stack here. https://buf.build/blog/bufstream-kafka-lower-costhttps://news.ycombinator.com/item?id=40919279
Versus what Notion is doing:
> We ingest incrementally updated data from Postgres to Kafka using Debezium CDC connectors, then use Apache Hudi, an open-source data processing and storage framework, to write these updates from Kafka to S3.
Feels like it would work about the same with Bufstream, replacing both Kafka & Hudi. I've heard great things about Hudi but it does seem to have significantly less adoption so far.
Is there any advantage to having both a Data Lake setup as well as Snowflake. Why would one also want Snowflake after doing such an extensive data lake setup?
Many BI / analytics tools don't have great support for Data Lakes, so part of the reason could be supporting those tools (e.g. they still load some of their data to snowflake to power BI / dashboards)
We've solved that issue with Trino. Superset and a lot of other BI tools support connection to it and it's a very cost efficient engine (compared to DWH solutions). Another way to go even cheaper is using Athena, if you're on AWS.
They are several versions behind, support for delta was added just recently. Also consider that with Trino you can build a cache layer on Alluxio, making it really fast (especially on NVMe disks).
Saving money 100% also lower latency on distributed access. Accessing file partitioned S3 doesn’t require to spin a warehouse and wait for your query to go on a queue, so if every job runs in like k8s you don’t have to manage resources and auto scale in snowflake is a “paid feature”
I believe just not having to handle a query queue system is already.
For one, Snowflake is expensive (you pay for the convenience and simplicity) and the data in there is usually stored in S3 buckets that Snowflake owns (and they dont pass along any discounts that they get from AWS for the cost of that storage).
> Iceberg and Delta Lake, on the other hand, weren’t optimized for our update-heavy workload when we considered them in 2022
"when we considered them in 2022" is significant here because both Iceberg and Delta Lake have made rapid progress since then. I talk to a lot of companies making this decision and the consensus is swinging towards Iceberg. If they're already heavy Databricks users, then Delta is the obvious choice.
For anyone that missed it, Databricks acquired Tabular[0] (which was founded by the creators of Iceberg). The public facing story is that both projects will continue independently and I really hope that's true.
Shameless plug: this is the same infrastructure we're using at Definite[1] and we're betting a lot of companies want a setup like this, but can't afford to build it themselves. It's radically cheaper then the standard Snowflake + Fivetran + Looker stack and works day one. A lot of companies just want dashboards and it's pretty ridiculous the hoops you need to jump thru to get them running.
We use iceberg for storage, duckdb as a query engine, a few open source projects for ETL and built a frontend to manage it all and create dashboards.
They advertise markdown support, the ability to export markdown, and the ability to import markdown...
However, what they don't say, is that the export and import format aren't compatible and are different subsets of markdown with different features.
If I export a notion page as markdown, then re-import that same markdown document back into notion, I get something wildly different.
All I want is to not use the notion editor (which lags and sometimes crashes my browser), and to instead use my local text editor which has served me well for everything else.
Failing at that, I want to edit plain text, like I can in github comments or wikipedia pages.
Like, the fact that if I write 'foo`', and then go back and edit a backtick in before the word 'foo', it doesn't code-format it, but if I do it in the opposite order and type '`foo`', it code-formats it, makes it very clear I'm not editing text, and there is weird hidden state, which is annoying to reason about.
Just let me edit something like markdown directly, with an optional preview window somewhere, and that would _also_ be vastly better than the mess they have.
That's probably the problem. It's over engineered and just does things I didn't want or need it to do. I just want to type words and paste things in without having a bunch of bullshit happen.
Thank you for the clarification! It's great to hear more about the efficient data management practices at Notion. Your team's innovative use of the data lake to streamline the reindexing process while ensuring user data privacy is impressive. Keep up the excellent work!
Notion does not sell its users' data.
Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".
As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.
When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.
Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).
I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!