Ive been in the end stage of this (worked on data validation for a good chunk of my career) and these are my thoughts on the article:
Determining blocking vs non blocking is a big issue - deciding which checks should be stoppers and which shouldn’t is often a matter of extensive debate. In my experience, only a few data checks are absolute show stoppers under any circumstance and a lot of things need to spawn tickets that should be routed to the correct team and followed up on. Some type of tracking system is necessary for this.
Defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.
A model that plugs into an Airflow DAG as AirBnB has designed seems like a good approach. Often when it was time to incorporate checks into the pipeline we had heterogenous strategies to invoke our checks engines. Having a standardized approach helps drive adoption across the organization- oftentimes I’ve found that people are reluctant to run non critical checks if it’s a significant time and effort cost and will only run critical ones to try and push data quality accountability either upstream or downstream. If it’s really easy to turn on and incorporate that’s one less excuse that can be used to not run the checks.
You seem to know what you're talking about. Ignorant question: do you think Dagster would work better as an orchestration/validation tool than AirBnB's Wall?
I don’t know much about Dagster but it does not look like they have a validation tool equivalent to Wall, which requires Airflow. So you would not get validation with Dagster unless you brought it yourself.
For blocking checks - I personally use notion of errors and warnings, with errors definitely going to quarantine and propagated to good data, and warnings going to both good data and quarantine. It’s a trade off between not blocking all data and having a visibility of what is potentially bad. Another approach is to send everything into quarantine, but then giving users an instrument for rescuing their data, and further tuning checks to avoid this happening.
I'm a little bit annoyed at reading about details that seem closely connected to internal code (e.g. CheckConfigModel classes) without being able to see the source.
I am not sure what others find so compelling about this blog post. Granted it's from Airbnb which probably has one of the more interesting data sets, but honestly it looks to me like an internal blog post that's been reposted to Medium without considering the viewpoint of an external user. I understand if they don't want to open source the framework; but then most of the blog post should be about design principles, maybe a bit about the process itself — not implementation details that seem directed towards an internal audience.
Thanks for this post! Naive question: why not "just use Great Expectations"? At first blush GE seems like it has a lot of what you need out of the box: checks definable in YAML, extensibility, and connectors to many major data sources.
Was there something you all found lacking there which made "roll your own" the right approach here?
As a software engineer new to the data space, I am baffled by why people recommended great_expectations. It has a lot of questionable dependencies that inflate image sizes and lead to conflicts at scale. It is also a very ambitious project that fails to deliver on many fronts, including documentation and basic data quality checks. The complexity in writing your own checks is way too high. There’s a lot of very abstract concepts you have to understand before you can write a single line of code. If you think I’m wrong, stop now and go look at some of their code examples. You’re better of using python’s built-in unittest to run a query and then make assertions on the result as a task in your DAG
> The new role requires Data Engineers to be strong across several domains, including data modeling, pipeline development, and software engineering.
> comprehensive guidelines for data modeling, operations, and technical standards for pipeline implementation
> Tables must be normalized (within reason) and rely on as few dependencies as possible. Minerva does the heavy lifting to join across data models.
> When we began the Data Quality initiative, most critical data at Airbnb was composed via SQL and executed via Hive. This approach was unpopular among engineers, as SQL lacked the benefits of functional programming languages (e.g. code reuse, modularity, type safety, etc)
> made the shift to Spark, and aligned on the Scala API as our primary interface. Meanwhile, we ramped investment into a common Spark wrapper to simplify reads/write patterns and integration testing.
> needed to improve was our data pipeline testing. This slowed iteration speed and made it difficult for outsiders to safely modify code. We required that pipelines be built with thorough integration tests
> tooling for executing data quality checks and anomaly detection, and required their use in new pipelines. Anomaly detection in particular has been highly successful in preventing quality issues in our new pipelines.
> important datasets are required to have an SLA for landing times, and pipelines are required to be configured with Pager Duty
> a Spec document that provides layman’s descriptions for metrics and dimensions, table schemas, pipeline diagrams, and describes non-obvious business logic and other assumptions
> a data engineer then builds the datasets and pipelines based on the agreed upon specification
Is this available for others to use or internal only? I think the answer is the latter as a google search didn't turn anything up and I didn't see anything in the article. But if I'm wrong I'd love to kick the tires a bit.
I'm the first person to criticize the dumpster fire that is Airbnb. I've hosted with them for 5 years and they've made awful decisions time and time again. Scams aren't as bad on Airbnb compared to VRBO and TripAdvisor, though.
> Hive SQL, Spark SQL, Scala Spark, PySpark and Presto are widely used as different execution engines
This makes me think they're doing something very very wrong. AirBNB does not have data on the scale that would require these tools. They have 5.6 million listings, 150 million users, and 1 billion total person-stays. These numbers can easily be processed with Postgres or SQLite on single machines. Spark and Hive are for companies like Google and Facebook.
Have you ever worked in data engineering? They're using these systems for event data, data generated through transformations (multiplicative effect on base size), data used for ML, etc.
These events aren't just being generated per stay. A company like Airbnb will have events about logins, searches, site interactions, etc. You'll also be transforming the raw data and storing it again as higher level, materialized tables.
Disclaimer: Worked at Airbnb (not on a data engineering or data infra team)
So all unimportant data? I mean sure you can squeeze insights out of that but if a third of it disappeared overnight it wouldn't be a big deal.
And even then anything short of obsessive mouse tracking won't be that much data.
This isn't doing much to prove that the stuff in the article matters. Maybe it does but it's not self-evident and the criticism upthread makes sense.
(Please note that I am not ignorantly saying the job is easy. I'm mostly wondering if it affects revenue and satisfaction by more than a tiny sliver to do the hard job with all these different big data engines as opposed to doing a much simpler job.)
Search interactions data is one of the most valuable data in marketplaces. I never worked at Airbnb, but I worked at companies smaller than airbnb where improving ranking had many millions $ / year impact on revenue.
> And even then anything short of obsessive mouse tracking won't be that much data.
Consider tracking clicks and query / results. That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
> That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
If we estimate a search input as 50 bytes and the results as 25 4-byte ID numbers, then multiply by 100 million, that's 15TB, one hard drive or a couple SSDs.
And a hard drive full of clicks can fit a trillion.
So even at 2-3 orders of magnitude over a billion, we're not looking at huge sums of data.
And it's quite questionable whether you need special systems dedicated to keeping that click data pristine at all times.
Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already. Being able to query those data across different segments, aggregate into different dimensions is already quite far beyond what you can do w/ off the shelve postgresql or sqlite.
And in general, in companies the size of airbnb, you don't control all the data sources to be super efficient, because that's organizationally impossible. So instead the data will be de normalized, etc.
There is a reason most companies with those problems use systems like BQ, snowflake, and co. If it were possible to do with sqlite, a lot of them would do it.
> Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already.
Am I doing the math wrong? "1 billion" was supposed to be lifetime stays, but let's say it's per year. Here's the math using 'my' numbers:
1 billion stays per year * 100 searches per stay * 150 bytes per search = 15TB per year
1 billion stays per year * 1000 page loads per stay * 15 bytes per page load = 15TB per year
How are you getting petabytes? If 3 months is 1-2 hundred million stays, you'd need to store ten million bytes per stay to reach 1-2PB. (And images don't count, they wouldn't be in the database.)
you're right about TB vs PB ofc :) But then keep in mind the assumptions were super conservative:
* 1qps is likely off by at least half order magnitude
* 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
* etc.
So you will reach the PB if not per a few month, but at least per year. sqlite or "simple" postgres really is not gonna cut it.
I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Okay, if you're going to try to inflate your estimate by 200x so you can get back to petabyte range then I'll do a more detailed comparison.
> * 1qps is likely off by at least half order magnitude
Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.
And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.
And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?
> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.
> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.
Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.
I agree certain orders of magnitude are harder to guess accurately. My claim for 1k qps being conservative is mostly based on
* my current company, but I can't share more precise numbers. So I guess not very convincing :)
* however, at ~ 300 millions listings in 2021 for airbnb, that means i.e. ~10 listing per second. 1k qps implies ~ 100 queries per listing, which would be an extremely good ratio. Every time you change a parameter (price range, map range, type of place), that's a new query.
> Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
I agree once you start thinking about optimization, such as interning uuids, etc. you could maybe go down to double digit or even single digit TB of data per month. E.g. by using parquet or other columnar format, use compression, etc.
But keep in mind the teams working on the event generation (mobile, web, etc.) and the teams consuming those events work completely separately. And those events have multiple uses, not just search, so they tend to be "denormalized" when stored at the source. A bit old but still relevant reference from twitter that explained how many companies do that part: https://arxiv.org/pdf/1208.4171.pdf.
A search would contain a lot more than that. Structured events about search result pages will often contain a denormalized version of the results and how they were placed on the screen, experiment ids, information about the users session, standard tracking data for anti abuse.
You might use these data to make statements in legal documents, financial filings, etc. and therefore you’d want a good story about why those data are trustworthy.
Think about e.g. ranking in their search/recommendation engine. To be able to train ranking ML models, you would need to at least track the views, clicks, purchases, etc. done through their platform. For each search, you want to keep the query string and the results items ids.
Let's say, very conservatively, they have 1000 qps on average. We're talking about 100 of millions of events a day.
And this video from May 2017 mentions 1B daily events (for whatever they define as an event):
https://youtu.be/70luTZU-D3E?t=102
It wouldn't surprise me if they're storing calls between microservices as "events" and they're likely logging a lot of both user data and internal services data, but that's purely a guess.
It seems like more and more companies (looking at AWS and Netflix directly) seem to deploy ~1k micro services.
I work in a team where we manage ~14 micro services (per environment - dev, staging and production - max ~42) and find it complicated to manage and monitor...
It happens with technology sometimes, but all time the with finance. Having worked in hedge funds most of my career (as an engineer but I see enough of the business side) it's hilarious how clueless but confident people on HN are about anything that touches finance, trading, stocks, crypto, etc. Nothing wrong with being clueless, but the hilarious part is the confidence of the posters about what they are writing, which they probably got from Medium. If you don't know better you'd think they know what they're talking about.
Determining blocking vs non blocking is a big issue - deciding which checks should be stoppers and which shouldn’t is often a matter of extensive debate. In my experience, only a few data checks are absolute show stoppers under any circumstance and a lot of things need to spawn tickets that should be routed to the correct team and followed up on. Some type of tracking system is necessary for this.
Defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.
A model that plugs into an Airflow DAG as AirBnB has designed seems like a good approach. Often when it was time to incorporate checks into the pipeline we had heterogenous strategies to invoke our checks engines. Having a standardized approach helps drive adoption across the organization- oftentimes I’ve found that people are reluctant to run non critical checks if it’s a significant time and effort cost and will only run critical ones to try and push data quality accountability either upstream or downstream. If it’s really easy to turn on and incorporate that’s one less excuse that can be used to not run the checks.