Data diffs: Algorithms for explaining what changed in a dataset (2022)

lelima · on July 27, 2023

We're doing a env migration and we've been using spark diff extension for reconcile data, it's amazing, we've discover bugs in the data logic so quickly,

here is the extension if anyone is interested https://github.com/G-Research/spark-extension/blob/master/DI...

bloaf · on July 27, 2023

Dolt is working in this space, and I think there is a lot of potential:

https://www.dolthub.com/

One use case I've not heard anyone talk about is modeling/optimization. In my previous role we were managing a bunch of data that all fit nicely in databases (e.g. market prices, timeseries data from facilities, optimization model structure) but had no good tools to summarize "what changed from last week" when the optimization models would spit out something unusual.

j-pb · on July 27, 2023

TerminusDB is also in that area.

Dolt is git for MySql.

TerminusDB is git for RDF/graphs.

https://terminusdb.com

esafak · on July 27, 2023

Does anyone here who uses it in production care to comment?

michaelmior · on July 27, 2023

> Make diff work on more than just SQLite.

Another way of doing this that I've been wanting to implement for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).

[0] https://calcite.apache.org/

lvh · on July 27, 2023

We use a giant pile of Clojure (almost everything, with specific measured exceptions, is). As a side effect, we have a lot of data. Not necessarily a lot in the sense of "big S3 bill", but definitely a lot in the sense of "you might not expect this being in a machine-readable format".

Things like: what Lambdas existed in a customer AWS account 6 months ago in us-east-2 that had access to a specific SQS queue" (because we learned later that one of the consumers of that queue would actually consume Python pickles if you asked nicely, and hence get you RCE).

As a side effect, we do a lot of data diffing: just mostly on more vanilla Clojure structures rather than data sets in the Datasette/CSV/... sense.

For example, we have tooling (e.g. [recidiffist], which we also have wired up to Terraform + S3, so if you write some files to S3, you can get the structured diffs right next to it for free). It's one of those things that's simple and works ridiculously well. Well, if you do it consistently anyway. "Let's look at what resources aren't managed by IAC, and how that has changed historically" is an interesting question to me, but I couldn't do it without having that data available, encoded in a way that didn't pre-suppose how it would be used.

Even though we mean "explanation" differently, this really is critical for enabling explanation and exploration. While many of these pieces of data (e.g. "historical complete snapshots of AWS environments") aren't big compared to a lot of corporate data lakes or log management systems, they're way bigger than what fits in my tiny human brain at once. They're still amenable to fast local analysis. We use [clerk] for amazing notebooks to aid in that.

Where this all comes together is in the encoded, and machine-evaluatable, expressions. The way you do a Cloud audit is you write a Clojure program: you develop invariants about the environment that are or at least should be true, and then you talk about the places that didn't work out.

[recidiffist]: https://github.com/latacora/recidiffist [clerk]: https://github.com/nextjournal/clerk

gregw2 · on July 27, 2023

The "diff" described in this article is very different from "diff" in Unix, it's a "fancy data science data comparison tool", not a "pedestrian data engineering tool" comparing two tables.

If you want a SQL diff that works more like Unix diff, to validate that two huge tables are the same and data didn't get corrupted in some ETL process for example and that no rows got duplicated or column values got mangled, you can use the following technique which basically just uses ordinary GROUP BY/UNION ALL/HAVING SQL operators in a set-based way:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

I have compared a pair of billion-row tables in under a minute on a columnar database with this technique.

This can be handy for regression testing algorithmic results that produce large datasets or for testing certain types of data migrations.

opportune · on July 27, 2023

The author of this blog didn’t handle minimality or max_order > 1 which basically works around all the benefits of DIFF and instead can just brute force across all column values with support over the threshold. It turns an exponential or polynomial (on cardinality of columns and number of columns) problem into a problem linear in number of column values across columns. But even DIFF has a problem:

From the example, because I updated my application from v2 to v3, the initial set has 90% of (Galaxy, 11.0) records with v2 and 4% v3. After the update v2 is at 10% and now v3 is 80% of records with (Galaxy, 11.0). And at the time of both slices (Galaxy, 11.0) represents 80% of records.

Let’s say the total crash rate increased 25% and I call DIFF on (version, client, os). DIFF highlights the (v3, Galaxy, 11.0) slice first even if the crash rate was the same for v2 because there was an organic shift in proportions of v2 and v3 records: the contribution is 20x and support is 75%. But actually, some combo in the slices outside of (Galaxy, 11.0) contributed most of the diff - it gets ranked second, with less contribution and less support. The “explanation” that v3 introduced a bug is wrong: we simply caught that v3 replaced most usage of v2.

DIFF does roll ups, but DIFF applies a minimality and risk ratio threshold to results so the fact that the (Galaxy, 11.0) crash rate across v2 and v3 didn’t actually change may even cover up the true story from the result. But, the minimality threshold is what makes their implementation performant.

If you disable minimality and directly compute all roll ups, you have to group by as many permutations as the power set of the column cardinalities, instead of the product of their cardinalities, which on k columns with cardinality ~n takes you from O(n^k) to O(2^(n^k)) (I think, please check the math). Note that in the experimental run with minimality disabled they performed worse than their comparison software, and that most datasets don’t have any mention of cardinalities. But, with minimality enabled, IIUC you don’t catch “shift within X^Y from X^Y^Z to X^Y^~Z that doesn’t explain overall shift because XY risk_ratio is 0” which is a problem.

DIFF also mitigates the operate issue by limiting the number of total columns combined to produce answer. It’s a nice way to prevent long runtimes but it imposes additional constraints on real world utility - how often is it just version+client+os that’s a problem and not version+client+os+uses_feature+migrated_from_legacy_table? My post is getting too long but many state of the art experimentation tools work around this by controlling for distribution shifts between sets - doesn’t work well with a table of just crashes but works fine once you have a denominator like “usage time”

gieksosz · on July 27, 2023

some years ago when I was digging into DIFF and Macrobase (the one from Ballis lab) I made a simple reproduction of DIFF algo https://github.com/PiotrZakrzewski/macrobase-diff

0cf8612b2e1e · on July 27, 2023

This looks very interesting. Having not read the paper or the author’s library, I am curious how well it scales to more features. I would naively assume it is doing an exponential number of column comparisons.

Keeping in mind the strong correlation between divorces and margarine (https://tylervigen.com/spurious-correlations) I am still tempted to wire this up to automatically generate reports when data falls outside of a certain threshold.

opportune · on July 27, 2023

I read the paper. It’s fundamentally I think an exponential problem but they apply some constraints to reduce it: only considering tuples of size n <= 3, pruning.

Personally I think the case where you’re doing this on one column is not that interesting: for each column, get all values with >= support frequency, group by it on old and new, include in results is risk_ratio over threshold. List results in order of risk_ratio. Going from that to just two columns is much much more computationally demanding and where pruning and such really matters.

0cf8612b2e1e · on July 27, 2023

The article was giving me a market basket analysis vibe(maybe just the shared terminology), so possibly a similar underlying algorithm to do the pruning.

michaelmior · on July 27, 2023

Of interest might be Explain-Da-V[0] which will be presented at VLDB this year.

[0] https://github.com/shraga89/ExplainDaV

TACIXAT · on July 27, 2023

I am working on this for a different definition of term dataset. I started learning deep learning which led me to start building datasets.

Wanting to store versions of the datasets efficiently I started building a version control system for them. It tracks objects and annotations and can roll back to any point in time. It helps answer questions like what has changed since the last release and which user made which changes.

Still working on the core library but I'm excited for it.

angrais · on July 27, 2023

Have you looked into existing version control systems for data, such as DVC?

TACIXAT · on July 27, 2023

Thanks for the suggestion. I have glanced through the docs in past but haven't tried it. I am trying to do a bit more than what git can offer.

First the good. Git LFS solves the issue of checking out a massive repository in whole.

Git can work pretty well if your annotations are in a text based format and stored one annotation per file. That makes it easy to track and attribute annotation changes.

What I'm building can serve as a backend to labeling. There is a built in workflow for reviewing changes, objects have different statuses (in annotation, included in release, etc.), reproducible releases, things like that.

It is really designed for collaboration with untrusted third parties. Imagine someone making a pull request for a binary annotation format. To review it you would have to clone it, load it in an annotation tool, then go and tie what you saw to what is in the pull request. What do you do it like 90% of the annotations are correct? Reject everything? Very tough, also assumes your annotater can make a pull request.

Mine will still require you to bring your own annotation tool, but makes it much easier to integrate the review process.

ozkatz · on July 27, 2023

Might want to check out lakeFS: https://github.com/treeverse/lakeFS

(full disclosure: I'm one of the creators)

hermitcrab · on July 27, 2023

I've bookmarked this for when I add a Diff transform into Easy Data Transform. ;0)

aaronblohowiak · on July 27, 2023

i wonder if someone has extended postgres to add this yet. googling diff is hard as there are so many other uses of that term.

mistrial9 · on July 27, 2023

the open data world needs this for easy and consistent checks between Openstreetmap and Overture Maps