Really interesting. Curious about one particular case -- obfuscating the values of columns which are part of a `unique` constraint.
Let's take, say, a set of a hundred thousand nine digit social security numbers, stored in a DB column that has a uniqueness constraint on it. This is a space small enough that hashing doesn't really mask anything -- there are few enough possible values that an adversary can compute hashes of all of them, and unmask hashed data. But the birthday paradox says that the RandomString transformer is highly unlikely to preserve uniqueness -- ask a genuinely random string generator to generate that many nine-digit strings, and you're extremely likely to get the same string out more than once, violating the uniqueness constraint.
One approach that I've seen to this is to assign replacement strings sequentially, in the order that the underlying data is seen -- that is, in effect, building up a dictionary of replacements over time. But that requires a transformer with mutable state, which looks kind of awkward in this framework. The best way I can see to arrange this is a `Cmd` transformer that holds the dictionary in memory. Is there a neater approach?
Not sure what the approach of this library is, but can't you generate a nonce from a larger alphabet, hash the column values with the nonce `hash(nonce || column)`, and crypto-shred the nonce in the end.
Then, during hashing you just need a constant immutable state, which effectively expands the hash space, without incurring the mutable state overhead of replacement strings strategy.
We ran it weekly to dump production down to a “scrubbed” copy which we used for our test/qa environments and then further trimmed some large tables to create a much smaller development version.
It really helps to have the “shape” of production data and we saw surprises/bugs on prod deploys due to unexpected data drop to almost nothing. It was much easier to maintain than seeds imo, though it is a cat and mouse game, if a new column with sensitive data is added, you need to ensure it’s covered with a masking query too, but we solved that by auto tagging the security ops folks whenever our schema file changed on GitHub.
Looks interesting. I am struggling mightily with a problem at work where we have a very large production DB and need a subset of it for local dev environments. You can't just randomly sample the data as that will break foreign keys and other relationships.
Anyone have a solution to this problem that works well for them?
At each employer I now start by writing a data extractor that takes one client's data and loads it into my local DB. Usually includes some 'skip' or filter options to make the process faster. And some transformations like anonymizing PII.
With PG you can bypass the FK checks locally by temporarily changing the replication mode from origin to replica. Though if you don't eventually copy the referenced records you may run into FK errors during normal operations.
I‘m using https://github.com/ankane/pgsync for that, it has a very easy config yaml and you can just define queries that get pulled from the db.
I have a simple one like „base_tables“ that just pulls me all the fixtures into my local db, then entity specific ones that pull an entity with a specific id + all related entries in other tables for debugging but as long as you can query it you can set up everything very easily.
I would distinguish between "local dev environment" and "debugging specific data-dependent customer issue".
My experience is you can usually write the application with no dependencies on customer data, and do local development entirely with synthetic data.
For debugging specific customer escalations or understanding data distributions for your synthetic data modeling, you can use a replica of production. No need to extract just a subset or have devs host it themselves. All access can be controlled and logged this way.
Better than a replica would be something like [1] but that isn't available everywhere.
I agree the problem of "dump a DB sample that respects FK constraints" comes up sometimes and is interesting but I'm not sure I'd use it for the problem you describe.
We solve this by managing referential integrity across relational databases when we subset. Effectively you have to build a graph of edges and nodes where edges are constraints and nodes are tables, then you have to follow these edges from node to node to maintain the constraints. It's certainly not trivial and gets very complicated when people use sensitive data as a primary key (for ex. email)/foreign key, then you need a way to anonymize those emails and shelf those primary keys until you can re-insert them later so you don't break the constraints.
(disclaimer - co-founder of Neosync: github.com/nucleuscloud/neosync, open source tool to generate synthetic data and orchestrate and anonymize data across database)
And I'm updating the documentation for v2.0.0 in real-time https://docs.dbsnapper.com. The new configuration is there, but I have some work to do adding docs for the new features.
Works with Mac and Linux (Windows soon)
If you're interested in giving the alpha a try, I'd love feedback and can walk you through any issues you might encounter.
Built something like this a few years ago for work, but it has become a monstrosity that is difficult to enhance and maintain. We've been debating whether to build a more extensible v2 since I haven't seen anything that fits our needs on the market. I'm excited to check this out!
It's closed-source / proprietary at the moment, but I'm highly considering opening parts / all of it up as a golang library at a minimum, or do a free/pro version to support development. Any thoughts, suggestions?
Postgres can do a pg_dump with RLS (Row Level Security) enabled to only get 1 tenants data. Very convenient if you use RLS already, if you don't, wouldn't take too long if you have tenant_id on every table
Let's take, say, a set of a hundred thousand nine digit social security numbers, stored in a DB column that has a uniqueness constraint on it. This is a space small enough that hashing doesn't really mask anything -- there are few enough possible values that an adversary can compute hashes of all of them, and unmask hashed data. But the birthday paradox says that the RandomString transformer is highly unlikely to preserve uniqueness -- ask a genuinely random string generator to generate that many nine-digit strings, and you're extremely likely to get the same string out more than once, violating the uniqueness constraint.
One approach that I've seen to this is to assign replacement strings sequentially, in the order that the underlying data is seen -- that is, in effect, building up a dictionary of replacements over time. But that requires a transformer with mutable state, which looks kind of awkward in this framework. The best way I can see to arrange this is a `Cmd` transformer that holds the dictionary in memory. Is there a neater approach?