> Aww I’m feeling bad for the poor engineer who is saying “crap” a thousand time...

onion2k · on June 18, 2021

I hope not. It sounds like the test database was not being anonymized

Taking a copy of a production database and using it for tests is a bad idea, even if you believe you're expunging any private user data.

Development, staging, and test environments just shouldn't ever have access to production data. If you're at a company that's ISO27001 certified for data security it even goes as far as most employees not having any access to data. I've never seen any production data for the app I work on.

https://en.m.wikipedia.org/wiki/ISO/IEC_27001

gls2ro · on June 18, 2021

I agree about the part of not accessing information from production.

But I am wondering how could we debug or test something which happens only on production? I ask this because there are some bugs that can appear at the intersection of code and data.

So far my strategy is to do the following:

1. Only one person can access production DB. This person will do a backup copy and encrypt it to an internal storage.

2. Another one will get the backup and run an anonomizer script on data. The anonimzer is still up to debate what it should do after the obvious cleaning of personal data from user accounts. One important (and hard step) is regenerating the uuids but keeping foreign keys integrity.

At the end this person will create a new DB internally with the anonimizer data.

3. Someome reviews the new DB and marks it as ready to be used

Then a dev can ask access to this fresh copy.

In some teams I played with making this process full automated until review. But then if there are bugs suddenly we have a live internal DB with customer data which is not wanted.

As an alternative but only for small projects I wrote once a script which analysis the DB data and tries to create fro, scratch a similar data structure but with fake data.

Kalium · on June 18, 2021

> But I am wondering how could we debug or test something which happens only on production? I ask this because there are some bugs that can appear at the intersection of code and data.

I've found that your strategy depends greatly on the kind of bug and what kind of service:

* If you're implementing a DNS server, you can copy live queries and compare good-to-bad. Then you can notify when something bad crops up. But odds are you aren't implementing a DNS server.

* If you're working on something whose behavior potentially changes under load, you need to find a way to replicate load. Some companies have entire production environments where release candidates are sent without being less secure. Cloudflare has some of these - I implemented one of the early versions.

* If you're dealing with weird logic tied to edge cases in the database, you need to work to identify those. Having live data often makes it only marginally easier.

There are products out there that will synthesize large amounts of production-like data based on the patterns in your database. I've used tonic.ai, and I know there are others. As you say, this is a touchy process with nasty error cases. Having someone else implementing it might be desirable.

eru · on June 18, 2021

Use a copy of production (perhaps anonymized) for debugging, and delete the copy afterwards.

> But then if there are bugs suddenly we have a live internal DB with customer data which is not wanted.

Don't let the production-copy touch your normal development environment. Make sure it's deleted in time.

onion2k · on June 18, 2021

Use a copy of production (perhaps anonymized) for debugging, and delete the copy afterwards.

This way of debugging assumes a lot of things;

- You're assuming that your anonymization script works. What if some data isn't removed?

- What if the system you're using for debugging sends an email or connects to a webhook or attaches to a remote volume or pushes to a cloud service etc etc? Did your anonymization step really work?

- What if someone has connected the system you're debugging on to a production service by mistake? That would mean you're not even using the anonymized database. You're really on production..

- What if you forget to delete the database afterwards? Or forget to purge a cache? Or you fail to delete a container? Or you do delete the container, but not the container volumes? That production data is still there. Oops.

It's much simpler to just not use production data for debugging. It makes debugging harder, which is annoying, but you can't go wrong and accidentally leak your user's data. I'd prefer to just spend more time on debugging than have my users data be put at risk.

eru · on June 18, 2021

Yes, obviously you'd try to debug as much as possible without touching production data.

Of course, different businesses also have different requirements on how sensitive production data is.

flukus · on June 18, 2021

> I've never seen any production data for the app I work on.

The rest I agree with you, at least in a perfect world, but not allowed to look at production data? In the jobs I've had recently I wouldn't even be able to hypothesize what the problem is without looking at production data and production logs. Some of the issues wouldn't even have been reported if I wasn't checking the logs.

How do you bridge the gap from problem to replication and/or something actionable? Do you have someone knowledgeable enough in a role where they can feed you this information?

somebodythere · on June 18, 2021

I guess the poster meant production customer data. Production logs and metrics should be easy to access, but customer data should be highly privileges and definitely not present in logs. At my old employer viewing production customer database required a customer support escalation.

onion2k · on June 18, 2021

The rest I agree with you, at least in a perfect world, but not allowed to look at production data?

For some context, the app is all about visualising corporate and legal structures at global law firms, so it's all very private and very secure. Never having access to production data to replicate issues certainly makes debugging a bit harder, but it's never been so complex that we've not been about to figure out what's happened. I've learned a lot about understanding how an application works, how data flows through it, and intuitively zeroing in on a likely problem area while I've worked on it.

eru · on June 18, 2021

It's a lot of effort, but you can make a system like this work.

Eg Google does for example.

(SREs can still look at some metadata of the running system, like load etc.

Logs themselves have to be carefully anonymized.

The data itself is almost completly off-limits.)

KronisLV · on June 18, 2021

I'm also in a similar situation, but in my case, i cannot even get access to the application logs unless i explicitly ask for them (typically as a part of solving a certain problem, given a time of occurrence), same for APM data.

While there's certainly something good to be said about the data security in such instances, it makes catching errors and fixing them absolute hell, especially if the clients are unaware that there are the occasional exceptions appearing into the logs, or they send the wrong logs (in the case of old fashioned file based logging with unclear logging strategies).

Daily ETL with data anonymization/pseudonymization from the prod and into the test environments would be really good to have, yet i haven't really seen any companies adopt that. The closest i've seen were situations where, the production data would be manually exported, scripts run against it and then given to the developers quarterly at best.

That concludes my tiny rant that's vaguely related to the topic (DB data vs log data), though that could also encourage discussion about which data is available to other developers and how they approach it (e.g. trying to never log things like monetary amounts or even person data in logs to make them harmless and the tradeoffs of that, like them becoming more useless). Heck, maybe someone out there has automated the things i mentioned above.