I do machine learning work in healthcare, and work for a HIPAA covered entity. The issue of permissions and data access often gets applied in an unnecessarily strict fashion to data scientists in these environments, often due to a lack of understanding from engineering managers who have been brought in from a non-regulated environment (e.g., hiring a salesforce engineering manager into a healthcare system so they can "disrupt" or "solve" healthcare -- they hear "regulation" and immediately clamp down on everything).
HIPAA allows use of clinical data for treatment, payment, and operations. You can also get around consent issues if the data is properly deidentified. If you have a data scientist who is working to further treatment, payment, or operations (i.e., isn't working on purely marketing uses, selling the data, or doing "real" research), then they are allowed to use the data, assuming it's the minimum necessary for their job. For training machine learning models that support operations, "minimum necessary" is probably a lot of data. And, obviously, the production pipelines and training/experimentation/development would need access to the same amount of data if you want to train and deploy models.
Data scientists are also likely to be the first to notice problems with how your product is working, often before the data engineering team. At my company, I've found numerous bugs in our data engineering pipelines and production code because I've seen anomalies in the data and went digging through the data warehouse, replicas of the production databases, and within our actual product. You probably want to support and encourage that kind of sleuthing - but each organization is different, so maybe you have better QA that's more attuned to data issues.
My opinion, from having done this for over a decade, is that the question shouldn't be about how much access you give your data scientists. They should have access to nearly all of the data that's within their domain, assuming they're legally entitled to it. The question you should be solving for is what restrictions should be placed on how they access and process that data: e.g., have EC2 instances and centralized jupyter notebooks available for them to download and process data, and prohibit storing data on a laptop.
I'd just like to point out that "the minimum necessary for their job" is the reason many engineering managers apply unnecessarily strict rules.
It's very difficult to build rules and policies that allow broad access while maintain minimum necessary. Some project may be completely justified in accessing "all" (waves hands) data at its conception but slowly morphing to focus on only a few key identifiers while still processing "all" data.
Totally, and to extend that further, one employee may work on multiple projects for which the "minimum necessary" is different. Part of my job involves patient matching and reporting patient-level data to partners we have a BAA with. That means I need access to patient names and addresses. However, if I'm working on training some ML model to predict diabetic progression, it's not necessary for me to pull the names of patients.
I think there's an incorrect assumption in here that there exists a technical solution which entirely solves this problem; that we just need to figure out what the right set of rules are, or get the right column-level and row-level security policies in place and we're all set. It's necessary but not sufficient to have those kinds of safeguards in place. You also need to trust somebody in the organization, and you need to give those somebodies training and support to do the right thing.
In my case, I need access to all of the (clinical) data within the organization. I don't really care how that end is achieved: with one account that has every permission, with multiple accounts that are used for different purposes, or whatever. Ultimately, it's in the interest of the organization to make sure that I have the access I need to successfully do my job.
Not the OP, but I'd love to chat more about your experiences in this space. I don't see an email in your profile; feel free to drop me a note using the one in my email.
HIPAA allows use of clinical data for treatment, payment, and operations. You can also get around consent issues if the data is properly deidentified. If you have a data scientist who is working to further treatment, payment, or operations (i.e., isn't working on purely marketing uses, selling the data, or doing "real" research), then they are allowed to use the data, assuming it's the minimum necessary for their job. For training machine learning models that support operations, "minimum necessary" is probably a lot of data. And, obviously, the production pipelines and training/experimentation/development would need access to the same amount of data if you want to train and deploy models.
Data scientists are also likely to be the first to notice problems with how your product is working, often before the data engineering team. At my company, I've found numerous bugs in our data engineering pipelines and production code because I've seen anomalies in the data and went digging through the data warehouse, replicas of the production databases, and within our actual product. You probably want to support and encourage that kind of sleuthing - but each organization is different, so maybe you have better QA that's more attuned to data issues.
My opinion, from having done this for over a decade, is that the question shouldn't be about how much access you give your data scientists. They should have access to nearly all of the data that's within their domain, assuming they're legally entitled to it. The question you should be solving for is what restrictions should be placed on how they access and process that data: e.g., have EC2 instances and centralized jupyter notebooks available for them to download and process data, and prohibit storing data on a laptop.