Hacker News new | past | comments | ask | show | jobs | submit login
A GitHub repository was public-viewable (adafruit.com)
179 points by zdw on March 4, 2022 | hide | past | favorite | 128 comments



Office Depot Salesforce source code still up. Complete with some integration private keys. Bug bounty says it’s not a bug…

Salesforce employee recently published source of one of their products. I’ve reported via email since I’ve been removed from their private Hackerone programme, presumably due inactivity. Sec team just said it was “test data”. Wish I’ve made a copy since it’s gone now and bullshit like this responses just wants me published everywhere.


Hackerone is a joke, anyway. Organizations will just respond with "it's a feature, not a bug" to get out of any bounty. I once reported that you could log on to certain PP accounts with just username and CC number, bypassing configured 2FA, and allowing to wipe the 2FA. Guess the response. Lo and behold, it's fixed now.


Hackerone should be an escrow that can arbitrate and overrule insipid behavior like that.

But they won’t, will they?

Maybe someone else will fill that void


Ok, I'm giving some credit to company above. Someone contacted me hours after this post and took down repo like minutes after I've responded (tho they say they found it in parallel). Root cause - third party was doing some POC and I'm guessing misconfigured CI.


Why are you committing PII to git in the first place?


Lots of data scientists, ex-statisticians, analysts, etc. are pressured with "you should use git". It's generally a noble thing, and the right one -- these folks often otherwise don't use git enough for their code.

It does however come with this risk, that they let data, or Jupyter notebook output or etc. get committed.

I'm not saying what happened here is excusable -- it's not, it's ridiculously bad. But I can see a few ways how it'd happen.

To me the equal or better question is why does this person have PII in the first place? It's not necessary to do analysis. Someone should have masked or removed it before this person ever got their hands on it. There's no way a PII dataset is needed for training.


> these folks often otherwise don't use git enough for their code.

Or at all. This is actually a fairly common problem among certain types of research scientists who think software engineering is "peasant work".

I remember a friend of mine working at a research organization where several researchers lost months of work due to an unscheduled server reboot. Turns out when you log into ephemeral containers and pretend they are VMs, things go poof.


I read a post on Reddit's legal advice subreddit of a data scientist having their apartment robbed, and part of it was saying they'd lost all their code and currently working on projects.

As horrible as the robbery was, I was screaming inside "WHAT ABOUT VERSION CONTROL?!??!"


The worst story I read was of someone working on his PhD thesis for 2 years, and then he forgot his laptop in a bus and everything was gone because in 2 years he didn't store the files anywhere else. I personally met a teacher who stored the only copy of her students' final exam submissions on her everyday thumb drive without backup.

For many people technology really is magic.


That’s a really sad story.

If he doesn’t come from a programming background version can seem foreign.

But like ya know… Dropbox. I’d be more concerned with my hard drive corrupting than anything else


I worked at a company which had a bunch of VMs for their data science teams. They replaced them with containers, then suddenly panicked when they realized that the containers would lose state when the admins of the container hosts applied security patches and rebooted or removed the outdated nodes. (The old VMs were rarely, if ever, patched after the initial boot, which became a compliance problem.) It turned into a monstrous issue where the data scientists wanted 6 months of continuity, the data platform SREs wanted 30 days and the container host admins wanted 15 minutes.

I think they might still be deadlocked to this day.


I love hearing these kinds of stories just to remind myself how not dysfunctional my own company actually is.


I know plenty of companies that rarely apply any patches or do OS upgrades. I'm talking servers with 4 to 5 year uptimes, still running Ubuntu 16.04 or worse. They don't want to reboot because some dude who left 4 years ago set everything up, often in a non-standard way, and nobody is quite sure how it works. They certainly don't want to be blamed when production goes down. I did a contract job for a very large company that was running a 6 year old distro on a "critical" server and was afraid to do anything beyond changing a password. It's easier to have an outside person do it, so they can blame them when it gets screwed up.


Wouldn’t it be possible to automatically backup the data before reboot ?

Docker offers persistent volumes, so my 30 second solution would be that


That assumes a willingness to learn enough about Docker to understand how you configure your container to do what you want. That can, in some organizations, be too much to ask of researchers. That's the core of the problem.


I believe that was asked for by the SREs- e.g. Tensorflow supports checkpointing to disk and restoring progress- but the ML training software used by the data scientists did not have this feature.


Correct. This is a common failure mode... The easiest way to build a training set for a system is to take some live data, salami slice it, and store it in a static location.

... that's also a short path to dropping a user's PII in the clear.


Use git but don't use github.


git alone doesn't protect you against accidentally deleting your code directory and whatnot, and GitHub is the most accessible remote git offering right now.


Isn't this what backups are for?


Your operating system might. But it doesn't protect against your machine dying. That's a different problem with different solutions. You could self host on a different machine...

Is github the most accessible? It is the most popular and best for discovery but a discussion around if gitlabs or bitbucket being more accessible could be had.


I was thinking: what about just some different PC where you got ssh access. Yes this might not be accessible enough. So the next thing could be gitlab. They offer free private repos


> It does however come with this risk, that they let data, or Jupyter notebook output or etc. get committed.

You make that sound like it's generically a bad thing. Having the data you used and tested with is good in many ways. Having it in the same git branch and repo as everything else is suboptimal but usually a lot better than not safekeeping it at all.


I would say not just why are you committing it to git, but why are you using it for "employee training"?

You should not be using real data with PII for training exersizes.

Hopefully that's what they mean by:

> We are additionally putting in place more protocols and access controls to avoid any possible future data exposure and limiting access for employee training use.


They're not, they're 'just' letting employees use real customer PII when training.

They screwed up by allowing that. The employee screwed up by committing it to git and then pushing to a public repo.

The employee wouldn't have been able to do that if they'd enforced using fake customer data for testing/training.


The unfortunate thing about startups is that a lot of them are this fast and loose with PII. Incentives are low to do better, and the tools that make it easy to do better cost money.

This isn't to excuse Adafruit; it's to remind everyone that the hot young startup you just signed up for is probably keeping your signup information in a mysql database that everyone in the company has access to right now with a plaintext password thumb-tacked to the one office wall they have.


Yep. And there’s a good chance the company’s production database is on employee laptops so they can test features with real data sets.

When a stranger has their laptop stolen on a bus, who knows what data was on it. Fingers crossed most people have FileVault turned on these days.


> The unfortunate thing about startups is that a lot of them are this fast and loose with PII

Unless it is a HIPAA-compliant startup. Then scrubbing PII is priority number 1.


I know of business units IN government who explicitly ignore compliance. They sign off claiming they "accept the risk". It's worth your job if you push compliance too hard with them.


I thankfully have never had the displeasure of working with a business unit that explicitly ignored compliance with regulations.

i know they exist. i just havent, personally, had that experience, thankfully.


When you try to tell a business unit they can't use live/prod data then whine to their director who complains to their deputy minister, at which point the hailstorm of shit turns around and starts falling down on those of us who are "blockers". Don't get me started on "the business signs off on accepting the risk".


There's nothing particularly wrong with using git for PII, as long as you use it internally on private machines or a private LAN.

But it's a bad idea to put PII on github which is apparently what happened.


I think that even just on private machines, this would make some types of legal compliance needlessly difficult. If you ever need to delete that data, for example to comply with a corporate retention policy or in response to a request from an individual in a jurisdiction that requires you allow doing so, you would need not just to rewrite history but also to ensure that history is rewritten in every clone that any employee has ever made of that repository; there might not even be a record of which clones exist.


The clone issue is real, but it's no more complicated than it would be without git: Keep track of who has a copy of this data and where. Whatever system you have in place to solve that problem can also be used to manage the cloning process.

As for the individual repos, here's the procedure:

  cd <repo>
  rm -rf .git/  # DON'T TRY THIS AT HOME unless you know what you're doing.
  git add .
  git commit -m "Brand new repo"
Do not do the above unless you completely understand the ramifications. But those ramifications might be precisely the ones you want for legal compliance.


Please never do this -- not least of which because each person who does it will end up with now completely divergent histories and repos from each other.

There are various tools to properly filter history, e.g. `git-filter-repo`, but the short answer is as the grandparent commenter said, things get hairy, you need to rewrite history and coordinate. It's not a situation you should hope to get yourself into...


Might as well not use git then


Using git is no big deal, but storing the repo on Github means sharing the information with a third party (Microsoft) even if the repo is kept "private".


What mechanisms are in place to prevent an employee from doing so? We have so many people in the industry now that it's hard to keep up with this stuff. To be fair, Github has all the ingredients necessary to be a giant, gaping security liability. It's a "secure" platform built around software designed for sharing. That sentence alone has a ton of baggage in it. I'm not even trying to diss Github either.


Education/training is really the only way.


Oof, committing PII to Git. So much for that "right to deletion" unless you want to rewrite your entire git history.


My understanding is the right to deletion doesn't so much affect things such as logs, internal training data, etc.

Also, since that's a web store and the data comes from customer orders they would have no right to deletion as that information becomes protected for other legal requirements.


From a GDPR perspective, it also covers logs, internal training data etc. If a user requires to be deleted, you have to delete everything, there can be no trace of their existence.


> If a user requires to be deleted, you have to delete everything, there can be no trace of their existence.

False. Transaction data must be kept for legal reasons and deletion requests do not apply to it.


Yep, that might be true for certain industries.

But logs don't count as the transactional data that needs to be kept for legal reasons.


That depends entirely on what they're logging and why.


There are always time limits that apply, which means you need to have a process to delete the relevant log entries (or the whole log) eventually.



GDPR, from what the legal trainings I've had, logs aren't covered if you apply the technical requirement and costs, etc.


This also depends on if the data was sensitive and your log usage/retention policy, i.e. you can't just say "it's logs" to be able to keep things - you need to show you're only using them as logs.

Addresses are sensitive information and whatever was happening sounds like multi-purpose-consent-necessary data processing and it was years old.


There's a reason many of us across the ocean look at that part of the GDPR like someone decided you could put the feathers back in the pillowcase if you just made it illegal for the feathers to be outside the pillowcase.


Are they based in a jurisdiction that has right to deletion laws?


> Why aren’t we sending an email to every user?

> We evaluated the risk and consulted with our privacy lawyers and legal experts, and took the approach that we thought appropriately mitigated any issues while being open and transparent and did not believe emailing directly was helpful in this case.

Seems like a pretty weak justification. "Our lawyers said we don't have to notify users."


Yep. Lawyers will always take the road of least liability. I have no doubt the lawyer said to not even write about it publicly.

I stopped trusting adafruit when they sent me broken boards and never replaced them. Their software side is very startupy, they love to break things, especially if it was working perfectly fine before. I have a USB to uart/SPI/i2c adafruit device that won't operate in any of those modes without changing windows drivers between them, and ontop of it requires local side circuit python and it used to just be C.

Circuit python was a huge mistake on their part. They didn't have the people to do C correctly (look at their nrf libraries), much less C and circuit python versions. Which ends up being a compatability nightmare.

So I just buy a 5 pack of esp32's or other dev boards off Amazon and be done with it. If they're broke, I get them for free. Sad to see that Amazon breaks less consumer laws than Adafruit in this case.


> I stopped trusting adafruit when they sent me broken boards and never replaced them.

They refunded me a $100 purchase, no return required, when I was sent defective product. I've always found their customer service to be friendly.

> Their software side is very startupy, they love to break things, especially if it was working perfectly fine before.

This is a very odd sentence to read in the context of embedded development. Adafruit is a hardware company with makers as a target market. I find Adafruit's software to be a good bit better than most vendor software I've worked with. And when there are bugs, most of the time the bug ends up being someone else's fault. cough Espressif cough.


> Yep. Lawyers will always take the road of least liability. I have no doubt the lawyer said to not even write about it publicly.

That's what I think a lot of business owners and managemwnt don't quite understand about lawyers. If you listened to everything your lawyers are telling you, then you would never do anything interesting (i.e. anything risky). You don't necessarily have to listen to everything a lawyer says because their incentive is to limit your risk, not necessarily to ensure your long term success, which are different things. If it means doing wrong by your customers then lawyers may tell you to do it if it means a micron less liability.


The USB to uart/i2c/spi issue is due to them using counterfeit chips. https://hackaday.com/2016/02/01/ftdi-drivers-break-fake-chip...


> No surprise really since most adafruit stuff is rebranded Ali express merchandise.

I don't know that I've ever seen rebranded Aliexpress merch on Adafruit. Boards and such are primarily their own design. Proof being ... it's all open source and the design files are on Github. There's occasionally other vendor's products there, like Pimoroni. Again, not rebranded Aliexpress. Besides boards, there's tools and such, but those have all been on brand from what I've seen. Adafruit is how I found out about the fantastic Japanese ENGINEER brand. They generally put a lot of effort into combing through suppliers and finding good quality stuff.


I edited my comment. That's fair comments... I meant more the non-board stuff such as wires/interconnects etc. I actually saw a thread once where people were able to correlate bulk purchases on Alibaba/Ali (since they publish it with redacted usernames) with Adafruit on stock of certain supplies.


It's entirely possible the passive components they sell are repackaged AliExpress stuff. But I'd guess it's because the passives are the same quality a hobbyist would get from buying on Amazon. Their "secret sauce" is the boards and software they design.


Actually, Aliexpress vendors are rebranding their merchandise. Adafruit publishes the board files, then someone removes their logo and sells it for $1 less on Amazon. If you have any doubt that Adafruit is doing their own designs, Lady Ada often streams herself designing new boards.

In general, they have reliable supply chains and manufacture everything in New York City. I have never heard of them selling counterfeit FTDI chips. I have two of their FT232H modules purchased directly through their store and they're legit.


Amazon doesn't need to turn a profit, this is both good and bad.

It's bad because smaller companies can't compete. And many of these smaller companies do more for the community than Amazon does. Good luck calling Amazon for help with picking a sensor.

It's good because if you return stuff you'll save money with Amazon. Like the times I buy things outside of Amazon I'm amazed to see I can't return it, of the penalty for doing so is like 40% of the item cost.


I hope that they don't have any EU customers, since leaking a bunch of names, email addresses and shipping addresses sounds pretty serious, to the degree where they might need to alert a bunch of EU member state authorities, and customers.


Did you ask your lawyers, "despite the advice that all users don't have to be notified, what if we did anyway?"


Lawyers do not (necessarily) care about that: it just increases risk, they advice against it, and that’s that.

Taking this advice, and weighting this against other factors (goodwill with customers, for example) is what a good CEO is able to navigate.


They've just ensured they're never going to see a dollar from me.

I strongly considered purchasing my Raspberry pis from them, but I was put off by high shipping cost.


What's a good alternative in the US? I've been wanting to get into electronics and microcontrollers as a hobby. I have a old Arduino Uno that I got for free from some conference but I have none of the other stuff I would need to actually make something. I thought about getting a Raspberry since I found the Arduino to have very little RAM.


I'm in a pretty similar place a few months along (programmer getting into hobby electronics).

I mostly buy from DigiKey. They have a very wide selection of professional stuff, and they also resell adadruit and other similar hobby stuff (I think adadruit is a great company that makes great products personally).

You're not going to find a raspberry pi anywhere except from a scalper (~2x MSRP) for at least a few months, though. See https://rpilocator.com/?country=US .

I started off with rpi, but quickly realized I preferred not having a whole OS to manage. All my projects are now using adadruit boards, mostly feather nrf52840 ($20, Bluetooth low energy, great rust or circuitpython support).


I like Sparkfun, fast shipping and all.


High shipping cost? They pass along the cost of your choice of USPS, FedEx, UPS. You're not going to find anything better unless you buy on Amazon, which has a much bigger counterfeit problem.


Not everybody are in (in fact the overwhelming majority of people are not) in the US. Shipping costs and option vary around the world.


That's a good point. I didn't imagine anyone would buy from adadruit internationally because that'd be crazy. They have distributors around the world: https://www.adafruit.com/distributors


Normally you would email users notifying them of a hack so they can change passwords since there was no password breach it doesn't make a lot of sense to email people, theres nothing for them to do.


If you are a customer who might start receiving death threats in the mail because someone leaked your email address and home address at the same time, there might be things that they need to do.


Classic corporate use of passive voice to deflect responsibility.


Honestly, just the title of this post has caused me to lose some trust in Adafruit as a company. It's so completely passive and doesn't explain what actually happened. It feels like it was intentionally written to avoid attracting attention, but then attracts attention because of how vague and weird it is.



What are they supposed to do, say "Jeff the intern fucked up badly, and Christine let them do it"?

What is this criticism supposed to mean?


It's not about singling out individuals, it's about Adafruit as a company taking responsibility for their mistakes. The phrasing in the post makes it sound like the breach was something that happened to them, vs being the direct result of the company's actions.


The issue is the information was made public I don't know why people are upset about it being stored in git. If I store PII in a database and then that database gets leaked am I dumb for storing my PII in a database or for letting it get leaked? Seems kinda obvious to me.


The idea is that git is for source code/configs. Not data. Data gets stored in databases. Secrets/Tokens are stored in secret stores/vaults. Blobs like PDFs/etc should be stored in blob storage.

If a diff is not useful, don't store it in git. Doing a code review on PII is not useful.

You also shouldn't add packages in a git repo if they can be downloaded immutably in CI/locally. It adds a lot of space. Every commit is a snapshot + new code. Commits are not diffs.


I disagree. They were using it as sample data, when they should have used synthetic sample data. It's totally sane to check-in a JSON (or similar) file with a few hundred [edit: non-confidential] samples so that you can run integration tests without needing to set anything else up.

Similarly, I keep small pdf manuals in git. I add them in their own commit, which doesn't have a useful diff. In exchange, they're always there and I don't need to spin up some special one-off system.


Fixture data (for testing) is fine. Sometimes a few lines is the minimum viable to test, sometimes a hundred or so lines of JSON. As long as the data is being optimized for testing and you aren't just storing a GB because it's easier.

I was replying to a comment asking why PII data shouldn't be stored in git vs a database. Just tried to give a general rundown of how to think about the tools.

Totally agreed that the data should have been synthetic/scrubbed and that a small sample set (large enough to validate) would be ok.

Sometimes a PDF or a few pictures are necessary (or just much simpler). I get that. I have seen some repos with an unreasonable amount of PDFs/Docs/pictures/etc.. and that's when a script that copies them into a gitignored directory from someplace (Dropbox/S3/etc..) is a better fit.


People are upset that it is being stored in git because it shows how moronically negligent this company is about the security of its customer data. Storing any secrets in version control is one of the stupidest things you can do, and the fact that this was allowed to happen reflects poorly on the entire company.


If you really, really want to manage PII in a private git repo, you can probably do it - in approximately the same that you can probably make API-first web services in FORTRAN if you really, really want to. The fact remains that your choice of tooling hurts you.

Putting PII into Github (even if the repo is private) is catastrophic. You just made Github into a third-party data processor by accident. Good luck explaining that from a CCPA/GDPR perspective.


Adafruit's response to this is appalling. Storing secrets in version control is one of the stupidest things you can do, and reflects poorly on the entire company. But the fact that they are choosing to not inform their customers that their PII was leaked due to their negligence is criminal (well, in certain jurisdictions).

Shame on Adafruit. Hope they get sued out of existence.


> Shame on Adafruit. Hope they get sued out of existence.

They made a mistake. The response to the mistake is NOT GOOD. However, destroying a company that also does actually good things is not the answer here, either.

How should they fix this? I think they should start be writing another blog post apologizing for how bad the first one was, for starters. Emails to affected people next... But, let's be honest. How much better is _that_? Can't unleak the data. You can only be offered subscriptions to data protection services _sooo many times_. There's not really a monetary value as damages here...


Storing PII on GitHub is not wise. It is stored with the history and can be restored even if deleted. As for opening up the repo publicly, it can be achieved with a quick win by triggering alerts on such changes.


PII data should never touch github, ever. Good job you're not in the EU with GDPR as this process alone implies a serious lack of security due diligence.

"Adafruit team began the forensic process" - Checking audit logs is hardly a forensic process and the rest of the speel about "privacy lawyers and legal experts" feels like a poor effort to regain some trust.


I'm the OP on this comment. The fact I've been 50/50 voted for what is quite frankly common sense is disappointing. Just ask yourself a simple question; would you want your personal information uploaded to GitHub for someone to learn data analysis and would you be happy about that?


I got downvoted for saying you should never store secrets in version control. Insane. I think some sophomores from /r/programmerhumor may be visiting.


I hope it's just people trying to cover for Adafruit.. which is fair enough and I get that. They have done way more good than this little hiccup - but it's so important to not play around with real PPI data or be exposing that or secrets to version control.


PD and secrets are not the same thing. You can be following all technical best practices regarding secret management and still fuck this one up.

You got downvoted because storing encrypted secrets is fine.


> storing encrypted secrets is fine

Only until/unless the encryption is broken. I wouldn’t store long-lived secrets even encrypted in a public place.


"You got downvoted because storing encrypted secrets is fine."

Storing where?


... in version control. The response is intended to be read in context of the sentences above it.


The security and integrity of data shouldn't rely on itself to prove such.. imagine trying to store a checksum within the data it represents for instance.


How is checking audit logs not a forensic process? Surely that's what audit logs are for?


It's not a "process" it's looking at logs to see if someone accessed X in a certain time frame. The words "forensic process" in this instance is used to make it appear there is something more involved going on.. if say there was 1 access to the data how will this "forensic process" play out tracing who accessed the data and whom they passed it on to - they don't know and have no "forensic process" to know.


Exactly, without logs what else exists? Git itself IS an auditing tool.


Why shdnt it touch github? If you think it is not secure than also your companies source code shdnt touch it, shouldnt it? Or is there a different reason?


The laws account for limited retention time, and can be retroactively changed to make it so that companies have to get rid of historical data. Versioned history is supposed to be immutable and changing it is a pain.


Trade secret and PII are two different things. Trade secret is something you can risk. PII is not something most governments will allow you to risk.


You never know where that repo is going to go. It could be private today and public tomorrow with a totally different license. Git should be thought of as sticky in the sense that it's intentionally designed to preserve a history of everything for all time in the most efficient way possible. Anyone who's ever had to scrub an API key they accidentally pushed or optimize a monorepo can attest to this.


If it’s your own repo presumably you know exactly where that repo is going to go. The vast majority of private repos are never going to go public. I have many repos for projects that there is no chance I would ever make public, they’re not that kind of project. And even if I really did want them to be public, I would make a new repo for them instead.


Yes, but they're your repos, not an organization's. I'm certainly not going to question your self-knowledge. You may know that, and if you work with one or a couple other people closely they might know that too. The challenge is when you grow beyond that or don't always have the chance to communicate about these types of things, especially when they seem very obvious but might not be.


It's personal data, would you want your personal data being uploaded to github - in this instance so someone can learn data analysis?


Even in a private repo, putting it on GH means sharing it with the third-party Microsoft, which is illegal in many countries.


Practically speaking, I don't think that's true. Many EU companies use Microsoft's cloud offerings.


If PII transfer to MS servers is part of that, there are certain situations and necessary steps to take to make that not a violation. Moving from one provider to the other can not be done legally without properly informing the individuals involved and (depending on the nature of the data and its purpose and use) getting the right explicit consent.

Taking an extract including PII (names, usernames, IP addresses, or email addresses for example) from your customer prod db and dumping it in a csv in a private repo on GH is most likely a violation unless you have prior explicit consent for that very purpose and use.

This is true even if it's "pseudo-anonmized" in a way that the original PII can be deduced by combination with other datasets.

Finally, I wouldn't be surprised if many of those companies are operating illegally. Drinking and driving doesn't become legal just because a threshold of people start doing it. There is a lot of ignorance (willful or not) among EU businesses even today.


I thought you needed consent for usage, not for technical details? As in "we use your purchase history to generate recommendations", not "we store you purchase history on these systems and run these algorithms on it". Are you arguing that if my colo provider burned down I'd need to get explicit informed consent from every user before restore a db backup somewhere else?


> Are you arguing that if my colo provider burned down I'd need to get explicit informed consent from every user before restore a db backup somewhere else?

AIUI, it could swing either way depending on several factors, such as: the format and usage of the data; how and where the data is transferred, processed, stored and exposed; what access and role the colo provider has (are you purely renting a dedicated server in a DC with FDE that you unlock remotely with an HSM or is the data processed by one of their managed services?); how the consent you already acquired was formulated.

If the colo provider has an outsourced support engineer in Asia looking at logs/coredump or temporarily transferring a backup where the PII appears, that would constitute a transfer, for example, and full compliance needs to be guaranteed throughout.

It's years since I considered myself to have a clear and deep understanding of it and it's gotten a bit fuzzy since, so someone else might chime in with a more clear answer.


it would be ok to store it in github if it were encrypted, but storing encrypted data in git is not very efficient.


I wonder how many users. They don’t say, just use the word ‘some’ which makes me wonder if it’s quite a few.


Does Github delete blobs when a repository is deleted? I doubt it.


Your doubt is right. Here is how repos can be restored: https://docs.github.com/en/repositories/creating-and-managin...


For posterity, they updated the post on Mar 7 and will be emailing anyone affected.

"Update March 7, 2022: We appreciate the feedback from the community and our customers, and will be emailing users as part of this disclosure. We apologize for not doing that at the same time as the post/disclosure on Friday, March 4, 2022."


I've been in data analysis for a few years -- grew from scientific research in SAS with zero VCS, into programming and data analysis in R and STATA, still no VCS for a while...eventually I was allowed to use Github after I left my restrictive hospital research space. No one ever sat me down and trained me on Git, though. I just heard repeatedly: if you're not using Git, you're missing out on a huge set of important software-creation tools, you shouldn't be coding without VCS.

Why did this happen? Imagine it's this data analysts' first year out of school, something like a crappy statistics program that only teaches SAS or base-R with no VCS. This young padawan needs to practice analysis in preparation for a work project. They grab some data and stash it in a folder where Git is tracking. They do their analysis practice, it's Friday afternoon, they get lazy, they don't look at what they're committing, and they click buttons in the GUI without much thought.

It's incompetence and laziness, not malice. These tools that allow us to share widely, not just GH but also social media broadly: these tools have great power that should be used with greater training, responsibility, and care.


I don't think it's incompetence so much as professional ignorance (which might be incompetence, I don't think it is). Source control is for your source code, not for your data. Your data belongs in a data base. Git is not a database, or at least shouldn't be treated like one.

Sure it's easy to call it lazy for a data set to be in some local directory and accidentally get committed. Happens to all of us. The bigger problem is "why is that data sitting on your file system in a directory, when it should be in some data base, preferably not locally."

> these tools have great power that should be used with greater training, responsibility, and care.

This screams more and more that the tools are bad. Git is famously hard to use and even harder for non-plaintext data. Databases are annoying to initialize and get access to without a developer who's done it before. The tools suck, they can be better, and require less training. It's not wrong to be lazy - it's wrong to make the lazy path dangerous.


> Git is not a database

Um, yeah, it is - by most reasonable definitions of the word "database".

No doubt there are a few unusually-narrow definitions of "database" out there that would exclude Git, but I'm pretty certain they're in the minority.


I'm not talking about pedantry, but pragmatism. Git is not designed to be used as a conventional database, and should not be.


You should at least consider encrypting the PII on git, https://github.com/AGWA/git-crypt it is not rocket sience


You should consider never ever ever ever ever ever storing secrets in version control. Period.


Storing encrypted secrets in SCM is pretty standard practice. Usually I recommend separate repos but it's not required especially if the secret material is small and rarely changes.

This data wasn't "secrets" though. PD treated this loosely would still be an issue even if it was encrypted.


Standard doesn't mean good, and the sophistication to do this well is beyond most orgs and individuals . You can come up with specific scenarios, but they'll get garbled through the telephone game over time.

Much better to treat data, including secrets, as radioactive, and thus follow least priv + defense in depth. If it's never there, and better, they never got it, there's nothing to worry about

Ex: Before PII hits logs, encrypt it and don't give data scientists the key.

Ex: orgs we work with who version notebooks won't allow notebook output to be saved, and people know (+ software) to reject baked creds. Likewise, auth isn't baked generally, instead SSO.

If it's never there, for every stage, so much easier.


No, storing encrypted secrets in repositories is an incredibly common and safe practice in DevOps/GitOps environments. Virtually every orchestration tool has some way to support it.

E.g. k8s sealed secrets https://github.com/bitnami-labs/sealed-secrets

E.g. Salt's GPG filters https://docs.saltproject.io/en/latest/ref/renderers/all/salt...

E.g. Ansible Vault https://docs.ansible.com/ansible/latest/user_guide/vault.htm...

E.g. Puppet hiera-yaml https://puppet.com/docs/puppet/6/securing-sensitive-data.htm...

PD/PII is a completely separate issue. First because even encrypting doesn't remove legal obligations concerning processing, and second because your DS/BI teams probably need access to the unencrypted data to like, do their actual jobs. You need completely orthogonal forms of access control for that (like SSO as you alude to).


Wrong thread? I use tools like that, but they're not the environments, credentials, & threat models (I think) we're talking about.

I'd be curious if/how any data science teams (google/facebook/netflix/...) bake user credential & API secrets into data science notebooks. I've never seen it, but I don't get to see everyone's notebook environments. I have seen 1-2 projects attempting to do DB auth plugins/libs for jupyter notebooks, but not high-grade production ones. Instead, it gets baked into the deeper env (think Tableau, Databricks, ...), vs part of the ipynb.

Notebook security feels like 1990s/2000s browser security and the literal decades of unsafe web apis. DLP and all that is at the forefront in risks, yet most tools just do system auth and maybe a few special connector auth. The real threat model is outside of their system, so it's no surprise analysts & their data orgs fail in practice :(


Arguably kahrl is in the wrong thread repeatedly posting about "never ever ever ever ever ever storing secrets in version control". As I said, I agree PII/PD is different.

Sealed secrets can be used in various k8s-aware notebooks.


Yeah, and when quantum supremacy arrives and/or there is a mathematical breakthrough, then the data suddenly is exposed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: