> The lawsuit said GitHub had an obligation under California law and industry standards to keep off or remove the Social Security numbers and personal information from its site. The plaintiffs believe that because Social Security numbers had a fixed format, GitHub should have been able to identify and remove this data
> The lawsuit alleges that by allowing the hacker to store information on its servers, GitHub violated the federal Wiretap Act.
> The lawsuit also makes a bold claim that "GitHub actively encourages (at least) friendly hacking." It then links to a GitHub repository named "Awesome Hacking." ... not associated with GitHub staff or management, but owned by a user who registered on the platform
This lawsuit is a natural extension of the calls for internet platforms to better police and accept liability for the content they are hosting. These complaints are usually directed towards the big corporations like Google, Facebook and Amazon. But if the rationale is accepted, it will need to be applied universally to startups and SMBs as well. As someone who thinks the world will be a far better place if we had decentralized dumb platforms, as opposed to very centralized platforms with heavy-handed top-down censorship and moderation, I sure hope this movement is turned back.
> calls for internet platforms to better police and accept liability
Hollywood has for years been quite transparent that this is a high-priority objective and long-term campaign for them. Establishing "rules of the road" for the "wild west" internet, a thing of "little good" and free-loading "entitled" users, where most traffic is video downloading. It's one advantage of group-think ("no-one honestly disagrees with our position" so why not describe the obviously fair and uncorrupted right thing). It's only the tactics, the lobbying to establish precedents, the press story placement, the taking far-too-respected tech "down a notch", that is less discussed. So the superficiality of press coverage, of competing visions of what the internet is and should become, has long struck me as odd.
I’ve got to say, I think the “common carrier” argument is suspect if you’re running targeted advertising and recommendation on the page, but GitHub isn’t doing any of those things.
> As someone who thinks the world will be a far better place if we had decentralized dumb platforms
Then you should be very much in favour of assigning expensive liability to companies running these centralized platforms. If it becomes extremely expensive or legally risky to maintain a big centralized database, that opens a window for free, open source federated protocols to fill that gap.
Consider: You can sue Megaupload Ltd, but you can't sue BitTorrent-the-protocol.
Yeah, good luck with that. If you're lucky, you'll win a judgement for both the defendant's Playstation and his Xbox. I'm sure white shoe law firms will be beating down your door to represent you in that suit.
Legally, what's the difference between all the distributed, load balanced servers running a "centralized" database and the computers which host the same data in a "distributed" platform?
This doesn't solve the problem. If anything, it amplifies it. We have to either protect platforms or hold platforms liable, but making everyone a platform doesn't help much.
Are you being facetious? Are you genuinely asking me to explain the difference between SMTP and Facebook.com?
If you make everyone an individual actor, then there is no "platform". Visa can ban a merchant; you can't ban a merchant from the concept of accepting a cash payment.
> Are you genuinely asking me to explain the difference between SMTP and Facebook.com?
Of course not. I'm asking you what the legal difference is between the GitHub we have today and your desktop running a hypothetical distributed federated GitHub.
If your computer takes part in the hosting of some illegal content, wouldn't you be held liable like GitHub is here?
God this is depressing to read. Git (not "GitHub") is already a distributed VCS. This isn't hypothetical!
If you're a random individual hosting a git repo containing illegal content, you might get sued, but much much more likely, you'd just get an angry email demanding you take it down.
> God this is depressing to read. Git (not "GitHub") is already a distributed VCS. This isn't hypothetical!
I feel like you're missing the context in this thread. My parent was giving federated protocols as as alternative to centralized platforms, and my question is what are the legal implications for me if I end up hosting illegal content because of a network sync of some federated git.
Maybe social media is a less distracting example for you. Currently, Facebook is liable if they host e.g. child porn to their site. If some federated social media platform takes off where every user's computer takes part in also hosting, what happens when a child porn photo ends up on the platform?
My point is that federated vs centralized is not a magic bullet for this issue. In either case, there is some platform which can be held liable for content. People seem to think "use open source distributed protocols" solves this legal issue, but it does not.
> Ifyou're a random individual hosting a git repo containing illegal content, you might get sued, but much much more likely, you'd just get an angry email demanding you take it down.
Is this true? I'm genuinely asking. Can I commit the same "crime" as GitHub or Facebook but be treated so differently by the legal system? What's the actual rules here that differentiates us?
I feel like you're wilfully missing the point. Let's leave aside child porn because that's an extreme case and I'm assuming you'd report that to the police if it showed up on your social media feed.
In terms of less extreme cases, there's a fairly good chance that in the course of your normal browsing, you've illegally downloaded copyrighted content. Maybe deliberately, but also by accident if your browser cached some photos or you right-clicked and saved some images you didn't have a valid license for.
Have you ever been sued for this? Do you know anyone who has?
The problem is that all these big surveillance as a service companies are vaccines for the state. Centralized data silos are setting the precedent that there is indeed a magic button that politicians/lawlers can push to control "The Internet".
It seems impossible and foolish to sue a protocol, until the precedent is gradually set that "reasonable" services implement many forms of state and corporate censorship. Then when the next thing happens that they don't like, their question becomes why are "lawless" decentralized protocols allowed to exist.
And obviously a protocol itself still can't be legislated out of existence. But its use can certainly be filtered, criminalized, etc. Even its development can be hampered if its cast as a "circumvention device" against status-quo censorship.
I don't think we're terribly far from having to fight this battle regardless - when consumer net neutrality fails, incumbent ISPs will discover most of their market is content with a default whitelist to save a few dollars. But hastening its arrival is not good.
There’s a giant wave of political support for this. Right now people are talking on the tv about 8chan and mass shooters. GitHub is being sued for content.
The end result will be a (horrifying) market solution m. Site owners will have no choice but to pay outside companies to analyze content for them and auto delete. This will probably end up throwing the baby out with the bath water.
Agreed. I will add that it is both sides of the political spectrum in US seem to push for it though for different reasons.I am annoyed at how many conversations I had with people who think it is not only good, but necessary.
They already do this. I see people on Twitter getting banned daily for mundane, reasonable things like "fuck transphobes" while white supremacists spout hate daily and Twitter does nothing.
Can you expand on that? There may be selective enforcement but a hundred million ssn’s on any platform causes the same amount of damage. If you’re going to hold the platform accountable for preventing it what’s the rationale for excluding a small player? You’d just drive this activity their way.
> The plaintiffs believe that because Social Security numbers had a fixed format, GitHub should have been able to identify and remove this data
I don't see how they can expect to enforce this with 100% accuracy.
SSNs do have a fixed format but other things could potentially follow the same format.
For example what if you had a library that lets you configure randomly generated codes in a XXX-XX-XXXX format and it just so happens one of the random codes matches a valid SSN pattern?
Is GitHub going to mess with your code? What if you have tests that matched on a hardcoded random number in a SSN-like format. If GitHub scrubs that then suddenly your CI tests might not pass. Also how would it deal with modifying your git history so others couldn't clone it with the potentially sensitive data?
> SSNs do have a fixed format but other things could potentially follow the same format.
I even have trouble with this premise, in addition to general agreement with a lot of the other comments in this tree arguing that SSN detection is a red herring.
I agree that displaying SSNs for human consumption has an agreed upon standard format. This doesn't imply that code (and therefore any tests or distributions of data) working with SSNs is handling them as XXX-XX-XXXX.
I've seen plenty of clients storing them without dashes in database tables. I saw one storing them as INTs and handling padding in display logic. I don't know if an SSN can start with a leading 0, but they guarded against that.
Especially given the storage and memory implications of a 32-bit integer vs a 9-11 character string, I see lots of reasons to work with SSNs in code as ints. Should we flag all ints as SSNs? Or maybe we can be "reasonable" and flag any integer in the range 100,000,000-999,999,999 (aka 100-00-0000 to 999-99-9999) as an SSN?
I can trivially generate a list containing all valid SSNs with a simple loop. Should I hesitate to publish code with loops over integers in the range above? That code could easily be used by a malicious hacker to generate the SSN list!
SSNs don't have a checksum like credit card numbers do (or like national identity card numbers do), because the SSN is not meant to be used the way it is used.
But do many countries treat them as secrets? Where I live, my number is a unique identifier for me but it's not secret. Because, you know, sharing secrets isn't smart and leads to the recurring issues we see in the US.
I'm from the US and I remember about 20 years ago I registered for a Blockbluster card (a way to rent DVDs from Blockbuster) and the form required putting my full social security number on it. In the US it's supposed to be secret but lots of places want access to it. Blockbuster never got my SSN and they did let me sign up without providing it. It's crazy they would even ask.
I’m reminded of a webpage that claimed to know your pin for your credit card; just do a find in page to see! In reality, it just had all 10,000 possible numbers listed in numerical order
Is this true? I wasn’t aware that websites could capture your find-in-page searches. I’d be interested to know if they can capture your key events when the find-in-page box has focus. Intuition tells me they can’t, and that it would be outside of the websites “sandbox”. But I can’t say for sure.
In most modern browsers you can't capture this when find-in-page box has focus. Only if you manually select text after searching can you capture it (or, like the other reply stated: capture the scroll distance).
But you could easily disable ctrl+f and throw up your own search box with keypress capture. Not all browsers show the search box outside of the browser viewport, and even for those that do (such as Chrome), you could display a hovering modal inside the viewport, as most users won't remember the exact location of the search box.
Detecting plain SSN numbers wouldn't be difficult with a combination of regex, machine learning and human verification.
Even if hackers could just encode the SSN numbers, it would at least mitigate the spreading of PII.
Edit : I don't care about the downvotes, I care about privacy. Enough of the argument "but wait, can't you imagine the cost?", well if you can't afford to protect people's privacy, don't do business at all.
Edit 2 : people are totally missing my points. The goal is to not display any plaintext SSN that would be scraped by bots. As I said, the hackers could just encode the SSNs, but then the numbers won't be readable by scrapers
Edit 3 : once a project is reviewed and verified, it would stop to trigger alerts. This is trivial, but the HN mentality is just disrespectful regarding people's data, until it's their own personal data that leaked
is a smoking gun to prove Github as a company is supporting hackers are the ones putting forth this idea, no one with any technical knowledge should come close to supporting this.
Per EU's GDPR, and maybe the future California Consumer Privacy Act, yes : GitHub should do everything they can to preserve people's privacy. Including maybe flagging and reviewing projects which process PII.
GitHub is popular enough that less technical people also browse it, in particular young people looking to learn.
If I accept that technical difficulty and infeasibility is no defense, then I want that standard applied to lawyers as well.
Overly litigious firms causing rising legal costs across industries? I don't care if it's hard to solve, the onus for fixing it is on the firms, they figure it out or face penalties. Perhaps in the interests of helping the disenfranchised we could institute something like what real estate has, where banks are required to sell a quota of mortgages in certain areas regardless of the financial viability. Lawyers could be forced to seek out clients they would ordinarily never entertain due to the risk of loss. If would stink for them, but what do I care? That's their problem.
I can't tell if you actually mean what you're saying, or if this is just the time-honored HN tradition of being a contrarian for the sake of being a contrarian.
In regards to your edit 1, cost benefit analysis isn't just about a company's bottom line, it's also about the types of architecture and services that are allowed to exist online. Github is not the primary way that PII is leaked online, you're thinking of Pastebin.
Should Pastebin be allowed to exist? Should the Open Source developers behind Wordpress and Ghostery be liable for not scrubbing PII off of websites? The "cost" here isn't really money, it's a social cost. It's regular people's access to services that, on net, make their lives much better.
I'm not worried about Microsoft making less money, I'm worried about damaging one of the best software repository services online and making it less useful to ordinary developers.
In regards to your edit 2, Github is a collection of software repositories. Say you replace all social security numbers with <number-redacted>. If I'm a screen scraper, that doesn't block me -- I can just clone the repository. There is no way that Github can block this unless they delete or replace the numbers in the actual uploaded code, which would obviously be a bad idea.
In regards to your edit 3, are you planning on linking Github repositories to real-life identities? Probably not, since that would be a huge privacy problem, and you're trying to improve privacy.
So what happens when ownership transfers? Or when someone makes an innocuous repo and then later on pushes PII? Note that this is not an abstract problem, we've seen multiple malware attacks on packages, browser extensions, and phone apps that boiled down to, "it looked safe, and then somebody stole the credentials or just decided to push malicious code."
I don't understand how a review process would help here unless it was a review process on literally every commit.
Yes, because as far as I know, Pastebin doesn't process sent data to provide its service. They just store it, they could simply store it encrypted and say they can't access data. On the other hand, GitHub process repositories to provide further services. They already read the sent data.
> The "cost" here isn't really money, it's a social cost. It's regular people's access to services that, on net, make their lives much better.
What about the cost for people whose data leaked? Are you saying we should treat them as collateral damage for others to have an "accessible" service? That sounds irresponsible.
> Should the Open Source developers behind Wordpress and Ghostery be liable for not scrubbing PII off of websites? The "cost" here isn't really money, it's a social cost.
No, as long as they don't operate public instances of those softwares, they don't need to include all the tooling for detecting and handling PII. However, anyone who wishes to operate those softwares with a way to publicly sent data to it should implement privacy safeguards.
> ordinary developers.
In 2000s, we were the "ordinary developers". All the leaks happening today is because we didn't care enough about that aspect of software engineering. The new wave of developers should always have privacy in mind before writing software for businesses, and to change the mentality, changing a platform like GitHub would help spreading a new culture of security and privacy-oriented mindset.
> In regards to your edit 2, Github is a collection of software repositories. Say you replace all social security numbers with <number-redacted>. If I'm a screen scraper, that doesn't block me -- I can just clone the repository. There is no way that Github can block this unless they delete or replace the numbers in the actual uploaded code, which would obviously be a bad idea.
Github is an opinionated centralized collections of software repositories. It should be simpler : If your repository is private, then there should be no filter or review at all. If you repository is public or becomes public, then it should be draconian about what is posted and shared. No hate speech, no PII, etc. If there is a positive detection, then the repository should automatically turn private or be suspended until it's resolved.
> So what happens when ownership transfers?
Unsolved issue so far, I don't have proposition about it
> I don't understand how a review process would help here unless it was a review process on literally every commit.
Regarding publicly sharing information on famous platforms like GitHub, it should be mandatory IMHO. I would happily trade a few false positives for a better peace of mind.
> What about the cost for people whose data leaked? Are you saying we should treat them as collateral damage for others to have an "accessible" service?
Yes. That's what being in a society means, you have to put some thought into the collective good of allowing a service to exist. Living in a society means that sometimes you accept a greater personal risk in order to allow a large number of people to access services that better their lives. It's how cars work, for example.
You're drawing a distinction between private and public that seems completely arbitrary to me. If we really ought to be doing everything in our power to limit PII leaks, I don't see why operating a private service is any excuse.
Why do I care whether a message was meant to be public or private if it leaks my PII? I could see an argument for something like end-to-end encryption being exempt, because in that case the service provider literally can't scan the messages. But where my texts, or a private software repository, or self-hosted software are concerned, there's nothing technical or legal that prevents companies and software providers from running the exact same tests as they would on publicly facing content.
If we're going to hold public hosts accountable, why are we letting private hosts off the hook? And if it seems obvious that private hosts shouldn't be subject to those restrictions, then what's fundamentally different about a public host that means they should? You can't think about PII as a black-and-white issue, these are a set of tradeoffs that have to be run through a cost-benefit analysis.
> Github is an opinionated centralized collections of software repositories.
A side-effect of making moderation into an indicator of responsibility is that platforms will stop moderating. Platforms like Facebook and Twitter are bad at moderation, but we don't want them to turn off all of their moderation and become 8Chan. The "they already moderate some things" argument can have some really negative side-effects, because it punishes companies and increases their liability just for trying to be better.
Quick side-note, Pastebin does do some text processing for spam detection, particularly around links. But let's assume a service which did no moderation at all. It sounds good to say this service should be treated differently than Github. But what you end up with in that scenario is that everybody stops doing moderation.
If your goal is to make it harder for people to publicly post PII, then this is counterproductive. You should be trying to make it easier and less risky for companies like Github to moderate content.
> Unsolved issue so far, I don't have proposition about it.
Well... don't you think you should solve that before you make a policy? If I propose a car with square wheels, I can't just say, "don't worry, I'll figure out how to make them roll after I build it."
You missed the whole point of the comment you replied to. You can detect the format of the SSN, but you can't reliably tell that it actually is a SSN, and not some other type of identifier.
Can you yourself reliably detect if something is a list of SSN's or some other type of identifier?
I think you can. Now when we have neural nets capable of distinguishing 100s of different dog breeds, we still have to trip on the most basic and structured type of entity extraction? No. A simple regex, combined with heuristics, and a linear model on top can reliably detect SSN's.
Just that there will be a few false positives (no matter if you automate this, or do this manually) does not mean it is a Herculean technical challenge to do this.
What I think happened: Someone contacted Capital One by email to responsibly disclose to them that there were SSN's and other data on a Gist. That person found them with a simple crawler or search.
Then Cap 1 thought: If some rando can find this after a lot of damage has been done, why can't Github find these seconds after upload?
And, really, there is no technical excuse. It is perfectly possible to do this, and lots of big companies do this (or hire security companies to do this for them). Mention their name on some deep web hacking forum, a pastebin, or inside Github code, and somewhere an alarm goes off.
Github could (and should) warn if a user uploads loads of PII-like data. For the cost of running a search server and a few moderators. "Are you sure you want to upload your AWS credentials in a public repository?".
Github is somewhere halfway between moderated and a content platform. They already have a history of taking down repositories if they link to PII data (or infringe copyright, or damage U.S. national security): http://web.archive.org/web/20180619172528/https://github.com... so not acting on this specific repo with SSN numbers could be seen as a poor/shoddy job on their part. Github is certainly in the dominant position to mitigate spread of PII data, so they should have their stuff in order.
Don't you need at least one other piece of identifying info to make that useful? Like, I could churn out mostly valid gmail passwords all day, but I have no idea which users have those passwords.
Worse. You can make the ML model spit out the SSNs it was trained on. That's a problem when you can't manually curate billions of documents. If you didn't look, you wouldn't even know they were there.
‘The lawsuit also makes a bold claim that "GitHub actively encourages (at least) friendly hacking." It then links to a GitHub repository named "Awesome Hacking.”[0]’
While I do think that "hacking" is less demonized than it was circa "Hackers", it is not surprising that the attitude around a lawsuit is a little outdated. I'm just glad the target is an entity with sufficient resources to defend such a spurious accusation - a similar suit could probably destroy a smaller company.
So? Trying to break into a system can be the only way to know it's reasonably secure. This is like saying locksmiths are bad. Preventing this makes systems _less_ secure, defeating the point of trying to ensure privacy.
It doesn't even appear to be an official GitHub page (Hack with GitHub - location: Bangalore, India, email: hackwithgithub@gmail.com). Just because someone creates an "X-with-Github" repository, it doesn't GitHub are actively encouraging X.
> The lawsuit said GitHub had an obligation under California law and industry standards to keep off or remove the Social Security numbers and personal information from its site. The plaintiffs believe that because Social Security numbers had a fixed format, GitHub should have been able to identify and remove this data
I can't wait until I get to debug our first build that won't run because we uploaded some data, however broadly that ends up being defined, with 9 digits in a row...
Barriers to posting SSN-like data would make it difficult for a lot of people to do their job. Software that handles SSN info should have fake data for tests.
Also SSN isn’t that distinctive of a format. nnn nn nnnn. Check bits and reserved prefixes were all removed decades ago when it became clear we’d run out unless we use the whole name space (and even then that buys us to 2100). \d{3}\s?\d{2}\s?\d{4} will match a surprising amount.
Detecting SSNs is hard without accepting a high false positive rate. Much harder than phone numbers, credit card numbers, or cloud credentials.
I'd assume many systems would store SS numbers without spaces or dashes in the backend so that rendering is up to the client. Which means you're looking for 9 digit strings. For example, full zip codes (xxxxx-xxxx) are also 9 digit strings.
I've posted elsewhere in this thread about this. There's really no reason to expect SSNs as strings for internal use. 32bit integers readily represent the same, as the max SSN is just a 9-digit number. I've seen at least one client store SSNs as INTs in a database and handle left-padding to 9 characters and interposing hyphens in display code.
Any 9-digit integers are immediately suspect under this reasonable storage choice.
You're strawmaning. I did not speak about making things secret. I also suggest you check this article out https://en.wikipedia.org/wiki/Free_Speech_Flag so you understand the difference between censorship and secrecy.
Honestly, it scares me that this was even filed. Even though we know how ridiculous it is to include Github in this suit, I'm afraid we're going to be left with some weird middle ground that shouldn't even exist to begin with made by people who have no idea how things work trying to fix something that isn't broken.
This might be the first good, non-entitled argument I’ve run across for having some form of software engineering licensure: having qualified people whose technical testimony in court would carry more weight than Larry McSues-a-lot.
Except that professional licensure tends to attract the lower end of the spectrum, because obtaining that credential represents a better path to success. So it would be easy to get a certified technical person to say the exact opposite for their paycheck, despite the technical consensus being "wtf".
The general problem you're referencing is one of stature, specifically that someone who's core activity is forwarding emails without trimming the replies is viewed as more-equal by the court because they've obtained a thick piece of paper. Alternatively we could just remove professional licensure from the field of Larry McSues-a-lot and diminish his stature.
"Coffee spills, Pokemon class actions, tobacco settlements. American courts have
made a name for themselves as a wild lottery and a money machine for a lucky few
lawyers. At least in part, however, the reputation is unfounded. American courts seem to
handle routine contract and tort disputes as well as their peers in other wealthy
democracies.
"More generally, Americans do not file an unusually high number of law suits.
They do not employ large numbers of judges or lawyers. They do not pay more than
people in comparable countries to enforce contracts. And they do not pay unusually high
prices for insurance against routine torts.
"Instead, American courts have made the bad name for themselves by mishandling
a few peculiar categories of law suits. In this article, we use securities class actions and
mass torts to illustrate the phenomenon, but anyone who reads a newspaper could suggest
alternatives.
"The implications for reform are straightforward: focus not on the litigation as a
whole; focus on the specifically mishandled types of suits."
I don't know where I first heard this, but I have in my head the impression that America has the reputation of being overly litigious because mis-behaving companies think they benefit from creating that misconception.
> The plaintiffs believe that because Social Security numbers had a fixed format, GitHub should have been able to identify and remove this data
No it's not, especially once you add binary files to the mix.
I once worked at a company that required everyone to run some sort of local scanner to see if there's sensitive data on their laptops. My laptop with no sensitive data had something like 10k+ matching files. I promptly ignored the thing.
I swear lawyers are a bunch of geniuses. How long until your personal computer becomes a liability because some cached content in the browser becomes knowingly hosting content? Had it on your phone and took a trip? Now you’re transporting across state lines. Absolutely ridiculous. And as per my last comment, this is 100% about increasing settlement size because the lawyers get a percent. Never been through a class action before? Let me tell you how it works: the opposing lawyer has a set amount of payout they want before they even begin the process. They work to achieve that amount. Once it’s agreed on then they are happy. It has absolutely nothing at all to do with enforcing laws or protecting rights. It is all about buying a lawyer a new house.
It would be trivial to post every SSN online. There's maximum one billion numbers fitting that format. Store each number in 30 bits, that's about 3600 MiB of data.
Obviously they are going to argue that section 230 doesn't apply.
Section 230 subsection d:
> (4) No effect on communications privacy law
Nothing in this section shall be construed to limit the application of the Electronic Communications Privacy Act of 1986 or any of the amendments made by such Act, or any similar State law.
> The lawsuit alleges that by allowing the hacker to store information on its servers, GitHub violated the federal Wiretap Act.
> The lawsuit also makes a bold claim that "GitHub actively encourages (at least) friendly hacking." It then links to a GitHub repository named "Awesome Hacking." ... not associated with GitHub staff or management, but owned by a user who registered on the platform
This lawsuit is a natural extension of the calls for internet platforms to better police and accept liability for the content they are hosting. These complaints are usually directed towards the big corporations like Google, Facebook and Amazon. But if the rationale is accepted, it will need to be applied universally to startups and SMBs as well. As someone who thinks the world will be a far better place if we had decentralized dumb platforms, as opposed to very centralized platforms with heavy-handed top-down censorship and moderation, I sure hope this movement is turned back.