Hacker News new | past | comments | ask | show | jobs | submit login
GitHub: a case study in link maintenance and 404 pages (chrismorgan.info)
36 points by chrismorgan on Oct 4, 2013 | hide | past | favorite | 38 comments



Surprised nobody mentioned the extremely annoying and misleading 404 response when you are denied permission to view a private repo. I can't tell you how many times people at work tell me my link doesn't work because they aren't signed-in.


We do that to avoid leaking the presence of the repository, if it's private. It's a bit of a pain, but privacy's important to us.


Thank you for this, some of us take notice of 404-for-security responses like these and appreciate it.


Perhaps, when the link could be a private repository, the error page should state it's either missing or inaccessible.

All we need is an error code 402.5 "plausible deniability between unauthorized and not found"...


The spec actually uses a 404 specifically for this purpose: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html


Sure, and everybody who has read the spec knows that. Alas, that's 1% of your user base.

For the rest of your users it wouldn't hurt to say, "You might see something different if you're logged in." Or if they are logged in, saying, "Were you expecting something different? Maybe you just don't have access yet."


Wait, we cant hold github to account for making linkrot and not applying the spec, but when the spec is implemented by them, for good reason, say "but 1% of your user base reads the spec". Thats a bit of a double standard.


Are you talking to me? I don't think I said anything like the first half of that.


Reasonable compromise, although they'd have to use that version of the 404 page for any URL that could possibly a repository.


I think that is one of the biggest mistakes github did. I don't care what the spec says, you should always default to 401, so that requests can be retried with auth. Then after getting through the 401, you should fire the 404 or 200 (as the case may be).


Yes, one of the unfortunate trade‐offs between usability and information disclosure. That's why they put the login form there if you're not logged in—though having it add the line of text “there may be a private repository here, log in if you know there is” (or similar) would be a good thing.


This is understandable however. If you get a 401 response the person knows that you have a private repository with that name, they just don't have access to it. A small information leak.


> GitHub uses the branch name; you can replace it with a changeset ID and it’ll work, but you’ll need to find a changeset ID.

You can press 'y' to expand the URL to its canonical form.


Oh, cool. I didn’t know that. It’s a pity they don’t make it more obvious. Post updated to include this tidbit of information. Thanks!


I'm really interested in a GitHub person's take on this (holman?). Is my notion of the balance reasonable? (You've got the data, I'm just guessing with it.) Do you think you're likely to assess improving GitHub's 404 page at least?


You bring up some good points, and we're always improving pages like these (the 404 itself has gone through a few versions this year already).

I don't think the answer is forcing the canonical URL on every page request, though. I think it's important to retain branch names instead of the full sha. I find this, for example:

    https://github.com/holman/dotfiles/blob/master/osx/set-defaults.sh
…far more usable and meaningful than this:

    https://github.com/holman/dotfiles/blob/0fe9e9963b2389eae4c9de49a4873bd819e19067/osx/set-defaults.sh
It'd be great to be able to support file renames and deleted branches better in the product, but that takes some time to build out. Hopefully we can do better with that in the future (and we have been working on things like this recently- supporting repository redirects was a huge one, really).


Github can definitely improve on this. For instance, if I want to edit a file like so:

https://github.com/takezoe/gitbucket/edit/master/README.md

you will see a 404 if you're not logged in with no information.


By the way, I apologise to anyone looking for a decent 404 page from my site. I initially configured that site (with WebFaction) as a static site, and they don't expose a way of providing a custom 404 page for that. I should switch it to Apache so that I can get a 404 page, but I haven't got round to doing that. (More importantly, I haven't designed the 404 page I want, so the other part of the effort would be wasted.)

At least I gave http://relink.chrismorgan.info a proper 404 page. The sentiments displayed on it are the same as what I would put for http://chrismorgan.info, though: you won't get a 404 page on my site unless you broke the link yourself.


[deleted]


Google Webmaster Tools will tell you of some broken links, but not all; it's not proactive about informing you, whereas Relink will be scanning regularly for this specific purpose and telling you. I'm also interested in making sure that all links that ever worked continue to work, something GWT doesn't do. Those are a few of the things; I've got much more in line for it where GWT is not competition at all. You may find more useful info in my post from yesterday, https://news.ycombinator.com/item?id=6489084 (the actual submission of which became [dead] for some reason entirely unknown and unclear to me).


> Link maintenance is hard; the web doesn’t just automatically stay intact; it requires effort on your part.

This sounds all well and good in theory, but - commercially speaking - becomes very impractical for many businesses. I'd be really curious what the ROI would be on a more rigorous link maintenance practice. My general impression is that it's simply not worth it.

> I should switch it to Apache so that I can get a 404 page, but I haven't got round to doing that.[1]

Interesting comment from the OP, whom I suspect also recognizes (quite possibly like GitHub) that the time it takes to do this type of stuff hardly outweighs the benefit.

[1]https://news.ycombinator.com/item?id=6495675


This may happen to be true for some businesses, but it's not necessarily true. Some toolsets may make it hard, which raises the I. Others make it easy or free, so people naturally do it. The smaller the I gets, the more likely a given R is worth it.

I also think it's easy to underestimate the R here. It's very hard to detect things like brand damage and lost leads, especially when somebody turns up on your site once, thinks you guys are chumps because of a bad first impression, and never comes back. Whereas the I is obvious, because the cost is all internal to the company.

The two factors, alas, reinforce one another. Link rot isn't an obvious problem when setting up a system, because nobody is linking in. By the time the problem becomes noticeable, the system is hard to change, making the cost of a fix high. And that trains people to treat link rot as unimportant, deepening the cycle.


My favorite part about the Github 404 page is that it consistently locks up Firefox for me.


I don't see how that's possible. Go report the bug. This is a bizarre bug.


I don't necessarily agree the author. For example, if someone wants their privacy and wants to delete their Facebook account, when you search the account it should say the account does not exist. It doesn't say it's gone. It just say it is not found. 404 has a good security and privacy implication. For example, when you turn a repo to private and anyone tries to access it without proper permission should see 404 instead of seeing 401.


The OP makes valid points but its hard to hate on Github too much. Their users are the ones breaking the links. But its also hard to hate on the users too because the whole point is to develop software and that development is never done.


It depends; users break some of the links, but GitHub's design for non-permanence is, to my mind, the biggest problem.


fair point


Isn't this Security 101, limit what external users can gather about your systems. Especially when an error occurs, you don't want an exception getting thrown and then the stack trace gets displayed to the whole internet.


Totally tangential but, does Mercurial really not allow you to delete things ever? So, for instance, when you accidentally commit a 100MB PSD file and then need to remove it, there's no way to do that?


Mercurial does let you do it in much the same way Git would, but it requires you to do it very deliberately, enabling (bundled) extensions in the config. For example, the "strip" command, part of mq. As with Git, doing things like that if you've pushed publicly will be difficult and require coordination.



I love the spirit of this, but I would never put it on a serious site.

I put things on the web so people can see/use them. If they come to my site, I want them to get what they're looking for. If I'm going to serve a 404, it's either because a) I screwed up, or b) somebody mistyped something. In either case, distracting them with something else can only take them farther away from their original goal.


I wonder whether that works at all for web sites that have essentially global reach. Children from the US are unlikely to appear in Germany all of a sudden, I guess. (Incidentally, the example page for »Other countries« was in Greek which I cannot even read).


For something like Github, the more international it is, the better


I wonder—can they demonstrate whether or not it’s achieved anything? It’s an interesting idea, but I’m dubious about its efficacy. Now if Microsoft and GitHub both used it—


I use it in production. You will always have 404's and this is what I feel to be a meaningful 404.


If links should be permanent, what happens after a HTTP DELETE?


410 GONE

EDIT: To be clear, this is more in regard to the original articles qualification on the permanence of links: If for some reason it can’t exist any more, don’t just let it go: it should show useful error. Using 410 GONE instead of 404 NOT FOUND after a DELETE is a fairly minimal but direct application of this principle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: