Hacker News new | past | comments | ask | show | jobs | submit login
Testing 3 million hyperlinks, lessons learned (samsaffron.com)
206 points by sathyabhat on June 7, 2012 | hide | past | favorite | 55 comments



(I've had this rant before, but I'll repeat it.)

He points to Stack Overflow's 404 as a good example and claims "We do our best to explain it was removed, why it was removed and where you could possibly find it."

Yet there is still no permanent archive of deleted Stack Overflow content; you have to rely on third party archives like archive.org and even then, you have to be lucky.

SO moderators have a habit of retrospectively deleting old content that is off-topic under current rules, even if it was perfectly on-topic at some time in the past. I feel this is bad internet citizenship -- it's removing internet history for no good reason.

Fair enough, delete newly created off-topic questions under the current moderation rules. But when these types of questions were asked originally they were on topic at the time. Deleting them retrospectively (completely - no redirect either) is still poor form.

(Top read otherwise though!)


And thus we lost dozens of episodes of Doctor Who, and almost lost all of Monty Python (if Terry Jones hadn't stepped in), because someone thought the material was no longer relevant and archival not a chartered function (thought the economics of maintaining an archive are valid).


Actually no, we lost them because of bizarre copyright rules which meant they had to be deleted.


Can you explain?


akent, on a personal level I am torn on the deleted content issue, I understand both sides of this issue. On one hand this is not content the community want to associate with the site, on the other hand we have no facility for redirecting this to an external home. We are experimenting with special locked questions for many of the historical cases, this is a very tough issue we are trying to solve.


Set up a new read-only Stack Exchange site on a new domain. On request and at their discretion, moderators could undelete questions and migrate them to the new site. (Users are redirected to the new site if they visit a question that was migrated there from another site.)

Exclude the site and its posts/activity from most of the listings and indexes on the rest of the network, to emphasize that this content has been removed from the Stack Exchange network. Include a disclaimer in the header of every page.


> It would be trivial to do some rudimentary parsing on the url string to determine where you really wanted to go

Specific to this point, a new project I'm building supports "pretty" URLs and I've found my (now) favourite solution is to build an aliases system.

It works like so: when a user creates an item an "alias" is registered, it's set to "current" and all future queries to that alias are logged. If the user causes a change to the URL in future (name change, etc.) then the new alias is registered but the old one is retained and 301s to the new alias. All aliases are accessible by the user and they can invalidate them manually (if they want to re-use an alias for example) however if an alias has had a large amount of hits from a single source since that alias was retired (say 50 referrals from website.com to mysite.com/previous-alias) the system assumes that the user posted the link on another website and so invalidating that alias will cause a dead link (and lose my site traffic) so it doesn't allow it.

I guess it's convoluted and adds extra overhead but I feel like if you have pretty URLs (which are in my opinion something that a website should aim for) you need to be in a position where they're not going to cause the site to break the rest of the internet. The easy solution is to have pseudo pretty URLs (eg: website.com/123-pretty-url, where 123 = ID and pretty-url is just an ignored string) or just not allow URLs to ever be changed, but I don't like either.

I wonder if any other websites have a good approach to this.


At Stack Overflow we use slugs for this approach,

http://stackoverflow.com/questions/427102

http://stackoverflow.com/questions/427102/what-is-a-slug

etc, will all redirect to the canonical:

http://stackoverflow.com/questions/427102/what-is-a-slug-in-...

If the title changes we update the slug and redirect with a 301 to the new canonical


I wouldn't recommend having the "pretty" part not validated. It can cause some serious issues with google & duplicate content and if someone wants to be malicious they can create a bunch of fake urls that essentially point to the same page, or even worse if they receive enough links they can be indexed at the new "fake" url. A similar thing happened to a newspapers website but I can't recall who of the top of my head.

Another potential solution & my preferred method is whenever a change is made that would affect the url of a page. Update a "legacy" table with the old url and the location of the new url, next time a 404 is going to be thrown do a search against the database & redirect accordingly if a new url is found. I rolled this approach into https://github.com/leonsmith/django-legacy-url and whilst it's not polished it's by far the easiest & probably most automatic/maintainable solution I have found.


"I wouldn't recommend having the "pretty" part not validated. It can cause some serious issues with google & duplicate content"

Not if you properly generate and apply canonical links :)


canonical links are only hints to google (all be it very strong) they always reserve the right to ignore it if they think a webmaster is shooting themselves in the foot, that in itself is where the problem is. If I built up a few hundred links to example.com/1234-this-site-sucks I'm sure google would think that is the correct url rather than the canonical link version of example.com/1234-the-real-slug


FYI: the word you wanted is "albeit".


Your approach is exactly how Drupal does it, and it's one of the things I really felt that Drupal got right (and have mimicked in websites I have built since).


Julian Assange on Self Destructing Paper(http://web.archive.org/web/20071020051936/http://iq.org/):

The internet is self destructing paper. A place where anything written is soon destroyed by rapacious competition and the only preservation is to forever copy writing from sheet to sheet faster than they can burn.

If it's worth writing, it's worth keeping. If it can be kept, it might be worth writing. Would your store your brain in a startup company's vat? If you store your writing on a 3rd party site like blogger, livejournal or even on your own site, but in the complex format used by blog/wiki software de jour you will lose it forever as soon as hypersonic wings of internet labor flows direct people's energies elsewhere. For most information published on the internet, perhaps that is not a moment to soon, but how can the muse of originality soar when immolating transience brushes every feather?


"but in the complex format used by blog/wiki software de jour you will lose it forever"

Exactly. I don't understand why almost all blogging and CMS platforms store data only to database. I think that proper solution for most sites would be to keep DB for maintenance and indexing purposes. For visitors everything would be served from static files.


is there anything being done to create some persistence? i would imagine a service like this would be useful. ex. small fee (10c) to publish 10kb of unicode, available indefinitely

edit: found this: http://www.chronicleoflife.com/ ... but i was thinking something to publish instead of simple backup

edit2: probably the only company i could trust to pull this off (one-time fee for publishing static content) would be amazon. it fits really well with their core business (infrastructure), and amazon is very good with long term stuff


Any single service or provider is unreliable, and what is needed is not a breadth of services (like the sort of service one or two techies seized by a HN-related enthusiasm might start running) but a depth in time of services - long-term services.

Currently, I back up URLs I care about or link on my site to ~3 places: the Internet Archive, WebCite, and my hard drive ( http://www.gwern.net/Archiving%20URLs )


The proposed goal of The OpenPhoto Project is exactly that.

http://theopenphotoproject.org/

Others are Unhosted (http://unhosted.org/) and OwnCloud (http://owncloud.org/).


archive.org strive to help with this problem


Specifically, for webpages there is the Wayback Machine: http://archive.org/web/web.php


I think part of the point he's making is to take matters into your own hands, rather than trusting it with a service that may very well disappear (including those that you pay for today).


What if you try to solve for the case where you yourself are not around anymore?

(Might take payments off the table as well, or at least have 10-year plans or something :-))


> Some sites like giving you no information in the URL

For me one of the worst offender in this category is youtube. I can't understand why they don't put a slug with the video name in the canonical URL (especially since they have youtu.be for shortening URLs). It's really a pain to find back an old video in, say, an IRC log with only the opaque video ID.

Vimeo does the same thing. Dailymotion however does put a meaningful slug.


For me one of the worst offenders, given theaudience and content, is HN.

This story for example - http://news.ycombinator.com/item?id=4077891.


Obligatory W3C link: Cool URIs don't change: http://www.w3.org/Provider/Style/URI.html

(note: this page's URL didn't change since at least 1999)


I find it funny that it is now shown as www.w3.org/Provider/Style/URI.html in one of my browsers, and http://www.w3.org/Provider/Style/URI.html in another browser. On my mobile it is shown as http://www.w3.org/Provider/Sty...


A variable called stuff? Seriously?

Shame as the rest of the article is quite good, but that really flags me that this is a little bit cowboy code.

Also interesting to read some sites are taking a 'white-list' approach to robots.txt, as he says this is resulting in people starting to ignore it.


There you go, it is called items now, personally I don't see much difference this is a tiny demo class and the var is private, you better understand before copying and pasting. Anyway, lets not argue about semi columns. The class is correct for multi threaded access and the API is pretty much perfect for my needs.

Glad you liked the rest of the article, I hope this helps others


Starting to ignore robots.txt? Unfortunately, there's no way to prove that anyone respects it. A well-identified bot can back off, then return from a different IP address with a different User-Agent and attempt to mimic a human user. Webmasters really have no defense against policy violations. If you run a bot of any kind, including a link checker or SSL tester, please respect robots.txt. If not, be prepared to be identified as malicious and blocked by an IDS.


Let me add: Since the purpose of your bot is to verify links and protect/serve your users, consider removing the links from your site if robots.txt prohibits you from checking them. That's what I would prefer as a webmaster who explicitly set that policy on a site, since I have no control over who posts the links.


That doesn't make sense.

The point of blocking a link with robots.txt is to say "Hey, web crawlers, please don't load and index this page". it does not mean "Hey, users, please don't come and load and read this page".

So the script written, for all intents and purposes, is just the same as a regular old user clicking the link and reading the page then keeping a list of the links that work and those that don't. It's not a crawler, it's an automated user.

If you are a webmaster than wants to block people from posting links to your page all around the web allowing others to come and read it, make the page 403.


So stackoverflow should remove all links to github? Unwise.


I like the profuse amounts of comments, especially when documenting a function's purpose.


I work for stuff.co.nz (biggest news website in New Zealand). You wouldn't believe the amount of 'stuff' in our codebase. :)


I didn't even notice the stuff variable, but the lock(this) really irked me.


Why? I'm not familiar with C# idioms, but it looks like a sane way to ensure that "items" and "expireOverride" are only modified by a single thread at a time.


For mismanaged sites where the site owner changed URLs and could have added proper redirects but instead chose to just show 404s for all of them (article mentions the examples of github and java, but there are countless more), there should be a wikipedia-style community-driven reference project with better redirects. Is anybody working in this direction?


Semi-OT:

On the subject of GitHub's robots.txt[0], would anyone have a guess at why this particular repo[1] is singled out?

[0] https://github.com/robots.txt

[1] https://github.com/ekansa/Open-Context-Data


It could be a honeypot. Any robot that crawls that URL gets auto-banned.


What are the common causes of broken links?

Seems unavoidable on large sites.


I guess some common ones are,

1. sys-admin reorganisation, moving content from one spot to another without redirects in place.

2. developer reorganisation, for example moving from "confusing" urls to "slug" based urls without adding redirects

3. fragile content, content that moves depending on external changes (beta to release for example)

4. product retiring or companies getting acquired

5. hackers messing stuff up in a way that can not be fully repaired (or an non-recoverable data loss)


1. people get tired of hosting blogs on dreamhost

2. list mirrors like nabble do wholesale migrations without redirects (Google groups is gong thru this now with new format (but with redirects, I hope):

    groups.google.com/forum/#!msg
3. wiki's get pages duped and branched to where their marginal utility is 0, so the sponsor decides to start over.


Just a guess, but I'd think the most common cause of broken links is the original administrator retiring that whole branch of the site (or the site itself). Often the domain still works but that old stuff has been thrown away.

Often it's clear that what's there now is a lot better or more professional, so you can see why the person didn't feel like messing up the site with the old stuff - but that old stuff is still gone.

Other times the domain is gone, since the person or people have moved on to doing a lot better stuff and stopped maintaining that old site - "why bother." People change - a web site isn't something you publish once, it's something you publish every time your server answers an http request. Would you keep publishing everything you wrote 10 years ago?


"just ignore robots.txt?"

how about "fuck you"? I guess it's high time to make honeypots, tarpits and bans common practice.


How about reading http://meta.stackoverflow.com/questions/132675/validating-th... before swearing at me.

No, we are not going to ban all the links from GitHub on our site cause shitty WEB CRAWLERS forced GitHub to use a white-list based approach. This is not WEB CRAWLING. It is link validating. We are not crawling in the sense of building a huge tree of links. We are testing that the external links on our sites work. If we are not allowed to test them, why are our users allowed to click them? Are we not committing an even greater crime by allowing these links on our web site?

"The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable."

The convention is a best effort thing, we tried to respect it, but doing so was both AGAINST what the authors of the Robots.txt file at GitHub intended AND the spec is advisory, not an IETF RFC. If it was an RFC then some smart people would review it and turn it into something sane and usable that deals with this exact use case.

And you know what, an RFC would NEVER pass for robots.txt as it is now cause the white-listing potential is anti Internet. Why should Google and Bing be the only parties who are allowed to discover content on the Internet? User agent restrictions are completely evil, wrong and backwards.

Sorry to shatter your imaginary delusion of what you think the Internet is.


"before swearing at me."

Followed by shitty" and "web crawlers" in all caps ^^ Someone ate a clown for breakfast I see.

"No, we are not going to ban all the links from GitHub"

What? Slow down there -- why would you care about invalid links? Did you just say that you can't possibly allow users to post links, as long as you don't know they work for automatic crawlers, not just for human visitors? And someone else chimed in saying you give arguments? Heh.

Well, you give an attempt of one, with "we are not crawling in the sense of", and then refute it with the bit you quoted: "web crawlers and other web robots". It's not called webscraper.txt, it's robots.txt period.

So how then would a website determine a rogue user agent? You dress up like the slimy guys, you get the banhammer -- what do you expect? If you care so much about Facebook and Twitter "content" that it is worth it for you to be undistuingishable from attackers, then just cope with it. But don't pout at me, just eat up what you ordered.

And what delusions about the internet? You just beat around the bush and then finish with that strawman? And what is an "imaginary delusion", by the way? The one you imagine I have? Now that's a Freudian slip if I ever saw one ^^

"Completely evil, wrong and backwards"... so... You're entitled to know the validity of links posted on your site, but website owners aren't allowed to care about their resources and who they offer them to? Who's deluded?


Slow down there -- why would you care about invalid links?

Because Stack Overflow is a site whose purpose is to answer questions. People may provide links when asking or answering a question, and those links may be important in understanding either the question or the answer. So invalid links degrade the value of the site.

What they're doing is fundamentally different than web crawling. Web crawlers are about discovering content. That means starting at a root and crawling out to see what you can find. One URL can spawn many more URLs to look at. They are starting with a known URL, and seeing if they can visit that URL. They have one URL, and only visit one URL.


He explains how and why in the article, and gives arguments. You do none.

The problem with whitelist-only robots.txt is that they favor monopolies and startups are the ones getting the "fuck you". But maybe you don't care about that.


As a webmaster, why would I want bots to go to my site that doesn't bring any (or much) trafic?


Why would your users post the addresses of your honeypots and tarpits to a 3rd party website?


What? I simply meant that if some brainiacs think robots.txt can just be disregarded, it's time to make it a minimum requirement of every self-respecting webmaster to make a tarpit (disallowed in robots.txt) and ban any and all bots going there. You would exactly NOT want a human visitor to post, or ever see, such a link. So yeah, it wouldn't even apply to this github thing, but don't tell that other guy about it.

These are supposedly good guys. So my reaction was "You gotta be fucking kidding?! You didn't just say that it's inconvient how some sites use robots.txt, so you just throw it out altogether for your precious little bot and epically important link checking quest. No wait, you did. Oh well then, BYE."

Oh well. I guess this is hack news, not hacker news, my bad :P


The sort of tarpit you're talking about wouldn't even affect this link validator. You really think Stack Exchange should have given up on validating links because Github's robots.txt has:

User-agent: *

Disallow: /

in it?


They could ask for Github's permission.



Would link validation be OK if I manually went through and clicked every single link by hand, and used a pen-and-paper tally of which ones worked and which ones didn't?

What's the difference?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: