Hash URIs

0xbadcafebee · on March 7, 2011

Does nobody care that this makes the world-wide-web completely pointless from any perspective that isn't a web browser with JavaScript? I'm going to be royally pissed if I have to write from scratch an http library in C with JavaScript support just to make a friggin' scraper script or debug a website/webserver.

If you want to be annoyingly fancy with the way you deliver content, just do it to user agents that support JavaScript. For any other user agent just provide the actual content we wanted! To do this, all you have to do is not use hash tags. The URI can remain the same and the application will work just fine (JS can still check the friggin query uri and do any Ajax trickery it wants). Sometimes I think webapp designers are just trying to piss hackers off.

We'll ignore the fact that having to escape the bang is really annoying, considering the page is useless from the console anyway due to the aforementioned requirement of JavaScript.

tptacek · on March 7, 2011

You don't need to embed Javascript to solve this problem, which has confronted web security products since the invention of "Ajax". Yes, the clientside is evaluating JS to figure out what to load, but once you know how those loads work, nothing stops you from simply making the "backend" requests directly.

We can nerd out till the sun goes down on how feasible it is to automatically figure out what links to load, but we spend the better part of many people's time every week dealing with this, and it rarely creates a major problem. Actually scraping tag soup is by far the more annoying issue.

chc · on March 7, 2011

I'm not sure reading and fully comprehending the flow of the complete JavaScript source for every site you might wish to scrape is actually easier than making a WebKit-based scraper that runs load and click handlers.

mdaniel · on March 7, 2011

Relevant: http://code.google.com/p/phantomjs/ is a headless browser based on WebKit

I don't have any hands-on experience with it, but if one were to go down the path you just described, that project would likely be a great start on that journey.

aristus · on March 7, 2011

I think hashbang URLs are a temporary phase, one abstraction poking out into another, like a sharp cliff being pushed up by the forces below. Eventually it will sort itself out. No one is breaking the internet forever and ever.

I agree with you that everyone should have a 1:1 mapping between "real" urls and hashbang state fragments, eg /foo/bar/1234 == /foo#!/bar/1234

The purpose behind all this is that you want to separate delivery of the "application" from state transitions of the application. Changing the hashbang avoids a roundtrip to the server, while providing deep links into application state.

Hashbangs are not ideal, but they work for now (see 1:1 mapping comment above). There are new History rewriting features (and libraries to take advantage of them) that can make all this ugly jiggery-pokery invisible to the user, not to mention headless clients.

colinsidoti · on March 7, 2011

With Hash URIs, I have an opportunity to create an application with a better experience for my users. In taking advantage of this opportunity, I also get to put less load on my servers. That's a win-win...bi-winning if you will.

But wait, now you tell me that by using these Hash URIs, I also annoy hackers who are attempting to use my application in an unintended manner... That's more winning than Charlie Sheen could shake a stick at.

No, I don't care that you cannot use wget to browse my application. I built it to work in a browser, and it works darn well in a browser. If I really want you to have easy access to the data within the application, I'll give you an API.

0xbadcafebee · on March 7, 2011

A web developer walks into a bar.

Bartender says, "why so glum?"

Web developer says "my app's got some bug, the site's acting weird and I can't tell what's going wrong."

Bartender says "so? that's nothing to be sad about! just view the source and-"

"I can't", said the web developer, "my application's such a mess of random includes and calls it's hard to tell if it's a bug in the browser or my code or something else."

Bartender says "so just ask someone else like a sysadmin or sysengineer to help debug it..."

"Nope", says the web developer. "They don't know the code, and when they try to query the app it just returns a few lines of JavaScript. They can't see any error."

"Hey, i'm just a bartender, but maybe if you optimized your application in a way that made it easier to troubleshoot you wouldn't be in this pickle."

--

SEO consultant walks into a bar.

Bartender asks, "what'll you have?"

"I'll have a double of JB, neat."

"Whoa! Woman troubles got you down?"

"Nope. This damn application i'm trying to get more visibility for has no content for crawlers. It's all a mess of funky quirky browser optimization, but no content for people to search for. It's like the page doesn't exist!"

"Whoa. I'll get you the bottle."

--

A blind man walks into a bar.

thud

Bartender says "hey buddy, watch where you're goin'. What'sa matter, you drunk or something?"

"Yep", the blind man says. "I can't use the internet anymore because everything's based on JavaScript and my screen readers and accessibility tools don't work, so I just drink instead of getting work done on the internet."

"Join the club", says the bartender, and points to all the other people whose lives are now more annoying because somebody thought it was neat to use a hashbang and hide content behind JavaScript parsers instead of making the web easier to work with.

(ok, i suck at writing. but you get my point)

steadicat · on March 7, 2011

Don't forget that the content that JavaScript displays comes from somewhere, typically an HTTP API, which you can access with any HTTP library the same way JavaScript does.

If a JS web site also provides you with a clean API to the actual raw content in a format that is even easier to consume than HTML (e.g. JSON), would that make you happy?

0xbadcafebee · on March 7, 2011

You mean, if every website in the world provided a standardized API that any HTTP-querying application would instantly deduce, understand and begin to process, rebuilding the page as is output to a user on a browser after constructing its markup after JavaScript parsing, would I be happy?

Sure. But i'd rather not discuss ridiculous things which will never happen.

I don't want an API interface to a web page. I want the web page. More specifically, I want the content; I could give a crap about markup if my application is not a web browser of some kind. There's probably a thousand different uses of applications that just get content from web pages and none of them try to deduce an API to call to get an idea of what a webpage's content was supposed to be. They just expect that, you know, when querying a webpage they're going to get the webpage, and not some cryptic scripting.

I should not have to reverse engineer a webpage or read an API spec in order to wget it. This is ridiculous.

btipling · on March 7, 2011

At Cloudkick we use the hash bang on our new overview. The hash bang URI represents the application state or view. It is a UI tool that helps provides the user with an additional, browser friendly, means to navigate the application. The information in the hash bang is also private to the account and irrelevant to the public which can't access the same information. The URI portion without the hash bang is simply the application's address. The server doesn't need to know about application's UI, all it does is deliver the application. Many sites make the server part of the user facing application with server side templates and scripts but we are moving away from that. The UI will be in the client and server doesn't need to know about UI state. If your server is part of the UI then don't use the hash bang convention as the server can't make sense of it.

lzm · on March 7, 2011

I don't understand why Cloudkick uses a hashbang instead of just a hash. Is my overview being indexed by Google?

btipling · on March 10, 2011

Your overview isn't, but we've built this into our library of tools that we use for history management, and maybe we will build something like our provider portal that would be.

MatthewPhillips · on March 7, 2011

It's part of 2 general trends that are taking place recently.

1) A move away from server-side programming. Server is now used to serve static files and json data. Not to dynamically create pages. Application logic is not done in javascript.

2) The emergence of high-level javascript frameworks that do most of the blunt work. These frameworks are server-language agnostic and hence do all of the application scripting in browser. JQuery Mobile and Mobl come to mind here.

davnola · on March 7, 2011

Great article. Recommends building on conventional hash-less URIs and using progressive enhancement e.g onClick handlers to implement hash-bang URIs.

Three particularly interesting ideas:

* sites should support the _escaped_fragment_ query parameter, the result of which should be a redirection to the appropriate hashless URI

* if you’re serving large amounts of document-oriented content through hash-bang URIs, consider swapping things around and having hashless URIs for the content that then transclude in the large headers, footers and side bars that form the static part of your site [can't envisage how this would work]

* "I suspect we might find a third class of service arise: trusted third-party proxies using headless browsers to construct static versions of pages without storing either data or application logic"

moe · on March 7, 2011

If you're starting on a new app today then I'd suggest to embrace the history-API instead of entering the world of pain that is hashtags.

http://caniuse.com/#search=history

Adoption will sky-rocket when Firefox 4.0 rolls out, as that is the last major browser to not support it yet.

Oh yes, and then there's IE... This is where my suggestion becomes opinionated: Screw IE and display a polite message to the remaining IE-visitors.

Skalman · on March 7, 2011

Well... I don't see how ignoring IE would change that much for those visitors, if the pushState is used only as progressive enhancement. I probably wouldn't even display a message to them.

On the other hand, if dynamic changes are essential, we'll have to support hashbangs too.

chris_j · on March 7, 2011

I knew nothing of the HTML 5 history API until I read this. I found this page in the Mozilla docs to be fairly informative:

https://developer.mozilla.org/en/DOM/Manipulating_the_browse...

I look forward to the day when this API is supported by all the major browsers. In the meantime, the article suggests a sensible-looking set of steps for gracefully degrading.(and using hash-bang URLs as little as you can).

The frustrating thing about all this is that, until HTML 5, there hasn't been any way of causing the contents of the URL bar to change to an arbitrary URL within your site without reloading the page.

keefe · on March 7, 2011

I don't think there's any particular reason to use a hash-bang style to get the same functionality - /content/uuid could be parsed as easily.

maybe I missed something by skimming the article... I was initially hoping to read something about using one way functions to derive UUIDs from document content :/

diamondsea · on March 7, 2011

I think the whole method that these sites use to deliver content is completely backwards, at least from the user perspective.

The goal of going to a web site is ultimately to get the content you are looking for. If you're going to Twitter, it's to read someone's twitter stream, etc.

The way these hashbang sites function is to display the least-relevant (to the user) information first, all the chrome and ads of the page. Then, when all the stuff you don't actually care about is done rendering, only then does it go and get the content you came to see.

This drives me nuts as a user. It is particularly annoying on slower devices, such as Edge or 3G smartphones, where you see the lag even more strikingly.

From a user-experience, content-oriented framework, this entire architecture should be reversed. The URI should point to the content, not the presentation as it does with Hashbang. The content should be served first, in a readable format, and then the javascript magic should kick in, download all the chrome in the background and then update the page once with all the other elements.

If you're going to have your page do a secondary load anyway (as hashbang does now) you might as well make the content the front-and-center part.

cabalamat · on March 7, 2011

> If you're going to Twitter, it's to read someone's twitter stream, etc.

You may be interested in the twitter-like website I'm currently developing. It currently uses no JS at all (it will do so, for e.g. writing new messages, but the entire functionality will work without JS).

mckoss · on March 7, 2011

I am hearing a lot of vitriol against the use of the hash uri (or hash-bang uri) pattern. While I agree that it does make a slightly different set of assumptions about the browser model, I don't think that application developers are employing it just because it is a " cool new thing".

In many cases, the hash-uri enables a new class of applications and system architectures that were not previously possible on the web. Naked (hash less) uri's require that all state transitions round-trip to the server. This isn't at all desirable if you want to support offline, or high latency (mobile) clients.

I am in agreement that we want to retain the link structure of the web. But we also do want to not freeze the application architecture of the web to 1997. I think this post had some great recommendations about implementing a hash-bang based site, and still "playing nice" with a diversity of client assumptions.

wvenable · on March 7, 2011

> I don't think that application developers are employing it just because it is a " cool new thing"

Have you seen the Gawker media redesign of lifehacker, Gizmodo, etc? They appear to be using the hash-bang for all links for no good reason at all. So yes, some big sites are employing it because it's a cool new thing.

mckoss · on March 7, 2011

That may be. But it's pretty hard to ascribe intention simply by looking at their site. While what they are achieving could be implemented w/o the use of hash-bang URIs, it may be that they have reasons that are not readily apparent.

Whether you agree with them or not, there was a lot of thought put into the Gawker redesign:

http://lifehacker.com/#!5701749

I do note that they (sometimes) avoid redrawing the right-hand column, when I switch between Gizmodo articles, since they don't refresh the whole page. Even though they still have dozens of server round trips for this kind of transition, they seem to flow in very quickly and the page transitions are actually quite smooth. This would not have been possible with a standard page refresh (which would re-anchor the browser view to the top of page, regardless of the user's current scroll state).

I'm not saying they are taking optimal advantage of the hash-bang pattern, but it does allow them some user experience optimizations that they could not get without it.

wvenable · on March 7, 2011

The links could still be rendered as regular urls and onclick handlers used to dynamically update the page and set the fragment for bookmarks.

mckoss · on March 7, 2011

Isn't that what they are doing?

wvenable · on March 7, 2011

No, all links are hashbanged. Probably simplifies the implementation a bit because they only need one method of loading pages (Ajax) but I don't think it's particularly friendly.

mckoss · on March 8, 2011

I think the two methods result in the exact same thing. Especially since the crawler won't see any raw HTML pages anyway - all links are hashbanged, and have to be crawled via the Google parameter rewriting scheme.

wvenable · on March 8, 2011

Except non-JavaScript (or limited JS) browsers see nothing at all. Graceful degradation in this situation isn't all that difficult, they just chose not to do it.

svv · on March 7, 2011

Fragment identifiers are currently the only part of a URI that can be changed without causing a browser to refresh the page

This is actually the primary reason for most of the hash and hash-bang uris in today's javascript-heavy sites. Once all the popular browsers allow changing URI without reloading the page, the whole issue will become irrelevant.

I am actually a big fan of HTTP and REST (the real one, with HATEOAS), but browsers have a long history of making it hard to write RESTful web sites. HTTP methods had long been restricted to GET/POST; code-on-demand support before XHR had been very rudimentary (javascript was almost useless, and I don't even want to mention java applets); browser addressing conventions led to horrible (from the REST point of view) practices like redirect-after-POST etc.

rwmj · on March 7, 2011

Good article. Although instead of writing "A browser that doesn’t support Javascript ..." (implying some failure or deficiency in the browser) he might have written "For users who selectively disable Javascript ..."

snorkel · on March 7, 2011

There's also millions of low-end mobile devices that can reach the web but don't support javascript.

simonw · on March 7, 2011

She might have written

Sukotto · on March 7, 2011

A lot of visually impaired people disable javascript too since it usually doesn't play nice with the screen reader.

jonespen · on March 7, 2011

According to this survey (http://webaim.org/projects/screenreadersurvey3/), 98.4% had enabled js.

jdavid · on March 8, 2011

There are a number of data points these days behind the social wall that will never have access to a crawler.

amitraman1 · on March 7, 2011

Keep it simple.

Hopefully they did not use AJAX to serve static content.