Does nobody care that this makes the world-wide-web completely pointless from an...

tptacek · on March 7, 2011

You don't need to embed Javascript to solve this problem, which has confronted web security products since the invention of "Ajax". Yes, the clientside is evaluating JS to figure out what to load, but once you know how those loads work, nothing stops you from simply making the "backend" requests directly.

We can nerd out till the sun goes down on how feasible it is to automatically figure out what links to load, but we spend the better part of many people's time every week dealing with this, and it rarely creates a major problem. Actually scraping tag soup is by far the more annoying issue.

chc · on March 7, 2011

I'm not sure reading and fully comprehending the flow of the complete JavaScript source for every site you might wish to scrape is actually easier than making a WebKit-based scraper that runs load and click handlers.

mdaniel · on March 7, 2011

Relevant: http://code.google.com/p/phantomjs/ is a headless browser based on WebKit

I don't have any hands-on experience with it, but if one were to go down the path you just described, that project would likely be a great start on that journey.

aristus · on March 7, 2011

I think hashbang URLs are a temporary phase, one abstraction poking out into another, like a sharp cliff being pushed up by the forces below. Eventually it will sort itself out. No one is breaking the internet forever and ever.

I agree with you that everyone should have a 1:1 mapping between "real" urls and hashbang state fragments, eg /foo/bar/1234 == /foo#!/bar/1234

The purpose behind all this is that you want to separate delivery of the "application" from state transitions of the application. Changing the hashbang avoids a roundtrip to the server, while providing deep links into application state.

Hashbangs are not ideal, but they work for now (see 1:1 mapping comment above). There are new History rewriting features (and libraries to take advantage of them) that can make all this ugly jiggery-pokery invisible to the user, not to mention headless clients.

colinsidoti · on March 7, 2011

With Hash URIs, I have an opportunity to create an application with a better experience for my users. In taking advantage of this opportunity, I also get to put less load on my servers. That's a win-win...bi-winning if you will.

But wait, now you tell me that by using these Hash URIs, I also annoy hackers who are attempting to use my application in an unintended manner... That's more winning than Charlie Sheen could shake a stick at.

No, I don't care that you cannot use wget to browse my application. I built it to work in a browser, and it works darn well in a browser. If I really want you to have easy access to the data within the application, I'll give you an API.

0xbadcafebee · on March 7, 2011

A web developer walks into a bar.

Bartender says, "why so glum?"

Web developer says "my app's got some bug, the site's acting weird and I can't tell what's going wrong."

Bartender says "so? that's nothing to be sad about! just view the source and-"

"I can't", said the web developer, "my application's such a mess of random includes and calls it's hard to tell if it's a bug in the browser or my code or something else."

Bartender says "so just ask someone else like a sysadmin or sysengineer to help debug it..."

"Nope", says the web developer. "They don't know the code, and when they try to query the app it just returns a few lines of JavaScript. They can't see any error."

"Hey, i'm just a bartender, but maybe if you optimized your application in a way that made it easier to troubleshoot you wouldn't be in this pickle."

--

SEO consultant walks into a bar.

Bartender asks, "what'll you have?"

"I'll have a double of JB, neat."

"Whoa! Woman troubles got you down?"

"Nope. This damn application i'm trying to get more visibility for has no content for crawlers. It's all a mess of funky quirky browser optimization, but no content for people to search for. It's like the page doesn't exist!"

"Whoa. I'll get you the bottle."

--

A blind man walks into a bar.

thud

Bartender says "hey buddy, watch where you're goin'. What'sa matter, you drunk or something?"

"Yep", the blind man says. "I can't use the internet anymore because everything's based on JavaScript and my screen readers and accessibility tools don't work, so I just drink instead of getting work done on the internet."

"Join the club", says the bartender, and points to all the other people whose lives are now more annoying because somebody thought it was neat to use a hashbang and hide content behind JavaScript parsers instead of making the web easier to work with.

(ok, i suck at writing. but you get my point)

steadicat · on March 7, 2011

Don't forget that the content that JavaScript displays comes from somewhere, typically an HTTP API, which you can access with any HTTP library the same way JavaScript does.

If a JS web site also provides you with a clean API to the actual raw content in a format that is even easier to consume than HTML (e.g. JSON), would that make you happy?

0xbadcafebee · on March 7, 2011

You mean, if every website in the world provided a standardized API that any HTTP-querying application would instantly deduce, understand and begin to process, rebuilding the page as is output to a user on a browser after constructing its markup after JavaScript parsing, would I be happy?

Sure. But i'd rather not discuss ridiculous things which will never happen.

I don't want an API interface to a web page. I want the web page. More specifically, I want the content; I could give a crap about markup if my application is not a web browser of some kind. There's probably a thousand different uses of applications that just get content from web pages and none of them try to deduce an API to call to get an idea of what a webpage's content was supposed to be. They just expect that, you know, when querying a webpage they're going to get the webpage, and not some cryptic scripting.

I should not have to reverse engineer a webpage or read an API spec in order to wget it. This is ridiculous.