Hacker News new | past | comments | ask | show | jobs | submit login
Probably the worst URL scheme ever (bvb.de)
113 points by jakub_g on May 2, 2013 | hide | past | favorite | 100 comments



LinkedIn URLs are by far the worst. For example, the first profile that came up when I searched for Paul Graham:

http://www.linkedin.com/profile/view?id=23081590&authTyp...


Some poor PI is going to freak out when he checks his 'recently viewed your profile' stats.


Yes, all those *2s could easily have been optimised into a left shift operator.


Oh so the *2s have some sort of meaning beyond random URL garbage? Please explain! I've always been curious.


No, I think it is just a joke: if you parse "*2" as "times two" you could optimize it in code by shifting the bits left since bit shifting operations are cheaper than multiplication.

#b0001 = 1

#b0010 = 2

#b0100 = 4

#b1000 = 8


Haha thanks for the explanation.


And to continue on this tangent, if your code is multiplying by the constant 2 (e.g. "x = z * 2") the compiler's probably going to optimize that into a shift anyway, so just keep the "* 2" to keep future human readers of the code happy.


I was just going to say that! Before I saw this post I was on LinkedIn looking for something, and you simply cannot miss when something like this shows in your address bar:

...gid=3396514&goback=%2Enpv_152562310_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_*1&trk=NUS_JGRP-grp-nm


You can have a public URL for linkedin in the format /in/CustomName

E.g.:www.linkedin.com/in/barackobama


Too bad that seems like only an external reference method, and the site never uses those URLs internally.


Yeah, I want to know what they're doing internally with these insane URLs. It's just ugly -- poor craftsmanship IMHO.


Yeah, nothing screams good craftsmanship like armchair-quarterbacking someone else's work that clearly works well.

FWIW, it seems to be some sort of encoded state, essentially HATEOS.


Works well?

"Oh you want the URL of your LinkedIn Profile? Don't use the URL in the address bar, silly! Use this other URL we are providing randomly down the page."


I was referring to the URL-scheme of the LinkedIn app in general, not just the profile page - I somehow missed that this subthread was only about the profile page. Yes, I agree they dropped the ball completely on that. Especially tragic as they could fix the 80% case with a front-end one-liner (window.history.replaceState).


Apparently he is a 3rd connection. o.O


The elephant in the room here of course is this: https://news.ycombinator.com/x?fnid=b7VO4wED8MRumCeiX5fCnF


I never understood why HN has such a peculiar URL for accessing pages. It times out after a while too, is that to stop crawlers?


They are ids to lookup closures in a database. They time out to stop the database overflowing ;) It's called continuation-based web development [1], popular with Lisp and Smalltalk-based web servers (because who else has continuations?)

[1] http://en.wikipedia.org/wiki/Continuation#In_Web_development


There's no database.

EDIT: to all the people arguing with me. Read the source code to Hacker News.

    (= fns* (table) fnids* nil timed-fnids* nil)

    ; count on huge (expt 64 10) size of fnid space to avoid clashes

    (def new-fnid ()
      (check (sym (rand-string 10)) ~fns* (new-fnid)))

    (def fnid (f)
      (atlet key (new-fnid)
        (= (fns* key) f)
        (push key fnids*)
        key))

    (mac afnid (f)
      `(atlet it (new-fnid)
         (= (fns* it) ,f)
         (push it fnids*)
         it))
They are in memory. Which is why they expire randomly when the HN process is restarted.


That's probably true John. If I may mis-quote Greenspun:

"Any sufficiently complicated Lisp program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of an actual database."


Depends on what you consider a database.


The fnids do not expire randomly due to restarts; they expire when there are too many or they timeout so memory doesn't fill up with these continuations. Personally, I don't like this continuation-based approach since "Unknown or expired link" is a really bad user experience.

Way back I wrote a bunch of documentation on the Arc web server if you want details: http://www.arcfn.com/doc/srv.html Look at harvest-fnids which expires the fnids.

The fnid is an id that references the appropriate continuation function in an Arc table. The basic idea is that when you click something, such "more" or "add comment", the server ends up with the same state it had when it generated the page for you, because the state was stored in a continuation. (Note that these are not Scheme's ccc first-class continuations, but basically callbacks with closures.)

(The HN server is written in Arc, which runs on top of Racket (formerly known as PLT Scheme or mzscheme))

Edit: submitted in multiple parts to avoid expired fnids. Even so, I still hit the error during submission, which seems sort of ironic.


There's always a database.


Racket (arc's host language) keeps continuations on the filesystem, or you can write your own "stuffer" to do what you want with them (store them in a database or whatever). But you have to keep them somewhere or else (assuming the server uses continuations) you can't keep track of the user's path through your code as they click through links and such.

Racket does have an option to serialize the continuations, gzip them, sign them with HMAC, and then send all of that to the client so the server doesn't have to keep track of anything, but HN doesn't use it.

See http://docs.racket-lang.org/continue/#(part._.Advanced_.Cont... for a quick introduction.


HN doesn't use racket. It's a custom lisp based on scheme.


Sure it does. HN is written in "Arc":

https://github.com/wting/hackernews

Arc runs on Racket:

http://en.wikipedia.org/wiki/Arc_(programming_language)

See the "OS" section where it says "runs on the Racket compiler"

See also the Arc source code, https://github.com/Pauan/ar/blob/arc/nu/arc Note the "#lang racket" at the top.


Then where is the data that is associated with "b7VO4wED8MRumCeiX5fCnF" stored? How is that data requested? There certainly is a database, it is just most likely not a traditional database that most people think of.


where is the data that is associated with "b7VO4wED8MRumCeiX5fCnF" stored

In the Racket process that's running the Arc code for news.yc


Oh, wow. I had assumed that people who visited around the same time got the same next page URL, maybe as part of a caching strategy or something.

This way seems impractical, TBH. Certainly for the user - the expiration a bit of a nuisance, as I'll get it more often than not if I read a couple of stories and then click 'More'.


I believe it's a continuation-style server—hence fnid: function id—and the continuations are only kept around in memory for ~5 minutes.


I never understood why the continuation couldn't instead be addressed by a URL path. It could even get constructed from URL/query data if it moves out of memory, so keeping them in memory would only be a caching mechanism.


Every continuation framework I've looked at radiates an intense desire to treat the web as something other than what it actually is.


Yes, but the way hn works, the continuations map uri's to browser sessions so they can be expired on a more granular basis than all at once or lru or what have you. I'm guessing here, though, I've not looked through the code.

[addendum] I Reread your comment and realized that wasn't what you meant at all. What would be the benefit of path based uris over query string params in hn's case? I only see how they be equal, not better.


I meant that the continuation could be identified by (session_id, URL data), and re-created based on this data if it goes missing.

URL data can come from path or query string, it doesn't matter.


True. This hash should be appended like so:

https://news.ycombinator.com/page3/x?fnid=b7VO4wED8MRumCeiX5...


I think THOMAS (http://thomas.loc.gov/), the search engine provided by the US Library of Congress for searching federal legislation, has the worst URLs I've seen. Here's a random one:

  http://thomas.loc.gov/cgi-bin/bdquery/D?d113:1:./temp/~bdGqLa:@@@T|/home/LegislativeData.php|
And here's a link I got to the Patriot Act (HR 3162):

  http://thomas.loc.gov/cgi-bin/query/D?c107:44:./temp/~c107DgA33R::


While it's been reskinned, THOMAS dates back to 1995 and the core reflects a much earlier era of web development. It's slowly being replaced with congress.gov which has far more palatable URLs:

http://beta.congress.gov/bill/107th-congress/house-bill/3162


Note that these are not even canonical URLs. Both of them now fail with a "Search Timed Out" error.

US Trademark Office has the same issue, there is no way to link to a particular trademark, because the only access is via the search engine and queries time out, e.g.: http://tess2.uspto.gov/bin/showfield?f=doc&state=4802:xt...


It's clearly a set of arguments (separated by ':') to a function, i.e. it's a RPC invocation. The first seems to be the session of congress, 113th and 107th. The second, perhaps the nth item in that collections? The third, a temporary filename, probably where the search result is stored, so you don't have to re-run the search which you hit "back". The last bit, probably a later addition that manages breadcrumb navigation.


Making sense of that in linear time would be a great interview question.


The ./temp/~c107DgA33R bit looks like a reference to a cached internal state of the system, so you can probably make about as much sense of it as you can of https://news.ycombinator.com/x?fnid=H1QJE8EOaO2OkA28owXZ4H.


Why? Surely you'd have to go out of your way to make it that unnecessarily complicated.


i have often thought that the THOMAS system was an overt attempt to obfuscating government activities. it's such a pain to use.


i have seen worse.

i.e. sites that show on page with the url http://www.example.com and another page with the URL http://www.example.com and even after another click the URL http://www.example.com with completely new content

and sites, that use a logic like this http://www.example.com/357893857435/sfjsfsfsfd/this-should-b... where http://www.example.com/357893857435/sfjsfsfsfd/this-is-shoul... and http://www.example.com/357893857435/sfjsfsfsfd/tHIS-is-the-s... show the same page, oh and of course http://www.example.com/357893857435/sfjsfsfsfd also shows the same page.

oh, and cases where http://www.example.com/click1/click2/click3/item-id/123 show the same page as http://www.example.com/click1/item-id/123 which show the same page as http://www.example.com/click1/click2/click3/click4/item-id/1...

all of the examples above are far worse than bvb.de


Quite frankly, I despise the use of "SEO URLs" altogether. It's basically a waste of bytes.


Sometimes, sometimes not. You can often compress a lot of

&parameter=value into a simple /value/

rewrite. This - by most accounts - makes it more SEO friendly. Granted, putting full sentences that match an article title is never necessary, but there are a lot of SEO tricks that can make a url not only "nicer" looking but also shorter.


Some of the best SEO URLs also tend to be human guessable URLs, which is definitely not a waste of bytes.


Not if you view SEO as genuinely helping search engines to find content in a way that improves the discoverability of web pages.

If it's purely to boost rankings artificially, then yes, I agree.


I still think the one used by the Spanish Congress is worse. URL for legal document 162/000609:

http://www.congreso.es/portal/page/portal/Congreso/Congreso/...


Legal portals are generally gasbage, here's the french one for Article L511-1 of the environmental code: http://www.legifrance.gouv.fr/affichCodeArticle.do?idArticle...


Legal portals are also many times vulnerable to a form of directory traversal, where you descend the URL scheme by cropping out the last slash. ie. /documents/17683/ would become /documents/. Doing the same thing for parameters can do wonders.

So far I've found login portals to a few banks, teleoperators and to the parliament and military systems of my country. In addition, I've hit several FTP directories of organizations such as my state's public welfare system, which included software and documents.

I sometimes report these incodents as I find them, anonymously and without contact information, since nobody never responds to these reports.

tldr; Long urls can also be dangerous.


§ 1353 of the BGB (German Civil Code) can be found at http://www.gesetze-im-internet.de/bgb/__1353.html (literally ‘laws on the internet’).


So close, with minimal effort they could map that to '/bgb/1353'. It seems that dejure.org actually works that way -> http://dejure.org/gesetze/BGB/1353 seems to map to http://dejure.org/gesetze/BGB/1353.html, but they graciously ignore any kind of file extension...


Well they already managed to get an overview/full-text at /bgb/, so I am quite happy for now…


So, this page is named ?_(null)çô(null)

? How does that even begin to make sense?


It's an underscore followed by 4 bytes, possibly the integer 2650072859 or 468186269. If they're intentionally trying to obfuscate their URLs to prevent crawling, it might be further encrypted somehow.


How about http://www.tsa.gov/tsa-pre%E2%9C%93%E2%84%A2 ? It's "TSA-Pre✓™" and you have to type in the checkmark and the trademark symbols, otherwise it 404s

Edit: ho, they fixed that and it's redirected. Too bad, that was funny :)


"I can't imagine the skill required to do this without the experience to know it's a bad idea" (can't find the source for this quote, but you should get the sentiment).

...ahhh... found it: """"How do you attain the skills required to do this while not also learning not to?" http://news.ycombinator.com/item?id=4711355 """


The way you end up with URL's like http://www.tsa.gov/TSA-Pre✓™ is a CMS system that replaces spaces with dashes in the title to make the URL. No skill required.


Glad to see they finally fixed that with a redirect to a sane url.


It's Latin-1 encoded and then URL-encoded:

>>> import urllib

>>> print urllib.unquote('%1B%E7%F4%9D').decode('latin-1')

çô


It was up until last week, but http://www.tsa.gov/tsa-pre✓™ was the canonical version. Now it redirects to the version you don't need to know the ALT keyboard jockeying with.


Can anyone tell the advantages of having unreadable URLs like this? The only thing I can think of is reducing bandwidth usage through short URLs :)


And reducing bandwidth via a brilliant anti-SEO strategy.

* Useless URLS? Check.

* Eye-gougingly ugly design? Check.

* Densely packed content with tiny font? Check.

And have lots of fun reading the source.


You've got to go green by conserving bits. You can use up to a radix of 30 I believe in JS, so why use those pesky base 10 values when you can go base 30 with no additional overhead? Heck, use base 62 for the easily url passable values too. Just 4 chars to encode your 14M WP articles.

Think of the bits!


This is nitpicking, but the worst URL scheme is actually smb:


And the worse mime type is an unfunny one.


My first thought when I see this:

This is a band-aid to try and prevent an XSS or SQLi flaw somewhere on the site.


I counter with http://www.ctshirts.co.uk just go there and even the front page transmutes itself into a horrifying URL of doom.

Depending on which subsection of the site you're in, you sometimes also get |||||||| on the end of the URL.

I actually know someone who works in the e-commerce dept of the company and I think we should all berate him for how terrible a person he is ;)


I just noticed that %7C is the URL encoded form of..... | so it's always just full of pipes!


Not sure what it is but the URLs on the japanese BVB site look better: http://www.bvb.jp/


What the heck, this site is reasonable, attractive, and legible, why is this not the MAIN SITE?!?


Here's a NOT SAFE FOR WORK url.

    http://www.adultwork.com/ViewProfile.asp?UserID=1945109&Keywords=&KeySearch=1&TargetURL=http%3A%2F%2Fwww%2Eadultwork%2Ecom%2FSearch%2Easp%3FRefreshVar%3D02%252F05%252F2013%2B17%253A11%253A24%26cboCountryID%3D158%26cboCountyID%3D146%26cboAPID%3D0%26rdoRatings%3D0%26cboLastUpdated%3D01%252F01%252F2003%26intAgeFrom%3D25%26intAgeTo%3D33%26DF%3D1%26cboLastLoginSince%3DX%26strSelPostCode%3D%26HotListSearch%3D0%26rdoKeySearch%3D1%26strPostCodeArea%3D%26SearchTab%3DProfile%26cboRegionID%3D11%26question_69%3D%26question_70%3D%26question_2%3D%26question_3%3D%26question_57%3D%26question_27%3D%26question_42%3D%26strKeywords%3D%26intHalfHourRateFrom%3D%26intHalfHourRateTo%3D%26dteAvailableAnotherDay%3D%26hdteToday%3D02%252F05%252F2013%26cbxSelIsEscort%3DON%26strTown%3D%26dteMeetDate%3D%26intMiles%3D%26intMilesUSA%3D%26rdoOrderBy%3D7%26intMeetDuration%3D%26cbxGenderID%3D2%26cboSCID%3D0%26cbxPreferenceID%3D55%26intHourlyRateFrom%3D%26intHourlyRateTo%3D%26intHotListID%3D0%26PageNo%3D1%26SS%3D0%26strSelUsername%3D%26dteMeetTime%3DX%26intMeetPrice%3D%26cboBookingCurrencyID%3D28%26intOvernightRateFrom%3D%26intOvernightRateTo%3D%26strSelZipCode%3D%26CommandID%3D1&NavUserIDs=1768061x1983764x1816348x1873822x1896482x1964251x548903x1052569x1188136x1431008x1780228x1788349x1801647x1475155x1635012x1964725x1995120x1985169x1657721x1678563x1620768x591995x1551539x1579011x1996472x1694586x1198128x1916266x1945109x883257x1097958x273262x1891436x1578047x1797390x1415157x1825574x1935666x1119043x929033x1935510x1957223x1468772x1873269x1494092x1120357x1282956x1284275x1107421x639826


Take THAT, evil web crawlers! Maybe they are trying to get poorly-written spiders to crash when they hit the site?


Well, I do remember years ago when I helped a colleague debug a problem with a web app. Seems that IE was the only browser we could find that crashed with %00 in the URL. I'm pretty certain there was a NUL byte exploit we could have dug into.


Actually when you google site:www.bvb.de there are some readable URLs (not sure how they got there), upon navigating to which they do 302 redirects.


It's just like tinyurl.com, but in reverse :)


Unsurprisingly, that actually exists: http://hugeurl.com/


hahaha :D

Let's put the tech sensation aside. I'm glad to know that the HN Folk have a good sense of humor :)

BTW: you can add multiple routes pointing to the same url, but allow only the SEO URLs to be indexed. This keeps the cryptic URLs for the entertainment of the Users/Crawlers.


How would you do this? (Leaving aside the question of 'why?')

I suppose you could try blocking crawlers from the raw URLs with an aggressive robots.txt and then put a sitemap (with friendly/SEO URLs in) somewhere for them to discover instead. Would that work?

Paranoid web spiders could flag the site as suspicious, though. Such schemes might make it seem like the website is presenting one view to the spider, and another to real visitors. Almost like it was trying to hide malware from a scanner.


You could simply add "index" to the page when accessed with /seo/url and noindex when accessed with the cryptic url. Additionally you can enforce that using .htaccess or nginx rules also. Your Framework and HTTP-Router class just has to support multiple URLs per page.

Basically your CMS or Framework must allow to have multiple routes like site.com/best/watch/casio and site.com/→@ðŋ]æ~@¢“«¢“¹²³»«@€^ linking to the same page.

I've used that in the past to switch languages dependant on url-path + browser-language. /en/my-article would show that english article to a german visitor, but everything else on the site like nav, terms etc. would be German. To access the Enlish site, the german visitor would have to click the appropriate flag. I could have easily added the feature to read that same article in German, by a click on a flag in the bread-crumb's mini drop-down. Example: blog»my-article[v] a click on [v] would open blog»mein-artikel etc.


My understanding is that you can use e.g. "rel=canonical" links to tell bots what the indexable URL of the current page is. Other tools in the box include UA sniffing and sitemap.xml.


A former coworker of mine created django-unfriendly[1], which seems like it's probably worse. On the other hand, django-unfriendly is meant to obfuscate on purpose.

[1] - https://github.com/tomatohater/django-unfriendly



Look! Another fixed-width, left-aligned German website. Brimming with nostalgia here.


Weird characters aside, having URLs of the form example.com?stuff has many advantages. For one, you don't need any weird magic to get relative URLs working properly.



Nice try. See you crying in 3 weeks.


What makes this bad? Or rather what makes this objectively bad? I feel like URL schemes espoused by people who judge them are basically bs. Success of the website IMO seems like the only objective measure and by that measure pretty much any URL scheme is fine given the schemes used on some of the most popular sites use schemes that people like jakub_g complain about

I can see where maybe an API URL might have objective better and worse schemes but a content URL? Show me the research results, not just fashion opinions


It's a legit point and a legit question. Instead of voting it down answer the question with some objective facts


SANITIZE ALL THE SERVER CONTENTS!


Ehy, i see nothing wrong here.


Best football club in the world with word URL scheme in the world?


http is a perfectly normal url scheme

ducks


Haha that is epic.

EDIT: I apologise for the really bad comment. I have read the rules now and will only put high quality posts from now on. Thankyou.



Actually thanks for that, I didn't realise so many people were against it but I shall refrain from using the word "epic" in a context that it should be used.


Thank you. I often wanted to write a rant about the epidemic usage of that word, but now I see I don't have to (I can just link to it, like a cow ^^)


ouch


That page is epic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: