Hacker News new | past | comments | ask | show | jobs | submit login

Minor point: I see copy-paste from wikipedia about ISO-8859-5. It's unfortunate since nobody ever used ISO-8859-5. They probably should change it to ISO-8858-something-else in the wikipedia article.



It also entirely glosses over the fact that before the ISO-8859 standards, there were the horrendous code pages in DOS and numerous other encodings on other platforms, which made things hard even for Europeans, let alone languages with a non-latin-derived alphabet.


And after ISO-8859, there were Windows (aka ANSI) code pages, also not matching perfectly (or sometimes at all) with either DOS or ISO-8859.


Most SBCS codepages other than 437 as used today was introduced in 1987 with DOS 3.3 which was after ISO 8859.


And before that, you get into encodings like EBCDIC, RAD50, SIXBIT, FIELDATA, and even more failed schemes now largely forgotten.

Why should a brief overview go back even to the pre-ISO-8859 days except to mention ASCII? None of them are directly relevant: The world we're dealing with now on the Web begins with ASCII, moves through a Pre-Unicode Period, and finishes up in the Land of Unicode, where it's at least possible to do things Right. All history tells a narrative; when it comes to character encodings, that's a good default unless you really think your audience cares about why FORTRAN was spelled that way back in the Before Time.

Tom Jennings has an interesting history:

http://www.wps.com/projects/codes/


"The world we're dealing with now on the Web begins with ASCII"

I know nothing about the implementation of early web browsers/gopher/etc, but I doubt there ever was anything on the web that used ASCII. 7-bit email may have been around at e time, but I would guess Tim Berners Lee just used whatever character set his system used by default (corrections welcome; being snarky isn't the only reason I write this)


It was a hotly debated topic whether the www should use 7-bit/mime or not.


> I know nothing about the implementation of early web browsers/gopher/etc, but I doubt there ever was anything on the web that used ASCII.

All headers, HTTP, email, or otherwise, are 99% or more ASCII. HTML markup is over 99% ASCII for most documents, especially the complex ones.

ASCII is the only text encoding you can guarantee everything on the Web (and the Internet in general, really) knows how to speak. Finally, guess what all valid UTF-8 codepoints in the range U+00 to U+7F inclusive are compatible with: ASCII.


ASCII in fact is the completely safe text encoding for HTML - and thanks to HTML entities, you do not lose any international character support. You can have a Unicode-using HTML document encoded in ASCII - it's just quite big.


I know that, but "over 99% ASCII" = "not ASCII". For many users, UTF8 is over 99% ASCII, but it is not ASCII.


> I know that, but "over 99% ASCII" = "not ASCII"

No, that's not what I meant. I meant that all of the essential bits are ASCII, all of the software that generates those important pieces as to know ASCII, and it's entirely possible for software that speaks only ASCII to handle it as long as the filenames (the main source of non-ASCII characters) being served are also ASCII.

Read the HTTP specification sometime.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: