I must quibble on points of technical precision (because otherwise the article is both amusing and interesting): the title is pretty much entirely incorrect.
I’ll pick on the FQDN part first, because it’s the (only) part that is unequivocally wrong. FQDN is a very specific technical term in the domain name system. A fully qualified domain name includes a trailing dot, so Ⅷ.fi. would be four characters, even if Ⅷ and fi were the actual labels. But they’re not: DNS is strictly ASCII-only, so this normalisation is happening at a higher level (as the OP notes in another response here, tools are applying IDNA2008, per RFC5895). The FQDN is viii.fi., which is eight characters long.
Next I deny the claim that it’s a single-character domain. Perhaps I’m getting petty here, but even if people do colloquially speak of example.com as a six-letter domain, counting only the label at the level you register the domain, so that I would grudgingly allow Ⅷ.fi to be considered a single character domain (per proletariat vernacular), the domain name that was “bought” was not that, but viii.fi, which is a four-letter domain. Hair splitting is fun.
But my pettiness knows no bounds. Domain names aren’t bought, they’re registered for such-and-such an amount per annum. And I bet it wasn’t exactly £15.00 that was paid.
:-)
———
I went thinking about other TLDs that would work, and ℡ (TEL → .tel) and № (No → .no) occurred to me off the top of my head. Haven’t seen a .tel domain name in yonks. I never did quite see the point of .tel.
The explanation in your blog post isn't quite correct. The conversion of "Ⅷ" to "VIII" happens when Unicode text is normalized using a "compatibility" mapping, that is normalization forms NFKC or NFKD, not when text is lowercased. While IDNA2003 required a custom mapping based on NFKC, IDNA2008 doesn't specify a mapping phase. It does allow custom mappings, though, and it seems that browsers apply NFKC at some point in the process. See https://unicode.org/reports/tr46/#Mapping
Sigh. I predicted as I wrote my comment that there would be some point in my comment that someone would contest; I just didn’t expect it to be such an elementary error! Cue sad OBOE (more apt than sad trombone).
• Hmm, I wasn’t aware of that admission of ambiguity. Allowing that fuzziness in terminology puzzles me, because if you don’t have the trailing dot, then what you have fundamentally isn’t fully qualified. Sure, some resolvers may treat it as though it were, but some won’t, and are perfectly correct not to. (I haven’t the foggiest idea what the balance of implementation behaviours might be. I know enough DNS to be dangerous, but I don’t live and breathe the stuff.)
• I should have said DNS hostnames are ASCII-only, which I believe to be true. (And yeah, this still depends on conventional rather than rigorously defined terminology.)
The trailing-dot-to-anchor thing is hidden away from the user by probably a majority of tools now for managing DNS records. If anything, this has probably made the situation a bit worse by allowing a lot of people to spend years managing DNS records without even knowing about it. Then they edit a BIND text file or something that doesn't assume it and it all goes wrong.
As far as client-side tools... no one ever uses the trailing dot and that can lead to some interesting situations if you have a search path set and use a resolver that resolves wildcard. Use of the search path at all is fairly unusual for "client-ish" setups though. You could also view the common non-use of the trailing dot as one of the causes of the ICANN recommendation against top-level domains resolving, as the search-vs-hostname-vs-fqdn ambiguity would be less common (but still present) if people commonly used the trailing dot.
Plus, if some site were using length limits as a security check, would they be counting Unicode glyphs, Unicode codepoints, code units, or bytes? Cause the one thing TFA's domain doesn't do is take up 4 bytes even without the trailing period (which isn't needed for the XSS anyways). No, it takes 7 bytes using UTF-8 and, naturally, precomposition. So TFA's 15-character XSS is actually an 18-byte XSS (in UTF-8).
The more I learn about text representation and Unicode, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.
I suppose this is inevitable when you tasked with representing literally every symbol in existence. You couldn't pay me enough to touch this problem with a ten foot pole (this and text rendering).
It’s not a clusterfuck and IMHO it’s an unfair characterization. It is insanely complicated and shouldn’t be touched except when wearing appropriate hazmat gear.
Writing seems simple — children do it routinely — but like a biological system it evolved over millennia in a ton of different directions. It’s coupled with emotional, practical, and even, yes, moral issues that operate on both deeply personal and social issues. This is hard to capture in software.
Unicode made a couple of hard decisions right up front. I hate them but they were smart and Unicode would not have survived had they not made them. One was round trip with legacy character sets, which meant encoding a lot of redundant characters (English and German “A” have he same code point, but Greek “A” and Russian “A” do not, nor does an “A” that appears in a Japanese code table. Second was abandoning attempts at Han unification, which had its own linguistic, emotional and political issues.
People are complicated and so are their languages so wrestling the whole thing into a tractable system has been worth the effort.
While the goal and work of Unicode are admirable, I can't help but fear that they're setting themselves up for future problems. Take for example flag emojis [1]. At first it seems "just" complicated. But then it starts to become problematic: what happens when a country changes flags? What happens when a country ceases to exist? Or splits? Or merges into another? What about when there are flag disputes?
Imagine if Unicode has to start dealing with the temporal change that for example the Olson TZ database [2] has to!
This is already not an issue. Unicode doesn't assign a separate codepoint to any flag. Each flag is represented by a two character ISO code using regional indicator symbols (such as IN for Indian flag).
That's part of the point. Are we prepared to track those changes across time? What if there's an article written today with (Unicode Hong Kong flag) or (Unicode Crimean flag)? Those articles might mean to express something in a context where HK is a certain independent entity, or Crimea is Ukrainian. What if that article is displayed with a Chinese and Russian flag 10 years from now?
Technically they already have a flag dispute, over Taiwan, as I recall. Thankfully for the Unicode consortium they’ve managed to leave the implementation problems that causes to the vendors.
As someone who used to work in country list related things, the existence of Taiwan as a country flag codepoint at all would be an issue for China. China will complain if you include Taiwan, and Taiwan will complain if omitted, so it's not fun appeasing both sides.
Agreed, just pointing it out. In our system, we had to display 'Taiwan, Province of China' for the Chinese users, and 'Taiwan' to everyone else, though that was just UI and the backend treated it identically.
Usually it works perfectly. Sometimes it doesn’t. I’m occasionally stunned by how such a fundamental thing as text representation can be ruined by obscure encoding issues. For example, there is absolutely no way to be certain of the character encoding scheme of binary string data unless it is stored as metadata somewhere. Unicode attempts to solve this with the Byte Order Marker. If present, we can know that a string is unicode-encoded, and whether it’s big-endian or little-endian. However, the BOM is optional, and so it’s not known for sure if a string is Unicode.
One example of how this is a huge clusterfuck is that until recently, Windows Notepad opened and saved everything with the Win-1252 encoding scheme (labeled as ASCII in the app). The web, and the other popular OSes, on the other hand, are standardized around UTF-8. So if you download a txt file from the web or OS without a BOM, and you open it in Notepad, you can get characters that looked right in your browser, but not in Notepad.
There are smart algorithms out there that can detect character encoding pretty well, but none of them are perfect (as far as I know).
The Win-1252 default and the fact that most computer users have no idea about character encoding have caused all sorts of headaches for me with the reporting software I work on.
I wouldn't call that an obscure encoding issue, but an absolutely fundamental one. Absent meta information, you can never be sure that some text (or actually any data) is in a specific encoding (at best, you can be sure that it is not in some specific encoding). As an (indirect) illustration, see polyglots (programs that are valid programs in multiple programming languages simultaneously):
It’s often the simple, fundamental things (text, time, images) that we think are easy to implement, but in fact are incredibly complex under the hood. It’s often the human “decentralization” that causes all the quirks and oddities that make things difficult to get right. Text encoding is a good example, date and time another one. Both actually have much more in common than you would think.
Perhaps it looks like it works "almost perfectly" because you're only using English (and similar western languages)? the problems that arise in Asian text are numerous -- and they do frequently hit end users.
Historically computing has been the backbone of bureaucracies for a very long time and as bureaucracies do, they make people bend to their rules, and thus the rules of computing. I'm German, the German alphabet is exactly the same as the English one, except we have four extra letters: äöü, the friendly Umlauts, and ß. Since a lot of older computing systems does not handle this (7-bit ASCII or mainframe character sets), computing bent the language instead. Jägerstraße => Jaegerstrasse. A lot of unixish software doesn't handle spaces in names and such. People bowed to that as well.
The idea that computers should support cultures, and not the other way around, is pretty recent.
I'm not German, but afaik the spelling reform of 1996 that introduced ss as an always-alternative for ß was mainly aimed at simplification and unification. Do you have any support for your statement that it was because of insufficient support by IT systems?
Such an always-alternative doesn't exist. ß was changed to ss at the end of short syllables, that is all. I think there is a rule to always use ss instead of ß (and ae instead of ä, etc.), when ß is not available. But that wasn't introduced in 1996, that is way older and less relevant today than it used to be.
> computing bent the language instead. Jägerstraße => Jaegerstrasse.
Um, no. The words were originally written that way. Ä, ö, ü and ß actually developed from ligatures for ae, oe, ue and ss, long before computers were a thing.
What's your point exactly? Umlauts as we know them have been used for a few hundred years (hard to pin-point, because öäü evolved in "casual" hand-writing, not printing or books) before computing came along, so were clearly how the language worked before. The motive force for using AE and SS in computing was clearly that computers commonly didn't support it, not because people thought suddenly writing like this again a couple hundred years later would be fun.
Originally ß was just a ligature for ss (ſs) but it since developed its own meaning. ß indicates that the preceding vowel is long. Busse and Buße are pronounced differently and mean different things. The conversion ß -> ss destroys information that was present in the original orthography.
> The more I learn about X and Y, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.
I modified your first sentence to make it more generic and applicable to many other things in software.
As a frequent user of Unicode, for Chinese and Japanese, I sorta go along with it, but there's no arguing that we'd be closer to flying cars and jetpack commutes if computers just used ASCII.
Barring that, we could have all used UTF-8, but windows really screwed that up, and none of the arguments for 16-bit alignment vs 8-bit alignment for processing really hold water.
Technically speaking no, they didn't have UTF-16 but rather UCS-2. UTF-8 was fully specified in 1992, and called UTF-8 in 1993, while it took until 1996 before UTF-16 was specified; whereas Windows NT 3.1 was released with UCS-2 in 1993.
For brevity and wit, the most impressive email address I ever saw was up@3.am - and the most impressive website was http://ws./
Sadly, ws. is not serving right now it seems - I had no idea root ccTLDs could vend an A record but there you go; I guess technically that means the root servers themselves could vend A records, which would let you have the ultimate website at http://./
Two letter TLDs are reserved for country codes (ccTLD), so barring policy change and ignoring minor subtleties in the actual rules, .ux will only happen if a new country comes around and gets assigned the ISO 3166-1 alpha-2 country code “ux”.
A more probable course of events would be “tux” being registered as a new gTLD.
There are, of course, exceptions---the policy was put in place after the creation of ccTLDs and does not apply to them, so a few ccTLDs get to break the rule.
This is a cool looking website, but I don't think I'd ever use their service without knowing the people behind it or even where their office is located. Also I believe without an imprint this site is not GDPR compliant.
The domain resolved for me in Chrome on macOS though!
If you go search regularly, there are some cheap two-dot-two you’ll find that are ascii. Bit longer in string length but shorter on the wire.
I own 0e.vc, which is on GitHub as a general purpose xss domain if you need it. Iirc it does eval(window.name), or location.hash. Whatever works for you. It’s also on the public suffix list which makes it almost like a top level domain for security purposes. So I can have subdomains that can’t ever share cookies :-))
How: File an issue at https://github.com/publicsuffix/
Why: It's fun. Also, web Security testing gets easier when you have can make pages for all likely and unlikely scenarios (cross-site/domain/origin).
Technically, you should be able to have your local host resolve first. To avoid it being resolved and use the TLD instead, use the trailing dot: http://dk./https://news.ycombinator.com./ This makes it a FQDN, just like you enter it in your DNS records. Practically, people tend to forget about it, and accessing websites this way sometimes break in unexpected ways when the server doesn't expect the trailing dot in the "host" header.
Technically it could be shorter still, if only either . had an A RRset, or if there were 1-character TLDs. Just cause those aren't true today, doesn't mean they can't ever be true.
Apparently it is possible to do it with local DNS resolution, using your hosts file Not sure how possible it would be on a remote DNS server, or whose authority you'd need to actually do it.
Yes, if the root zone (".") published a CNAME or A record. You can simulate this yourself by setting your resolver as the master for "." and giving it an A record. Note that most webservers will barf if you try to send "." in the Host header unless you explicitly configured it.
# cat named.conf.local
zone "." {
type master;
file "/etc/bind/db.root";
};
# cat db.root
$TTL 60
@ IN SOA localhost. root.localhost. (
1 ; Serial
604800 ; Refresh
86400 ; Retry
2419200 ; Expire
604800 ) ; Negative Cache TTL
;
@ IN NS .
@ IN A 209.216.230.240
# dig a .
...
;; QUESTION SECTION:
;. IN A
;; ANSWER SECTION:
. 60 IN A 209.216.230.240
# wget http://./
--2020-08-17 09:38:26-- http://./
Resolving . (.)... 209.216.230.240
Connecting to . (.)|209.216.230.240|:80... connected.
HTTP request sent, awaiting response... 400 Bad Request
2020-08-17 09:38:26 ERROR 400: Bad Request.
domains are split by the dot, from right to left. what is not visible is that the top level domain is not really the first one. the first one is the root domain (or root zone), which is represented by an empty string. you can try it on any domain, add a dot at the end. it will resolve, but the web server may not be configured to respond to it with the right website.
if you want to query a dns server for a top level domain, you have to query the dns that handles it, which in our case is one of the 13 root dns servers.
I suppose other TLDs could have different rules. But in general you'd want to have a canonicalization step so you don't have two domains that are just different ways of composing the same thing.
I tried to go one character smaller, by using the combined unicode character "⒕", in an attempt to eliminate the ".". I combined it with the "ms" TLD, which can also be represented a the single character. Unfortunately all the browsers I tested refuse to treat it as a domain name without a full stop, so I'm stuck using the three character http://xn--1rh.xn--ryk/ instead of the two character http://xn--6sh056c/ I was hoping for.
On the bright side, even the three character version is unlinkable on facebook -- it just redirects me to http://invalid.invalid/. I'll take that as a win. I still managed to get a pretty cool domain name for around $40, and it was definitely fun to mess around with this idea.
EDIT:
Interestingly, HN has automatically punycoded the URLs. That should be ⑭.㎳ and ⒕㎳
Hi Andy. I'm OP. As you spotted, I registered v i i i. fi with Gandi.
When your browser see the Ⅷ character, it performs the IDNA2008 process listed in RFC5895 to normalise it. The same thing happens in OS tools like dig and ping.
Yeah, I read the article and just now went back and read the Minimal Viable XSS article linked as well. I am also rather puzzled.
This seems to be only useful if you manage to find a website with an XSS flaw and one that also limits the input to 20 characters? Are these situations really common enough to warrant this attack? It all seems rather arbitrary to me.
Note that you are responsible for checking some company and trademark registers for clashes with existing names before registering a .fi domain. With a quick check there is at least one trademark close enough that it could theoretically cause you a headache, though I doubt they care.
My brother's old .fi domain was taken by a company with the same name (because they have priority) and they didn't even do anything with it...
Yep! I chose it as the domain’s total character length is 4 (when visible in a browser obviously, not in punycode form) and the recycling symbol felt appropriate for url shortening. I hosted it on Bitly (hence you see a Bitly landing page if the short url doesn’t exist).
I don’t use it as much anymore because of the previously mentioned Unicode filtering some sites use, like HN.
It's not as hard or expensive as you say to find two characters domains on two characters TLDs. You may want to do some better research. Hint: search on hn.algolia for "short domain names" :)
4 character names were readily available when Norway eased regulations. I bought a couple just last year when (.no) opened for 2-letter domains. Some are still available, I believe.
If you don't mind mixing one letter with one number, and then you choose a 2-letter TLD, you obviously can have a a very short and still inexpensive domain.
Cause `viii.fi` is indeed what OP registered. They are counting on browsers to turn `Ⅷ.fi` into `viii.fi` before resolving, by running `Ⅷ.fi` through a unicode denormalization routine first.
That may be a standard thing to do with unicode in domain names, run it through the standard unicode denormalization first? Understanding what browsers are "supposed" to do with unicode in domain names (and URLs generally) is very confusing for me.
I would be curious to learn more about what standards govern how browsers handle unicode in domain names, the history of it, how compliant browsers are, etc. I also don't entirely understand the goal here -- the original `Ⅷ.fi` isn't actually only two bytes in any encoding... what is the value of having something that shows up as two "glyphs" even though it's more bytes and denormalizes to something else with a yet different number of bytes?
Some sites limit the number of bytes of input. Some sites limit the number of Unicode characters of input. For instance, Twitter's 280-character (formerly 140-character) limit is Unicode characters, not bytes.
Since browser already turns it into ascii format before resolving, how would it work in XSS for server-side max length limitation as he mentioned in his other article, "Minimum Viable XSS" [1]?
speaking more maliciously about unicode, I know that there are ways to attack the domain names, for example replacing the o in Google with a cyrillic o or some other character that looks like o for the purposes of phishing.
I’ll pick on the FQDN part first, because it’s the (only) part that is unequivocally wrong. FQDN is a very specific technical term in the domain name system. A fully qualified domain name includes a trailing dot, so Ⅷ.fi. would be four characters, even if Ⅷ and fi were the actual labels. But they’re not: DNS is strictly ASCII-only, so this normalisation is happening at a higher level (as the OP notes in another response here, tools are applying IDNA2008, per RFC5895). The FQDN is viii.fi., which is eight characters long.
Next I deny the claim that it’s a single-character domain. Perhaps I’m getting petty here, but even if people do colloquially speak of example.com as a six-letter domain, counting only the label at the level you register the domain, so that I would grudgingly allow Ⅷ.fi to be considered a single character domain (per proletariat vernacular), the domain name that was “bought” was not that, but viii.fi, which is a four-letter domain. Hair splitting is fun.
But my pettiness knows no bounds. Domain names aren’t bought, they’re registered for such-and-such an amount per annum. And I bet it wasn’t exactly £15.00 that was paid.
:-)
———
I went thinking about other TLDs that would work, and ℡ (TEL → .tel) and № (No → .no) occurred to me off the top of my head. Haven’t seen a .tel domain name in yonks. I never did quite see the point of .tel.