Hacker News new | past | comments | ask | show | jobs | submit login
Buying a single character domain – and 3 character FQDN – for £15 (shkspr.mobi)
260 points by edent on Aug 15, 2020 | hide | past | favorite | 118 comments



I must quibble on points of technical precision (because otherwise the article is both amusing and interesting): the title is pretty much entirely incorrect.

I’ll pick on the FQDN part first, because it’s the (only) part that is unequivocally wrong. FQDN is a very specific technical term in the domain name system. A fully qualified domain name includes a trailing dot, so Ⅷ.fi. would be four characters, even if Ⅷ and fi were the actual labels. But they’re not: DNS is strictly ASCII-only, so this normalisation is happening at a higher level (as the OP notes in another response here, tools are applying IDNA2008, per RFC5895). The FQDN is viii.fi., which is eight characters long.

Next I deny the claim that it’s a single-character domain. Perhaps I’m getting petty here, but even if people do colloquially speak of example.com as a six-letter domain, counting only the label at the level you register the domain, so that I would grudgingly allow Ⅷ.fi to be considered a single character domain (per proletariat vernacular), the domain name that was “bought” was not that, but viii.fi, which is a four-letter domain. Hair splitting is fun.

But my pettiness knows no bounds. Domain names aren’t bought, they’re registered for such-and-such an amount per annum. And I bet it wasn’t exactly £15.00 that was paid.

:-)

———

I went thinking about other TLDs that would work, and ℡ (TEL → .tel) and № (No → .no) occurred to me off the top of my head. Haven’t seen a .tel domain name in yonks. I never did quite see the point of .tel.


I appreciate your pettiness! I have edent.tel

I also have a list of tld which can be shortened by this process. https://shkspr.mobi/blog/2018/11/domain-hacks-with-unusual-u...

And, yes, £15.30. But let's not quibble :-)


The explanation in your blog post isn't quite correct. The conversion of "Ⅷ" to "VIII" happens when Unicode text is normalized using a "compatibility" mapping, that is normalization forms NFKC or NFKD, not when text is lowercased. While IDNA2003 required a custom mapping based on NFKC, IDNA2008 doesn't specify a mapping phase. It does allow custom mappings, though, and it seems that browsers apply NFKC at some point in the process. See https://unicode.org/reports/tr46/#Mapping


Huh! That’s probably the most serious use of .tel that I’ve ever seen. Well done. :-)


> example.com as a six-letter domain

If we're being pedantic, "example" has seven letters!

:-)


Sigh. I predicted as I wrote my comment that there would be some point in my comment that someone would contest; I just didn’t expect it to be such an elementary error! Cue sad OBOE (more apt than sad trombone).

:-) indeed.


Looks like another instance of Muphry's Law: https://en.wikipedia.org/wiki/Muphry%27s_law


Lol I thought you misspelled Murphy’s Law until I clicked the link. Good to know. :)


> A fully qualified domain name includes a trailing dot

The term is ambiguous, and many uses don't include the trailing dot. See IETF RFC 8499 section 2.

> DNS is strictly ASCII-only

DNS is not strictly ascii. At the protocol level labels are octet strings. You may be thinking of the LDH convention.


Thanks for the corrections.

• Hmm, I wasn’t aware of that admission of ambiguity. Allowing that fuzziness in terminology puzzles me, because if you don’t have the trailing dot, then what you have fundamentally isn’t fully qualified. Sure, some resolvers may treat it as though it were, but some won’t, and are perfectly correct not to. (I haven’t the foggiest idea what the balance of implementation behaviours might be. I know enough DNS to be dangerous, but I don’t live and breathe the stuff.)

• I should have said DNS hostnames are ASCII-only, which I believe to be true. (And yeah, this still depends on conventional rather than rigorously defined terminology.)


The trailing-dot-to-anchor thing is hidden away from the user by probably a majority of tools now for managing DNS records. If anything, this has probably made the situation a bit worse by allowing a lot of people to spend years managing DNS records without even knowing about it. Then they edit a BIND text file or something that doesn't assume it and it all goes wrong.

As far as client-side tools... no one ever uses the trailing dot and that can lead to some interesting situations if you have a search path set and use a resolver that resolves wildcard. Use of the search path at all is fairly unusual for "client-ish" setups though. You could also view the common non-use of the trailing dot as one of the causes of the ICANN recommendation against top-level domains resolving, as the search-vs-hostname-vs-fqdn ambiguity would be less common (but still present) if people commonly used the trailing dot.


Correct. The ASCII constraint is for hostnames and domainnames, but not for all labels.


Plus, if some site were using length limits as a security check, would they be counting Unicode glyphs, Unicode codepoints, code units, or bytes? Cause the one thing TFA's domain doesn't do is take up 4 bytes even without the trailing period (which isn't needed for the XSS anyways). No, it takes 7 bytes using UTF-8 and, naturally, precomposition. So TFA's 15-character XSS is actually an 18-byte XSS (in UTF-8).

:)


http://jonpos.tel would agree with you.


>I’ll pick on the FQDN part first, because it’s the (only) part that is unequivocally wrong.

Technically correct is best correct.

>Hair splitting is fun.

Surely it just can't get any better than this. We're done here.

>But my pettiness knows no bounds.

I'm dead. Wrap me up.


The more I learn about text representation and Unicode, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.

I suppose this is inevitable when you tasked with representing literally every symbol in existence. You couldn't pay me enough to touch this problem with a ten foot pole (this and text rendering).


It’s not a clusterfuck and IMHO it’s an unfair characterization. It is insanely complicated and shouldn’t be touched except when wearing appropriate hazmat gear.

Writing seems simple — children do it routinely — but like a biological system it evolved over millennia in a ton of different directions. It’s coupled with emotional, practical, and even, yes, moral issues that operate on both deeply personal and social issues. This is hard to capture in software.

Unicode made a couple of hard decisions right up front. I hate them but they were smart and Unicode would not have survived had they not made them. One was round trip with legacy character sets, which meant encoding a lot of redundant characters (English and German “A” have he same code point, but Greek “A” and Russian “A” do not, nor does an “A” that appears in a Japanese code table. Second was abandoning attempts at Han unification, which had its own linguistic, emotional and political issues.

People are complicated and so are their languages so wrestling the whole thing into a tractable system has been worth the effort.


> abandoning attempts at Han unification

Huh? Han unification happened.


It is quite different from what was originally proposed but You are right I should not have phrased it that way.


While the goal and work of Unicode are admirable, I can't help but fear that they're setting themselves up for future problems. Take for example flag emojis [1]. At first it seems "just" complicated. But then it starts to become problematic: what happens when a country changes flags? What happens when a country ceases to exist? Or splits? Or merges into another? What about when there are flag disputes?

Imagine if Unicode has to start dealing with the temporal change that for example the Olson TZ database [2] has to!

[1] https://shkspr.mobi/blog/2019/06/quirks-and-limitations-of-e...

[2] https://en.wikipedia.org/wiki/Tz_database


This is already not an issue. Unicode doesn't assign a separate codepoint to any flag. Each flag is represented by a two character ISO code using regional indicator symbols (such as IN for Indian flag).

https://en.wikipedia.org/wiki/Regional_Indicator_Symbol


That's part of the point. Are we prepared to track those changes across time? What if there's an article written today with (Unicode Hong Kong flag) or (Unicode Crimean flag)? Those articles might mean to express something in a context where HK is a certain independent entity, or Crimea is Ukrainian. What if that article is displayed with a Chinese and Russian flag 10 years from now?


Technically they already have a flag dispute, over Taiwan, as I recall. Thankfully for the Unicode consortium they’ve managed to leave the implementation problems that causes to the vendors.


As someone who used to work in country list related things, the existence of Taiwan as a country flag codepoint at all would be an issue for China. China will complain if you include Taiwan, and Taiwan will complain if omitted, so it's not fun appeasing both sides.


I imagine so, but it exists.


Agreed, just pointing it out. In our system, we had to display 'Taiwan, Province of China' for the Chinese users, and 'Taiwan' to everyone else, though that was just UI and the backend treated it identically.


Usually it works perfectly. Sometimes it doesn’t. I’m occasionally stunned by how such a fundamental thing as text representation can be ruined by obscure encoding issues. For example, there is absolutely no way to be certain of the character encoding scheme of binary string data unless it is stored as metadata somewhere. Unicode attempts to solve this with the Byte Order Marker. If present, we can know that a string is unicode-encoded, and whether it’s big-endian or little-endian. However, the BOM is optional, and so it’s not known for sure if a string is Unicode.

One example of how this is a huge clusterfuck is that until recently, Windows Notepad opened and saved everything with the Win-1252 encoding scheme (labeled as ASCII in the app). The web, and the other popular OSes, on the other hand, are standardized around UTF-8. So if you download a txt file from the web or OS without a BOM, and you open it in Notepad, you can get characters that looked right in your browser, but not in Notepad.

There are smart algorithms out there that can detect character encoding pretty well, but none of them are perfect (as far as I know).

The Win-1252 default and the fact that most computer users have no idea about character encoding have caused all sorts of headaches for me with the reporting software I work on.


I wouldn't call that an obscure encoding issue, but an absolutely fundamental one. Absent meta information, you can never be sure that some text (or actually any data) is in a specific encoding (at best, you can be sure that it is not in some specific encoding). As an (indirect) illustration, see polyglots (programs that are valid programs in multiple programming languages simultaneously):

https://en.wikipedia.org/wiki/Polyglot_(computing)


You’re right. I just meant obscure from the end-user perspective. It’s not clear what’s wrong to uneducated users, only that their text looks weird.


It’s often the simple, fundamental things (text, time, images) that we think are easy to implement, but in fact are incredibly complex under the hood. It’s often the human “decentralization” that causes all the quirks and oddities that make things difficult to get right. Text encoding is a good example, date and time another one. Both actually have much more in common than you would think.


Perhaps it looks like it works "almost perfectly" because you're only using English (and similar western languages)? the problems that arise in Asian text are numerous -- and they do frequently hit end users.


Historically computing has been the backbone of bureaucracies for a very long time and as bureaucracies do, they make people bend to their rules, and thus the rules of computing. I'm German, the German alphabet is exactly the same as the English one, except we have four extra letters: äöü, the friendly Umlauts, and ß. Since a lot of older computing systems does not handle this (7-bit ASCII or mainframe character sets), computing bent the language instead. Jägerstraße => Jaegerstrasse. A lot of unixish software doesn't handle spaces in names and such. People bowed to that as well.

The idea that computers should support cultures, and not the other way around, is pretty recent.


I'm not German, but afaik the spelling reform of 1996 that introduced ss as an always-alternative for ß was mainly aimed at simplification and unification. Do you have any support for your statement that it was because of insufficient support by IT systems?


Such an always-alternative doesn't exist. ß was changed to ss at the end of short syllables, that is all. I think there is a rule to always use ss instead of ß (and ae instead of ä, etc.), when ß is not available. But that wasn't introduced in 1996, that is way older and less relevant today than it used to be.

Gruß, stkdump


> computing bent the language instead. Jägerstraße => Jaegerstrasse.

Um, no. The words were originally written that way. Ä, ö, ü and ß actually developed from ligatures for ae, oe, ue and ss, long before computers were a thing.


What's your point exactly? Umlauts as we know them have been used for a few hundred years (hard to pin-point, because öäü evolved in "casual" hand-writing, not printing or books) before computing came along, so were clearly how the language worked before. The motive force for using AE and SS in computing was clearly that computers commonly didn't support it, not because people thought suddenly writing like this again a couple hundred years later would be fun.


Originally ß was just a ligature for ss (ſs) but it since developed its own meaning. ß indicates that the preceding vowel is long. Busse and Buße are pronounced differently and mean different things. The conversion ß -> ss destroys information that was present in the original orthography.


> The more I learn about X and Y, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.

I modified your first sentence to make it more generic and applicable to many other things in software.


As a frequent user of Unicode, for Chinese and Japanese, I sorta go along with it, but there's no arguing that we'd be closer to flying cars and jetpack commutes if computers just used ASCII.

Barring that, we could have all used UTF-8, but windows really screwed that up, and none of the arguments for 16-bit alignment vs 8-bit alignment for processing really hold water.


Microsoft implemented UTF-16 before UTF-8 was even invented, and certainly before it came in to widespread use.


Technically speaking no, they didn't have UTF-16 but rather UCS-2. UTF-8 was fully specified in 1992, and called UTF-8 in 1993, while it took until 1996 before UTF-16 was specified; whereas Windows NT 3.1 was released with UCS-2 in 1993.


For brevity and wit, the most impressive email address I ever saw was up@3.am - and the most impressive website was http://ws./

Sadly, ws. is not serving right now it seems - I had no idea root ccTLDs could vend an A record but there you go; I guess technically that means the root servers themselves could vend A records, which would let you have the ultimate website at http://./


A friend of mine, Garrett Smith, has g@rre.tt, which always impressed me.

I’m still waiting for someone to launch a .ux domain so I can grab macint.ux


Two letter TLDs are reserved for country codes (ccTLD), so barring policy change and ignoring minor subtleties in the actual rules, .ux will only happen if a new country comes around and gets assigned the ISO 3166-1 alpha-2 country code “ux”.

A more probable course of events would be “tux” being registered as a new gTLD.


Well, for the low low price of $185k and a bunch of paperwork you could be the proud new owner of .tux:

https://newgtlds.icann.org/en/applicants/agb/guidebook-full-...

As it looks like that TLD is not currently issued:

https://data.iana.org/TLD/tlds-alpha-by-domain.txt


I oft rue my failure to be vastly wealthy.


http://ai./ as well


Top-level domains or "dotless domains" are generally prohibited from resolving. SAC-053 explains why: https://www.icann.org/en/system/files/files/sac-053-en.pdf

There are, of course, exceptions---the policy was put in place after the creation of ccTLDs and does not apply to them, so a few ccTLDs get to break the rule.


Congratulations -- that's a really clever hack. Exactly the kind of thing I love to read about on this site.


You seem to use "character" and "codepoint" interchangeably, but its important to note that you are not saving on bytes.

You mention this can be used to avoid filters, so I guess this is specifically to trick "string".length?


You are quite right. Ⅷ in UTF-8 is 0xE2 0x85 0xA7

That's still shorter than V I I I though.


Is the other character you hint at ㎉?


Yes!


How about $12, like https://xn--bj8a.com or ꑮ.com (which currently resolves in Safari on macOS and iOS only).


I can highlight the domain and open fine in Firefox on android. It doesn't render as the unicode glyph in the address bar after resolution though


Also resolves in Firefox on macOS


Resolves and loads fine on Firefox and Chrome on Windows 10


This is a cool looking website, but I don't think I'd ever use their service without knowing the people behind it or even where their office is located. Also I believe without an imprint this site is not GDPR compliant.

The domain resolved for me in Chrome on macOS though!


Clicking this link crashes my HN client on iOS


If you go search regularly, there are some cheap two-dot-two you’ll find that are ascii. Bit longer in string length but shorter on the wire.

I own 0e.vc, which is on GitHub as a general purpose xss domain if you need it. Iirc it does eval(window.name), or location.hash. Whatever works for you. It’s also on the public suffix list which makes it almost like a top level domain for security purposes. So I can have subdomains that can’t ever share cookies :-))


> It’s also on the public suffix list

A bit of a tangent, but: how and why?


How: File an issue at https://github.com/publicsuffix/ Why: It's fun. Also, web Security testing gets easier when you have can make pages for all likely and unlikely scenarios (cross-site/domain/origin).


Search manually, or is there a tool?


Found one myself:

https://ahreflink.com/domains/two-letter

Seems there are plenty available.


Technically, the shortest domain name is 2 letters:

http://dk


Also http://ai./ which doesn't redirect you anywhere.


Does this mean you (theoretically) can’t name a host "dk" on your local network? That seems bad, (even if naming hosts after TLDs is a bad idea).


Technically, you should be able to have your local host resolve first. To avoid it being resolved and use the TLD instead, use the trailing dot: http://dk./ https://news.ycombinator.com./ This makes it a FQDN, just like you enter it in your DNS records. Practically, people tend to forget about it, and accessing websites this way sometimes break in unexpected ways when the server doesn't expect the trailing dot in the "host" header.

http://www.dns-sd.org./TrailingDotsInDomainNames.html


You shouldn't be doing this anyways. There are reserved TLDs for this like .corp

https://tools.ietf.org/id/draft-chapin-rfc2606bis-00.html#ne...


.va used to have an SMTP server.


Technically it could be shorter still, if only either . had an A RRset, or if there were 1-character TLDs. Just cause those aren't true today, doesn't mean they can't ever be true.


Chrome won't load that without the trailing dot: http://dk.


Loads fine for me without dot in Chrome (tested on Linux and Android).


Is http://. possible?


Apparently it is possible to do it with local DNS resolution, using your hosts file Not sure how possible it would be on a remote DNS server, or whose authority you'd need to actually do it.

[root@host ~]$ curl -v http://.

* About to connect() to . port 80 (#0)

* Trying 127.0.0.1...

* Connected to . (127.0.0.1) port 80 (#0)

> GET / HTTP/1.1

> User-Agent: curl/7.29.0

> Host: .

> Accept: /

>

< HTTP/1.1 400 Bad Request

< Date: Sat, 15 Aug 2020 15:38:54 GMT

< Server: Apache

< Content-Length: 347

< Connection: close

< Content-Type: text/html; charset=iso-8859-1

<

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>

<title>400 Bad Request</title>

</head><body>

<h1>Bad Request</h1>

<p>Your browser sent a request that this server could not understand.<br />

</p>

<p>Additionally, a 400 Bad Request

error was encountered while trying to use an ErrorDocument to handle the request.</p>

</body></html>

* Closing connection 0


Yes, if the root zone (".") published a CNAME or A record. You can simulate this yourself by setting your resolver as the master for "." and giving it an A record. Note that most webservers will barf if you try to send "." in the Host header unless you explicitly configured it.

    # cat named.conf.local
    zone "." {
     type master;
     file "/etc/bind/db.root";
    };

    # cat db.root 
    $TTL 60
    @ IN SOA localhost. root.localhost. (
             1  ; Serial
        604800  ; Refresh
         86400  ; Retry
       2419200  ; Expire
        604800 ) ; Negative Cache TTL
    ;
    @ IN NS .
    @ IN A 209.216.230.240

    # dig a .
    ...
    ;; QUESTION SECTION:
    ;.    IN A
    ;; ANSWER SECTION:
    .   60 IN A 209.216.230.240

    # wget http://./ 
    --2020-08-17 09:38:26--  http://./
    Resolving . (.)... 209.216.230.240
    Connecting to . (.)|209.216.230.240|:80... connected.
    HTTP request sent, awaiting response... 400 Bad Request
    2020-08-17 09:38:26 ERROR 400: Bad Request.


Was my thoughts also.

Doing a dig, there are no A records for . but they do provide NS records obviously, for all the other TLDs to resolve.

Devilish detail is probably in the RFC and likely implementation specific also, as to whether they accept it is a valid request or fail before trying.

dig A .

; <<>> DiG 9.11.5-P4-5.1-Debian <<>> A . ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26280 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION: ;. IN A

;; Query time: 3 msec ;; SERVER: 192.168.1.254#53(192.168.1.254) ;; WHEN: Sat Aug 15 18:19:41 BST 2020 ;; MSG SIZE rcvd: 17


How does this work? Also, why can't nslookup resolve it?


domains are split by the dot, from right to left. what is not visible is that the top level domain is not really the first one. the first one is the root domain (or root zone), which is represented by an empty string. you can try it on any domain, add a dot at the end. it will resolve, but the web server may not be configured to respond to it with the right website.

if you want to query a dns server for a top level domain, you have to query the dns that handles it, which in our case is one of the 13 root dns servers.

read more here: https://en.m.wikipedia.org/wiki/DNS_root_zone

https://en.m.wikipedia.org/wiki/Root_name_server


Oh! Now that's nifty. How long have they had that?


DONKEY KONG


What about ﷺ ? Arabic letters in domains should be OK, right? It decomposes into a lot(!) of characters.


It does - but DNS is still limited to ASCII. So those characters would go through Punycode. I'm not sure if the browser would decompose it though.


Here are the rules for the .sa TLD https://www.iana.org/domains/idn-tables/tables/man_ar_1.0.tx...

I suppose other TLDs could have different rules. But in general you'd want to have a canonicalization step so you don't have two domains that are just different ways of composing the same thing.


it looks like .man


Oh true... I must have misread the comment I copied it from.


I tried to go one character smaller, by using the combined unicode character "⒕", in an attempt to eliminate the ".". I combined it with the "ms" TLD, which can also be represented a the single character. Unfortunately all the browsers I tested refuse to treat it as a domain name without a full stop, so I'm stuck using the three character http://xn--1rh.xn--ryk/ instead of the two character http://xn--6sh056c/ I was hoping for.

On the bright side, even the three character version is unlinkable on facebook -- it just redirects me to http://invalid.invalid/. I'll take that as a win. I still managed to get a pretty cool domain name for around $40, and it was definitely fun to mess around with this idea.

EDIT:

Interestingly, HN has automatically punycoded the URLs. That should be ⑭.㎳ and ⒕㎳


It's not clear what domain he actually registered.

xn--jxb.fi would appear to be unregistered.

(U+2167) (U+002E) (U+FB01) is not a valid domain name.

viii.fi (all ASCII) is registered, and the registrar is gandi. But so what?

I don't get it.


Hi Andy. I'm OP. As you spotted, I registered v i i i. fi with Gandi.

When your browser see the Ⅷ character, it performs the IDNA2008 process listed in RFC5895 to normalise it. The same thing happens in OS tools like dig and ping.

Hope that makes a bit more sense.


Yeah, I read the article and just now went back and read the Minimal Viable XSS article linked as well. I am also rather puzzled.

This seems to be only useful if you manage to find a website with an XSS flaw and one that also limits the input to 20 characters? Are these situations really common enough to warrant this attack? It all seems rather arbitrary to me.


I'm a rather amateur bug hunter but, yes, some sites do use string length limitations as a way of filtering out dodgy code.

They shouldn't - but they do.

(See https://www.openbugbounty.org/researchers/edent/)


Note that you are responsible for checking some company and trademark registers for clashes with existing names before registering a .fi domain. With a quick check there is at least one trademark close enough that it could theoretically cause you a headache, though I doubt they care.

My brother's old .fi domain was taken by a company with the same name (because they have priority) and they didn't even do anything with it...


Where do you check for such things?


I own n.nu - is that the shortest domain possible?


Isn't i.nu a little narrower, at least?


We only use monospace fonts around these parts


https://dk

I think there might be one or two other domains that have a host as well. But beating two letters is difficult.


I also own a four character domain too, but the domain part is a single Unicode character and HN seems to strip it from text here.

The punycode version is xn—x6h.ws


http://xn--x6h.ws

Recycling Symbol for Type-4 Plastics


Yep! I chose it as the domain’s total character length is 4 (when visible in a browser obviously, not in punycode form) and the recycling symbol felt appropriate for url shortening. I hosted it on Bitly (hence you see a Bitly landing page if the short url doesn’t exist).

I don’t use it as much anymore because of the previously mentioned Unicode filtering some sites use, like HN.


Ironically, something has autosubstituted an em dash "—" for your original double hyphens...


You can use emoji with a .to domain.

Does that count?


It's not as hard or expensive as you say to find two characters domains on two characters TLDs. You may want to do some better research. Hint: search on hn.algolia for "short domain names" :)


4 character names were readily available when Norway eased regulations. I bought a couple just last year when (.no) opened for 2-letter domains. Some are still available, I believe.


Do you do anything with them or are you just keeping them for fun or future profits?


If you don't mind mixing one letter with one number, and then you choose a 2-letter TLD, you obviously can have a a very short and still inexpensive domain.


In French, Ⅷ.fi is pronounced "wifi".


Is there list somewhere which symbols decompose into characters? (other than that 4 character symbol example)


I tried opening viii.fi (no special characters), and that also directs to this website. How did that happen?


Cause `viii.fi` is indeed what OP registered. They are counting on browsers to turn `Ⅷ.fi` into `viii.fi` before resolving, by running `Ⅷ.fi` through a unicode denormalization routine first.

That may be a standard thing to do with unicode in domain names, run it through the standard unicode denormalization first? Understanding what browsers are "supposed" to do with unicode in domain names (and URLs generally) is very confusing for me.

I would be curious to learn more about what standards govern how browsers handle unicode in domain names, the history of it, how compliant browsers are, etc. I also don't entirely understand the goal here -- the original `Ⅷ.fi` isn't actually only two bytes in any encoding... what is the value of having something that shows up as two "glyphs" even though it's more bytes and denormalizes to something else with a yet different number of bytes?


For your weekend reading pleasure:

Internationalized Domain Names (IDN) FAQ (https://unicode.org/faq/idn.html)

Unicode® Technical Standard #46: UNICODE IDNA COMPATIBILITY PROCESSING (https://www.unicode.org/reports/tr46/)

Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale (http://tools.ietf.org/html/rfc5894)

Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework (http://tools.ietf.org/html/rfc5890)

Internationalized Domain Names in Applications (IDNA) Protocol (http://tools.ietf.org/html/rfc5891)

The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) (http://tools.ietf.org/html/rfc5892)

Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA) (http://tools.ietf.org/html/rfc5893)


Some sites limit the number of bytes of input. Some sites limit the number of Unicode characters of input. For instance, Twitter's 280-character (formerly 140-character) limit is Unicode characters, not bytes.


I don't get it then.

Since browser already turns it into ascii format before resolving, how would it work in XSS for server-side max length limitation as he mentioned in his other article, "Minimum Viable XSS" [1]?

[1] https://shkspr.mobi/blog/2016/03/minimum-viable-xss/


Yeah, I don't really get that either; I don't understand the value of these denormalizing domain names to XSS.


That browser behavior is literally the entire point of the article!


Please keep the guidelines[0] in mind when posting and commenting. Thanks.

[0]: https://news.ycombinator.com/newsguidelines.html


Is there any site where I could get a list of single character or a double character domain names ?


speaking more maliciously about unicode, I know that there are ways to attack the domain names, for example replacing the o in Google with a cyrillic o or some other character that looks like o for the purposes of phishing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: