so.... in theory you should be able to create several visually identical links t...

kccqzy · 2025-02-12T18:57:04 1739386624

"Visually identical" is never good enough. Have you heard of attacks confusing Latin letters and Cyrillic letters? For example C versus С. (The latter is known as CYRILLIC CAPITAL LETTER ES.) Have you heard of NFC forms versus NFD forms? For example é versus é (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT versus LATIN SMALL LETTER E WITH ACUTE.)

Nothing that's important when it comes to security and privacy should rely on a "visually identical" check. Fortunately browsers these days are already good at this; their address bars use puny code for the domain and percent encoding for the rest of the URL.

moody__ · 2025-02-12T22:27:41 1739399261

As the sibling comment has mentioned Unicode in DNS uses a punycode encoding but even further then that the standard specifies that the Unicode data must be normalized to NFC[0] before being converted to punycode. This means that your second example (decomposed e with combining acute accent vs the composed variant) is not a valid concern. The Cyrillic one is however.

[0] https://www.rfc-editor.org/rfc/rfc5891 § 4.1 "By the time a string enters the IDNA registration process as described in this specification, it MUST be in Unicode and in Normalization Form C"

kccqzy · 2025-02-13T01:10:55 1739409055

The OP said link. The NFC/NFD issue remains if these are part of a path name or query parameter.

moody__ · 2025-02-13T04:29:37 1739420977

Sure, but the security concerns of that I feel are much less concerning than having multiple domain names with the same visual appearance that point to different servers. That has immediate impact for things like phishing whereas lookalike path or query portions would at least ensure you are still connecting to the server that you think you are.

komboozcha · 2025-02-12T19:23:42 1739388222

Erm, DNS uses Punycode because it comes from a time when Unicode didn't exist, and bind assumes a grapheme has no more than one byte.

ale42 · 2025-02-12T21:57:57 1739397477

Yes but I guess that the message was meaning that browsers now detect homographs and display the punycode instead. See also https://news.ycombinator.com/item?id=14130241; at that time Firefox wasn't fixed, but in the meantime it fixed the issue too (there's a network.idn.punycode_cyrillic_confusables preference, which is enabled by default).

cscheid · 2025-02-12T14:28:29 1739370509

My understanding is that "weird" unicode code points become https://en.wikipedia.org/wiki/Punycode. I used the 󠅘󠅕󠅜󠅜󠅟 (copy-pasted from the post, presumably with the payload in it) to type a fake domain into Chrome, and the Punycode I got appeared to not have any of the encoding bits.

However, I then pasted the emoji into the _query_ part of a URL. I pointed it to my own website, and sure enough, I can definitely see the payload in the nginx logs. Yikes.

Edit: I pasted the very same Emoji that 'paulgb used in their post before the parenthetical in the first paragraph, but it seems HN scrubs those from comments.

bmicraft · 2025-02-12T16:05:27 1739376327

domains get "punycode" encoded, urls get "url encoded"[1], which should make unicode characters stand out. That being said, browsers do accept some non-ascii characters in urls and convert them automatially, so theoretically you could put "invalid" characters into a link and have the browser convert it only after clicking. That might be a viable strategy.

[1] https://www.w3schools.com/tags//ref_urlencode.asp

echeese · 2025-02-12T18:23:52 1739384632

The emoji is gone but the content is still there.

riquito · 2025-02-12T15:50:43 1739375443

> I've always assumed links without any tracking information (unique hash, query params, etc) were safe to click(with regards to my privacy). but if this works for links I may need to revise my strategy regarding how to approach links sent to me.

Well, it was never safe, what you see and where the link are pointing at are different things, that's why the actual link is displayed at the bottom left of your browser when you move your mouse over it (or focus it via keyboard)

dmbche · 2025-02-12T14:22:23 1739370143

You need to decode the text after copy pasting it, I believe clicking on text will not interact with the obfuscated data since your computer will just find the unicode and ignore the obfuscated data.

This is just so that you can hide data and send it to someone to be decoded (or watermarking as mentionned)

nzach · 2025-02-12T14:37:34 1739371054

yes, I understand this is not a security risk.

but my fear is precisely that I my be sending data to a remote host while I'm completely unaware of this fact.

I tried to create a POC with some popular url shortner services, but doesn't seems to work.

what I wanted to create was a link like <host.tld>/innoc󠅥󠅣󠅕󠅢󠄝󠅙󠅔󠄪󠅑󠅒󠅓ent that redirects to google.com. in this case the "c" contains some hidden data that will be sent to the server while the user is not aware. this seems possible with the correct piece of software.

cess11 · 2025-02-12T14:23:11 1739370191

HTML entity encoding will show the hidden content, try with https://mothereff.in/html-entities.

layer8 · 2025-02-12T16:09:47 1739376587

URIs with non-ASCII characters are technically invalid. Browsers and the like should (but likely don’t all do) percent-encode any invalid characters for display if they accept such invalid URIs.

password4321 · 2025-02-12T17:50:21 1739382621

This tool and idea sketchy AF: https://github.com/zws-im/zws

("Shorten URLs using invisible spaces")