Hacker News new | past | comments | ask | show | jobs | submit login

> This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc

So every programmer has to know about and support U+202E, but not filesystem programmers?




More like UI programmers? It seems that almost everyone has agreed that text-processing smarts inside a filesystem are a bad idea (see: the NTFS collation table, the APFS transition away from ancient-version-NFD-but-not-quite), although there is that island of (admittedly very smart) -insensitive but -preserving holdouts (casing on Windows, normalization on ZFS). Linus rants on the topic[1] passionately, if not very informatively.

Note that U+202E is a control code that has effect on display, not the logical order of the text (much like, say, a bare CR), so I can’t say what the filesystem is doing wrong here (except maybe for not rejecting this outright, but see re smarts above, this probably needs to be done on a higher level). You don’t blame the filesystem for believing the filename "A\rB.txt" starts with A and not B, do you? Even though ls will say otherwise.

Bidi IRIs (which are at that higher level) are kind of horrendous, though.

[1] https://yarchive.net/comp/linux/utf8.html


That's pretty much correct. Most of the filesystems I'm aware of just treat filenames as a "string of bytes" with some list of characters that aren't allowed, and perhaps a few other rules. Other than that, it's a free-for-all on names.


What do you want the filesystem programmer to do?


> What do you want the filesystem programmer to do?

Replace:

    if(bytestring_ends_with(filename, ".exe")) execute_file(...);
By:

    if(last_displayed_glyphs_equal(filename, ".exe")) execute_file(...);


The filesystem isn't executing anything so if anything you'd want the file manager or shell programmer to handle it. But yours is a terrible solution that would mean everyone else interacting with the filesystem to handle it too. Better to adjust the display code to treat extensions specially (if it doesn't already) and make sure that it is clear to the user what the real extension is.


    if (!isascii(c)) panic("stupid user");


  если (!кои(с)) авост(«тупой оператор»);
You wouldn’t want to live in that world, would you? I know I wouldn’t, and I have that as my native script and most of my filesystem in Latin. I’ve spent my childhood with a computer that ran a VGA-chargen-reprogramming hack at startup and later had to maintain a website stored in an encoding designed to preserve legibility after Latinization through amputation of the 8th bit (in case you’ve ever wondered where the illogical order of KOI-8 comes from). I do not want that world back, however fondly I remember my 286.


I probably wouldn't mind it if were the lingua franca in computing.


> I probably wouldn't mind it if were the lingua franca in computing.

And in programming, I don’t! It’s more like a weird pidgin lignua celto-germano-franca with funky morphology, but I love it nevertheless. I’ve read the Unicode identifiers spec, and frankly, however much I like my Agda with that special Unicode maths sauce, I’m not sure I’d be better off with that in my compiler.

A old and grizzled plant worker who needs a new computer-operated lathe, though, will rightfully tell me to take a hike if I try to sell him a machine that only speaks and accepts a foreign language, and his boss will support him. (It depends on the country: a French person will look down on you if you don’t try to speak their native language to them, and a Norse one will think you’re looking down on them if you do.) I might be able to hold out for a couple of decades, but ultimately, my computer will speak the lingua franca to computing professionals and the native language to users, or somebody else will build one that does.

This means user-facing, user-specified identifiers such as file names will need to support at least these two languages—and given a requirement for exchanging data in a global network, essentially every other one as well. You might try to tell users they’re supposed to use some other kind of identifier, but given these are still going to need to be human-readable, integrity-critial, equality-supporting, globally-exchangeable identifiers, I don’t see how that does anything except rename the problem.


Same works for urls.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: