Hacker News new | past | comments | ask | show | jobs | submit login
The hell that is filename encoding (2016) (beets.io)
155 points by kristjansson on May 4, 2018 | hide | past | favorite | 118 comments



Funny(???) warstory:

1. Back in the days, we were using a Linux NFS server, with NFSv3, and out-of-the-box locale was iso-8859-1 (latin1). Life was good, except for occasional problems with people with strange non-latin1 names, or documents with non-latin1 names etc.

2. At some point, we switch to using UTF-8 by default. Telling users to use convmv to rename their files when they are ready to switch to the new defaults. Most people ignored this, of course, but files with now invalid utf-8 were mostly fine, just with the occasional "?"'s in the names.

3. Switch to NFSv4. Invisible to end users. NFSv4 per se requires that paths are UTF-8 encoded, but in practice the Linux NFS server and client just pass along a bag of bytes, so invalid UTF-8 just worked as fine as it did previously.

4. Switch from a Linux NFS server to a netapp.

5. User complains that files are missing. Initial comparison with the old Linux NFS server, which was still online, shows no problems. Problem occurs only on user workstation, not on admin box which has both the old Linux NFS and netapp directory trees mounted. Investigation on users workstation shows that in some cases lots of files appear to be missing, including ones which plain ASCII names.

- Turns out that the admin box had the netapp mounted with NFSv3, and thus everything appeared Ok there, including the rsync from Linux NFS -> netapp in the first place.

- However, when mounted using NFSv4, netapp follows the spec and does not like non-utf8 paths. Does it report an error then? Hell no, the NFS READDIR (READDIRPLUS?) message reply just stops returning directory entries when it hits the first one with invalid UTF-8. And thus you get a partial directory listing. GAAAH!

- So the solution was to run convmv centrally (from the admin box which had the netapp mounted with NFSv3) for the entire directory tree which had been moved.


Ah yes, had the same fun problem at a customer's facility last week. Moving 350 TB of data from an old DDP storage server to a Linux one. Mounting with CIFS (no other option available), an copying using "cp -a".

The file names look OK after the copy on the Linux machine. However, when exporting the directory through Samba, the Macs Finder doesn't display files with accents in the names (though they appear correctly with "ls", weird...).

So the user copies the files again, using the Finder. Now I have files with exactly the same name (uhhhhh???):

# ls -l Mmo-1. -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif -rwxr--r-- 1 test test 8417218 6 sept. 2013 Mémo-1.aif -rw-rw-rw- 1 root root 363175 6 sept. 2013 Mémo-1.m4a -rwxr--r-- 1 test test 363175 6 sept. 2013 Mémo-1.m4a

Yes, it looks like two files have exactly the same name, but actually they're different: one as "é" encoded as 0xCC81, and the other one (the "good one") as 0xC3A9. Why is that? Why does one work with the Finder, and the other doesn't? who knows.


Most likely it's different normalization. I've seen this before with Mac systems.

Renaming the files to use NFKC normalization fixed it. In python, you could loop through the files and do something like:

  os.rename(originalfilename, unicodedata.normalize('NFKC', originalfilename.decode('utf8')))
EDIT: You'll probably need to do this on a non-Mac system, linux for example should work.


A similar thing happens with Java's file and directory APIs on Linux. IIRC in Java filenames are Strings. If your vm is configured with utf8 as "file.encoding" and you have non-utf8-compliant filenames on your filesystem, those files are completely inaccessible to Java!


This is kind-of why I think there should be heavy push towards not supporting any other encoding than UTF-8.


Useless non-sense like this is the main reason why i desperately want to move away from software engineering.


Careful, though. Plain human bureaucracy can be just as bad - and often harder to debug :)


> on Windows, paths are fundamentally text

They were back when there were less than 2^16 characters in the Unicode standard. Back then each two-byte word in a filename corresponded exactly with a Unicode code point.

Now there are more than 2^16 but well under 2^32, Windows uses UTF-16 in filenames. That is, Unicode code points above 2^15 are obtained by a pair of special Unicode code points in the range 2^15-2^16 called surrogates; surrogate pairs need to be collapsed into a single code point when decoding the file name. Surrogates are exactly those things that Python uses on Linux to hide bytes that are not valid UTF-8. Here's the problem: it is possible to have unmatched surrogates in a file name (or in other places that Windows accepts UTF-16).

In summary, on Windows, you end up with effectively the same situation as Linux: file names that are supposed to be in one encoding (UTF16) but contain invalid data for that encoding.


Funnily, because of the 2/3 transition, Python became a language obsessed with encoding correctness. Hence the team spent a great deal improving the situation, from version to version during the last 10 years.

For a fantastic read (from Victor Stinner himself) on all the work done and get how twisted this get when you want to be cross plateform and abstract it:

https://vstinner.github.io/python37-new-utf8-mode.html

Windows and Linux FS encoding, of course, are at the center of the challenge.

It's also the reason why we now have a __fspath__ protocol that allows any object to be converted to a file system path instead of having pathlib.Path inheriting from string.


> Here's the problem: it is possible to have unmatched surrogates in a file name (or in other places that Windows accepts UTF-16).

NTFS (and Windows as a whole) does not use UTF-16, it uses UCS-2. It is a subtle difference, but surrogate pairs didn't exist in UCS-2.


Old versions use UCS-2. New versions use UTF-16. (correctness not enforced) This is also how Java and OS X were updated.

The kernel generally uses the 16-bit equivalent of the old Pascal string, that being a 16-bit count of 16-bit pieces of UTF-16 data. This allows a 16-bit NUL to get into various places that make the Win32 API choke.


> Old versions use UCS-2. New versions use UTF-16. (correctness not enforced) This is also how Java and OS X were updated.

I usually call that ucs2-plus-surrogates, to make it clear that you may encounter unpaired surrogates, and thus invalid paths if you assume proper UTF-16.


This is actually kinda painful with NTFS as NTFS doesn't really care what's in a path other than the directory separators it's all binary. This means that different applications using different Unicode normalization will result in odd things happening. To me the right answer is NTFS should normalize all paths the same way internally but they have yet to implement it because it would break legacy systems that have un-normalized paths (I assume).


Precisely.

Programs that internally use ShiftJIS, for instance, would stop functioning on UTF-16 enforced compatability or normalization. They're currently "broken" (as in, operating incorrectly) but in a way that works.


Can you explain why this would be the case? In theory this shouldn't be an issue because any ShiftJIS conversion to Unicode should be reversible.


I think it's not a perfect round-trip due to differences in how the two standards encode certain characters. You will get a correct conversion either way, but the result of a round-trip might not be bit-identical.


It should be reversible, sure. Just let me know when you find a standardized, correct, one-to-one mapping between JIS codepoints and Unicode codepoints. Also make sure it hasn't changed in the lifetime of the oldest software that anyone is using.


The apps I'm thinking of write bytes, not strings. If you enforced UTF-16 compatibility, they'd have to say what encoding they're using (they don't) or convert it themselves (they don't) - and changing either of these would require at least an application recompile.

The reason they currently work is because bytes out == bytes in, so they can read the files they create, despite what mojibake the user sees.


Indeed. I should have written something more like "in places Windows appears to accept UTF-16". But there is certainly a little truth to it: if you pass a surrogate pair to e.g. the text of a Windows label using an appropriate Unicode font, I believe it will show its UTF-16 interpretation.


> file names that are supposed to be in one encoding (UTF16) but contain invalid data for that encoding

You just described my music collection, aggregated over two decades and haphazard successive migrations. I've given up salvaging the corrupted names in an automated manner...


Rust `std::path` [1] has two representations under the hood for Windows (UTF-16 plus lone surrogates) and non-Windows (bytes) for the exactly same reason. Paths are not strings nor texts.

[1] https://doc.rust-lang.org/stable/std/path/


Emphasis on the "plus lone surrogates" part. Like on Unix, Windows does not require a path to be valid Unicode.

That is, on Windows, paths are fundamentally sequences of 16-bit words, just like on Unix paths are fundamentally sequences of 8-bit bytes. On neither system are paths fundamentally text.


The story then goes on further: every NTFS volume contains a special file named `$UpCase` that has a uppercase mapping for all possible 16-bit words, resulting in an 128 KiB table. This approach has an upside for backward and forward compatibility... unless you eventually need a case mapping for non-BMP characters or complex mapping that expands to multiple characters.


I should briefly explain why this is here:

NTFS is (usually) case preserving but not case-sensitive. So the OS needs to be able to tell whether EXAMPLE.TXT and example.txt are the "same" name, which means it needs case conversion.

Not everybody agrees about how this conversion should work. The most famous example is Turkish, but there are others. So there's an actual choice to make here.

If Windows baked this into the core OS, they might get pushback in countries where their (presumably American) defaults were culturally unacceptable.

If they made it configurable at the OS level, everything would seem fine until, say, a German tries to access a USB drive with files from a Turk on them and some files don't work correctly, or the disk just can't be mounted at all.

So, they have to bake it into each NTFS filesystem.


HPFS had a similar system some years before.

* http://www.edm2.com/index.php/Inside_the_High_Performance_Fi...


They're not strings nor texts right up until the point you need to display them to users.


Same applies to many things, but that doesn’t make those things strings. Numbers might be another example.


My point is that you _always_ have to convert path names to strings to display them to users, but you don't always know how because you don't always know the (sometimes implied) encoding.


Filesystems seem so fiddly and broken.. is it very difficult to make a small layer over filesystems that provides sane semantics? Like one that handles paths sanely, handles fclose()/fsync() properly, lets you control when things are buffered/flushed etc? Even a broken API with clear, modern documentation enumerating all the fiddly cases would be a huge step forward from digging through random mailing lists on sites with UX from the 90s.

Has anyone tried this? Is it possible with FUSE? I would love to hear from people who know about this stuff - what are the obstacles? Or do you think FSs are fine the way they are?


As far as I can tell, file systems are inherently broken by design; this isn't an implementation issue. For example, the notion of finding a file by its path is just riddled with race conditions. If you create files /a/b/c and /a/b/d, are c and d necessarily in the same directory? Not really, because someone could have moved around the parent directories in between. But we conveniently assume paths stay the same... except, of course, when they change. Now try actually formalizing what exactly that means from a global (multi-program) perspective!


Isn't that why openat() exists? Of course, that isn't used nearly as much because it's annoying to have to do things that way, but it seems like the sort of thing if you need it.

The "open things by paths" thing, IIRC, is part of the reason Windows doesn't like to let you delete open files by default.


Yeah, and Windows has NtOpenFile(OBJECT_ATTRIBUTES*) to let you specify a parent directory, but these are only at the syscall level, not at the standard application API level (Win32 or C APIs don't allow it). The entire model exposed to normal applications has this problem.


I though Windows won't let you move or rename a directory, if it has open files.


Almost, but not quite. I just tried it to confirm: Windows lets you delete parent directories if they're reparse points. (I don't believe the deletion prevention was attempting to solve this problem in the first place, so this isn't really a bug or anything.)


ZFS handles internationalization about as well as could be hoped for. You can forbid non-UTF-8 strings (note that ZFS doesn't know if some string that is valid UTF-8 is actually encoded in UTF-8 -- it might not be), and ZFS does form-insensitive directory lookups, so if you copy some normalized-to-NFD files from OS X, it will work out fine.

Making filesystems codeset-aware is not worth the trouble. It's best instead to just use UTF-8 locales everywhere. If you need to deal with other codesets, convert as needed, but don't use non-UTF-8 locales.

On ZFS with a fast ZIL fsync()/sync() function a lot like write barriers, which is what we really need. Actually, what we really need is for all filesystem operations to be available with async system calls, write barriers included.


What is sane path handling?

Alone on windows, the maximum path length is 260 characters, except when you use extended-length paths which have a 4 character prefix and a maximum length of 32,767 characters. A sane API for reading files probably converts your paths to extended-length paths. But if you do the same thing for writing files your users start calling you insane again, because most of Windows (including Windows Explorer) can't open extended-length paths. So you would be creating files that only select software can even open, and which the user can't browse without third party software.

(An easier to ignore fun fact is that NTSF and Windows also support case sensitive names if you set the right flags in the file APIs. But nobody uses that, so it's probably save to ignore that (until somebody mounts EXT3 partitions in windows...))


> So you would be creating files that only select software can even open, and which the user can't browse without third party software.

iOS suggests that people are ok with this. /s


There's support for >260 char path lengths now without the \\?\ prefix, but I think you need to modify the registry, plus the application needs to opt into the new behavior via a setting in its manifest.


You can use subst to open these paths. The better option is to not create them in the first place though.


ZFS has a `utf8only` property that constrains filenames to be UTF-8 only... :)


It does more than that! It also does form-insensitive directory lookups.


Only if you set the normalization parameter. Otherwise it'll guarantee only that a file name is UTF8-strict and not that there aren't two files named "é" with different normalization.

Honestly, normalization should probably always be set. It gets way more confusing than case-sensitivity (which can also be changed on ZFS!) already is.


The simple thing to go about this is to make assumptions that do not always hold technically. And to be ok when these assumptions break - in this case you simply can't deliver your promises anymore.

Another way to put this is "shit in, shit out".

So: When receiving a filename that is not UTF-8, one could just emit a warning and ignore the file. That's what I would do if I wrote a music tagger, at least.

When someone else modifies a directory tree at the same time, we're bound to run into problems. That's just how it is. Actually there are synchronization facilities (e.g. flock()) for some of these types of problems. But these are seldom used because they lead to other problems.

I think they knew all that in the 70s, and just chose to be pragmatic about it.


Actually, these Unix-alike conventions of filenames just being NUL-terminated sequences of bytes/16-bit words come from the same thinking that gave us the concepts of files just being length-counted sequences of bytes that it was up to applications softwares to interpret and impose structure upon.

They were reactions to the more structured access methods of the day.


> So: When receiving a filename that is not UTF-8, one could just emit a warning and ignore the file.

The problem with invalid filenames is that doing anything at all with them might be impossible (without escaping of fixing). Your output (gui or terminal) most likely uses utf8, you can't pass invalid filename to it, so you can't even display it. Also in most situation you can't just ignore something. Imagine a text editor where user tries to open invalid file, how do you "ignore" that?

Many (if not most) application that process filenames do rely on those being valid text at least in some codepaths and they simply cannot work otherwise. The proper solution is to fail and let the user fix the problem.


> Many (if not most) applications that process filenames do rely on those being valid text at least in some codepaths and they simply cannot work otherwise. The proper solution is to fail and let the user fix the problem.

Yep, that's what I meant. You need to do what is appropriate to the situation. Ignoring / warning are only two possible handling strategies. In many situations, straight-out failing is another valid one.

A file copy program should just not care and copy the darn thing. A text editor, basically the same, but maybe issue a warning.

Different tasks have different requirements, like the filename being text, the filename not containing spaces, etc. Given that these requirements are somewhat arbitrary, I think it's a fine choice to just not put any non-technical constraints in the guts.

There's an expression for it: Mechanism, not policy. You can always add policy on top, be it in the VFS layer, or as additional programs / classes of programs.

So many potential bugs would be easily fixed if the shell glob ('*') was more configurable.


I think Java does that. Thanks to its built-in libraries, you can write code that works on multiplie platforms with different file systems. However, you still need to know yourself which characters are legal in which system (for example, ‘:’ is not a legal character in file names on windows, but it is on OS X.


":" is or is not a legal character in filenames on macOS depending on which API you use!

Try saving a file or renaming a file to contain ":" using the GUI. It will not work. However, "/" is fine.

Try saving a file or renaming a file to contain "/" using the CLI. It will not work. However, ":" is fine.

A ":" in the CLI is translated to "/" in the GUI and vice versa. ":" was the directory separator in Mac OS 9 and earlier, which explains the behavior. The actual on-disk filenames will contain "/", which is translated to ":" for the POSIX API. This is for HFS and HFS+, not sure how APFS changes things.


At least since Java 7, the API has a concept of a filesystem which can create filesystem-specific Path objects and, if it encounters illegal characters, throws an InvalidPathException that tells you which character was illegal.


Sounds tricky to get right, considering the filesystem can change in the middle of a path.


I haven't worked extensively with the API, but I believe that it represents filesystem instances accessible to the JVM, not abstract technical filesystems.

So in the case of a Unix filesystem tree with various mount points, the FileSystem object would throw the exception iff the path you're trying to construct is illegal for the actual filesystem configuration in the system.


What are you trying to achieve? If you want to store data for your own application, you'd use something higher-level than a filesystem (e.g. a database); those exist and offer sane semantics. The only reason to interact with the OS-level filesystem is to use it to communicate with other programs, in which case the insanity is necessary: if you want to e.g. read files created by other programs, you have to be prepared to deal with malformed names, because other programs will create files with malformed names.


> you'd use something higher-level than a filesystem (e.g. a database); those exist and offer sane semantics

In many cases they offer the exact same semantics (damn you, mysql). Also they store data in files, and those files have to be named, so... back to square one.


> In many cases they offer the exact same semantics (damn you, mysql).

I mean bad databases exist, sure, but the solution to that is to not use those.

> Also they store data in files, and those files have to be named, so... back to square one.

Not really - the database developers have handled the fsync, buffering and what-have-you for you, so you don't have to deal with them. (And FWIW serious databases generally offer the option of storing data on raw partitions).


sqlite?


The uncomfortable generic answer to this: https://xkcd.com/927/


It even lists "character encodings" in the title as an example.


I work on another music file management system, my personal special hell is playlist files. An m3u playlist file is just a new-line separated list of file paths, which can be relative or absolute, and potentially encoded in whatever locale is set on the users computer. Some fun issues:

* Windows and Mac filesystems are generally case-insensitive, so some users will have the file names in the playlist file in one case and the actual file names on disk in another format * Sometimes file paths cross between two different filesystems, because one is mounted in the other with a USB drive or over CIFS or similar. Sometimes these two different filesystems have different case sensitivities * There's no way to know how the playlist file was encoded * HFS+ normalizes file paths to Unicode NFD, but there's no guarantee that the paths in a playlist file will be normalized. Also, sometimes users generate an m3u file on a Windows system and expect it to just work on a Mac. Also, the filesystem nesting problem with network or USB mounts can happen this way too.


Sounds a lot like my life a couple of years ago (and intermittently since). I don't get the bug reports any more because I think customer service has learned that file name problems can be fixed by renaming the files. Not fun for the user, but a sure fix.

Ya know what kind of file names work virtually everywhere? ASCII ones.


This is a mis-use of ASCII. After all, the colon, asterisk, forward slash, question mark, backward slash, and NUL characters are all in ASCII, yet they are far from things that "work virtually everywhere". And that isn't even considering the open and close square bracket and semi-colon characters which are also not anywhere near portable to the extent of "working virtually everywhere".

The kind of file names that do work "virtually everywhere" are not ASCII, but rather are those who only use characters from the POSIX Portable Filename Character Set, which at 65 characters is just over half the size of ASCII (which has 128 characters).

* http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...


It misses a even more complex, I'd say insane, encoding problem: on HFS+ (or even APFS now?) filenames are unicode normalized.


HFS+'s use of NFD made even more insane by the fact that OS X's input modes prefer to produce NFC anyways. And besides, so do other OSes' input modes, so in any heterogeneous system this is a nightmare.

This is why ZFS does form-insensitive directory lookups (and hashing)[0] rather than normalize-on-CREATE! I'm so glad ZFS got it right, and can stand as a model for all. (I implemented none of that functionality, though I code-reviewed some of it, specifically the u8_* functions in Solaris/Illumos, but I remember it took some doing to convince others that this was the correct approach.)

[0] https://cryptonector.com/2006/12/filesystem-i18n/ [1] https://cryptonector.com/2010/04/on-unicode-normalization-or...


Many people rail against case-insensitive lookups like on Windows. How is this different?


People can see case. They cannot see form.


It isn't. Case-insensitive lookups are the right thing; unix people oppose them out of tribalism rather than anything else.


Not only are they normalized unicode, they're normalized decomposed, and not only that, but slightly non-standard (does not conform to standard Unicode "NFD" form). (Or at least, this was the case with HFS. I haven't followed APFS closely enough to say for that.)


NFD hadn't been standardized at the time.

IIUC the reason they did this is that they wanted directories to be canonically ordered on disk, and they thought decomposition would naturally yield better results than pre-composition. I'm not sure that's right, and frankly I don't care either, because the most important thing to note is that input methods (especially for European languages) by and large produce NFC, and most application software does no normalization at all, so disagreements as to form cause problems[0][1].

[0] https://cryptonector.com/2010/04/on-unicode-normalization-or... [1] https://cryptonector.com/2006/12/filesystem-i18n/


I should add that because different locales have different collations, it's not that important that directories be order by name. It's good enough that directories be somewhat ordered, and even that they not be at all. GUIs will almost always let you sort by name and/or date, and the same goes for ls(1), so, really, it doesn't matter at all.

IMO it was a terrible mistake to normalize to NFD on create. Normalizing to NFC on create would still have been a mistake, but a lesser one.


HFS intended to store name entries in “US display order”, but it had a bug in sorting. https://developer.apple.com/legacy/library/technotes/tn/tn11...:

”HFS uses 31-byte strings to store file names. HFS does not store any kind of script information with the file name to indicate how it should be interpreted. File names are compared and sorted using a routine that assumes a Roman script, wreaking havoc for names that use some other script (such as Japanese). Worse, this algorithm is buggy, even for Roman scripts. The Finder and other applications interpret the file name based on the script system in use at runtime.”

The bug (or part of it) was that some punctuation sorted before everything else.


For non-Macheads who are confused, this refers to the original HFS from 1985, which was replaced with HFS Plus in 1998:

> HFS Plus uses up to 255 Unicode characters to store file names. Allowing up to 255 characters makes it easier to have very descriptive names. Long names are especially useful when the name is computer-generated (such as Java class names).

I have to admit I let out a laugh at Apple's reference to Java class names...


"NTFS allows any sequence of 16-bit values for name encoding (file names, stream names, index names, etc.) except 0x0000. This means UTF-16 code units are supported, but the file system does not check whether a sequence is valid UTF-16 (it allows any sequence of short values, not restricted to those in the Unicode standard). "

- from wikipedia NTFS page [1]

So if you assume that NTFS filename is valid UTF-16 and convert it to UTF-8 there might be a problem. Basically they can be any sequence of 16-bit values.

  [1] https://en.wikipedia.org/wiki/NTFS


There was a time when some of our customers had lots of problems with gigantic files on their drives that was impossible to delete with windows explorer. I would come home to them and help them delete the files with the command line using filename*.ext to catch them. My guess was that the filename had some protected characters that windows explorer didn't allow. Don't remember how they ended up with the files but most likely some download program and someone having a laugh :-)


Doesn't it (or Windows) also disallow the path component separator character(s) ('/' and '\')?

Unix and alike disallow NULs and /, for obvious reasons.


There are a number of characters like path separators that cannot be part of a file name on windows. However I am not sure if this is enforced by the OS APIs or by NTFS itself. It is entirely possible that NTFS could allow something that higher layers don’t.


If the kernel (and SMB, and...) imposes these constraints, it's fine for the filesystem to not also impose the same constraints on file naming.


You could break NTFS into accepting this. Fun things happen, for a specific definition of "fun".


The built-in file system libraries for many languages are total footguns. Using strings as paths is a great example. I’ve had a few recent bugs around case sensitive vs case insensitive file systems because a lot of code assumes that when pathA != pathB then it must be dealing with two different resources. Not to mention the classic “doesn’t work on windows” problem: newPath = pathA + “/“ + pathB


AFAIK, Windows understands / as a directory separator.


Only in some special cases, not in general.


Really? I use forward slashes all the time in windows 10. I don't think I've run into a problem yet.


Yes. One huge problem is slashes are also used for command-line switches. That can entirely change the meaning of your commands.

For example, compare these two in the command prompt:

  start /Windows/Notepad.exe
  start \Windows\Notepad.exe
I wouldn't blame this on poor parsing or other silly things though. I actually think it makes sense to use slashes for switches, because they are invalid filename characters, whereas dashes are valid, and hence get ambiguous (hence the need for "--" in *nix). I think it would've made more sense to disallow slashes as directory separators entirely, to avoid this for good.


They work everywhere, except on the command line or file dialogue windows.


On the contrary. It accepts '/' as a path separator for filenames in every API call. The special case that doesn't is command line parsing (cmd.exe and a few others)


> On the contrary. It accepts '/' as a path separator for filenames in every API call. The special case that doesn't is command line parsing (cmd.exe and a few others)

No...

  assert( PathIsRoot(TEXT("C:\\")));
  assert(!PathIsRoot(TEXT("C:/" )));
Also, I believe you meant directory separator, not path separator.

Please don't be so tempted to take an antagonistic position and confidently declare other people wrong when you cannot possibly support your position in full... I see this very commonly on HN and I cannot tell you how extremely frustrating it is for those trying to help. It sucks away all the energy and enthusiasm we have for trying to help people get accurate information (meaning we might not even have the energy to bother to respond), and on top of that, you risk disseminating incorrect information. In this case, you simply could not have tried all the wide variety of Windows APIs, so at the very least, maybe say "in my experience" if something is only based on your experience.


That's why it's good to have a cross-platform path datatype/literal baked into the language.

For eg. in Rebol/Red - http://www.rebol.com/r3/docs/datatypes/file.html


Most decent stdlibs have a path.join or an path.separator you can use instead. Just that people don't bother and hardcode...


IIRC java's stdlib converts `/` to the mentioned separator


tangentially: Tool I use to test my stuff when I expect it to handle all valid filenames:

https://github.com/jakeogh/angryfiles


I found a similar problem with my backups (Ugh. :-)

A year ago I went on a trip and some combination of the humidity, the travel, and the 6 year old Thinkpad resulted in my laptop not booting.

I had been experimenting with Borg to backup the system, and so I tried using Borg to restore the latest copy onto the new laptop. Turns out that I have a bunch of files on my laptop that have names with weird characters in them: rips of my CD collection. I couldn't find any combination of settings and environment and locale that would allow borg to recover or skip these files and recover everything else.

Now, I had 2-3 other copies of the data (my pre-borg backups, the original SSD which was still readable, a few other rsync copies), so it wasn't a big deal.

But, as always, test your recoveries!


I think you could have mounted your borg backup as a FUSE filesystem, and then used rsync to restore your files.


now sing along with me children: /none of this matters to me/ because I live in eight dot three/


I wrote an essay a while ago about fixing Unix/Linux filenames here: https://www.dwheeler.com/essays/fixing-unix-linux-filenames....

This is a big disconnect between "what most users expect" and "what systems actually do". Usually generally expect that filenames are sequences of characters - and today almost everyone expects that they must be in UTF-8 on a Unix-like system. That is not, of course, what most systems actually do.


I wrote about this eons ago: https://cryptonector.com/2006/12/filesystem-i18n/ and https://cryptonector.com/2010/04/on-unicode-normalization-or... -- these might still be available on https://blogs.oracle.com/, though these are from my days at Sun.

TL;DR, basically, the lack of ability to tag strings in the system call API with codesets means that UTF-8 is the only plausible answer, and the ends (C library system call stubs, filesystems) have to apply whatever codeset conversions. But there's practically zero chance of C library system call stubs (and related functions) performing codeset conversions (can you imagine readdir(3) doing it?), which means that the only reasonable answer is to use UTF-8 locales and be done.

Even shorter: just use UTF-8 locales and be done.


You cannot use UTF-8 locales on Windows though.


That's OK. On Unix use UTF-8. On Windows use Unicode, and let apps use UTF-8 or UTF-16 as appropriate -- the kernel/NTFS make it right.


That's alright. On Windows, cryptonector's assertion about a lack of tagging API calls with code sets is wrong to begin with and the reasoning thus does not apply. All of the ...A() API calls are implicitly tagged with the current code page, after all.


    chcp 65001 
and bob's your uncle.


In which the author takes a long and winding path to what most of us already know, "paths are fundamentally bytes".


… but also, sort-of, but not really, text: they get displayed to the user, they get input from the user, and they get emitted in logs, messages, etc. All as text. And that rub between where they're bytes but they should have been text, that's the problem and that's the complexity.


... except on the operating systems, discussed at length in this very same discussion, where paths are fundamentally 16-bit words.


Just dealing with file extensions is enough of a head spin. We stopped trying to differentiate between .xls, .xlsx, *.xlst... etc. to show an Excel icon for a file uploaded to our SaaS and just went with a generic file icon in the end.


That sounds like the classic example of a decorative feature that someone thought, "oh that sounds easy, add it to someone's sprint." But of course turns out to be mind numbingly complex.


Honest question, why is that feature complex? What is the problem in looking at last part after dot?


Because Microsoft made their new office extension .xml if that doesn't make your head spin i don't know what else will.


Is this in very recent Office versions? Or are you talking about the XML format used by Office 2007 upwards? Because that does still use different file extensions for different programs, obtained by adding an 'x' on their old binary format extensions (.docx rather than .doc, .xlsx rather than .xls, ...). They're zip archives of XML files rather than individual XML files so .xml wouldn't make any sense.

I got the impression the problem in the comment above was too many extensions rather than too few. For example, you have use .docm etc rather than .docx if your document contains macros otherwise Word will refuse to open it (this is a security feature to prevent document viruses). But it sounds like there are many others.


I mean, this is only adding another entry to your array of extensions for the specific icon. It's also not a deal-braker if it doesn't work. It really is easy, I don't see what's the problem.


Maybe because it is used for all kinds of office documents. You'd have to look at the contents to see if it's a spreadsheet, text document, presentation or not even a non office pain XML file


If this is a head spin, then go take a stroll on a Debian package mirror for what kinds of awesome ^W bizarre file extension combos they've come up with. This one is my favorite: https://twitter.com/stefanmajewsky/status/928983817825665025


Only on Windows.


On the contrary, if one is on Windows this stuff is actually much easier to handle, as the operating system supplies out of the box a large, extensible and maintained by installed applications, table of extensions, with an indirect mapping (via an intermediate) to display icons supplied in that very same table.

The ASSOC .DOCX command shows an example of the first leg of the indirect mapping.


IBM's backup software TSM/Spectrum Protect messes this up as well.

If the machine has a UTF-8 encoding (like, say, every modern system), it will try to treat filenames as valid UTF-8 strings and fail to back up files which don't fulfill that assumption. The "solution" is to run the TSM software with a single-byte locale like en_US.

I've seen a number of shops that were silently missing files from backup from old systems because of this problem.


I don't think any backup software actually can do the right thing(tm). Some might preserve (or attempt so, anyway) binary representation, others attempt to preserve unicode codepoint-space representation...

... most do neither, but rather do ${complex thing emerging from combination of implementation details of runtime and backup tool, impossible to reproduce in any other runtime, likely platform- and environment dependent; the same backup likely restores in different ways on different machines, and the same source files create different backups on different machines; creating a backup on one machine and restoring it on another does not generally result in the same files; and I have not yet mentioned what might happen if you mount the same source file system from different platforms, because results might vary a lot; also, we are only talking about paths here, not any of the other plethora of things that can and will be different between any element in OSxFSxEnv}.


> I don't think any backup software actually can do the right thing(tm).

Sure it can. In this case, I'd say treating the filename as a bag of bytes is the correct way to go, as that's the way the OS treats them. Translating filenames between character sets should not be part of a backup systems job.

There are valid setups where different software on the same machine might be running with different character sets for legacy reasons. In that case there is no correct way to handle the filenames as text. But treating it as a bag-of-bytes will always work consistently.

Also, the one purpose of a backup system is to back up the files on the filesystem. If it can't back up some files that the OS considers valid, it's the backup software that failed.


And me thinking that I would see mainframe and other non-POSIX systems, only to see the usual Linux/Windows dichotomy.


And a slow black snake slithers out of your computer's USB ports, made of billowed smoke and unrealized dreams. It sticks its tongue out; "typesssss," it whispers, "typessssss."


A way to stay sane seems to use Python3 pathlib and drop Python2 development.


[dead]


Because people use Windows and might ask for you to support their platform?


Ummm... problem is, Windows is used by a ton of people.

Break their workflow, and said peoples will hate you.

So... UTF-16 in Windows Hell, and UTF-8 everywhere else that is actually sane.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: