Hacker News new | past | comments | ask | show | jobs | submit login

True; there's no reason that the filesystem should be storing anything other than char[]. The filesystem is a serialized domain, and char[] buffers are for storage and retrieval of serialized data. But that also means that each filesystem should explicitly specify a serialization format for what's stored in that char[] -- hopefully UTF-8.

However, the filesystem should really be where that serialized representation begins and ends. The filesystem should be interacting with the VFS layer using runes (Unicode codepoints), not octets.

And then, given that all filesystems route through the VFS, it can (and should) be enforcing preconditions on those runes in its API, expecting users to pass it something like a printable_rune_t[]. (Or even, horror of Pascalian horrors, a struct containing a length-prefixed printable_rune_t[].)

And for the situation where there's now files floating around without a printable_rune_t[] name -- this is why NTFS has been conceptually based around GUIDs (really, NT object IDs) for a decade now, with all names for a file just being indexed aliases. I wonder when Linux will get on that train...




Well, history sadly dictates that the interface to the upper layers it based around code units because those have always been fixed-length. Unicode came to late to most operating systems to really be ingrained in their design and where it was (Windows springs to mind) it all got a turn for the worse with the 16-to-21-bit shift in Unicode 2 with Unicode-by-default systems being no better than 8-bit-by-default systems had been a decade earlier.

That NTFS uses GUIDs internally to reference streams is news to me, though. But I think on Unix-like systems the equivalent would be inodes, I guess, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: