Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.
Case sensitivity in file names is surprising even to non-technical people. If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"
Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.
> If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"
But you also wouldn't say that if they sent "Book - Draft 1.docx", "Book Draft I.docx", "BookDraft1.docx", "Book_Draft_1.docx", or "Book Draft 1.doc", and surely you wouldn't want a filesystem to treat all of them as the same.
This is a personal reason, but the reason I prefer case sensitive directory names is I can make "logical groupings" for things. So, my python git directory might have "Projects/" and "Packages/," and the capitalization not only makes them stand out as a sort of "root path" for whatever's underneath, but the capitalization makes me conscious of the commands I'm typing with that path. I can't just autopilot a path name, I have to consciously hit shift when tab completion stops working.
That might sound like a dumb reason, but it's kept me from moving things into the wrong directory, or accidentally removing a directory multiple times in the past.
I also use Windows regularly and it really isn't a hindrance, so maybe I wouldn't actually be bothered if everything was case sensitive.
To me, this sounds like a great practice for terminal environments but may be less intuitive when using file system apps. I could easily overlook a single letter capitalization in a GUI view of many directories. Maybe it's because at a terminal the "view" into the file system is narrow?
Now I'm wondering how I can use this in my docker images. I mean that might irritate devops. Well, maybe they'll like it too. Man, thanks for posting this.
You have to draw the line somewhere, but I do appreciate when the UI sorts "Book draft 2" before "Book draft 11". That requires nontrivial tokenization logic and inference, but simple heuristics can be right often enough to be useful.
On that note, ASCIIbetical sort is never the right answer. There is a special place in hell for any human-facing UI that sorts "Zook draft 1" between "Book draft 1" and "book draft 1".
And that line, at least for sorting, belongs firmly outside the filesystem.
Sorting is locale-dependent. Whether a letter-with-dots sorts next to letter-without-dots or somewhere completely different has no correct global answer.
I think there's a pretty big difference between how the UI orders things and how the filesystem treats things as equivalent. A filesystem treating names case sensitively doesn't prevent the UI from tokenizing the names in any other arbitrary way
There are just not the same characters. A filesystem should not have an opinion on what strings of characters _mean_ the same. It is the wrong level of abstraction.
filenames might even not be words at all, and surely not limited to English. We shouldn't implement rules and conventions from spoken English at a filesystem level, certainly not S3.
I think what they mean is if you somehow had two files with the same name but different cases (as NTFS supports this) it would be impossible to fix with win32 calls
No, NTFS has always been at least optionally case sensitive; current Windows versions even allow case-sensitivity to be controlled on a per-directory basis[1], which even works for (some) Win32 programs:
Microsoft Windows [Version 10.0.22631.3593]
(c) Microsoft Corporation. All rights reserved.
C:\Users\jtm>mkdir foo
C:\Users\jtm>fsutil file setCaseSensitiveInfo foo
Case sensitive attribute on directory C:\Users\jtm\foo is enabled.
C:\Users\jtm>echo bar > foo\bar.txt
C:\Users\jtm>echo Bar > foo\Bar.txt
C:\Users\jtm>dir foo
Volume in drive C is Aristotle-Win
Volume Serial Number is E4AE-428B
Directory of C:\Users\jtm\foo
2024-05-31 17:55 <DIR> .
2024-05-31 17:55 <DIR> ..
2024-05-31 17:55 6 Bar.txt
2024-05-31 17:55 6 bar.txt
2 File(s) 12 bytes
2 Dir(s) 41,524,133,888 bytes free
C:\Users\jtm>type foo\bar.txt
bar
C:\Users\jtm>type foo\Bar.txt
Bar
And so should we be able to have “é.txt” and “é.txt” in the same directory (with a different UTF-8 normalization?)
What encoding should we use BTW?
I’m not advocating for case-insensitive fs (literally the first thing I do when I get a Mac is reformat it to be on a case-sensitive fs), but things are not that simple either.
> And so should we be able to have “é.txt” and “é.txt” in the same directory
That's what Linux does.
It does create some problems that seem to never happen on practice, while it avoids some problems that seem to happen once in a while. So yeah, I'd say it's a good idea.
You look from technical perspective. From average person perspective, even files are too much technicality to deal with.
As a user I want my work to be preserved, I want to view my photos and I want system to know where is my funny foto of my dog I did last Christmas.
As a developer I need an identifier for a resource and I am not going to let user decide on the Id of the resource, I put files in system as GUID and keep whatever user feels as metadata.
Exposing average people to the filesystem is wrong level of abstraction. That is why iOS and Android apps are going that way - but as I myself am used to dealing with files it annoys me that I cannot have that level of control, but I accept that I am quite technical.
Dealing with files used to be something everyone interacting with computers had to do. It is something average people can do.
I think too much abstraction is a mistake and adds a lot of unneeded complexity.
People should learn something about technology they use. If you want to drive, you need understand how steering wheels work, if you want to drive a manual car (usual where I live and have lived) then you need to know how to work a gear stick and the effect of changing gear.
I'm not even sure 'everyone with an office job' had a computer. It certainly wasn't true 35 years ago. An office might have a computer or two, but not everyone had one, nor was everyone expected to use it.
Case insensitive matching is a surprisingly complicated, locale-dependent affair.
Should I.txt and i.txt match? (Note that the first file is not named I.txt).
Case insensitive filesystems make about as much sense as ASCII-only filenames.
You don't need to decide how to upper or lower case a character to be insensitive to case, though. Treating them all as matching isn't a terrible option.
And yet case insensitive file name matching / string matching is one of my favourite windows features. It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.
People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character, that they are different ASCII codes is a behind the scenes implementation detail.
(That said, S3 isn’t a filesystem, it’s more like a web hashtable key-to-blob storage)
> People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character
They are the same character to you, a native speaker of a Western language written in a latin script. They are the same to you because you are, in fact, an ASCII machine. Many many people in the world are not.
They are the same to me, they are different in ASCII, therefore I am not an ASCII machine. To me, the person using the computer to do work. Not the person wanting to do extra work to support the computer's internal leaky abstractions of data storage.
Your position, the position of too many people, is that I a native speaker of English etc. should not be allowed to have a computer working how English works because somewhere, someone else is different. This is like saying I shouldn't be allowed an English spell checker because there are other people who speak other languages.
Are the words hello and HELLO spelled differently? I am pretty squarely in the camp that filesystems should be case sensitive (perhaps with an insensitive shell on top), but I would not consider those two words as having a different spelling. To me that means they are the same sequence of characters.
And you seem to be conflating characters and letters. There are fewer letters in the standard alphabet than we have characters for the same, largely because we do distinguish between some letter forms.
I suppose you could imagine a world where we don't, in fact, do this with just the character code. Seems fairly different from where we are, though?
When you press the "E" key on a US keyboard and "e" comes out, do you return the keyboard because it's broken? If not, then you know what definition I'm using even if I misnamed it.
Every single time I type a path or filename (or server name) in the shell, or in Windows explorer, or in a file -> open or save dialog, I don't trip over capitalization. If I want to glob files with an 'ecks' in the name I can write *x* and not have to do it twice for *x* and *X*.
When I look at a directory listing and it has "XF86Config", I read it in my head as "ecks eff eight six config" not "caps X caps F num eight num six initial cap Config" and I can type what I read and don't have to double-check if it's config or Config.
Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.
Case sensitivity is like walking down a corridor and someone hitting you to a stop every few steps and saying "you're walking Left Right Left Right but you should be walking Right Left Right Left".
Case insensitivity is like walking down a corridor.
In PowerShell, some cmdlets are named like Add-VpnConnection where the initialism drops to lowercase after the first letter, others like Get-VMCheckpoint where the initialism stays capitalised, others mixed like Add-NetIPHttpsCertBinding where IP is caps but HTTPS isn't - any capitalisation works for running them or searching them with get-command or tab-completing them. I don't have to care. I don't have to memorise it, type it, pay attention to it, trip over it, I don't have to care!.
"A programming language is low level when its programs require attention to the irrelevant." - Alan Perlis.
DNS names - ping GOOGLE.COM works, HTTPS://NEWS.YCOMBINATOR.COM works in a browser, MAC addresses are rendered with caps or lowercase hex on different devices, so are IPv6 addresses in hex format, email addresses - firstname.lastname or Firstname.Lastname is likely to work. File and directory access behaving the same means it's less bother. In Vim I :set ignorecase.
In PowerShell even string equality check is case insensitive by default, string match and split too. When I'm doing something like searching a log I want to see the english word 'error' if it's 'error' or 'ERROR' or 'Error' and I don't know what it is.
If I say the name of a document to a person I don't spell out the capitalisation. I don't want to have to do that to the computer, especially because there is almost no reason to have "Internal site 2 Network Diagram" and "INTERNAL site 2 network diagram" and "internal site 2 NETWORK DIAGRAM" in the same folder (and if there were, I couldn't easily keep them apart in my head).
All the time in command prompt shell, I press shift less often, type less, change directories and work with files more smoothly with less tripping over hurdles and being forced to stop and doublecheck what I'm tripping over when I read "word" and typed "word" and it didn't work.
On the other hand, the edge cases it causes me are ... well, I can't think of any because I don't want to put many files differing only by case in one directory. Maybe uncompressing an archive which has two files which clash? I can't remember that happening. Maybe moving a script to a case sensitive system? I don't do that often. In PowerShell, method calls are case insensitive. C# has "string".StartsWith() and JavaScript has .startsWith() and PowerShell will take .startswith() or .StartsWith or .Startswith or anything else. That occasionally clashes if there's a class with the same name in different case but that's rare, even.
In short, the computer pays attention to trivia so I don't have to. That's the right way round. It's about the best/simplest implementation of Do What I Mean (DWIM) that's almost always correct and almost never wrong.
> Both options are independent of file system case-sensitivity.
In Windows world it works everywhere, in any win32 program - file open dialogs, et al. Here you have to have it built in to every tool. (and windows doesn't do it at the filesystem layer)
None of these are the filesystem though, they are all abstractions over the file system that could easily implement case insensitivity, and as a sibling comment pointed out, actually do in many cases. I'm perfectly fine with the idea of interacting with files using a case insensitive interface. I just don't feel like it should be the job of the filesystem to enforce case insensitivity.
Case Preserving and Case Sensitive are subtly two different things. Most case insensitive file systems are case preserving and whatever the UTF8 equivalent is I forget the name.
heh, I especially enjoy that in a huge thread about how capitalization does and doesn't matter, "gps point" was not, in fact, concerning some coordinates of the global positioning system but rather "GP's point". I first chalked it up to some autocomplete artifact but then realized what was actually happening
> Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.
The number of spaces is usually not meaningful in written language. "Hi, how are you?" means the same thing as "Hi, how are you ?". I don't think it's a good reason to make file system ignore space characters.
No offense, but I think that's a very western-centric view. Your example only make sense when the user is familiar to English (or other western languages, I guess). To me personally, I find it strange that "D.txt" and "d.txt" means the same file, since they are two very different characters. Likewise, I think you would also go crazy if I tell you "ア.txt" and "あ.txt" means the same file (which is hiragana and katakana for A respectively, which in a sense is equivalent to uppercase and lowercase in Japanese), or "一.txt" and "壹.txt" means the same file (which both means number 1 in Chinese, we call the latter one literally "uppercase number")
Agreed, and you could even take this into "1.txt" being the same as "One.txt". Which, I mean, fair that I would expect a speech system to find either if I speak "One dot t x t". But, it would also find "Won.txt" and trying to bridge the phonetic to the symbolic is going to be obviously fraught with trouble.
What if Unicode updates some capitalization rules in the next version, and after an OS updates some filenames now collide and one of the is inaccessible?
If someone says they sent you "Book Draft 1.docx" and you check your email to find "Ⓑⓞⓞⓚ Ⓓⓡⓐⓕⓣ ①.ⓓⓞⓒⓧ", "฿ØØ₭ ĐⱤ₳₣₮ 1.ĐØ₵Ӿ" - these are different files.
Ages ago on Flowdock at work (a chat webapp kind of like Slack that no longer exists), I used the circle ones for a short time as my nickname, and no one could @ me.
File systems are not user interfaces. They are interfaces between programs and storage. Case insensitive is much better for programs.
The user shell can choose however it wants to handle file names, a case sensitive file system does not prevent the shell from handling file names case insensitively.
> Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.
Not sure why what Windows does is relevant to this, honestly. Personally, I strongly prefer case sensitivity with filenames, but the lack of it isn't a dealbreaker or anything.
What are some of the advantages of case sensitivity? Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files? That just sounds like asking for trouble.
The advantages that I value are that case sensitivity means I can use shorter filenames, it makes it easier to generate programmatic filenames, and I can use case to help in organizing my files.
> Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files?
That's a situation where sensitivity can cause difficulty, yes, but for me personally, that's a minor confusion that is easy to avoid or correct. Everything is a tradeoff, and for me, putting up with that annoyance is well worth the benefits of case sensitivity.
I do totally understand that others will have different tradeoffs that fit them better. I'm not taking away from that at all. But saying "case sensitivity is undesirable" in a broad sense is no more accurate than saying "case sensitivity is desirable" in a broad sense.
Personally, I think the ideal tradeoff is for the filesystem to be case sensitive, but have the user interfaces to that file system be able to make everything behave as case-insensitive if that's what the user prefers.
Unicode case folding is a complicated algorithm, and its definition is subject to change with updated Unicode versions. It's nice not to have to worry about that.
Okay, but I don't think this has anything to do with the use case JohnFen mentioned or my questions about it.
If your goal is super easy filename generation then you're probably not going to leave ASCII.
And if you do go beyond ASCII for filename packing/generating, then you should instead use many thousands of CJK characters that don't have any concept of case at all. Bypass the question of case sensitivity entirely.
Enough that I prefer it. If that were the only advantage, I'd only slightly prefer it. But being able to use case as a differentiator in filenames intended for me to read is something I find even more valuable.
A filesystem not being case sensitive isn't a dealbreaker or anything. I just prefer case sensitivity because it increases flexibility and readability for me, and has no downsides that I consider significant.
Also note that 'are these 2 words case insensitively equal' is impossible without knowing what locale rules to apply. And given that people's personal names tend to have the property that any locale rules that must be applied are _the locale that their name originates from_, and that no repository of names I am aware of stores locale along with the name, that means what you want, is impossible.
In line with case insensitivity, do you think `müller` and `muller` should boil down to for example the same username for login purposes?
That's... tricky. In german, the standard way to transliterate names to strict ASCII would be to turn `müller` into `mueller`. In swiss german that is in fact mandatory. Nobody in switserland is named `müller` but you'll find loads of `mueller`s. Except.. there _are_ `müller` in switzerland - probably german citizens living ther.
So, just normalize `ü` to `ue`, easy, right? Except that one doesn't reverse all that well, but that's probably allright. But - no. In other locales, the asciification of `ü` is not `ue`. For example, `Sjögren` is swedish and that transliterates to `sjogren`, not `sjoegren`.
Bringing it back to casing: Given the string `IJSSELMEER`, if I want to title case that, the correct output is presumably `IJsselmeer`. Yes, that's an intentional capital I capital J. Because it's a dutch word and that's how it goes. In an optimal world, there is a separate unicode glyph for the dutch IJ as a single letter so we can stick with the simple rule of 'to title case a string, upper case the first glyph and lowercase all others, until you see a space glyph, in which case, uppercase the next'. But the dutch were using computers fairly early on and went with using the I and the J (plain ascii) for this stuff.
And then we get into well trodden ground: In turkish, there is both a dotted and a dotless i. For... reasons they use plain jane ascii `i` for lowercase dotted i and plain jane ascii `I` for uppercase dotless I. But they have fancy non-ascii unicode glyphs for 'dotted capital I' and 'dotless lowercase i'.
So, __in turkish__, `IZMIR` is not case-insensitive equal to `izmir`. Instead, `İZMIR` and `izmir` are equal.
I don't know how to solve this without either bringing in hard AI (as in, a system that recognizes 'müller' as a common german surname and treats it as equal to 'mueller', but it would not treat `xyzmü` equal to `xyzmue` - and treats IZMIR as not equal to izmir, because it recognizes it as the name of a major turkish city and thus applies turkish locale rules), or decreeing to the internet: "get lost with your fancypants non-US/UKian weird word stuff. Fix your language or something" - which, well, most cultures aren't going to like.
'files are case insensitive' sidesteps alllllll of this.
Yeah, but that little bit of user friendliness ruins the file system for file system things. Now you need “registries” and other, secondary file systems to do file system things because you can’t even use base64 in file names. Make your file browsing app case insensitive, if that’s what you want. Don’t build inferiority down to the core.
From a technical implementation pov 'A' & 'a' are well established as different characters (ascii, unicode, etc). Regardless of personal preference, I don't understand how can a developer/Sys admin be surprised and even frustrated that a file system is case sensitive.
The developer is still free to abstract this away for the end user when it makes sense such as search results
Author here. There's no complaint. It's an observation rather than an absolute good or bad. It's something you have the consider in designing your application.
Why exactly? I'm not aware of any benefits of filenames being case-sensitive, it just opens a room for tons of very common mistakes that literally can't happen otherwise. It's not like in coding where it helps enforce the code style and thus aids readability - and even in programming it was a source of PITA to solve bugs before IDEs became smart enough to catch typos in var names. One thing I loved in Pascal the most is that it didn't care about the case, unlike the C.
The case-sensitivity algorithm needs a locale as input in order to correctly calculate the case conversion rules.
The most common example is probably that i (U+0069 LATIN SMALL LETTER I) and I (U+0049 LATIN CAPITAL LETTER I) transform into each other in most locales, but not all. In locales az and tr (the Turkic languages), i uppercases to İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), and I lowercases to ı (U+0131 LATIN SMALL LETTER DOTLESS I).
case-insensitive is all fine if you only handle text that consist of A-Za-z, but as soon as you want to write software that works for all languages it becomes a mess.
This is the main point, and almost all the other chatter is not particularly relevant. A dumb computer and a human can agree with "files are case sensitive and sometimes that's a bit weird but computers are weird sometimes". If there was indeed exactly one universal way to have case insensitivity it would be OK. Case insensitive file systems date from when there was. Everything was English and case folding in English is easy. Problem solved. But that doesn't work today. And having multiple case folding rules is essentially as unsolvable a problem as the problems that arise from case sensitivity, except they're harder for humans to understand, including programmers.
Simple and wrong is better than complicated and wrong and also the wrong is shoved under the carpet until it isn't.
Though you still ought to declare a Unicode normalization on the file system. Which would be perfectly fine if it weren't for backwards compatibility.
Except at the UI layer (where you can easily offer suggestions and do fuzzy search), the opposite is true. There are so many different ways to do case-insensitive string comparisons, and it's so easy to forget to do that in one place, that case-insensitivity just leads to ton of bugs (some of which will be security critical).
For example, did you know that Microsoft SQL Server treats the columns IS_ADMIN and is_admin as either the same or two different columns depending on the database locale (because e.g. Turkish distinguishes between i and I)? That's at least a potential security bug right there.
macOS is case preserving, though. To me, it’s the best of both worlds. You can stylize your file names and they will be respected, but you don’t have to remember how they are stylized when you are searching for or processing them because search is case insensitive.
IMO this is the worst possible solution, as what you are seeing is not what you are getting. You do not actually know what is being stored on the file system, and your searches are fuzzy rather than precise.
Maybe macOS is case-preserving, but it's not encoding-preserving. If you create a file using a composed UTF-8 "A", the filesystem layer will decompose the string to another form, and create the filename using that decomposed form "B". Of course, "A" and "B" when compared can be completely different (even when compared using case insensitivity enabled), yet will point to the same file.
macOS (Darwin) has always written filenames as NFD via the macOS APIs. The underlying POSIX-ish APIs may not do NFD, but Finder and every native macOS GUI program gets files in NFD format.
macOS has case sensitivity. It's just off by default and is a major pain to turn on. You have to either reinstall from scratch onto a case-sensitive partition, or change the "com.apple.backupd.VolumeIsCaseSensitive" xattr from 0 to 1 in a Time Machine backup of your whole system and then restore everything from it.
You shouldn't do this if you value things working, though-- this is a pretty rare configuration (you have to go way out of your way to get it), so many developers won't test with it and it's not unheard of for applications to break on case-sensitive filesystems.
If you absolutely need case-sensitivity for a specific application or a specific project, it's worth seeing if you can do what you need to do within a case-sensitive disk image. It may not work for every use-case where you might need a case-sensitive FS, but if it does work for you, it avoids the need to reinstall to make the switch to a case-sensitive FS, and should keep most applications from misbehaving because the root FS is case-sensitive.
Most things work fine, but it will break (or at least did break at one point) Steam, Unreal Engine, Microsoft OneDrive, and Adobe Creative Cloud. I'm rather surprised about the first two, since they both support Linux with case-sensitive filesystems. I took the opposite approach as you, though: making my root filesystem case-sensitive and creating a case-insensitive disk image if I ever needed those broken programs.
I keep a case sensitive volume around to checkout code repositories into.
For everything else I prefer it insensitive, but my code is being deployed to a case sensitive fs.
It's not "the thing I like", it's the better tradeoff. It's less complex and thus more secure (due to reduced API surface and fewer opportunities to make lookup mistakes or to mistakenly choose the wrong out of dozens of kinds of case-insensitive comparison in a security decision). It's also potentially faster, and more compatible with other Unixes.
TIL. I've switched to Mac about a year ago and it's sufficiently Linux-like to take such things for granted. I wonder what other surprises are going to bite me in the future.
Case insensitive is how humans think about names. “John” and “New York” are the same identifiers as “john” and “new york”. It would be pretty weird if someone insisted that their passport is invalid because the name is printed in all caps and that’s not their preferred spelling.
IMO the best thing would be to call Unix-style case-sensitive file names something else. But it’s obviously too late for that.
The word “Turkey” is not the same as “turkey”, “August” is not the same as “august”, and “Muse” is not the same as “muse”. https://en.m.wikipedia.org/wiki/Capitonym
Humans will also treat "Jyväskylä", "Jyvaskyla" and "Jyvaeskylae" as the same identifiers but I don't think that's a good basis for file storage to have those be the same filenames.
In the era of Unicode, this battle is pretty much lost. Several different code point sequences can produce the glyph 'ä', and user input can contain any of these. You need to normalize anyway.
Agreed. I think case sensitivity in Unix filesystems is actually a pretty poor design decision. It prioritizes what is convenient for the computer (easy to compare file paths) over what makes sense for the user (treating file paths the same way human intuition does).
In Germany there is a lowercase letter ß. It actually is a ligature of the letters s and z. It does not have an uppercase variant, because there is no word that begins with it. One word would be Straße. If you write that all in uppercase, it technically becomes STRASZE, although you almost always see STRASSE. But if you write that all in lowercase without substituting SS with ß, you are making a mistake. And although Switzerland is a german-speaking country, they have different spelling and rarely use ß -- if not ever.
This is just one of many cases, where case-insensitiy would give more trouble than it's worth. And others pointed out similar cases with the Turkish language in this post.
But the thing is that the file system doesn't need to be case-insensitive for your system to support human intuition! As others have said, people don't look at and use filesystems, they use programs that interface with the filesystem. You can absolutely have a case-sensitive system that nonetheless lets you search files in a case-insensitive manner, for example. After all, to make searches efficient, you might want to index your file structure, and while doing that, you might as well also have a normalised file name within the index you search against.
Now, as you said, UNIX did the choice that's easier for computers. And for computers, case-insensitive filesystems would be worse. There are things that are definitely strange about UNIX filesystems (who doesn't love linefeeds in file names!?), but case-sensitivity is not one of them.
I don't know if that's right. The most obvious way two characters can be the same is if they actually look exactly the same i.e. are homoglyphs https://en.wikipedia.org/wiki/Homoglyph
But no filesystem I am aware of is actually homoglyph insensitive.
Case insensitive filesystems picked one arbitrary form of intuition (and not even the .oat obvious one) in one language (English) and baked that into the OS at a somewhat deep level.
You say "human intuition" - are those using different writing systems nonhuman then?
Except that is not true, it is sometimes convenient, and sometimes very inconvenient and not wanted. My reasoning for file systems that are case sensitive is the following:
1. Some people want file systems to case sensitive.
2. Case sensitive is easier to implement. This is very much not a trivial thing. Case insensitivity only really makes sense for ASCII.
In the camp of wanting case insensitivity:
1. Some people want file systems to be case insensitive.
The case sensitivity one is easy, here's a thing that's more likely to be entirely unintuitive:
S3 paths are fake. Yes, it accepts uploads to "/builds/1/installer.exe", and yes, you can list what's in /builds, but all of that is a simulation. What you actually did was to upload a file literally named '/builds/1/installer.exe' with the '/' as part of the name.
So, "/builds/1//installer.exe" and "/builds//1/installer.exe" are also possible to upload and entirely different files. Because it's the name of a key, there's no actual directories.
and which also according to the command docs i read recently add random limitations to maybe half of the S3 API
I'm not sure reusing (parts of) the protocol was a good idea given unrelated half of it was already legacy and discouraged (e.g. the insane permission model with policies and acls) or even when not, let's say... weird and messy.
Aside from resolution of paths to some canonical version (e.g. collapsing redundant /'s in your example), what actually is an "actual directory" other than a prefix?
An actual directory A with 2 child directories A/B and A/C, each of which with 1 million files, doesn't need to go through millions of entries in an index to figure out the names of B and C.
It's also not possible for A and A/ to exist independently, and for A to be a PDF, A/ an MP4 and A/B a JPG under the A/ ”directory". All of which are possible, simultaneously, with S3.
An actual directory is a node in a tree. It may be the child of a parent and it may have siblings - its children may in turn be parents with children of their own.
Yeah, the prefix thing is a source of so many bugs. I get why AWS took that approach and it’s actually a really smart approach but it still catches so many developers out.
Just this year our production system was hit by a weird bug that took 5 people to find. Turned out the issue was an object literally just named “/“ and the software was trying to treat it as a path rather than a file.
I can't trust myself using S3 (or any other AWS service). Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.
I think I'll stick to actually simple services, such as Hetzner Storage Boxes or DigitalOcean Spaces.
I like digital ocean spaces, but it has its own annoying quirks.
Like I recently found out yif you pipe a video file larger than a few MB, it’ll drop the https:// from the returned Location. So on every file upload I have to check if the location starts with https, and add it on if it’s not there.
Of course the S3 node client GitHub issue says “sounds like a digital ocean bug”, and the digital ocean forums say “sounds like an S3 node client bug” lol
The way that DO handles secrets should scare anyone. Did you know that if you use their Container Registry and set it up so that your K8S has automatically access to it, their service will create a secret that has full access to your Spaces?
> Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.
I took a break from cloud development for a couple of years (working mostly on client stuff) and just recently got back. I am shocked at the amount of complexity built over the years along with the cognitive load required for someone to build an ironclad solution in the public cloud. So many features and quirks which were originally designed to help some fringe scenario are now part of the regular protocol, so that the business makes sure nobody is turned away.
Here is a good one: deleting billions of objects can be expensive if you call delete APIs.
However you can set a wildcard or bucket wide object expiry of time=now for free. You’ll immediately stop being charged for storage, and AWS will manage making sure everything is deleted.
Nit: the delete call is free, it’s the list call to get the objects that costs money. In theory if you know what objects you have from another source it’s free.
That's true but OP clearly knows that already. You stop getting charged for storage as soon as the object is marked for expiration, not when the object is finally removed at AWS's leisure. You can check the expiration status of objects in the metadata panel of the S3 web console.
> There may be a delay between the expiration date and the date at which Amazon S3 removes an object. You are not charged for expiration or the storage time associated with an object that has expired.
Because AWS gets to choose when the actual deletes happen. The metadata is marked as object deleted but AWS can process the delete off peak. It also avoids the s3 api server being hammered with qps
That bit on failed multipart uploads invisibly sticking around (and incurring storage cost, unless you explicitly specify some lifecycle magic)... just ugh.
My proposal was that parts of incomplete uploads would stick around for only 24 hours after the most recent activity on the upload, and you wouldn't be charged for storage during that time. ahenry@ vetoed that.
we had cron script on a very old server running for almost a decade, starting a multipart upload every night, pushing what was supposed to be backups to a bucket that also stored user-uploaded content so it was normal that the bucket grows in size by a bit every day. the script was 'not working' so we never relied on the backup data it was supposed to be pushing, never saw the files in s3, the bucket grew at a steady and not unreasonable pace. and then this spring i discovered that we were storing almost 3TB of incomplete multipart uploads.
and yes, i know that anecdote is just chock full of bad practices.
Yeah, I've definitely treaded on the storage cost landmine. Thankfully it was just some cents in my case, but it's really infuriating how badly the console exposes the information.
Pretty glad about it considering how much more simpler ASCII was to work with compared to Unicode.
I say it as a non native english speaker, programming has so many concepts and stuff already, its best not to make it more complex by adding a 101 different languages to account for.
Unicode and Timezone, the two things that try to bring more languages and cultures to be accounted for while programming and look what happens, it creates the most amount of pain for everyone including non native english programmers.
I dont want to write computer programs in my non-english native tongue, if that means i’ll have to start accounting for every major language while im programming.
Its fine that IT discussions are so English-centric. Diversity is more complexity, and no one owns the english language, its just a tool used by people to communicate, thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.
They all own the english language too, the moment they decided to speak in it.
Politics is just "how people think things should be". Therefore politics are everywhere not because people _bring_ them everywhere but because they arise from everything.
Your comment is in fact full of politics, down to your opinion that politics shouldn't be included in this discussion.
**
> thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language
Personally my impression is that native speakers just run circles around everyone else during meetings and such. Being _truly_ comfortable in the language, mastering social cues, being able to confidently and fluently express complex ideas, mean that they effectively take over the room. In turn that means they will hold more power in the company, rise in rank more quickly, get paid more, etc.. There's an actual, significant consequence here.
Plus, anglos usually can't really speak another language, so since they don't realize how hard it is they tend to think their coworkers are idiots and will stick to do things with other anglos rather than include everyone.
> Diversity is more complexity
In a vacuum I agree, but within the context of your comment this is kinda saying "your existence makes my life too complex, please stop being different and join the fold"; and I can't agree with that sentiment.
You raise an interesting point about the nature of politics. I’ve been thinking about this a bit, but it seems to me that radical/revolutionary politics are talking about how people want things to be while quotidian political ideas are more about how people ought to do a few things. The distinction here being people’s timelines and depth of thought. If a policy has some seriously bad consequences, people may not notice because they weren’t really thinking of things should be, just the narrower thought of how a thing out to be done (think minimum wage driving automation rather than getting people a better standard of living, or immigration control driving police militarization). Of course, for most politicians, I am not sure either of these are correct. I think for politicians, politics is just the study of their own path to power; they likely don’t care much about whether it’s how things are done or how things ought to be so long as they are the ones with the power.
I don’t know that this comment really ads anything to the conversation, but I do find it all interesting.
Edit: also, on topic, languages are fun. The world is boring when everything is in one language. Languages also hold information in how they structure things, how speakers of that language view the world, and so on, and in those ways they are important contributors to diversity of thought.
> considering how much more simpler ASCII was to work with compared to Unicode.
And elemental algebra is more simple than differential calculus.
ASCII being simpler just means it is not adequate to represent innate complexity that human languages have. Unicode is not complex because of "diversity politics", whatever that means. It is because languages are complex.
The same story with time zones: they are as complex as time is.
> thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.
My lazy ass wishes that English is enough to access those communities too. There are many cool and interesting developers, projects and communities only or mostly working in their native languages. One of the major motivation for me to learn Chinese now is to access those communities.
I'm not sure why you characterise this as political.
I wish the "case" was a modifier like Italic, bold. It would have been easier to _not_ have separate ASCII codes for upper and lower-case letters in the first place. What are your thoughts on MS Word using different characters for opening and closing quotes?
* Multipart uploads cannot be performed from multiple machines having instance credentials (as the principal will be different and they don't have access to each other's multipart uploads). You need an actual IAM user if you want to assemble a multipart upload from multiple machines.
* LIST requests are not only slow, but also very expensive if done in large numbers. There are workarounds ("bucket inventory") but they are neither convenient nor cheap
* Bucket creation is not read-after-write consistent, because it uses DNS under the hood. So it is possible that you can't access a bucket right after creating it, or that you can't delete a bucket you just created until you waited enough for the changes to propagate. See https://github.com/julik/talks/blob/master/euruko-2019-no-su...
* You can create an object called "foo" and an object called "foo/bar". This will make the data in your bucket unportable into a filesystem structure (it will be a file clobbering a directory)
* S3 is case-sensitive, meaning that you can create objects which will unportable into a filesystem structure (Rails file storage assumed a case-sensitive storage system, which made it break badly on macOS - this was fixed by always using lowercase identifiers)
* Most S3 configurations will allow GETs, but will not allow HEADs. Apparently this is their way to prevent probing for object existence, I am not sure. Either way - cache-honoring flows involving, say, a HEAD request to determine how large an object is will not work (with presigned URLs for sure!). You have to work around this doing a GET with a Range: of "very small" (say, the first byte only)
* If you do a lot of operations using pre-signed URLs, it is likely you can speed up the generation of these URLs by a factor of 10x-40x (see https://github.com/WeTransfer/wt_s3_signer)
* You still pay for storage of unfinished multipart uploads. If you are not careful and, say, these uploads can be initiated by users, you will be paying for storing them - there is a setting for deleting unfinished MP uploads automatically after some time. Do enable it if you don't want to have a bad time.
These just off the top of my head :-) Paradoxically, S3 used to be revolutionaly and still is, onl multiple levels, a great products. But: plenty features, plenty caveats.
The one that caught me a couple of weeks ago is multipart uploads have a minimum initial chunk size of 5 MiBs (https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts...). I built a streaming CSV post-processing pipeline in Elixir that uses Stream.transform (https://hexdocs.pm/elixir/Stream.html#transform/3) to modify and inject columns. The Elixir AWS and CSV modules handle streaming data in but the AWS module throws an error (from S3) if you stream "out" that totals less than 5 MiBs as is uses multi-part uploads which made me sad.
The last part can be any size, so with a few tweaks to the streaming code you should be fine. Ready-made AWS SDKs handle this (chunking) for you. Truth be told, the multupart upload on GCP is even worse :/
Here's another fun one that took a colleague and I several days of analysis to diagnose. S3 will silently drop all requests once a single TCP connection has sent 100 HTTP requests.
It doesn't silently drop, it sends a header indicating it's closed the TCP connection.
This is a pretty common pattern whereby you want keep-alive.for performance reasons, but you don't want clients running _too long_ creating hot spots on your load balancers.
How about the fact that S3 is not suitable for web serving due to high latencies (in standard storage classes)?
Many people think you can just host the resources for your websites, such as images or fonts, straight on S3. But that can make for a shitty experience:
> applications can achieve consistent small object latencies (and first-byte-out latencies for larger objects) of roughly 100–200 milliseconds.
Those uploader decides rules are wild. Does that mean someone with a website poorly configured enough can have user content uploaded to, and later served from amazon glacier? (assuming a sufficiently motivated user)
Basically every new service that expects to deal with a lot of traffic should work this way (we did this also for AWS IoT). It's a hell of a lot easier to deal with load balancing and request routing if you can segment resources starting at the DNS level...
Regarding the deletion point, note that you cannot delete S3 buckets that are non-empty, so in order to actually delete data you have to first manually delete them. Of course, if any action is allowed then that is as well. But still, it's not a single request away for any non-trivial bucket.
My life experience has been that this line of thinking results in two separate outcomes: "the great thing about standards is that there are so many to choose from" <https://xkcd.com/927/>, and that publishing a standard does absolutely zero for getting buy-in, with that last one made worse by any short-sighted fools who publish a bunch of words without a reference impl or ideally a TCK. As this whole thread has shown, for 10 people there will be 15 understandings of any given sentence, which is very very bad when trying to get two computers to agree on something
A workaround for some of the limits of presigned urls like not being able to specify a max file size is to front your uploads using CloudFront OAC and CloudFront Functions. It costs more (.02/GB) but you can run a little JavaScript code to validate/augment headers between your user and S3 and you don't need to expose your bucket name. https://speedrun.nobackspacecrew.com/blog/2024/05/22/using-c...
> … a relatively small part of the API requires HTTP requests to be sent to generic S3 endpoints (such as s3.us-east-2.amazonaws.com), while the vast majority of requests must be sent to the URL of a target bucket.
I believe this is talking about virtual-hosted style and path-style methods for accessing the S3 API.
From what I can see [0], at least for the REST API, the entire API works either with virtual-hosted style (where the bucket name is in the host part of the URL) and path-style (where the bucket name is in the path part of the URL). Amazon has been wanting folks to move over to the virtual-hosted style for a long time, but (as of 3+ years ago) the deprecation of path-style has been delayed[1].
This deprecation of path-style requests has been extremely important for products implementing the S3 API. For example…
* MinIO uses path-style requests by default, requiring you set a configuration variable[2] (and set up DNS appropriately) to handle the virtual-hosted style.
* Wasabi supports both path-style and virtual-hosted style, but "Wasabi recommends using path-style requests as shown in all examples in this guide (for example, http://s3.wasabisys.com/my-bucket/my-object) because the path-style offers the greatest flexibility in bucket names, avoiding domain name issues."[3].
Now here's the really annoying part: The REST API examples show virtual-hosting style, but path style works too!
For example, take the GetBucketTagging example. Let's say you have bucket "karl123456" in region US-West-2. The example would have you do this:
GET /?tagging HTTP/1.1
Host: karl123456.s3.amazonaws.com
But instead, you can do this:
GET /karl123456?tagging HTTP/1.1
Host: s3.us-west-2.amazonaws.com
!!
How do I know this? I tried it! I constructed a `curl` command do to the path-style request, and it worked! (I didn't use "karl123456", though.)
So hopefully that helps resolve at least one of your S3 annoyances :-)
i just built a live streaming platform[1]. chat, m3u8, ts, all objects. list operations used to concat chat objects. works perfectly, object storage is such a great api.
it uses r2 as the only data store, which is a delightfully simple s3-alike. the only thing i miss are some of the advanced listv2 features.
My wife - she's a future developer, I'm a lawyer - has had enough of me explaining some of these details in the form of concerns. I am somewhat satisfied to know that my perplexities are not imaginary.
Wants to be. He reads a lot of code and tries to understand it and write his own. She will probably never be one of you, like and expert. But she's interested, she works laterally in the field - she's done a lot of hours on the intranet at her job. So I think she can be a developer in the future, a good developer. Now she has a degree in the field and... Wait. But then... she is already a developer. I don't know, man. It's a philosophical question.
It is simply not a phrase I commonly read. When I google it I find a visa consultancy under the name, and not much else. Curiously your comment is also among the results.
The problem is that the phrase "<something> developer" is used to describe what the person develops. A "real-estate developer" invests in real estate as a business. A "web developer" develops web applications, a "game developer" develops games, and so on and so on. So reading the word I immediately thought they mean someone who is developing the future? Like idk Douglas Engelbart or Tim Berners-Lee or someone like that.
If you want to write that someone is learning to become a developer I would recommend the much less confusing "developer in training" phrase, or even better if you just write "they are learning to become a developer".
Bad English case. It just is. I pay a lot of attention when I write English, but you can always tell when someone isn't a native English speaker in about two words. Je suis désolé.
Not one thing about different API limits for operations according to key prefixes, and how you need to contact support if you need partitioning and provide them the prefixes, huh?
I’ve become old. If you look at these things in disgust (ACL vs Policies, Delete bucket with s3:*, etc), you’re missing the point of (deterministic ?) software. It does what it is written to do, faithfully, and error out when not. When it doesn’t do as written or as documented, then yes… go full bore.
No mention of how AWS/S3 approximates the size of a file to save CPU cycles. It used to drive me up a wall seeing S3 show a file size as slightly different than what it was.
If I recall correctly, S3 uses 1000 rather than 1024 to convert bytes to KB and MB and GB. This saves CPU cycles but results in “rounding errors” with the file’s reported size.
It’s discussed here, although they’re talking about it being “CLI vs console” which may be true?
This almost certainly has nothing to do with “saving CPU cycles” and is most likely just that whoever created the Cloudwatch console used the same rounding that is used for all other metrics in CW, rather than the proper calculation for disk size, and it was a small enough issue that it was never caught until it was too late to change it because changing it would disrupt the customers that have gotten used to it.
Yeah we have long standing confusion that for historical reasons KB and MB often means 2^10 bytes units so now when you see KB you really don't know what it means. Therefore I am a staunch supporter of unambiguous KiB and MiB.
I think I explained this poorly, and I appear to be mistaken about it being to save CPU cycles (though an entity such as AWS would absolutely be about saving minuscule CPU cycles that add up at scale).
I spent a lot of time researching this issue when I came across it years ago- I just remember that local file size and S3 file size was not matching up with anything and the takeaway was that S3 was calculating file size differently from Windows/Linux/macOS.
AWS uses base 2 (binary) for calculating file size, meaning 1MB is treated as 1,048,576 (2^20) bytes rather than 1,000,000 bytes (10^6).
And as you've said, one is MB while the other is technically MiB.
That's how it should be and I am annoyed at macos for not having it.