Hacker News new | past | comments | ask | show | jobs | submit login
Things you wish you didn't need to know about S3 (plerion.com)
336 points by miles 5 months ago | hide | past | favorite | 213 comments



A lot of them are interesting points, but I am not sure I agree with the complaint the file system is case sensitive.

That's how it should be and I am annoyed at macos for not having it.


> That's how it should be

Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

Case sensitivity in file names is surprising even to non-technical people. If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"

Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.


> If someone says they sent you "Book Draft 1.docx" and you check your email to find "Book draft 1.docx," you don't say, "Hey! I think you sent me the wrong file!"

But you also wouldn't say that if they sent "Book - Draft 1.docx", "Book Draft I.docx", "BookDraft1.docx", "Book_Draft_1.docx", or "Book Draft 1.doc", and surely you wouldn't want a filesystem to treat all of them as the same.


This is a personal reason, but the reason I prefer case sensitive directory names is I can make "logical groupings" for things. So, my python git directory might have "Projects/" and "Packages/," and the capitalization not only makes them stand out as a sort of "root path" for whatever's underneath, but the capitalization makes me conscious of the commands I'm typing with that path. I can't just autopilot a path name, I have to consciously hit shift when tab completion stops working.

That might sound like a dumb reason, but it's kept me from moving things into the wrong directory, or accidentally removing a directory multiple times in the past.

I also use Windows regularly and it really isn't a hindrance, so maybe I wouldn't actually be bothered if everything was case sensitive.


TBF, you don't need case sensitive FS for that, just case retaining is enough. And then have the option on how to sort it.


Don't you need case sensitivity for this part?

> I can't just autopilot a path name, I have to consciously hit shift when tab completion stops working.

On a system that's case retaining but not case sensitive, wouldn't "pr" autocomplete to "Projects"?


No, MacOS doesn’t do that. `cat Foo` and `cat foo` will both work, but only the first one will tab complete if the file is called `Foo`.


zsh tab-completed both just fine, preserving the case in both. I’d have preferred it corrected the case, but meh.


I like it! That's a great idea.

To me, this sounds like a great practice for terminal environments but may be less intuitive when using file system apps. I could easily overlook a single letter capitalization in a GUI view of many directories. Maybe it's because at a terminal the "view" into the file system is narrow?

Now I'm wondering how I can use this in my docker images. I mean that might irritate devops. Well, maybe they'll like it too. Man, thanks for posting this.


You have to draw the line somewhere, but I do appreciate when the UI sorts "Book draft 2" before "Book draft 11". That requires nontrivial tokenization logic and inference, but simple heuristics can be right often enough to be useful.

On that note, ASCIIbetical sort is never the right answer. There is a special place in hell for any human-facing UI that sorts "Zook draft 1" between "Book draft 1" and "book draft 1".


And that line, at least for sorting, belongs firmly outside the filesystem.

Sorting is locale-dependent. Whether a letter-with-dots sorts next to letter-without-dots or somewhere completely different has no correct global answer.


I think there's a pretty big difference between how the UI orders things and how the filesystem treats things as equivalent. A filesystem treating names case sensitively doesn't prevent the UI from tokenizing the names in any other arbitrary way


Capitalization isn't part of grammar. Those examples are different strings of characters altogether.


The classic, if crude, counterexample: "I helped my uncle Jack off a horse."

(The uncapitalized version doesn't just have different semantics; it has a completely different parse-tree!)


I'll augment your statement by noting that punctuation is also not part of grammar.


Another classic counterexample: "This book is dedicated to my parents, Ayn Rand, and God." "This book is dedicated to my parents, Ayn Rand and God."


you called it - those are different situations all right


There are just not the same characters. A filesystem should not have an opinion on what strings of characters _mean_ the same. It is the wrong level of abstraction.

filenames might even not be words at all, and surely not limited to English. We shouldn't implement rules and conventions from spoken English at a filesystem level, certainly not S3.

MacOS and Windows are just wrong about this.


Windows doesn’t have it at the file system layer, NTFS is case sensitive. Windows has it at the Win32 subsystem layer, see replies and comments here:

https://superuser.com/questions/364057/


That's way worse than just putting it on the file system.

Now you have hidden information, that you can't ever change, and may or may not impact whatever you are doing.


What hidden information that you can't ever change?


I think what they mean is if you somehow had two files with the same name but different cases (as NTFS supports this) it would be impossible to fix with win32 calls


> Windows doesn’t have it at the file system layer, NTFS is case sensitive.

I think the common phrasing is "case-aware, not case-sensitive".


No, NTFS has always been at least optionally case sensitive; current Windows versions even allow case-sensitivity to be controlled on a per-directory basis[1], which even works for (some) Win32 programs:

  Microsoft Windows [Version 10.0.22631.3593]
  (c) Microsoft Corporation. All rights reserved.
  
  C:\Users\jtm>mkdir foo
  
  C:\Users\jtm>fsutil file setCaseSensitiveInfo foo
  Case sensitive attribute on directory C:\Users\jtm\foo is enabled.
  
  C:\Users\jtm>echo bar > foo\bar.txt
  
  C:\Users\jtm>echo Bar > foo\Bar.txt
  
  C:\Users\jtm>dir foo
   Volume in drive C is Aristotle-Win
   Volume Serial Number is E4AE-428B
  
   Directory of C:\Users\jtm\foo
  
  2024-05-31  17:55    <DIR>          .
  2024-05-31  17:55    <DIR>          ..
  2024-05-31  17:55                 6 Bar.txt
  2024-05-31  17:55                 6 bar.txt
                 2 File(s)             12 bytes
                 2 Dir(s)  41,524,133,888 bytes free
  
  C:\Users\jtm>type foo\bar.txt
  bar
  
  C:\Users\jtm>type foo\Bar.txt
  Bar
[1] https://learn.microsoft.com/en-us/windows/wsl/case-sensitivi...


And so should we be able to have “é.txt” and “é.txt” in the same directory (with a different UTF-8 normalization?) What encoding should we use BTW?

I’m not advocating for case-insensitive fs (literally the first thing I do when I get a Mac is reformat it to be on a case-sensitive fs), but things are not that simple either.


> And so should we be able to have “é.txt” and “é.txt” in the same directory

That's what Linux does.

It does create some problems that seem to never happen on practice, while it avoids some problems that seem to happen once in a while. So yeah, I'd say it's a good idea.


You look from technical perspective. From average person perspective, even files are too much technicality to deal with.

As a user I want my work to be preserved, I want to view my photos and I want system to know where is my funny foto of my dog I did last Christmas.

As a developer I need an identifier for a resource and I am not going to let user decide on the Id of the resource, I put files in system as GUID and keep whatever user feels as metadata.

Exposing average people to the filesystem is wrong level of abstraction. That is why iOS and Android apps are going that way - but as I myself am used to dealing with files it annoys me that I cannot have that level of control, but I accept that I am quite technical.


Dealing with files used to be something everyone interacting with computers had to do. It is something average people can do.

I think too much abstraction is a mistake and adds a lot of unneeded complexity.

People should learn something about technology they use. If you want to drive, you need understand how steering wheels work, if you want to drive a manual car (usual where I live and have lived) then you need to know how to work a gear stick and the effect of changing gear.


> used to be something everyone interacting with computers had to do

There were far fewer people 'interacting with computers' at that level years ago.


Everyone with an office job was still a lot of people though.


I'm not even sure 'everyone with an office job' had a computer. It certainly wasn't true 35 years ago. An office might have a computer or two, but not everyone had one, nor was everyone expected to use it.


Case insensitive matching is a surprisingly complicated, locale-dependent affair. Should I.txt and i.txt match? (Note that the first file is not named I.txt).

Case insensitive filesystems make about as much sense as ASCII-only filenames.


How would locale matter?


Off the top of my head, in turkish, `i` doesn't become `I`, it becomes `İ`. And `ı` is the lower case version of `I`


You don't need to decide how to upper or lower case a character to be insensitive to case, though. Treating them all as matching isn't a terrible option.


For example, it depends on the locale if the capitalized form of ß is ß or SS.


And yet case insensitive file name matching / string matching is one of my favourite windows features. It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character, that they are different ASCII codes is a behind the scenes implementation detail.

(That said, S3 isn’t a filesystem, it’s more like a web hashtable key-to-blob storage)


> People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character

They are the same character to you, a native speaker of a Western language written in a latin script. They are the same to you because you are, in fact, an ASCII machine. Many many people in the world are not.


They are the same to me, they are different in ASCII, therefore I am not an ASCII machine. To me, the person using the computer to do work. Not the person wanting to do extra work to support the computer's internal leaky abstractions of data storage.

Your position, the position of too many people, is that I a native speaker of English etc. should not be allowed to have a computer working how English works because somewhere, someone else is different. This is like saying I shouldn't be allowed an English spell checker because there are other people who speak other languages.


> “e” and “E” are the same character

They don't look like the same character to me. A character is a written symbol. These are different symbols.

What definition of "character" are you using where they're the same character?

I haven't ruled out that I am wrong, this is a naive comment.


Are the words hello and HELLO spelled differently? I am pretty squarely in the camp that filesystems should be case sensitive (perhaps with an insensitive shell on top), but I would not consider those two words as having a different spelling. To me that means they are the same sequence of characters.


You are confusing characters with glyphs. A glyph is a written symbol.


And you seem to be conflating characters and letters. There are fewer letters in the standard alphabet than we have characters for the same, largely because we do distinguish between some letter forms.

I suppose you could imagine a world where we don't, in fact, do this with just the character code. Seems fairly different from where we are, though?


I thought that if they're different glyphs they're different characters.

Surely the fact that they're represented differently in ASCII means ASCII regards them as different characters?

Whether they're different glyphs or not depends on the font.


When you press the "E" key on a US keyboard and "e" comes out, do you return the keyboard because it's broken? If not, then you know what definition I'm using even if I misnamed it.


> It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

Can you elaborate on this?


Every single time I type a path or filename (or server name) in the shell, or in Windows explorer, or in a file -> open or save dialog, I don't trip over capitalization. If I want to glob files with an 'ecks' in the name I can write *x* and not have to do it twice for *x* and *X*.

When I look at a directory listing and it has "XF86Config", I read it in my head as "ecks eff eight six config" not "caps X caps F num eight num six initial cap Config" and I can type what I read and don't have to double-check if it's config or Config.

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Case sensitivity is like walking down a corridor and someone hitting you to a stop every few steps and saying "you're walking Left Right Left Right but you should be walking Right Left Right Left".

Case insensitivity is like walking down a corridor.

In PowerShell, some cmdlets are named like Add-VpnConnection where the initialism drops to lowercase after the first letter, others like Get-VMCheckpoint where the initialism stays capitalised, others mixed like Add-NetIPHttpsCertBinding where IP is caps but HTTPS isn't - any capitalisation works for running them or searching them with get-command or tab-completing them. I don't have to care. I don't have to memorise it, type it, pay attention to it, trip over it, I don't have to care!.

"A programming language is low level when its programs require attention to the irrelevant." - Alan Perlis.

DNS names - ping GOOGLE.COM works, HTTPS://NEWS.YCOMBINATOR.COM works in a browser, MAC addresses are rendered with caps or lowercase hex on different devices, so are IPv6 addresses in hex format, email addresses - firstname.lastname or Firstname.Lastname is likely to work. File and directory access behaving the same means it's less bother. In Vim I :set ignorecase.

In PowerShell even string equality check is case insensitive by default, string match and split too. When I'm doing something like searching a log I want to see the english word 'error' if it's 'error' or 'ERROR' or 'Error' and I don't know what it is.

If I say the name of a document to a person I don't spell out the capitalisation. I don't want to have to do that to the computer, especially because there is almost no reason to have "Internal site 2 Network Diagram" and "INTERNAL site 2 network diagram" and "internal site 2 NETWORK DIAGRAM" in the same folder (and if there were, I couldn't easily keep them apart in my head).

All the time in command prompt shell, I press shift less often, type less, change directories and work with files more smoothly with less tripping over hurdles and being forced to stop and doublecheck what I'm tripping over when I read "word" and typed "word" and it didn't work.

On the other hand, the edge cases it causes me are ... well, I can't think of any because I don't want to put many files differing only by case in one directory. Maybe uncompressing an archive which has two files which clash? I can't remember that happening. Maybe moving a script to a case sensitive system? I don't do that often. In PowerShell, method calls are case insensitive. C# has "string".StartsWith() and JavaScript has .startsWith() and PowerShell will take .startswith() or .StartsWith or .Startswith or anything else. That occasionally clashes if there's a class with the same name in different case but that's rare, even.

In short, the computer pays attention to trivia so I don't have to. That's the right way round. It's about the best/simplest implementation of Do What I Mean (DWIM) that's almost always correct and almost never wrong.


If I want to glob files with an 'ecks' in the name I can write x* and not have to do it twice for x and X.*

Adding

  shopt -s nocaseglob
to ~/.bashrc makes globbing case-insensitive in bash[1].

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Adding

  set completion-ignore-case on
to ~/.inputrc makes completion case-insensitive in bash (and other programs that use libreadline)[2].

Both options are independent of file system case-sensitivity.

[1] https://www.gnu.org/software/bash/manual/html_node/The-Shopt...

[2] https://tiswww.cwru.edu/php/chet/readline/readline.html#inde...


> Both options are independent of file system case-sensitivity.

In Windows world it works everywhere, in any win32 program - file open dialogs, et al. Here you have to have it built in to every tool. (and windows doesn't do it at the filesystem layer)


None of these are the filesystem though, they are all abstractions over the file system that could easily implement case insensitivity, and as a sibling comment pointed out, actually do in many cases. I'm perfectly fine with the idea of interacting with files using a case insensitive interface. I just don't feel like it should be the job of the filesystem to enforce case insensitivity.


Complicated for who? I've little pity for developers and kernels ease of life as a user.


> Casing is usually not meaningful even in written language. "Hi, how are you?"

How about: “pay bill” vs “pay Bill”?

“Usually” in the context of automated systems design is a recipe for disaster.

Computers store bytes, not characters that may just happen to mean similar things. Shall we merge ümlauts? How to handle ß?


Case Preserving and Case Sensitive are subtly two different things. Most case insensitive file systems are case preserving and whatever the UTF8 equivalent is I forget the name.


But the gps point is that assuming you know the semantic meaning of the case and if retention is enough is silly.

Assuming case insensitivity is bizarre.


heh, I especially enjoy that in a huge thread about how capitalization does and doesn't matter, "gps point" was not, in fact, concerning some coordinates of the global positioning system but rather "GP's point". I first chalked it up to some autocomplete artifact but then realized what was actually happening


Perfect is the enemy of good. It is quite acceptable to streamline the easy cases now and the hard cases later or never.


> Casing is usually not meaningful even in written language. "Hi, how are you?" means the same thing as "hi, how are you?" Uppercase changes meaning only when distinguishing between proper and common nouns, which is rarely a concern we have with file names anyway.

The number of spaces is usually not meaningful in written language. "Hi, how are you?" means the same thing as "Hi, how are you ?". I don't think it's a good reason to make file system ignore space characters.


No offense, but I think that's a very western-centric view. Your example only make sense when the user is familiar to English (or other western languages, I guess). To me personally, I find it strange that "D.txt" and "d.txt" means the same file, since they are two very different characters. Likewise, I think you would also go crazy if I tell you "ア.txt" and "あ.txt" means the same file (which is hiragana and katakana for A respectively, which in a sense is equivalent to uppercase and lowercase in Japanese), or "一.txt" and "壹.txt" means the same file (which both means number 1 in Chinese, we call the latter one literally "uppercase number")


Agreed, and you could even take this into "1.txt" being the same as "One.txt". Which, I mean, fair that I would expect a speech system to find either if I speak "One dot t x t". But, it would also find "Won.txt" and trying to bridge the phonetic to the symbolic is going to be obviously fraught with trouble.


> To me personally, I find it strange that "D.txt" and "d.txt" means the same file, since they are two very different characters.

As a native English speaker, I agree with this.


Those are all the same, I don’t see an issue


What if Unicode updates some capitalization rules in the next version, and after an OS updates some filenames now collide and one of the is inaccessible?


If someone says they sent you "Book Draft 1.docx" and you check your email to find "Ⓑⓞⓞⓚ Ⓓⓡⓐⓕⓣ ①.ⓓⓞⓒⓧ", "฿ØØ₭ ĐⱤ₳₣₮ 1.ĐØ₵Ӿ" - these are different files.


I have a feeling you enjoyed that character set lookup. I know I did seeing it.


Ages ago on Flowdock at work (a chat webapp kind of like Slack that no longer exists), I used the circle ones for a short time as my nickname, and no one could @ me.


File systems are not user interfaces. They are interfaces between programs and storage. Case insensitive is much better for programs.

The user shell can choose however it wants to handle file names, a case sensitive file system does not prevent the shell from handling file names case insensitively.


> case insensitive is much better for programs

Can’t edit my comment. I mean case sensitive is better for programs, of course.


> Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

Not sure why what Windows does is relevant to this, honestly. Personally, I strongly prefer case sensitivity with filenames, but the lack of it isn't a dealbreaker or anything.


What are some of the advantages of case sensitivity? Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files? That just sounds like asking for trouble.


The advantages that I value are that case sensitivity means I can use shorter filenames, it makes it easier to generate programmatic filenames, and I can use case to help in organizing my files.

> Are you saying you actually want to save "Book draft 1.docx" and "Book Draft 1.docx" as two separate files?

That's a situation where sensitivity can cause difficulty, yes, but for me personally, that's a minor confusion that is easy to avoid or correct. Everything is a tradeoff, and for me, putting up with that annoyance is well worth the benefits of case sensitivity.

I do totally understand that others will have different tradeoffs that fit them better. I'm not taking away from that at all. But saying "case sensitivity is undesirable" in a broad sense is no more accurate than saying "case sensitivity is desirable" in a broad sense.

Personally, I think the ideal tradeoff is for the filesystem to be case sensitive, but have the user interfaces to that file system be able to make everything behave as case-insensitive if that's what the user prefers.


Even with only one case, just four characters is enough for a million files. How much benefit are you really getting from case sensitivity?


Unicode case folding is a complicated algorithm, and its definition is subject to change with updated Unicode versions. It's nice not to have to worry about that.


Okay, but I don't think this has anything to do with the use case JohnFen mentioned or my questions about it.

If your goal is super easy filename generation then you're probably not going to leave ASCII.

And if you do go beyond ASCII for filename packing/generating, then you should instead use many thousands of CJK characters that don't have any concept of case at all. Bypass the question of case sensitivity entirely.


Enough that I prefer it. If that were the only advantage, I'd only slightly prefer it. But being able to use case as a differentiator in filenames intended for me to read is something I find even more valuable.

A filesystem not being case sensitive isn't a dealbreaker or anything. I just prefer case sensitivity because it increases flexibility and readability for me, and has no downsides that I consider significant.


Also note that 'are these 2 words case insensitively equal' is impossible without knowing what locale rules to apply. And given that people's personal names tend to have the property that any locale rules that must be applied are _the locale that their name originates from_, and that no repository of names I am aware of stores locale along with the name, that means what you want, is impossible.

In line with case insensitivity, do you think `müller` and `muller` should boil down to for example the same username for login purposes?

That's... tricky. In german, the standard way to transliterate names to strict ASCII would be to turn `müller` into `mueller`. In swiss german that is in fact mandatory. Nobody in switserland is named `müller` but you'll find loads of `mueller`s. Except.. there _are_ `müller` in switzerland - probably german citizens living ther.

So, just normalize `ü` to `ue`, easy, right? Except that one doesn't reverse all that well, but that's probably allright. But - no. In other locales, the asciification of `ü` is not `ue`. For example, `Sjögren` is swedish and that transliterates to `sjogren`, not `sjoegren`.

Bringing it back to casing: Given the string `IJSSELMEER`, if I want to title case that, the correct output is presumably `IJsselmeer`. Yes, that's an intentional capital I capital J. Because it's a dutch word and that's how it goes. In an optimal world, there is a separate unicode glyph for the dutch IJ as a single letter so we can stick with the simple rule of 'to title case a string, upper case the first glyph and lowercase all others, until you see a space glyph, in which case, uppercase the next'. But the dutch were using computers fairly early on and went with using the I and the J (plain ascii) for this stuff.

And then we get into well trodden ground: In turkish, there is both a dotted and a dotless i. For... reasons they use plain jane ascii `i` for lowercase dotted i and plain jane ascii `I` for uppercase dotless I. But they have fancy non-ascii unicode glyphs for 'dotted capital I' and 'dotless lowercase i'.

So, __in turkish__, `IZMIR` is not case-insensitive equal to `izmir`. Instead, `İZMIR` and `izmir` are equal.

I don't know how to solve this without either bringing in hard AI (as in, a system that recognizes 'müller' as a common german surname and treats it as equal to 'mueller', but it would not treat `xyzmü` equal to `xyzmue` - and treats IZMIR as not equal to izmir, because it recognizes it as the name of a major turkish city and thus applies turkish locale rules), or decreeing to the internet: "get lost with your fancypants non-US/UKian weird word stuff. Fix your language or something" - which, well, most cultures aren't going to like.

'files are case insensitive' sidesteps alllllll of this.


> you don't say, "Hey! I think you sent me the wrong file!"

You do! Why not?

It's a big trap. A lot of counterfeit, spam, phishing etc go by this method. You end up buying a fake brand or getting tricked.


> Why?

Because it introduces extra complexity.

Now, "Cache" and "cache" are the same, but also...different because you'd care if Cache suddenly became cache.


> Why? Windows is also not case-sensitive, so it's not like there's a near-universal convention that S3 is ignoring.

You can enable case sensitivity for directories or disks, but this is usually done for special cases, like git repos


Yeah, but that little bit of user friendliness ruins the file system for file system things. Now you need “registries” and other, secondary file systems to do file system things because you can’t even use base64 in file names. Make your file browsing app case insensitive, if that’s what you want. Don’t build inferiority down to the core.


Then why don’t you just always write in lower case?


I agree 100%.

From a technical implementation pov 'A' & 'a' are well established as different characters (ascii, unicode, etc). Regardless of personal preference, I don't understand how can a developer/Sys admin be surprised and even frustrated that a file system is case sensitive.

The developer is still free to abstract this away for the end user when it makes sense such as search results


Author here. There's no complaint. It's an observation rather than an absolute good or bad. It's something you have the consider in designing your application.


> That's how it should be

Why exactly? I'm not aware of any benefits of filenames being case-sensitive, it just opens a room for tons of very common mistakes that literally can't happen otherwise. It's not like in coding where it helps enforce the code style and thus aids readability - and even in programming it was a source of PITA to solve bugs before IDEs became smart enough to catch typos in var names. One thing I loved in Pascal the most is that it didn't care about the case, unlike the C.


The case-sensitivity algorithm needs a locale as input in order to correctly calculate the case conversion rules.

The most common example is probably that i (U+0069 LATIN SMALL LETTER I) and I (U+0049 LATIN CAPITAL LETTER I) transform into each other in most locales, but not all. In locales az and tr (the Turkic languages), i uppercases to İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), and I lowercases to ı (U+0131 LATIN SMALL LETTER DOTLESS I).

case-insensitive is all fine if you only handle text that consist of A-Za-z, but as soon as you want to write software that works for all languages it becomes a mess.


This is the main point, and almost all the other chatter is not particularly relevant. A dumb computer and a human can agree with "files are case sensitive and sometimes that's a bit weird but computers are weird sometimes". If there was indeed exactly one universal way to have case insensitivity it would be OK. Case insensitive file systems date from when there was. Everything was English and case folding in English is easy. Problem solved. But that doesn't work today. And having multiple case folding rules is essentially as unsolvable a problem as the problems that arise from case sensitivity, except they're harder for humans to understand, including programmers.

Simple and wrong is better than complicated and wrong and also the wrong is shoved under the carpet until it isn't.

Though you still ought to declare a Unicode normalization on the file system. Which would be perfectly fine if it weren't for backwards compatibility.


Minor nitpick: case-insensitive comparison is a separate problem from case conversion, and IIRC a little simpler. Still locale-specific.


Except at the UI layer (where you can easily offer suggestions and do fuzzy search), the opposite is true. There are so many different ways to do case-insensitive string comparisons, and it's so easy to forget to do that in one place, that case-insensitivity just leads to ton of bugs (some of which will be security critical).

For example, did you know that Microsoft SQL Server treats the columns IS_ADMIN and is_admin as either the same or two different columns depending on the database locale (because e.g. Turkish distinguishes between i and I)? That's at least a potential security bug right there.


macOS is case preserving, though. To me, it’s the best of both worlds. You can stylize your file names and they will be respected, but you don’t have to remember how they are stylized when you are searching for or processing them because search is case insensitive.


Windows is also case-insensitive but case-preserving


IMO this is the worst possible solution, as what you are seeing is not what you are getting. You do not actually know what is being stored on the file system, and your searches are fuzzy rather than precise.


> You do not actually know what is being stored on the file system

This makes no sense to me. Did the user's file explorer (whether GUI or via commands like `ls`) suddenly disappear?


Maybe macOS is case-preserving, but it's not encoding-preserving. If you create a file using a composed UTF-8 "A", the filesystem layer will decompose the string to another form, and create the filename using that decomposed form "B". Of course, "A" and "B" when compared can be completely different (even when compared using case insensitivity enabled), yet will point to the same file.

More info here: https://eclecticlight.co/2021/05/08/explainer-unicode-normal...


macOS (Darwin) has always written filenames as NFD via the macOS APIs. The underlying POSIX-ish APIs may not do NFD, but Finder and every native macOS GUI program gets files in NFD format.


And this is different from what I wrote how exactly?

Btw, this has nothing to do with POSIX vs Finder, it's a filesystem driver trait, at least for HFS+, but probably for APFS as well.


macOS has case sensitivity. It's just off by default and is a major pain to turn on. You have to either reinstall from scratch onto a case-sensitive partition, or change the "com.apple.backupd.VolumeIsCaseSensitive" xattr from 0 to 1 in a Time Machine backup of your whole system and then restore everything from it.


You shouldn't do this if you value things working, though-- this is a pretty rare configuration (you have to go way out of your way to get it), so many developers won't test with it and it's not unheard of for applications to break on case-sensitive filesystems.

If you absolutely need case-sensitivity for a specific application or a specific project, it's worth seeing if you can do what you need to do within a case-sensitive disk image. It may not work for every use-case where you might need a case-sensitive FS, but if it does work for you, it avoids the need to reinstall to make the switch to a case-sensitive FS, and should keep most applications from misbehaving because the root FS is case-sensitive.


Most things work fine, but it will break (or at least did break at one point) Steam, Unreal Engine, Microsoft OneDrive, and Adobe Creative Cloud. I'm rather surprised about the first two, since they both support Linux with case-sensitive filesystems. I took the opposite approach as you, though: making my root filesystem case-sensitive and creating a case-insensitive disk image if I ever needed those broken programs.


I keep a case sensitive volume around to checkout code repositories into. For everything else I prefer it insensitive, but my code is being deployed to a case sensitive fs.


I just mount a case sensitive Apple File System disk image at ~/code, works well


UNIX is one of the few OSes that went down that path.

Others do offer the option if one is so inclined, and also prepared to deal with legacy software that expects otherwise.

Which is also the case with macOS, because although it is a UNIX, OS X had to catter to the Mac OS developer community used to HFS and HFS+.


Curiously, iOS and iPadOS file systems are case-sensitive. There's less legacy there, so they opted to do the correct thing.


Please don't call this "the correct thing". Please recognize that there are multiple, valid, points of view. What you meant is "the thing I like".


It's not "the thing I like", it's the better tradeoff. It's less complex and thus more secure (due to reduced API surface and fewer opportunities to make lookup mistakes or to mistakenly choose the wrong out of dozens of kinds of case-insensitive comparison in a security decision). It's also potentially faster, and more compatible with other Unixes.


It would be incredibly weird not to be able to store urlsafe-base64 encoded paths on S3.


TIL. I've switched to Mac about a year ago and it's sufficiently Linux-like to take such things for granted. I wonder what other surprises are going to bite me in the future.


Sufficiently BSD-like :)


You can format disks in MacOS to be case sensitive.


Case sensitive filesystems are a mistake.


Case insensitive is how humans think about names. “John” and “New York” are the same identifiers as “john” and “new york”. It would be pretty weird if someone insisted that their passport is invalid because the name is printed in all caps and that’s not their preferred spelling.

IMO the best thing would be to call Unix-style case-sensitive file names something else. But it’s obviously too late for that.


The word “Turkey” is not the same as “turkey”, “August” is not the same as “august”, and “Muse” is not the same as “muse”. https://en.m.wikipedia.org/wiki/Capitonym


And “polish” and “Polish” are not even pronounced the same.


They might be at the beginning of a sentence (depends on the reason for capitalization).

It’s more like identifier reuse, on a case insensitive “system”.

“John” isn’t the same as “John” if I’m talking about two separate Johns.


Yet "TURKEY" is not a separate word from "Turkey" and "turkey". Ultimately context disambiguates these words, not capitalization.


Humans will also treat "Jyväskylä", "Jyvaskyla" and "Jyvaeskylae" as the same identifiers but I don't think that's a good basis for file storage to have those be the same filenames.


In the era of Unicode, this battle is pretty much lost. Several different code point sequences can produce the glyph 'ä', and user input can contain any of these. You need to normalize anyway.


And macOS does that normalization at the filesystem level.


Passport offices care and may object.


Agreed. I think case sensitivity in Unix filesystems is actually a pretty poor design decision. It prioritizes what is convenient for the computer (easy to compare file paths) over what makes sense for the user (treating file paths the same way human intuition does).


In Germany there is a lowercase letter ß. It actually is a ligature of the letters s and z. It does not have an uppercase variant, because there is no word that begins with it. One word would be Straße. If you write that all in uppercase, it technically becomes STRASZE, although you almost always see STRASSE. But if you write that all in lowercase without substituting SS with ß, you are making a mistake. And although Switzerland is a german-speaking country, they have different spelling and rarely use ß -- if not ever.

This is just one of many cases, where case-insensitiy would give more trouble than it's worth. And others pointed out similar cases with the Turkish language in this post.


But the thing is that the file system doesn't need to be case-insensitive for your system to support human intuition! As others have said, people don't look at and use filesystems, they use programs that interface with the filesystem. You can absolutely have a case-sensitive system that nonetheless lets you search files in a case-insensitive manner, for example. After all, to make searches efficient, you might want to index your file structure, and while doing that, you might as well also have a normalised file name within the index you search against.

Now, as you said, UNIX did the choice that's easier for computers. And for computers, case-insensitive filesystems would be worse. There are things that are definitely strange about UNIX filesystems (who doesn't love linefeeds in file names!?), but case-sensitivity is not one of them.


I don't know if that's right. The most obvious way two characters can be the same is if they actually look exactly the same i.e. are homoglyphs https://en.wikipedia.org/wiki/Homoglyph

But no filesystem I am aware of is actually homoglyph insensitive.

Case insensitive filesystems picked one arbitrary form of intuition (and not even the .oat obvious one) in one language (English) and baked that into the OS at a somewhat deep level.

You say "human intuition" - are those using different writing systems nonhuman then?


Except that is not true, it is sometimes convenient, and sometimes very inconvenient and not wanted. My reasoning for file systems that are case sensitive is the following:

1. Some people want file systems to case sensitive. 2. Case sensitive is easier to implement. This is very much not a trivial thing. Case insensitivity only really makes sense for ASCII.

In the camp of wanting case insensitivity:

1. Some people want file systems to be case insensitive.

There is more in favor of case sensitivity.


But end users do not speak to filesystems.

Programs speak to filesystems.


The case sensitivity one is easy, here's a thing that's more likely to be entirely unintuitive:

S3 paths are fake. Yes, it accepts uploads to "/builds/1/installer.exe", and yes, you can list what's in /builds, but all of that is a simulation. What you actually did was to upload a file literally named '/builds/1/installer.exe' with the '/' as part of the name.

So, "/builds/1//installer.exe" and "/builds//1/installer.exe" are also possible to upload and entirely different files. Because it's the name of a key, there's no actual directories.


You're right, unless you use the new S3 "Directory buckets" [1], which make the entire thing even more confusing!

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/direct...


and which also according to the command docs i read recently add random limitations to maybe half of the S3 API

I'm not sure reusing (parts of) the protocol was a good idea given unrelated half of it was already legacy and discouraged (e.g. the insane permission model with policies and acls) or even when not, let's say... weird and messy.


also, don't overlook that "/" is only the default path delimiting character; one is free to use your favorite other character if you need a filename with a "/" in it: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...


Aside from resolution of paths to some canonical version (e.g. collapsing redundant /'s in your example), what actually is an "actual directory" other than a prefix?


An actual directory knows its direct children.

An actual directory A with 2 child directories A/B and A/C, each of which with 1 million files, doesn't need to go through millions of entries in an index to figure out the names of B and C.

It's also not possible for A and A/ to exist independently, and for A to be a PDF, A/ an MP4 and A/B a JPG under the A/ ”directory". All of which are possible, simultaneously, with S3.


Just to add, "directory" is something you use to look-up information.

The thing is called "directory" exactly because you go there to look what its children are.


An actual directory is a node in a tree. It may be the child of a parent and it may have siblings - its children may in turn be parents with children of their own.


Yeah, the prefix thing is a source of so many bugs. I get why AWS took that approach and it’s actually a really smart approach but it still catches so many developers out.

Just this year our production system was hit by a weird bug that took 5 people to find. Turned out the issue was an object literally just named “/“ and the software was trying to treat it as a path rather than a file.


Happened at my job yesterday.


I can't trust myself using S3 (or any other AWS service). Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.

I think I'll stick to actually simple services, such as Hetzner Storage Boxes or DigitalOcean Spaces.


I like digital ocean spaces, but it has its own annoying quirks.

Like I recently found out yif you pipe a video file larger than a few MB, it’ll drop the https:// from the returned Location. So on every file upload I have to check if the location starts with https, and add it on if it’s not there.

Of course the S3 node client GitHub issue says “sounds like a digital ocean bug”, and the digital ocean forums say “sounds like an S3 node client bug” lol


The way that DO handles secrets should scare anyone. Did you know that if you use their Container Registry and set it up so that your K8S has automatically access to it, their service will create a secret that has full access to your Spaces?


Hum... Kubernetes is not on the GP's list...


Fair enough, but not having scoped secrets is a red flag.


> Nothing is straightforward, there are too many things going on, too much documentation that I should read, and even then (as OP shows) I may be accidentally and unknowingly expose everything to the world.

I took a break from cloud development for a couple of years (working mostly on client stuff) and just recently got back. I am shocked at the amount of complexity built over the years along with the cognitive load required for someone to build an ironclad solution in the public cloud. So many features and quirks which were originally designed to help some fringe scenario are now part of the regular protocol, so that the business makes sure nobody is turned away.


Here is a good one: deleting billions of objects can be expensive if you call delete APIs.

However you can set a wildcard or bucket wide object expiry of time=now for free. You’ll immediately stop being charged for storage, and AWS will manage making sure everything is deleted.


Nit: the delete call is free, it’s the list call to get the objects that costs money. In theory if you know what objects you have from another source it’s free.


> You’ll immediately stop being charged for storage

The effect of lifecycle rules is not immediate: they get applied in a once-per-day batch job, so the removal is not immediate.


That's true but OP clearly knows that already. You stop getting charged for storage as soon as the object is marked for expiration, not when the object is finally removed at AWS's leisure. You can check the expiration status of objects in the metadata panel of the S3 web console.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecy...

> There may be a delay between the expiration date and the date at which Amazon S3 removes an object. You are not charged for expiration or the storage time associated with an object that has expired.


Because AWS gets to choose when the actual deletes happen. The metadata is marked as object deleted but AWS can process the delete off peak. It also avoids the s3 api server being hammered with qps


An explicit delete operation can mark metadata the same way.

The real difference is likely more along the lines of LSM compaction, with expiry they likely use a moral equivalent of https://github.com/facebook/rocksdb/wiki/Compaction-Filter to actually do the deletion.


That’s called a delete marker in s3. Less IO than a true delete but also still can lead to high qps to s3 api server. https://docs.aws.amazon.com/AmazonS3/latest/userguide/Delete...


That bit on failed multipart uploads invisibly sticking around (and incurring storage cost, unless you explicitly specify some lifecycle magic)... just ugh.

And I thought one of the S-es was for 'simple'.


Yes, that sucks. Blame ahenry@, then GM for S3.

My proposal was that parts of incomplete uploads would stick around for only 24 hours after the most recent activity on the upload, and you wouldn't be charged for storage during that time. ahenry@ vetoed that.


Why would you propose something that makes company earn less money? I'm sure that at Amazon scale, this misfeature earned millions of dollars.


Customer relationships. I recall a Bezos quote along the lines of "It's better to lose a refund than to lose a customer".


this one has cost us many thousands of dollars.

we had cron script on a very old server running for almost a decade, starting a multipart upload every night, pushing what was supposed to be backups to a bucket that also stored user-uploaded content so it was normal that the bucket grows in size by a bit every day. the script was 'not working' so we never relied on the backup data it was supposed to be pushing, never saw the files in s3, the bucket grew at a steady and not unreasonable pace. and then this spring i discovered that we were storing almost 3TB of incomplete multipart uploads.

and yes, i know that anecdote is just chock full of bad practices.


That S for simple stands for simple ways to skyrocket your expenses


“Simple” was coined when the alternative was managing a fleet of servers with disks . Time changes everything.


Yeah, I've definitely treaded on the storage cost landmine. Thankfully it was just some cents in my case, but it's really infuriating how badly the console exposes the information.


I have the feeling that the entire case (in)-sensitive discussions are usually too much English-centric.

Allow me to iterate: I have the feeling way too many language discussions, especially in IT, are too much English-centric.


> too much English-centric.

Pretty glad about it considering how much more simpler ASCII was to work with compared to Unicode.

I say it as a non native english speaker, programming has so many concepts and stuff already, its best not to make it more complex by adding a 101 different languages to account for.

Unicode and Timezone, the two things that try to bring more languages and cultures to be accounted for while programming and look what happens, it creates the most amount of pain for everyone including non native english programmers.

I dont want to write computer programs in my non-english native tongue, if that means i’ll have to start accounting for every major language while im programming.

Its fine that IT discussions are so English-centric. Diversity is more complexity, and no one owns the english language, its just a tool used by people to communicate, thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.

They all own the english language too, the moment they decided to speak in it.

No need to bring diversity politics in IT.

Best to keep it technical.


> No need to bring diversity politics in IT.

Politics is just "how people think things should be". Therefore politics are everywhere not because people _bring_ them everywhere but because they arise from everything.

Your comment is in fact full of politics, down to your opinion that politics shouldn't be included in this discussion.

**

> thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language

Personally my impression is that native speakers just run circles around everyone else during meetings and such. Being _truly_ comfortable in the language, mastering social cues, being able to confidently and fluently express complex ideas, mean that they effectively take over the room. In turn that means they will hold more power in the company, rise in rank more quickly, get paid more, etc.. There's an actual, significant consequence here.

Plus, anglos usually can't really speak another language, so since they don't realize how hard it is they tend to think their coworkers are idiots and will stick to do things with other anglos rather than include everyone.

> Diversity is more complexity

In a vacuum I agree, but within the context of your comment this is kinda saying "your existence makes my life too complex, please stop being different and join the fold"; and I can't agree with that sentiment.


You raise an interesting point about the nature of politics. I’ve been thinking about this a bit, but it seems to me that radical/revolutionary politics are talking about how people want things to be while quotidian political ideas are more about how people ought to do a few things. The distinction here being people’s timelines and depth of thought. If a policy has some seriously bad consequences, people may not notice because they weren’t really thinking of things should be, just the narrower thought of how a thing out to be done (think minimum wage driving automation rather than getting people a better standard of living, or immigration control driving police militarization). Of course, for most politicians, I am not sure either of these are correct. I think for politicians, politics is just the study of their own path to power; they likely don’t care much about whether it’s how things are done or how things ought to be so long as they are the ones with the power.

I don’t know that this comment really ads anything to the conversation, but I do find it all interesting.

Edit: also, on topic, languages are fun. The world is boring when everything is in one language. Languages also hold information in how they structure things, how speakers of that language view the world, and so on, and in those ways they are important contributors to diversity of thought.


  > considering how much more simpler ASCII was to work with compared to Unicode.
And elemental algebra is more simple than differential calculus.

ASCII being simpler just means it is not adequate to represent innate complexity that human languages have. Unicode is not complex because of "diversity politics", whatever that means. It is because languages are complex.

The same story with time zones: they are as complex as time is.


> thanks to that universal language, I can express my thoughts to most people in India, China, Japan, South America, etc due to having 1 common language.

My lazy ass wishes that English is enough to access those communities too. There are many cool and interesting developers, projects and communities only or mostly working in their native languages. One of the major motivation for me to learn Chinese now is to access those communities.


I'm not sure why you characterise this as political.

I wish the "case" was a modifier like Italic, bold. It would have been easier to _not_ have separate ASCII codes for upper and lower-case letters in the first place. What are your thoughts on MS Word using different characters for opening and closing quotes?


Speaking of non-English cultures, do Japanese case insensitive systems differentiate between hiragana and katakana?

Because, in some ways, the two syllabaries remind me of uppercase and lowercase alphabets.


They're more like distinguishing between "o" and “ℴ”.

Which is where the European idea of "capital letters" originates, but not how we think about them today.


A lot of the time forms will specifically ask for hiragana or katakana, or specify full width or half width characters.

But basically it’s a mess there too


A few more:

* Multipart uploads cannot be performed from multiple machines having instance credentials (as the principal will be different and they don't have access to each other's multipart uploads). You need an actual IAM user if you want to assemble a multipart upload from multiple machines.

* LIST requests are not only slow, but also very expensive if done in large numbers. There are workarounds ("bucket inventory") but they are neither convenient nor cheap

* Bucket creation is not read-after-write consistent, because it uses DNS under the hood. So it is possible that you can't access a bucket right after creating it, or that you can't delete a bucket you just created until you waited enough for the changes to propagate. See https://github.com/julik/talks/blob/master/euruko-2019-no-su...

* You can create an object called "foo" and an object called "foo/bar". This will make the data in your bucket unportable into a filesystem structure (it will be a file clobbering a directory)

* S3 is case-sensitive, meaning that you can create objects which will unportable into a filesystem structure (Rails file storage assumed a case-sensitive storage system, which made it break badly on macOS - this was fixed by always using lowercase identifiers)

* Most S3 configurations will allow GETs, but will not allow HEADs. Apparently this is their way to prevent probing for object existence, I am not sure. Either way - cache-honoring flows involving, say, a HEAD request to determine how large an object is will not work (with presigned URLs for sure!). You have to work around this doing a GET with a Range: of "very small" (say, the first byte only)

* If you do a lot of operations using pre-signed URLs, it is likely you can speed up the generation of these URLs by a factor of 10x-40x (see https://github.com/WeTransfer/wt_s3_signer)

* You still pay for storage of unfinished multipart uploads. If you are not careful and, say, these uploads can be initiated by users, you will be paying for storing them - there is a setting for deleting unfinished MP uploads automatically after some time. Do enable it if you don't want to have a bad time.

These just off the top of my head :-) Paradoxically, S3 used to be revolutionaly and still is, onl multiple levels, a great products. But: plenty features, plenty caveats.


The one that caught me a couple of weeks ago is multipart uploads have a minimum initial chunk size of 5 MiBs (https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts...). I built a streaming CSV post-processing pipeline in Elixir that uses Stream.transform (https://hexdocs.pm/elixir/Stream.html#transform/3) to modify and inject columns. The Elixir AWS and CSV modules handle streaming data in but the AWS module throws an error (from S3) if you stream "out" that totals less than 5 MiBs as is uses multi-part uploads which made me sad.


The last part can be any size, so with a few tweaks to the streaming code you should be fine. Ready-made AWS SDKs handle this (chunking) for you. Truth be told, the multupart upload on GCP is even worse :/


Here's another fun one that took a colleague and I several days of analysis to diagnose. S3 will silently drop all requests once a single TCP connection has sent 100 HTTP requests.

https://github.com/aws/aws-sdk-go/issues/2825


It doesn't silently drop, it sends a header indicating it's closed the TCP connection.

This is a pretty common pattern whereby you want keep-alive.for performance reasons, but you don't want clients running _too long_ creating hot spots on your load balancers.


What header? Our client was envoy and it's pretty standards compliant, and it just kept trying to use the connection.

Edit: I see that it is `connection: close` I wonder if that is new behaviour or if envoy did not honour it at the time we encountered the issue.

Thanks for the info!


Instead of closing the connection it sends a message stating that it is closed? Wow.


How about the fact that S3 is not suitable for web serving due to high latencies (in standard storage classes)?

Many people think you can just host the resources for your websites, such as images or fonts, straight on S3. But that can make for a shitty experience:

> applications can achieve consistent small object latencies (and first-byte-out latencies for larger objects) of roughly 100–200 milliseconds.

From: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...


Most folks use S3 as a source for AWS cloudfront for serving content

You can even use cloudfront signed cookies to give specific users cdn access to only specific content owned by them on S3. How cool is that.


Typically you would use cloudfront with S3 if you want to use it for serving web assets.

It will cache frequently accessed assets, and in addition to reducing latency may reduce cost quite a bit.


Pretty much everyone knows this …


This has been well known for over a decade


s3 is not optimized to directly serve websites, but to durably store and retrieve ~unlimited data.


Put a memcached instance in front and you're good.


Those uploader decides rules are wild. Does that mean someone with a website poorly configured enough can have user content uploaded to, and later served from amazon glacier? (assuming a sufficiently motivated user)


It does. But if you're concerend about this (and many of the other items mentioned), you can control access to those features using IAM.

https://docs.aws.amazon.com/service-authorization/latest/ref...

The condition keys specifically are here and you can see keys to control access to storage class, tagging, etc.

https://docs.aws.amazon.com/service-authorization/latest/ref...


> S3 isn’t the only service that works this way, Hosted Cognito UI endpoints do something similar (https://[your-user-pool-domain]/login).

Basically every new service that expects to deal with a lot of traffic should work this way (we did this also for AWS IoT). It's a hell of a lot easier to deal with load balancing and request routing if you can segment resources starting at the DNS level...


Regarding the deletion point, note that you cannot delete S3 buckets that are non-empty, so in order to actually delete data you have to first manually delete them. Of course, if any action is allowed then that is as well. But still, it's not a single request away for any non-trivial bucket.




The S3 API, and most of AWS, is a kludgy legacy mess.

Is there any chance of getting a new, industry-standard not just industry-adopted, simpler but more usable[1] common API?

We already have some client libraries that try to paper over the differences (https://gocloud.dev/, https://crates.io/crates/object_store), but wouldn't it be nice to have just one wire protocol?

[1]: E.g. standardize create-if-not-exist, even if S3 doesn't implement that.


My life experience has been that this line of thinking results in two separate outcomes: "the great thing about standards is that there are so many to choose from" <https://xkcd.com/927/>, and that publishing a standard does absolutely zero for getting buy-in, with that last one made worse by any short-sighted fools who publish a bunch of words without a reference impl or ideally a TCK. As this whole thread has shown, for 10 people there will be 15 understandings of any given sentence, which is very very bad when trying to get two computers to agree on something


A workaround for some of the limits of presigned urls like not being able to specify a max file size is to front your uploads using CloudFront OAC and CloudFront Functions. It costs more (.02/GB) but you can run a little JavaScript code to validate/augment headers between your user and S3 and you don't need to expose your bucket name. https://speedrun.nobackspacecrew.com/blog/2024/05/22/using-c...


>Schrodiner’s cat is the one that’s both alive and de-lifed at the same time, right?

It was alive and dead at the same time. Don't use censor-appeoved language outside Tiktok!


>Things you wish you didn't need to know about S3

>A time travel paradox in the title is a good place to start a blog post, don’t you think?

Where is the paradox?


> S3 buckets are the S3 API

> … a relatively small part of the API requires HTTP requests to be sent to generic S3 endpoints (such as s3.us-east-2.amazonaws.com), while the vast majority of requests must be sent to the URL of a target bucket.

I believe this is talking about virtual-hosted style and path-style methods for accessing the S3 API.

From what I can see [0], at least for the REST API, the entire API works either with virtual-hosted style (where the bucket name is in the host part of the URL) and path-style (where the bucket name is in the path part of the URL). Amazon has been wanting folks to move over to the virtual-hosted style for a long time, but (as of 3+ years ago) the deprecation of path-style has been delayed[1].

This deprecation of path-style requests has been extremely important for products implementing the S3 API. For example…

* MinIO uses path-style requests by default, requiring you set a configuration variable[2] (and set up DNS appropriately) to handle the virtual-hosted style.

* Wasabi supports both path-style and virtual-hosted style, but "Wasabi recommends using path-style requests as shown in all examples in this guide (for example, http://s3.wasabisys.com/my-bucket/my-object) because the path-style offers the greatest flexibility in bucket names, avoiding domain name issues."[3].

Now here's the really annoying part: The REST API examples show virtual-hosting style, but path style works too!

For example, take the GetBucketTagging example. Let's say you have bucket "karl123456" in region US-West-2. The example would have you do this:

GET /?tagging HTTP/1.1

Host: karl123456.s3.amazonaws.com

But instead, you can do this:

GET /karl123456?tagging HTTP/1.1

Host: s3.us-west-2.amazonaws.com

!!

How do I know this? I tried it! I constructed a `curl` command do to the path-style request, and it worked! (I didn't use "karl123456", though.)

So hopefully that helps resolve at least one of your S3 annoyances :-)

[0]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAP...

[1]: https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-...

[2]: https://min.io/docs/minio/linux/reference/minio-server/setti...

[3]: https://docs.wasabi.com/docs/rest-api-introduction


a fun read!

i just built a live streaming platform[1]. chat, m3u8, ts, all objects. list operations used to concat chat objects. works perfectly, object storage is such a great api.

it uses r2 as the only data store, which is a delightfully simple s3-alike. the only thing i miss are some of the advanced listv2 features.

too anyone not enjoying s3 complexity, go try r2.

1. https://nathants.com/live


Object lock until 2099 that can only be cancelled by deleting the AWS account is *nasty*.


My wife - she's a future developer, I'm a lawyer - has had enough of me explaining some of these details in the form of concerns. I am somewhat satisfied to know that my perplexities are not imaginary.


What's a future developer?


A developer that is learning programming or a specific technology? Isn't it self-explanatory or what problem did you have understanding it?


Wants to be. He reads a lot of code and tries to understand it and write his own. She will probably never be one of you, like and expert. But she's interested, she works laterally in the field - she's done a lot of hours on the intranet at her job. So I think she can be a developer in the future, a good developer. Now she has a degree in the field and... Wait. But then... she is already a developer. I don't know, man. It's a philosophical question.


> Isn't it self-explanatory

No.

> what problem did you have understanding it?

It is simply not a phrase I commonly read. When I google it I find a visa consultancy under the name, and not much else. Curiously your comment is also among the results.

The problem is that the phrase "<something> developer" is used to describe what the person develops. A "real-estate developer" invests in real estate as a business. A "web developer" develops web applications, a "game developer" develops games, and so on and so on. So reading the word I immediately thought they mean someone who is developing the future? Like idk Douglas Engelbart or Tim Berners-Lee or someone like that.

If you want to write that someone is learning to become a developer I would recommend the much less confusing "developer in training" phrase, or even better if you just write "they are learning to become a developer".


Bad English case. It just is. I pay a lot of attention when I write English, but you can always tell when someone isn't a native English speaker in about two words. Je suis désolé.


No worries! Glad that trallnag asked so we could clear it up.

Wishing your wife the best of luck with her career! (and to you too!)


> There are only losers in this game, but at least we’ve all got a participation ribbon to comfort us in moments of angst.

I sense the frustration...


Not one thing about different API limits for operations according to key prefixes, and how you need to contact support if you need partitioning and provide them the prefixes, huh?


I’ve become old. If you look at these things in disgust (ACL vs Policies, Delete bucket with s3:*, etc), you’re missing the point of (deterministic ?) software. It does what it is written to do, faithfully, and error out when not. When it doesn’t do as written or as documented, then yes… go full bore.


The doc is huge, and the principle of least astonishment is often not respected.

Also third parties providers support a random subset of it given the protocol has nothing simple anymore (or maybe never had)


[flagged]


A career change, or a job change?


Audi S3?


No mention of how AWS/S3 approximates the size of a file to save CPU cycles. It used to drive me up a wall seeing S3 show a file size as slightly different than what it was.

If I recall correctly, S3 uses 1000 rather than 1024 to convert bytes to KB and MB and GB. This saves CPU cycles but results in “rounding errors” with the file’s reported size.

It’s discussed here, although they’re talking about it being “CLI vs console” which may be true?

https://stackoverflow.com/questions/57201659/s3-bucket-size-...


This almost certainly has nothing to do with “saving CPU cycles” and is most likely just that whoever created the Cloudwatch console used the same rounding that is used for all other metrics in CW, rather than the proper calculation for disk size, and it was a small enough issue that it was never caught until it was too late to change it because changing it would disrupt the customers that have gotten used to it.


If anything a bit shift could do the 1024 division faster


S3 doesn't format the size for human display, the object's Size is returned in bytes:

https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...


What I meant to say was S3 uses 2^20 to convert from bytes to kilobytes rather than 10^6 (e.g. what Windows uses).

https://repost.aws/knowledge-center/s3-console-metric-discre...

> 3 console and Storage Lens use base 2 conversion (/1024) to report storage metrics, and CloudWatch by default uses base 10 conversion (/1000)


But KB is indeed 1000 bytes, and MB is indeed 1000 KB.

In case of 2^10 units the correct names are kibibyte (KiB) and mebibyte (MiB). Check https://en.wikipedia.org/wiki/Mebibyte#Multiple-byte_units

Yeah we have long standing confusion that for historical reasons KB and MB often means 2^10 bytes units so now when you see KB you really don't know what it means. Therefore I am a staunch supporter of unambiguous KiB and MiB.


I think I explained this poorly, and I appear to be mistaken about it being to save CPU cycles (though an entity such as AWS would absolutely be about saving minuscule CPU cycles that add up at scale).

I spent a lot of time researching this issue when I came across it years ago- I just remember that local file size and S3 file size was not matching up with anything and the takeaway was that S3 was calculating file size differently from Windows/Linux/macOS.

AWS uses base 2 (binary) for calculating file size, meaning 1MB is treated as 1,048,576 (2^20) bytes rather than 1,000,000 bytes (10^6).

And as you've said, one is MB while the other is technically MiB.





Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: