Hacker News new | past | comments | ask | show | jobs | submit login
Building and operating a pretty big storage system called S3 (allthingsdistributed.com)
804 points by werner on July 27, 2023 | hide | past | favorite | 160 comments



> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.

One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.

Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).


> daily occurrence when you're operating at S3 scale

Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...

In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.


James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044


James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.


This phenomenon is just multiplication of the sample size (scale) times a probability (rare).


It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.


I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!


> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms. It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.

What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.

It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.


Thanks! That was interesting and helpful.


Well it is - nobody maintains the level of detail required to actually know about these sorts of events.

I worked on a safety critical system where we’d find all sorts of unusual bugs… because we were looking for them. It really narrowed the scope for product selection, many vendors were just disqualified.


Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.


Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!


This was 7-8 years ago now. Lot of scaling up since those days :)


I’m sure my numbers are out of date now too


Personally I'd love working in that kind of environment. That one in a billion hole still itches at me. There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).


That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.


The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.


I love conversations like this that remind me how unintuitive big numbers are.


Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.


Any examples you can share?


Redis Node failover


Apache tomcat starts to break down


Could you elaborate?


We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.


I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough


Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.


Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.


HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.


Well... yeah. S3 has checksums and all sorts of fixity checks right throughout. At no stage do they ever rely on a single mechanism. If there's one thing they're insanely paranoid about, it's data correctness and durability.

It has been several years, so I really don't remember much about the tcp checksum / corrupting NIC thing. Typically tcp checksum failures are handled entirely by the NIC, you wouldn't even notice it. My vague recollection was it coming up between two services not in the customer synchronous path (so e.g. not involved in getting data to or from the customer), and it caused something on the OS side.

I do remember that there was a contingent of engineers that were convinced it was a cosmic ray bit flip, which seems this whole thing certain types of engineers end up doing when presented with improbable seeming circumstances. It wasn't until it had happened a second or third time (weeks later) that they realised the origin machine was the same each time, and were able to dig in deeper to the point of reproduction.


To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…

[1] https://dl.acm.org/doi/10.1145/1807128.1807134


Ever see a UUID collision?


[deleted]


How did you know it was a double bit flip and not just BGP bug or an in memory bit flip before being sent to the socket?


> two bit flips in the same tcp packet cancel each other out and cause the checksum to pass

checksum != parity check

not sure if there even exists a chance for this to happen


Wow this is at the level of Homer Simpson "Cereal with Milk catching fire"

But yeah, mathematically possible (in AWS scale, but still) so of course it will happen once in a lifetime.


Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.


Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.


Are you sure about that math?

A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.


You know I think I converted powers of two and powers of ten interchangeably in my calculations. You're very likely correct.


This seems off. A few billion seconds to have a 50:50 chance? Why wouldn't it be a billion seconds at a billion per second (2^60 total requests) would give a 1 in 2^68 chance (or 1 in 2^62 if its really only 122 bits)?


Birthday paradox. The number of opportunities to collide is the number of items squared. (Divided by two and a smidge)


Lol. I must be brain dead. Yes.


Because we're talking about collisions, as opposed to comparing 2^64 independent pairs. With 2^128 possible values, if you've picked 2^63 distinct ones, the chance that a randomly selected value collides with one of those is 1 in 2^65. If none of your second batch of 2^63 collide with each other, that gives a 2^63/2^65 = 1/4 chance of one of them colliding with the first batch. Considering the possibility of collisions within each batch of 2^63 brings it closer to 1 in 2.


There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.

It depends on how paranoid you need to be.


NIST standards on RNG are not as random as expected?

Or do you mean certain folks intentionally chose substandard implementations for some reason?


A significant number of implementers roll their own UUIDv4. It seems so easy so why not? Most UUIDs are used in contexts where the devs are not that sophisticated so it isn’t that surprising that naive mistakes happen. If you are using it for distributed UUID generation, it just takes one person making a mistake to create havoc.

UUIDv4 is banned in many high security environments primarily because it is easy for people to screw up in practice and it is difficult to detect when those mistakes are made. 128-bits doesn’t leave much room for mistakes using probabilistic uniqueness.


Facts.


Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.


Working in genomics, I've dealt with lots of petabyte data stores over the past decade. Having used AWS S3, GCP GCS, and a raft of storage systems for collocated hardware (Ceph, Gluster, and an HP system whose name I have blocked from my memory), I have no small amount of appreciation for the effort that goes into operating these sorts of systems.

And the benefits of sharing disk IOPs with untold numbers of other customers is hard to understate. I hadn't heard the term "heat" as it's used in the article but it's incredibly hard to mitigate on single system. For our co-located hardware clusters, we would have to customize the batch systems to treat IO as an allocatable resource the same as RAM or CPU in order to manage it correctly across large jobs. S3 and GCP are super expensive, but the performance can be worth it.

This sort of article is some of the best of HN, IMHO.


It also explains some of the cost model for cloud storage. The best possible customer, from a cloud storage perspective, stores a whole lot of data but reads almost none of it. That's kind of like renting hard drives, except if you only fill some of each hard drive with the "cold" data, you can still use the hard drive's full I/O capacity to handle the hot work. So, if you very carefully balance what sort of data is on which drive, you can keep all of the drives in use despite most of your data not being used. That's part of why storage is comparatively cheap but reads are comparatively expensive.


You get similar properties/challenges in lots of multi consumer storage scenarios. I learned lots of similar lessons working on CDNs when it comes to object distribution and access rates.

If youre interested go search for some of the published work from "Coho Data", they had some great usenix presentations IIRC. This was the previous company Andy Warfield was at and they had an emphasis on effective tracking & prediction of IO workloads across very large datasets.


Unfortunately many tools in genomics (and biotech in general) still depend on local filesystems- and even if they do support S3, performance is far slower than it could be.


Most of these tools treat the "local file" as a stream which can be a pipe to a network stream from the object store.

The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is.

At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO.


I didn't make my statement out of ignorance.

Most of these applications depend on OS optimizations that have been made over the decades; multithreaded readers, readahead, and caching are critically important to read performance. In principle, a remote storage system could be as fast as a local disk. This includes random access. after all, the storage system is just a bunch of drives attached to machines connected by networks.

When I worked at Google I wrote a mapreduce that converted BAM files to sstables which are sorted, sharded by key, and sit in an object store like S3. Once the files were in sstables (or columnio) we could do realtime analytics using modern tools.


Right, most people that try to really optimize these things do not have access to the parallelism tools thay Google has built, and end up doing their own ad-hoc sharding schemes. Things that can be built by 1-3 people over the course of a few weeks tk solve ann immediate scaling problem. And of course BAM itself dates back to before standardized serialization formats were brought out of Google.

Even with potential optimizations, initiating a seek on GCS or S3 is far far slower than on a local SSD, so even if Google exposes fast cross-network seeks on objects inside an internal object store system, it is not readily accessible to the plebes like me and 99.9% of genomicists that use cloud systems or their own hardware.


You might be interested in our paper that just got published in Bioinformatics today as chance would have it: https://academic.oup.com/bioinformatics/advance-article-abst...


Thanks for sharing, I'll check it out. It would be interesting to see if it would help with these pretty astonishing Open Omics results that recently came out:

https://community.intel.com/t5/Blogs/Tech-Innovation/Artific...


Of course it’s slower. Your using https to do something that’s meant to be raw binary. The overhead is killing you. Something like iscsci, is pretty quick compared to https as a storage protocol.


Which is exactly why using standard Unix/POSIX files makes sense as a universal interface for genomics programs that are run in highly heterogeneous environments across the world, even if it leads to software engineers wishing that their internal custom data storage systems were used instead.

If random access is needed in a cloud environment, use either that local ephemeral SSD, or a cloud block device which is probably just an iSCSI implementation underneath, or at least a close equivalent.

Operating fleets of compute and IO in cloud environments means that POSIX semantics generally work really well for genomics.

Folks that have reimplemented basic genomics algorithms on top of protocol buffers standardized serialization still store them as BLOBs, and have not delivered benefits that can be realized in publicly available compute environments.


I’m still waiting for people to realize nvme doesn’t have to be local in newer kernels. That’s when the real fun will begin.


I used to work on datacenter NVMe products. I wrote the tests which validated them (mostly functional not performance). I left that company before things got hot with fabrics, but I really want to see that stuff succeed. It looked really cool.


I've worked in web tech long enough to recognize that the overhead of HTTP does not explain the difference in performance between "raw binary" protocols and ones that have textual headers.

Put another way I've seen extremely low-latency https servers. The latency in S3 doesn't come from using https.


Raw binary. As in I send the CPU instructions to seek to a memory location (whether local or remote). iSCSCI is usually on the same network (maybe even the same machine if using k8s to do this via Longhorn) and handled in kernel space.

I highly doubt HTTP has less latency than that.


no, but since that part of the data transfer wasn't the bottleneck, it doesn't matter.


The latency is higher so the key is parallelism... Which means you need more cores/hardware/VMs/pick your poison. New but same problem...


Is single-job performance the only criterion? Or can you just run a bunch of different jobs at the same time (genomics has many embarassingly parallel problems, often per-sample) and use the higher aggregate storage bandwidth of your object store to get "more work done in unit time".


I did the latter, across usually about 4000+ CPUs. That got me a peak of 15 or so GB/sec read from one GCS bucket, writing to another.

But yeah if it's not something that can be paralleled then it sucks.


As someone in this area: we very much want to make your EiB of data to feel local. It's hard and I'm sorry we only have 3.5 9's of read availability.


People working on storage systems are doing amazing things. When I first heard about Ceph more than a decade ago, I immediately emailed one of the founders asking for an exabyte data store, because I knew just how amazingly difficult it would be and that it was very much needed.

3.5 9s is incredible on large stores. S3 and GCS are just amazing machines. I have nothing but admiration for the people that make this happen.


Some of the best HN indeed. Would love to see any links to HN posts that you think are similarly good!


The things we could build if S3 specified a simple OAuth2-based protocol for delegating read/write access. The world needs an HTTP-based protocol for apps to access data on the user's behalf. Google Drive is the closest to this but it only has a single provider and other issues[0]. I'm sad remoteStorage never caught on. I really hope Solid does well but it feels too complex to me. My own take on the problem is https://gemdrive.io/, but it's mostly on hold while I'm focused on other parts of the self-hosting stack.

[0]: https://gdrivemusic.com/help


Absolutely this. I would LOVE to be able to build apps that store people's data in their own S3 bucket, billed to their own account.

Doing that right now is monumentally difficult. I built an entire CLI app just for solving the "issue AWS credentials that can only access this specific bucket" problem, but I really don't want to have to talk my users through installing and running something like that: https://s3-credentials.readthedocs.io/en/stable/


Most apps, however, assume POSIX-like data access. I would love to see a client-side minimally dependent library that mounts a local directory that is actually the user's S3 bucket.


Linux has FUSE, which is a framework to develop user-level filesystems. Mounting S3 buckets is a very good use case. Sshfs and httpfs are more or less similar in this regard.


Yep, and WinFSP and dokany are two options for FUSE on Windows. I'd recommend using rclone or maybe check this list: https://winfsp.dev/doc/Known-File-Systems/


Such a system would be amazing. It would really force companies whose products are UIs on top of S3 to compete hard because adversarial interoperability would be an ever present threat from your competitors.

It really is such a shame that all the projects that tried/are trying to create data sovereignty for users became weird crypto.


I agree with both halves of your comment, but I realized I can't identify the connection between S3 oauth and data sovereignty. Could you elaborate?


So the idea would be that you have an account with AWS (or realistically a more consumer friendly service that's Amazon branded) where all your data lives. Then when you use say Dropbox you can pick "Use my own storage" and grant Dropbox via OAuth the ability to write to /dropbox in your bucket and all your files would live there instead of Dropbox's servers. Lots of the data sovereignty solutions also include a database like interface you can grant apps the ability to use but I can't imagine that catching on initially.

Apple actually already does this with iCloud storage but hides it really well so it feels seamless.


Isn't this essentially how the Dropbox API already works (for apps that support using it)? I've used many apps over the years that offer this option alongside some alternatives.


You can get close with a Cognito Identity Pool that exchanges your user's keys for AWS credentials associated with an IAM role that has access to the resources you want to read/write on their behalf. Pretty standard pattern.

https://docs.aws.amazon.com/cognito/latest/developerguide/co...

edit: I think I misread your comment. I understood it as your app wanting to delegate access to a user's data to the client, but it seems like you want the user to delegate access to their own data to your app? Different use-cases.


We're building this at https://puter.com


You mean you're implementing something like this to be used by puter.com?


Apache Iceberg is kind of this, but more oriented around large data lake datasets.


> Now, let’s go back to that first hard drive, the IBM RAMAC from 1956. Here are some specs on that thing:

> Storage Capacity: 3.75 MB

> Cost: ~$9,200/terabyte

Those specs can't possibly be correct. If you multiply the cost by the storage, the cost of the drive works out to 3¢.

This site[1] states,

> It stored about 2,000 bits of data per square inch and had a purchase price of about $10,000 per megabyte

So perhaps the specs should read $9,200 / megabyte? (Which would put the drive's cost at $34,500, which seems more plausible.)

[1]: https://www.historyofinformation.com/detail.php?entryid=952


Must've put a decimal point in the wrong place or something. I always do that. I always mess up some mundane detail.


Did you get the memo? Yeah I will go ahead and get you another copy of that memo.


https://en.m.wikipedia.org/wiki/IBM_305_RAMAC has the likely source of the error: 30M bits (using the 6 data bits but not parity), but it rented for $3k per month so you didn’t have a set cost the same as buying a physical drive outright - very close to S3’s model, though.


I think this is still IBMs license model (at least a few years ago). It was explained to me you basically license a certain amount of compute even though the hardware is in your data center and you pay overages if you exceed your licensed throughput.

Since you license a fixed amount, there were projects at the company looking at running batch/non time sensitive jobs on the mainframe since it was effectively free off peak (I guess power cost was trivially compared to licensing).


You had online jobs during the day and batch at night then. That's why you always had to have one night between. Obviously doesn't work when load is 24/7.


oh shoot. good catch, thanks!


What most people don't realize is that the magic isn't in handling the system itself; the magic is making authorization appear to be zero-cost.

In distributed systems authorization is incredibly difficult. At the scale of AWS it might as well be magic. AWS has a rich permissions model with changes to authorization bubbling through the infrastructure at sub-millisecond speed - while handling probably trillions of requests.

This and logging/accounting for billing are the two magic pieces of AWS that I'd love to see an article about.

Note that S3 does AA differently than other services, because the permissions are on the resource. I suspect that's for speed?


Keep in mind that S3 predates IAM by several years. So part of the reason that access to buckets/keys is special is because it was already in place by the time IAM came around.

Its likely persisted since than largely since removing the old model would be a difficult taks without potentially breaking a lot of customer's setup


Exactly. This difference makes it easier to (1)understand how IAM works, and (2) how the s3 works...because IAM and S3 work together, but in a different way than the other services.

I heard that AA is done via asics, but resource-level permissions implies that authorization is done at the local level for s3. To me that implies that the system extracts S3 permissions from IAM and sends them downstream s3, which get merged with stuff that s3 manages.

I guess that occurs when permissions are saved up in IAM world. At some point those need to be joined against a principal somewhere, as roles can exist without assignment.

Again, it's be so interesting to see how this is done IRL.


AWS re:Invent 2022 - A day in the life of a billion requests (SEC404) https://www.youtube.com/watch?v=tPr1AgGkvc4


"As a really senior engineer in the company, of course I have strong opinions and I absolutely have a technical agenda. But If I interact with engineers by just trying to dispense ideas, it’s really hard for any of us to be successful. It’s a lot harder to get invested in an idea that you don’t own. So, when I work with teams, I’ve kind of taken the strategy that my best ideas are the ones that other people have instead of me. I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions. There are often multiple ways to solve a problem, and picking the right one is letting someone own the solution."

"I learned that to really be successful in my own role, I needed to focus on articulating the problems and not the solutions, and to find ways to support strong engineering teams in really owning those solutions."

I love this. Reminds me of the Ikea effect to an extent. Based on this, to get someone to be enthusiastic about what they do, you have to encourage ownership. And a great way is to have it be 'their idea'.


I don't mean this to be cynical, but I do think that it's worth acknowledging that describing the problem is also, in itself, a tool to guide people towards a solution they want. After all, people often disagree about what "the problem" even is!

Fortunately not every problem is like this. But if you look at, say, discussions around Python's "packaging problem" (and find people in fact describing like 6 different problems in very different ways), you can see this play out pretty nastily.


At a toy scale, using ChatGPT's Code Interpreter to do some programming for fun can be an exercise in getting what you want from an inconsistent worker by changing the problem definition (prompt engineering).

This is sort of like:

* writing an exam question so the person taking the exam is likely to get the answer you want

* guiding someone in a code interview that isn't going so well, without giving away the answer

* being in the back seat while pair programming, except you're not allowed to take a turn at the keyboard


I don't think it's cynical, I think it's the point. Describing the problem is not easy, and to your point, is sometimes controversial.

One advantage of focusing on describing the problem is that it naturally lets you have an impact on what you believe to be the important parts of the solution.


I just want to acknowledge that describing the problem is part of picking the solution, and it's not really _that much_ of a "I'm making the most neutral action and letting other people actually choose the solution".

Honestly the "real" hands off thing is letting somebody else also describe the problem and then probing it. But that might lead to a bit too much of an existential crisis for some people. And hey, if something works it works


For sure, it’s only partly hands off. But he is an engineer after all, he should be doing something outside of just managing.


That section really stood out to be as well.

If Andy Warfield is reading, and I bet he is, I have a question. When developing a problem how valuable is it to sketch possible solutions? If you articulate the problem that probably springs to mind a few possible solutions. Is it worth sharing those possible solutions to help kickstart the gears for potential owners? Or is it better to focus only on the problem and let the solution space be fully green?

Additionally, anyone have further reading for this type of “very senior IC” operation?


Here's a really quick story on how i accidentally worked out this strategy by getting it wrong first. When I started at Amazon and was trying to convince the team that we should do certain things, I did what I'd always been trained to do: I wrote down the problem and then sketched a solution to it. Then I'd start floating the doc around to try to get folks excited about it. And invariably, they'd do what they were trained to do, which was to have a critical response to the proposed solution. They'd argue that I was solving it the wrong way, and I'd be in a spot where we'd have a conversation where I was defending a position. But this was the last thing I wanted — I was trying to get everyone excited about fixing a problem, but I slowly realized that when I approached it this way, I was just getting feedback on my proposed solution.

So I started doing an experiment where I'd write that same doc, including the ideas i had on the shape of the work we should do, but then I'd delete my solution before sharing it. To your question: I'd still totally write my solution ideas down. Partially because I can't help myself and honestly it was a helpful way to think things through. But when I deleted it and shared a doc with just a problem statement, I'd get feedback on the problem statement. It's pretty obvious, but it was also a pretty surprising result: all of a sudden i was in conversations where we were all on the same side of the table. Feedback was either refining the problem (which was awesome) or proposing solutions. And when the person reading your problem statement starts trying to solve it, it's really cool... because they totally start getting invested and the conversations are great.

Like everything, none of this is actually either/or. There are points in between, like including a sketch of the shape of a solution, or properties that a solution would have to have. But the overall thing of separating the problem and the end state of where you want to get to, from the solution and the plan on how to get there is a pretty effective tool from a sharing ownership perspective.


That’s helpful. Thank you!


For the "very senior IC", I'd recommend https://staffeng.com


There's a saying that I'm often told, and I'm sure we've all heard it at some point "don't bring me problems, bring me solutions". It's such a shit comment to make.

I interpret it as if they are saying "You plebe! I don't have time for your issues. I can't get promoted from your work if you only bring problems."

Being able to solve the problem is being able to understand the problem and admit it exists first. <smacksMyDamnHead>


Depends how it’s used. If it’s used in an org where major, high impact problems are ignored, as a way to just say “ignore all problems”, then yeah, it’s a shit comment.

However, if it’s used to legitimately say “don’t just complain, fix”, then I think it’s a positive. An organization where everyone is constantly negative and complaining about every little issue, but not working to implement improvements/fixes, is essentially a failed company. Successful companies are full of people who actively fix the high impact problems, while also being realists, who can accept that the low impact problems aren’t worth the effort to fix, and aren’t worth endlessly complaining about.


I strongly agree with this perspective but I wish it could be generalized into techniques that work in everyday life, where there isn't already this established ranking of expertise that focuses attention on what is being said and not whether you have the clout or the authority to say it.

Because absent preestablied perceived authority or expertise, which is the context that most day to day problems surface within, holding forth and hogging the entire two-way discussion channel with your long detailed and carefully articulated description of the problem is going to make you sound like someone who wants to do all the talking and none of the work, or the kind of person who doesn't want to share in finding a solution together with others.


this only works if your team are made up of smart competent people.


Great to see Amazon employees being allowed to talk openly about how S3 works behind the scenes. I would love to hear more about how Glacier works. As far as I know, they have never revealed what the underlying storage medium is, leading to a lot of wild speculation (tape? offline HDDs? custom HDDs?).


Amazon engineer here - can confirm that Glacier transcodes all data on to the backs of the shells of the turtles that hold up the universe. Infinite storage medium, if a bit slow.


Shh....


Blueray disks are thought to be the key: https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

Some people disagree though. It’s still an unknown.


Glacier is a big "keep your lips sealed" one. I'd love AWS to talk about everything there, and the entire journey it was on because it is truly fascinating.


My impression is that the ambiguity gives them freedom to implement in different ways across different regions and over time.

The original Glacier was very clearly tape, but given the instant retrieval capabilities the newer S3-Glacier tiers are most likely just low-margin HDDs, maybe with some dynamic powering on and off of drives/servers.


I’m sure it’s a mix. Back when it launched there were a number of rumours about it being Blu-Ray based. They had similar capacity for the space used compared to tapes, were considered very physically stable storage mediums, but had long access time as they would need to be physically moved, like tape, explaining the retrieval times.


I don't buy the Blu-ray thing largely because of price, but also because Amazon is quite a conservative company and tape is the more obvious choice.


Glacier is just run on S3 with some sleep statements added.


The perceived value of results is higher if it takes a longer time to load, users feel the computer is hard at work. If its true for flight searches, its true for backup systems.


Reminds me of the automated phone systems that play random keystrokes while telling you they’re looking up your info - people don’t trust it if they come back instantly, I guess.


I am going to choose to believe this


I recall at launch just about the only implementation detail that _was_ publicly given was that it did not involve tape. That's going to be difficult to dig up a cite on years later.

No idea how it's evolved over the years, so for all I know it's tape based these days.


Never officially stated, but frequent leaks from insiders confirm that Glacier is based on Very Large Arrays of Wax Phonograph Records (VLAWPR) technology.


We came up with that idea in Glacier during the run up to April one year (2014, I think?), half jokingly suggested it as an April Fool's Day Joke, but Amazon quite reasonably decided against doing such jokes.

One of the tag line ideas we had was "8 out of 10 customers say they prefer the feel of their data after it is restored"


This would have been incredible. But I guess I get the angle of not wanting to risk pissing off the audiophile CTO paying you 10 figures per month. Cause he can TOTALLY hear the difference listening to Dark Side of the Moon on vinyl via Monster Cables.


The real problem is the lack of Star Wars references.


It's honestly super impressive that it's never leaked. All it takes is one engineer getting drunk and spouting off. In much higher stakes, a soldier in Massachusetts is about to go to jail for a long time for leaking national security intel on Discord to look cool to his gamer buddies. I would have expected details on Glacier to come out by now.


I don't expect high salary engineers leak it, but random contractor at datacenter or supplier would eventually leak if they use special storage device other than HDD/SSD. Since we don't see any leaks, I suspect that it's based on HDD, with very long IO waitlist.


HSM is a neat technology, and lots of ways it has been implemented over the years. But it starts with a shim to insert some other technology into the middle a typical posix filesystem. It has to tolerate the time penalty for data recovery of your favored HSM'd medium, but that's kind of the point. You can do it with a lower tier disk, tape, wax cylinder, etc. There's no reason it wouldn't be tape though, tape capacity has kept up and HPSS continues to be developed. The traditional tape library vendors still pump out robotic tape libraries.

I remember installing 20+ fully configured IBM 3494 tape libraries for AT&T in the mid-2000's. These things were 20+ frames long with dual accessors (robots) in each. The robots were able to push a dead accessor out of the way into a "garage" and continue working in the event one of them died (and this actually worked). Someone will have to invent a cheaper medium of storage than tape before tape will ever die.


Glacier was originally using actual glaciers as a storage media since they have been around forever. Bu then climate change happened so they quickly shifted to tiered storage of tape and hard drives.


It's just low powered hard drives that aren't turned on all the time. Nothing special.


Are there any public details on how Azure or GCP do archival storage?


Just look at other clouds. I doubt amazon is doing anything special. At least they don't reflect any special pricing.


> Imagine a hard drive head as a 747 flying over a grassy field at 75 miles per hour. The air gap between the bottom of the plane and the top of the grass is two sheets of paper. Now, if we measure bits on the disk as blades of grass, the track width would be 4.6 blades of grass wide and the bit length would be one blade of grass. As the plane flew over the grass it would count blades of grass and only miss one blade for every 25 thousand times the plane circled the Earth.


The standing joke is that Americans love strange units of measure but this is one is so outre that it deserves an award.


The part about distributing loads takes me back to S3 KeyMap days and me trying to migrate to it, from initial implementation. What I learned is that even after you identify the hottest objects/partitions/buckets you cannot simply move them and be done. Everything had to be sorted. The actual solution was to sort and then divide the host's partition load into quartiles and move the second quartile partitions onto the least loaded hosts. If one tried to move the hottest buckets, 1st quartile, it would put even more load on the remaining members which would fail, over and over again.

Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so.

Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :).


S3 is more than storage. It is a standard. I like how you can get S3 compatible (usually with some small caveats) storage from a few places. I am not sure how open the standards is, and if you have to pay Amazon to say you are "S3 compatible" but it is pretty cool.

Examples:

iDrive has E2, Digital Ocean has Object Storage, Cloudflare has R2, Vultr has Object Storage, Backblaze has B2


Google's GCS as well, and I haven't used Microsoft, but it'd be weird if they didn't also have an "S3 compatible" option.

Edit: I looked it up and apparently no, Azure does not have one :-/


Apologies if this comes off as blunt, but this is the type of content I come to read at hacker news rather than it being just a series of obituaries.

The author has made a lot of great points, but one that stuck with me was:

> I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.

I haven’t thought of it in this way, but this is an excellent way of motivating someone to “own” a problem.


> What’s interesting here, when you look at the highest-level block diagram of S3’s technical design, is the fact that AWS tends to ship its org chart. This is a phrase that’s often used in a pretty disparaging way, but in this case it’s absolutely fascinating.

I’d go even further: at this scale, it is essential and required to develop these kind of projects with any sort of velocity.

Large organizations ship their communication structure by design. The alternative is engineering anarchy.


I'll take the metaphor one step further. The architecture will, over time, inevitably change to resemble its org chart, therefore it is the job of a sufficiently senior technical lead to organize the teams in such a way that the correct architecture emerges.


Also known as "Conway's law"

> Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure

https://en.wikipedia.org/wiki/Conway%27s_law


Right. Conway's Law describes the property that the architecture will grow to resemble the org chart. I'm suggesting that you can productively apply that principal to produce good software by shaping the org chart.

If Conway's Law is phrenology, the "science" of determining someone's personality by measuring their skull's dimensions, I'm suggesting Terry Prachett's retrophrenology, the process of hitting someone with a hammer very precisely to make them a better person.


This is also why reorgs tend to be pretty common at large tech orgs.

They know they'll almost inevitably ship their org chart. And they'll encounter tons of process-based friction if they don't.

The solution: Change your org chart to match what you want to ship


A more cynical take is that it makes it look like the new management is doing something.

An even more cynical take is that it makes it difficult to compare performance with past performance.


Straight from The Mythical Man Month: Organizations which design systems are constrained to produce systems which are copies of the communication structures of these organizations.


something something Conway's law


Over 100 million requests per second authenticated, billed, versioned, logged, checksummed, encrypted against 200+ trillion objects.


The talk that this article is based on is available on YouTube: https://www.youtube.com/watch?v=sc3J4McebHE


> we’d read and generally have pretty lively discussions about a collection of “classic” systems research papers

Does anyone have the list of papers?

> we managed to kind of “industrialize” verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don’t have PhDs in formal verification can contribute to maintaining the specification, and that we could continue to apply our tools with every single commit to the software

Is any of this open source?


S3 is a truly amazing piece of technology. It offers peace of mind (well, almost), zero operations, and practically unlimited bandwidth for at least analytics workload. Indeed, it's so good that there has not been much progress in building an open-source alternative to S3. There seems not much activity in the Hadoop community. I have yet heard any company who uses RADOS on Ceph to handle PBs of data for analytics workload. MinIO made its name recently, but its license is restrictive and its community is quite small compared to that of Hadoop of its hay days.


There was a time when S3 was getting resilient. Today it is excellent. Pepridge Farms remembers.


> There seems not much activity in the Hadoop community

There is apache ozone https://ozone.apache.org/


Yeah, Ozone looks interesting. I was just not sure who used it at scale other than a Japanese startup. The community engagement seems much lower than other communities, though.


This is a fantastic point on ownership that those “placing” it on others can often miss.

“Ownership carries a lot of responsibility, but it also carries a lot of trust – because to let an individual or a team own a service, you have to give them the leeway to make their own decisions about how they are going to deliver it.”


> It’s all one thing, and you can’t really think about it just as software. It’s > software, hardware, and people, and it’s always growing and constantly evolving.

This is a lesson a lot of software people haven't yet learned. Bad UI, bad operational experiences, insufficient logging to resolve issues, un-fixable code because it's too complicated, and so on. But they use git.

The other term of art for this concept is "system engineering", in the aerospace sense. There are a lot of good texts and courses.

One example: Wesson: System Analysis Design and Development, Wiley, 2005. ISBN-10 0-471-39333-9


Not trying to be an arse, but the guy spent a lot more time talking about himself and other unrelated stuff than about how S3 works. And I don't mind a good article on RAMAC, but that seems... out of place in a discussion about peta-scale storage. I got the strong impression he doesn't really know the finer details of how S3 really works. And that's probably fine for what he's doing, there is plenty of room for application coding, firefighting, and problem management without having to get into the finer details of how it all works.


From 2009, a talk I gave about S3 internals [0], when I was Technology Evangelist for AWS. Still relevant today, I believe.

[0]: https://vimeo.com/7330740


I think there's a good call-out about ownership here. Ownership and autonomy go hand in hand (you can't force someone to own something)


How does S3 handle particularly hot objects? Is there some form of rebalancing to account for access rates?


I was disappointed too, this article was very light on details about the subject matter. I wasn't expecting a blue-print, but what was presented was all very hand-wavy.

In large systems (albeit smaller than S3) the way this works is that you slurp out some performance metrics from storage system to identify your hot spots and then feed that into a service that actively moves stuff around (below the namespace of the filesystem though, will be fs-dependant). You have some higher-performance disk pools at your disposal, and obviously that would be nvme storage today.

So in practice, it's likely proprietary vendor code chewing through performance data out of a proprietary storage controller and telling a worker job on a mounted filesystem client to move the hot data to the high performance disk pool. Always constantly rebalancing and moving data back out of the fast pool once it cools off. Obviously for S3 this is happening at an object level though using their own in-house code.


> All in, S3 today is composed of hundreds of microservices

wow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: