Ask HN: Why does S3 still not support append?

codemac · on Dec 17, 2015

Google Cloud Storage does not support append.

One reason these things don't support append is at some point they need to choose the "version" of the object. Usually this is done when an object upload has completed.

If they allow arbitrary appends to objects, then they would have a hard time assigning any type of ordering to them, as the concept of an object being "complete" would be thrown out the window.

(EDIT: and what does it mean to have a GET on an object, if you don't know the latest version to return?)

I think something like this could be implemented, but it would probably be an entirely different product that supported some specific traditional file operations (rename, ftruncate, link, etc) but had different scaling properties.

abuqutaita · on Dec 17, 2015

FWIW, Azure supports Append Blobs: http://blogs.msdn.com/b/windowsazurestorage/archive/2015/04/...

victorNicollet · on Dec 17, 2015

The ability to append to a block already existed with block blobs, which are a bag of data blocks and an ordered list of block identifiers: you could just create a new block, then commit a new list with its identifier at the end.

The real benefit of the new append blob is that you have a one-request append (instead of read list, upload block, commit list).

Also, append blobs (like block blobs) are limited to 50000 append operations.

jedberg · on Dec 17, 2015

S3 is a key/value store. Appends don't make sense in that context. If you think of it as a key/value store, then a lot of their constraints start to make more sense.

mailslot · on Dec 17, 2015

Google Cloud Storage does not support append. Their docs: "... you cannot make incremental changes to objects, such as append operations or truncate operations."

whatnotests · on Dec 17, 2015

Oops my mistake. I coulda swore.

Still wishing.

tacos · on Dec 17, 2015

Azure recently added it for logging scenarios. https://azure.microsoft.com/en-us/blog/azure-storage-release...

victorNicollet · on Dec 17, 2015

Azure already had append support for block blobs (i.e. you could upload new data to an existing blob without having to upload the entire blob). But unlike S3, Azure Storage tends to favor features (and consistency) over performance.

ChuckMcM · on Dec 17, 2015

Well answered by other comments here, the whole "eventually consistent blob" vs "directed graph of mutation operations" problem. FWIW, this is a good distributed systems interview question :-)

bm1362 · on Dec 17, 2015

You can really do either with S3- it's just an eventually consistent, immutable kv.

A) Keep a consistent manifest of chunk range to keys. B) Keep a ordered list of keys that represent the DAG.

In case A, you'll be able to assemble your blob in parallel even.

EwanToo · on Dec 17, 2015

S3 is eventually consistent, appending an eventually consistent file is going to get very messy, very fast - what happens when an append reaches a replica node before an earlier one does?

If you're happy with out of order appends, just use a container file format like Parquet where appends are actually additional file creations

bpchaps · on Dec 17, 2015

After a decently large RAID failure, I needed to gzip and send as many large files and send it over to S3 as quickly as possible on the risk of another failure. The script would gzip the file and then sync it up to s3, all in its own backgrounded processes. If two large files would get sent at the same time, both would die, then /leave incomplete files/.

After leaving that running over night, all of the files appeared to be uploaded... until the owner of the company needed to use them.

I'm still not sure if that's an exceptional use case, but it left a pretty bad taste in my mouth about S3 ever since.

donavanm · on Dec 17, 2015

It sounds like you were missing the "Content-MD5" header on your put requests. As i recall S3 will return an HTTP error response if the complete object does not match the Content-MD5 the client sends. The other issue with the HTTP protocol is that the request body doesnt have a mandatory delimiter. The client/server cant really distinguish between a terminated TCP connection and a complete HTTP body without the optional Content-Length/Content-MD5 headers. It really sounds like one or more of your latge files were timing out somewhere and the checksum was t sent.

gonyea · on Dec 17, 2015

Because reconciling 2 separate appends to 2 separate nodes which have different copies of the data would be a huge mess.

yid · on Dec 17, 2015

S3 is more of a simple key-value store than a full filesystem (and for good reason). I suspect the reason their docs push the filesystem metaphor so much is because filesystems are more familiar to many people, and most filesystem semantics can be implemented using a key-value store. In that sense, there is no update() or append() in S3, just a simple set().

dheera · on Dec 17, 2015

Also, because AWS doesn't provide a decent networked file system where multiple instances can simultaneously mount the same volume at the same time in read/write mode. S3 is as close as one can get, in many cases.

objectivefs · on Dec 17, 2015

You can use our ObjectiveFS[1] if you want a networked file system where multiple instances can simultaneously mount it read/write. It is backed by S3 and gives you a standard POSIX interface.

[1] https://objectivefs.com

dheera · on Dec 17, 2015

Thanks, interesting. Can you comment on how it compares to s3fs? I've used s3fs and it can sometimes be buggy (as in, to the point where files get clobbered and corrupted) and slow (especially in listing directories with a large number of files). Does ObjectiveFS solve these issues, and do you have any reliability statistics?

sciuto · on Dec 17, 2015

What about EFS? (still in beta, essentially NFS) https://aws.amazon.com/efs/

difosfor · on Dec 17, 2015

I mostly miss a MoveObject operation to rename files myself, but I guess they are keeping things simple and scalable etc. on their end and requiring us to work around it with the existing lower level operations.

nickcw · on Dec 17, 2015

You can use a server side copy then a delete to rename an object reasonably efficiently. I guess that is what you mean by using existing lower level operations, but if you don't you might find that helpful!

codeonfire · on Dec 17, 2015

I don't think it has a traditional filesystem. It probably just writes all puts sequentially as fast as possible and stores the location and then replicates. The easiest way to append would be to read the object, append, and then write to a new object. If they did that internally there would be no transfer out and no revenue although they could probably charge for the internal expense. Another reason is that people would probably think that appends are no big deal and try to append continuously to multi-gigabyte files. If this is the case then it is best to let the client handle appends where costs are out in the open.

whatnotests · on Dec 17, 2015

All excellent points, especially about cost.

I've considered "faking" the append functionality by making a new file per append action, then performing a periodic compaction.

Even compaction-via-combine-and-delete-old is clunky.

    aws s3 combine --target s3://bucket-name/output-file.txt \
      s3://bucket-name/input-file-1.txt \
      ... \
      s3://bucket-name/input-file-n.txt

I, for one, would pay extra for that.

benjiweber · on Dec 17, 2015

The lack of read-after-delete consistency makes this tricky.

https://aws.amazon.com/s3/faqs#What_data_consistency_model_d...

I've seen "eventually" consistent mean up to 24hrs in the face of problems. Several minutes seems common when versioning/bucket replication is enabled.

mikgan · on Dec 17, 2015

I can second that. Personally I'd like to see them support symbolic links so version controlling and rolling deployments of static websites becomes a little easier.

maxims · on Dec 17, 2015

This is actually pretty kewl. You could potentially branch the current CLI from the GitHub repo and add that functionality in. Ideally the flow would work something like the following:

  1. Start a multipart object upload
  2. Issue "Upload Part - Copy" requests for each part of an object ( http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html )
  3. Complete the multipart object upload

Alternative flow:

  1. Enable bucket versioning
  2. Download the parsed S3 objects
  3. Start a multipart object upload to S3 with the specified target object as the object name
  4. Reupload the parsed S3 objects as parts of a single multipart object upload
  5. Delete the previous parsed objects once the multipart object upload is complete (a delete marker should be added to the top of the version stack, but the previously stored version should still be available if you specify its handle/version id).

Edit: changed formatting

jheriko · on Dec 17, 2015

You mean appending things on the end of files or what? If so, probably because its trivial to work around by storing the data in new files - and large data where this would be valuable should be broken down into pieces for n different reasons anyway.

Why do people who ask questions fall slightly short of providing enough information to meaningfully answer them?