> The only downside is that it’s conceptually more complicated, and requires some understanding of underlying components (zip files, http responses, streams).
There's at least one more downside: the user loses all indication of progress as the Content-Length is unknown when the headers are sent
dumply knows the exact size of each image as it is saved in the DB on upload, and all the zip byte headers are fixed, so the zip file size should be deterministic and calculable even before the first byte is sent. Remember we don't compress the already compressed images.
If you didn't know the file sizes, for example you had raw unknown input streams, or had compressible data, you can still guesstimate the content-length so the user got some progress bar, even if it wasn't 100% accurate.
Looks like using browser sniffing, you can deliver an exaggerated Content-Length to everyone but Opera and browsers will deal with it gracefully. Pretty neat. (Obviously not desirable for violating the HTTP spec, but the UX gains might be worth it.)
You're not always dealing with the client that is specified in the UA string. Clients can use proxies, including transparent proxies.
For example major mobile operators pipe HTTP connections through a proxy that recompresses images. In that case you see e.g. Safari's or Opera's UA string, but you're actually dealing with proxy's HTTP behavior.
In dump.ly's use case, they would probably want client-side detection in JavaScript, not the UA string. You definitely have to be conservative implementing such an unexpected feature.
Client-side detection in JavaScript or UA string aren't mutually exclusive, since the UA string is exposed on the DOM.
Are suggesting they would rather guess the browser by inference through presence of DOM properties and methods?
In the end it narrows down to: would you rather a) break the functionality for some users in exchange to give the best possible solution to others, or b) give an OK experience to everyone
I usually go with "b". I think that frustration is much more powerful than awe.
Wouldn't guestimation cause problems (on user's end)? I don't know, but how browsers/curl/wget/crawlers/... react if you tell them Content-Length is 1000 bytes, and then send them 900 bytes and close the connection? or overshot and send 1100? I have a feeling that they wouldn't like it, or at least it varies between fetcher library implementations.
RogerE states above that under-estimating does lead to resource truncation, and over-estimating leads to the browser waiting until timeout/connection break in case there's more data to be fetched.
You can guess, but browsers are quite picky about this header --- guess any amount too low and the download is truncated. Guess too high, and the browser will wait for an amount of time just to be sure the server isn't going to send more data.
Note that it is possible to "always guess long" and pad out with null bytes... this usually works (many file format parsers aren't too picky). However, this is more of a practical work-around than a recommended solution!
I was playing around and did something similar with video encoding. The server code starts a running ffmpeg process, and then the handler code just looks like this:
server = http.createServer(function(request, response) {
request.pipe(ffmpeg.process.stdin);
ffmpeg.process.stdout.pipe(response);
});
What a nice interface! The end result is that you can do weird stuff like:
The module is quite mature at this point, and is used in production on many websites (including Box.net, which commissioned the initial work). The module supports the Content-Length header, Range and If-Range requests, ZIP-64 for large archives, and filename transcoding with iconv. Being written in C, it will probably use much less RAM than an equivalent Node.js module.
I have found that the hardest part of generating ZIP files on the fly has nothing to do with network programming; it's producing files that open correctly on all platforms, including Mac OS X's busted-ass BOMArchiveHelper.app.
The point wasn't that creating on the fly zips is new, it was that using pipeable steam abstractions is a composable way to build network servers, and nodejs is just something we found this easiest to express with.
Having a large number of stream primitives means you can easily wire up endpoints, for example say you wanted to output a large db query as xml, or consume and editing gigabytes of json, or consume, transcode and output a video.
You can by all means write a nginx module in C for each usecase and this is probably the right solution for very HEAVY specific loads.
But writing a C module is probably a barrier too high for many, whereas implementing a nodejs stream isn't. Respond to a few events, emit a few events and you have a module that can work with the hundreds of other stream abstractions available. (npm search stream)
You still need the specific domain knowledge (eg how zip headers work) and this is usually the complicated bit. mod_zip looks excellent, and I wonder if some of the domain knowledge of handling zips can be resused in zipstream.
> User decides that he wants to download all the images in this lightbox, so presses "Download Folder". The user is then presented with a list of possible dimensions that they can request.
> The user selects "Large" and "Small" and hits "Download"
> This request gets added to our Gearman job queue.
> The job gets handled and all the files are downloaded from Amazon S3 to a temporary location on the locale file server.
> A Zip object is then created and each file is added to the Zip file.
> Once complete, the file is then uploaded back to Amazon S3 in a custom "archives" bucket.
> Before this batch job finishes, I fire off a message to Socket.io / Pusher which sends the URL back to the client who has been waiting patiently for X minutes while his job has been processing.
This works okay for us because when users create "Archives" of their ligtboxes, generally they do this because they want to share the files with other people. This means that they attach the URL to emails to provide to other people.
So for us, it's actually neccessary to save the file back to S3... however, I'm sure that not everyone needs to share the file... it would definitely be worth investigating if the user plans to return back to the archive, in which case implementing streams could potentially save us on storage and complexity.
I think you have pretty much described our original ('ghetto') solution with caching ('lipstick').
With streams, there is no need to cache, as recreating the download is dirt cheap. Essentially just a few extra header bytes to pad the zip container, ontop of the image content bytes that you will have to always send.
The use case you mentioned, of sharing the download link, works exactly the same. You send the link, and the what ever user clicks on the links gets an instant download.
True you are bufferring data through your app, instead of letting S3 take care of it. But if your on AWS, S3 to EC2 is free and fast (200mb/s+), and then bandwidth out of EC2 costs the same as S3. If it goes over an elastic IP, then a cent more per GB. You app servers also handle some load, but nodejs (or any other evented framework) live to multiplex IO, with only a few objects worth of overhead per connection.
In return, you can delete a whole load of cache and job control code. Less code to write, test and maintain.
> With streams, there is no need to cache, as recreating the download is dirt cheap.
The cost when streaming and not streaming should be pretty much the same, unless your non-streaming case is working on-disk (in which case you're comparing extremely different things and the comparison is anything but fair)
Even non-evented ones, the interesting part really is not node itself (despite what the blog says) but the ability to pipeline streams without having to touch every byte yourself.
It should be possible to do something similar using e.g. generators (in Python) or lazy enumerators (in Ruby)
In fact, in Python's WSGI handlers return an arbitrary iterable which will be consumed, so that pattern is natively supported (string iterators and generators together, then return the complete pipe which will perform the actual processing as WSGI serializes and sends the response). Ruby would require an adapter to a Rack response of some sort as I don't think you can reply an enumerable OOTB.
Ye, 100% true. I've spent many man-years writing entire systems like this in C and Java.
It's just in node, doing it in the evented way was actually simpler and quicker to implement, than the 'ghetto' way. This isn't usually the case, and I always recommend doing the simplest thing that works first. It's just nice here that the simplest thing is also a tight solution.
We use it to mean, the quick and dirty simple solution that we write first, and that is usually good enough but your a little embarrassed to admit. It's not the elegant or crafted, but works.
Our original solution was literally a shell exec, and it was perfectly fine (..for a while)
What is the connection between evented and streaming? It seems like a thread-per-request server would have to do exactly the same thing (except, they would not have to worry about giving back their event loop thread).
Evented streaming doesn't tie up a process waiting for specific data to come in. During the times that the process is waiting other things can get done (other streams processed, etc.).
I used to have this same exact issue while working for a large porn company. We needed to make zips of hundreds of megs of images. We were creating them on the fly to start with, which sucked for all the same reasons mentioned in the blog post. After doing a ton of analysis and not finding a good streaming library that didn't require either C or Java (this is long before Node came along), we realized that as part of the publishing process, we could just create the zip and upload it to the CDN. Problem solved with the minimal amount of complexity.
This is really cool. How are errors handled though? What if you have a transient error to 1 of 50 images -- does that bork the whole download? The user could get a corrupted file.
I'm curious about this as well. While it's all very neat and improves the user experience when everything is working, what happens if things break? If you can't connect to S3 or something, but you've already sent HTTP headers for the ZIP download, what do you do? Throw an error message in a text file the ZIP? Send the user an empty ZIP? A corrupted ZIP, as chubot mentions, seems like it would be the worst-case scenario in terms of UX.
I feel like from a UX perspective, it'd be ideal to be able to give some friendly error message to at least acknowledge that the failure is on the server end. A page that says, "Sorry, we're having trouble accessing your files right now. Please try again in a minute.," seems more user-friendly to me than a download that suddenly fails with no explanation. Nevertheless, this is very cool.
The important bit is that this is THE core abstraction used in node and the node community. If for no other reason, you should do it (if you're using node) because it's how you hook into the existing libraries.
The main benefit here isn't that it's possible to do this thing, as many people pointed out the myriad ways this is accomplished elsewhere. The key point is that everything that manipulates data, node core as well as the userland libraries, implement the same interface.
That's not really true. Most libraries expose a callback mechanism, where the result of some IO is passed as a Javascript primitive to a callback function that you provide. The Dumply guys used to use an API like that.
The notion of piping the output from some I/O (say, a request to S3) into the input of some other I/O (say, a currently writing HTML response) without ever referencing it is blessed by node, which has a stream type as part of its standard library. But it's far from the most common abstraction of asynchronous work.
I don't know the Play! framework, but the main difference probably is the use of nonblocking IO in nodejs, in contrast to blocking IO in the example you just have given. (I'm not saying either is better).
Will that definitely stream from one to the other without buffering the full file in memory? That's the main benefit of the Node.js streaming approach - it doesn't need to hold the whole thing in memory at any time, it just has to use a few KBs of RAM as a buffer.
I like how this is done, but I do see one problem with this approach for users connected through certain wireless ISPs, such as Verizon, who have all their http image requests automatically degraded to a lower bitrate to save bandwidth. They might think they're getting a usable local copy of their project, when they've actually got ugly, butchered versions of all the assets. That would not have been an issue with the server-side implementation.
The image requests are client side, no? The client requests the images, then streams the responses to a zip file. Wouldn't Verizon Wireless' network management software replace the images requested with low quality versions in this case? If it does, then it may be advisable to keep the old method around as an option when this method is impractical for whatever reason. Maybe there could be client-side code to test whether images are being re-encoded by the ISP (calculate checksum for a known image), and request a zip via the old method if they are.
"The image requests are client side, no? The client requests the images, then streams the responses to a zip file."
No - the image requests happen on the server, which then concatenates them together in to a zip file which is served to the client. The client never sees the actual image files, just the resulting zip file.
I'd imagine (well, hope anyway) that the ISP proxies that downscale image files do so based on the HTTP Content-Type header - since the images contained within a zip file would be part of a file with a different Content-Type they should be left alone.
OK, if the server is the one creating the zip file and sending it to the client, then it probably avoids the issue. I was under the impression that the point of all this was to have the client create the zip, not the server.
that's all dandy until you run our of RAM, as everything is done in RAM and nothing to disk. you honestly don't see a scalability issue here? it may be ok for a few thousand concurrent downloads but anything above that will kill it. heck, you might not even get to 1k concurrents, depending on the file size..
He's streaming: Node will only buffer a few kb per connection and push it right out to the downloader. There is absolutely no need to download complete files. That's the beauty of streams and pies!
Thanks, I thought the client may have already downloaded some of the full size photos, so zipping on the client would reduce download sizes. In your uservoice feedback, there are a few votes for full size zooming, so this client zipping may be useful if you implement zooming. This method also simplifies server architecture as the frontend server now only needs to reverse proxy image request to s3.
Sure you could use Erlang for that, but what Dharmesh is saying is that building this kind of solution in node.js would definitely be easier and possibly faster to code that writing it with other languages/frameworks.
I'm not sure about that though: the gain here mostly seems to be the "stream" standard abstraction, and it being implemented (via adapters if needed) by many data-processing utilities leading to high pipeability (letting the developer define the chain, and the runtime handle all the data flow within that chain).
Many other languages have similar abstractions — python and generators for instance http://www.dabeaz.com/generators/ — although their usage would likely require more work as they probably are not as standardized as far as usage goes.
I mean in this case it's "faster" to write it because somebody else had already gone through the motion of creating a zipping stream (which they still needed to fork), it's not like node magically did it.
tldr: the node community is re-discovering dataflow, and a few are trying to pass it as some sort of magical property of node.
Or, to rephrase your point, the node community has built some nice pipeable abstractions in a way that's easier to use than Python (e.g.), and people are making good use of those abstractions. ;-)
pipelining data isn't new (Unix pipes!), but it's an very elegant architecture for IO heavy servers, and highly underused because of (perceived) implementation complexity. Many devs don't realise they can create streaming pipelines, and end up with complex systems with multi level caches, and work queues because thats 'how it's done'.
Theres nothing specific about node (architectures scale not frameworks), I'm just saying that node makes this kinda of architecture relatively simple. Use the correct tool for the job.
also checkout lazy haskell lists, which provide a very powerful (and related) abstraction at the language level.
Node is possibly overhyped, and certainly not a panacea, but that article is silly in its absolute dismissal of the framework. Node is very well-suited for I/O bound TCP/IP applications.
There's at least one more downside: the user loses all indication of progress as the Content-Length is unknown when the headers are sent