Why Node.js streams are awesome

masklinn · on March 25, 2012

> The only downside is that it’s conceptually more complicated, and requires some understanding of underlying components (zip files, http responses, streams).

There's at least one more downside: the user loses all indication of progress as the Content-Length is unknown when the headers are sent

dmmalam · on March 25, 2012

Nice catch, We are working on it!

dumply knows the exact size of each image as it is saved in the DB on upload, and all the zip byte headers are fixed, so the zip file size should be deterministic and calculable even before the first byte is sent. Remember we don't compress the already compressed images.

If you didn't know the file sizes, for example you had raw unknown input streams, or had compressible data, you can still guesstimate the content-length so the user got some progress bar, even if it wasn't 100% accurate.

tiles · on March 25, 2012

The IEInternals blog covers each browser's behavior for underrunning the purported Content-Length of a download: http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/brows...

Looks like using browser sniffing, you can deliver an exaggerated Content-Length to everyone but Opera and browsers will deal with it gracefully. Pretty neat. (Obviously not desirable for violating the HTTP spec, but the UX gains might be worth it.)

pornel · on March 25, 2012

You're not always dealing with the client that is specified in the UA string. Clients can use proxies, including transparent proxies.

For example major mobile operators pipe HTTP connections through a proxy that recompresses images. In that case you see e.g. Safari's or Opera's UA string, but you're actually dealing with proxy's HTTP behavior.

tiles · on March 25, 2012

In dump.ly's use case, they would probably want client-side detection in JavaScript, not the UA string. You definitely have to be conservative implementing such an unexpected feature.

An older article (2008) that also talks about misreporting content-length for fun and profit: http://tech.hickorywind.org/articles/2008/05/23/content-leng...

gilini · on March 30, 2012

Client-side detection in JavaScript or UA string aren't mutually exclusive, since the UA string is exposed on the DOM.

Are suggesting they would rather guess the browser by inference through presence of DOM properties and methods?

In the end it narrows down to: would you rather a) break the functionality for some users in exchange to give the best possible solution to others, or b) give an OK experience to everyone

I usually go with "b". I think that frustration is much more powerful than awe.

pooriaazimi · on March 25, 2012

Wouldn't guestimation cause problems (on user's end)? I don't know, but how browsers/curl/wget/crawlers/... react if you tell them Content-Length is 1000 bytes, and then send them 900 bytes and close the connection? or overshot and send 1100? I have a feeling that they wouldn't like it, or at least it varies between fetcher library implementations.

masklinn · on March 25, 2012

RogerE states above that under-estimating does lead to resource truncation, and over-estimating leads to the browser waiting until timeout/connection break in case there's more data to be fetched.

But yeah, it's probably indeterminate behavior.

nevinera · on March 25, 2012

You can do a solid best-guess estimate, especially if those images are already in compressed formats.

RogerE · on March 25, 2012

You can guess, but browsers are quite picky about this header --- guess any amount too low and the download is truncated. Guess too high, and the browser will wait for an amount of time just to be sure the server isn't going to send more data.

stephank · on March 25, 2012

I wonder if this would actually be useful to support in HTTP? (An `Estimated-Content-Length` header or similar, only valid with `Connection: close`.)

FooBarWidget · on March 25, 2012

It would be better to include progress estimation information as a chunked encoding extension. This allows much better progress reporting.

j_s · on March 26, 2012

Note that it is possible to "always guess long" and pad out with null bytes... this usually works (many file format parsers aren't too picky). However, this is more of a practical work-around than a recommended solution!

colinmarc · on March 25, 2012

I was playing around and did something similar with video encoding. The server code starts a running ffmpeg process, and then the handler code just looks like this:

    server = http.createServer(function(request, response) {
        request.pipe(ffmpeg.process.stdin);
        ffmpeg.process.stdout.pipe(response);
    });

What a nice interface! The end result is that you can do weird stuff like:

    $ curl -T my_video.mp4 http://localhost:9599 | mplayer

timc3 · on March 25, 2012

Think you will find VLC, PS3/Xbox streaming servers and a whole load of others do the same thing

EvanMiller · on March 26, 2012

Not to poop in your cocoa puffs, but I wrote an Nginx module to do the same thing in 2007.

https://github.com/evanmiller/mod_zip

The module is quite mature at this point, and is used in production on many websites (including Box.net, which commissioned the initial work). The module supports the Content-Length header, Range and If-Range requests, ZIP-64 for large archives, and filename transcoding with iconv. Being written in C, it will probably use much less RAM than an equivalent Node.js module.

I have found that the hardest part of generating ZIP files on the fly has nothing to do with network programming; it's producing files that open correctly on all platforms, including Mac OS X's busted-ass BOMArchiveHelper.app.

dmmalam · on March 26, 2012

The point wasn't that creating on the fly zips is new, it was that using pipeable steam abstractions is a composable way to build network servers, and nodejs is just something we found this easiest to express with.

Having a large number of stream primitives means you can easily wire up endpoints, for example say you wanted to output a large db query as xml, or consume and editing gigabytes of json, or consume, transcode and output a video.

You can by all means write a nginx module in C for each usecase and this is probably the right solution for very HEAVY specific loads.

But writing a C module is probably a barrier too high for many, whereas implementing a nodejs stream isn't. Respond to a few events, emit a few events and you have a module that can work with the hundreds of other stream abstractions available. (npm search stream)

You still need the specific domain knowledge (eg how zip headers work) and this is usually the complicated bit. mod_zip looks excellent, and I wonder if some of the domain knowledge of handling zips can be resused in zipstream.

lilyball · on March 26, 2012

What's busted about BOMArchiveHelper? I don't think I've ever run across a zip that doesn't open in BOMArchiveHelper and yet opens in other software.

chrisacky · on March 25, 2012

Nice approach.

This is how we handle it currently.

> User adds images to a virtual lightbox.

> User decides that he wants to download all the images in this lightbox, so presses "Download Folder". The user is then presented with a list of possible dimensions that they can request.

> The user selects "Large" and "Small" and hits "Download"

> This request gets added to our Gearman job queue.

> The job gets handled and all the files are downloaded from Amazon S3 to a temporary location on the locale file server.

> A Zip object is then created and each file is added to the Zip file.

> Once complete, the file is then uploaded back to Amazon S3 in a custom "archives" bucket.

> Before this batch job finishes, I fire off a message to Socket.io / Pusher which sends the URL back to the client who has been waiting patiently for X minutes while his job has been processing.

This works okay for us because when users create "Archives" of their ligtboxes, generally they do this because they want to share the files with other people. This means that they attach the URL to emails to provide to other people.

So for us, it's actually neccessary to save the file back to S3... however, I'm sure that not everyone needs to share the file... it would definitely be worth investigating if the user plans to return back to the archive, in which case implementing streams could potentially save us on storage and complexity.

dmmalam · on March 25, 2012

I think you have pretty much described our original ('ghetto') solution with caching ('lipstick').

With streams, there is no need to cache, as recreating the download is dirt cheap. Essentially just a few extra header bytes to pad the zip container, ontop of the image content bytes that you will have to always send.

The use case you mentioned, of sharing the download link, works exactly the same. You send the link, and the what ever user clicks on the links gets an instant download.

True you are bufferring data through your app, instead of letting S3 take care of it. But if your on AWS, S3 to EC2 is free and fast (200mb/s+), and then bandwidth out of EC2 costs the same as S3. If it goes over an elastic IP, then a cent more per GB. You app servers also handle some load, but nodejs (or any other evented framework) live to multiplex IO, with only a few objects worth of overhead per connection.

In return, you can delete a whole load of cache and job control code. Less code to write, test and maintain.

masklinn · on March 25, 2012

> With streams, there is no need to cache, as recreating the download is dirt cheap.

The cost when streaming and not streaming should be pretty much the same, unless your non-streaming case is working on-disk (in which case you're comparing extremely different things and the comparison is anything but fair)

timc3 · on March 25, 2012

Alternately you could hand it off to the web server which is probably a better more elegant solution.

http://wiki.nginx.org/X-accel and mod_zip for instance.

Why do people keep reinventing the wheel, thinking node is the be all and end all when this is nothing new at all?

nevinera · on March 25, 2012

I'm pretty sure any evented framework in any language can do the same thing.

masklinn · on March 25, 2012

Even non-evented ones, the interesting part really is not node itself (despite what the blog says) but the ability to pipeline streams without having to touch every byte yourself.

It should be possible to do something similar using e.g. generators (in Python) or lazy enumerators (in Ruby)

In fact, in Python's WSGI handlers return an arbitrary iterable which will be consumed, so that pattern is natively supported (string iterators and generators together, then return the complete pipe which will perform the actual processing as WSGI serializes and sends the response). Ruby would require an adapter to a Rack response of some sort as I don't think you can reply an enumerable OOTB.

timc3 · on March 25, 2012

It's possible in Python generators but in my tests the performance sucks particularly if the client can't receive fast enough.

Using eventlet or gevent was much kinder to the system.

dmmalam · on March 25, 2012

Ye, 100% true. I've spent many man-years writing entire systems like this in C and Java.

It's just in node, doing it in the evented way was actually simpler and quicker to implement, than the 'ghetto' way. This isn't usually the case, and I always recommend doing the simplest thing that works first. It's just nice here that the simplest thing is also a tight solution.

eaurouge · on March 25, 2012

And what exactly does 'ghetto' mean here? You've used the same term in your blog post.

dmmalam · on March 25, 2012

We use it to mean, the quick and dirty simple solution that we write first, and that is usually good enough but your a little embarrassed to admit. It's not the elegant or crafted, but works.

Our original solution was literally a shell exec, and it was perfectly fine (..for a while)

rhizome · on March 25, 2012

Given your definition, I'd probably try to use a different word.

mcantelon · on March 25, 2012

Lots of people use "ghetto" to mean quick-and-dirty.

eaurouge · on March 26, 2012

That's not an argument for it's continued use. I agree with rhizome, you should consider using a different word or phrase.

robfig · on March 25, 2012

What is the connection between evented and streaming? It seems like a thread-per-request server would have to do exactly the same thing (except, they would not have to worry about giving back their event loop thread).

mcantelon · on March 25, 2012

Evented streaming doesn't tie up a process waiting for specific data to come in. During the times that the process is waiting other things can get done (other streams processed, etc.).

mcantelon · on March 25, 2012

With Node it's brain-dead easy and requires less code.

i.e. http://news.ycombinator.com/item?id=3753019

latchkey · on March 25, 2012

I used to have this same exact issue while working for a large porn company. We needed to make zips of hundreds of megs of images. We were creating them on the fly to start with, which sucked for all the same reasons mentioned in the blog post. After doing a ton of analysis and not finding a good streaming library that didn't require either C or Java (this is long before Node came along), we realized that as part of the publishing process, we could just create the zip and upload it to the CDN. Problem solved with the minimal amount of complexity.

chubot · on March 25, 2012

This is really cool. How are errors handled though? What if you have a transient error to 1 of 50 images -- does that bork the whole download? The user could get a corrupted file.

georgefox · on March 25, 2012

I'm curious about this as well. While it's all very neat and improves the user experience when everything is working, what happens if things break? If you can't connect to S3 or something, but you've already sent HTTP headers for the ZIP download, what do you do? Throw an error message in a text file the ZIP? Send the user an empty ZIP? A corrupted ZIP, as chubot mentions, seems like it would be the worst-case scenario in terms of UX.

sirclueless · on March 25, 2012

Abruptly ending with a RST packet causes a failed download in every browser except Chrome: http://blogs.msdn.com/b/ieinternals/archive/2011/03/09/brows...

georgefox · on March 26, 2012

Good to know, thanks!

I feel like from a UX perspective, it'd be ideal to be able to give some friendly error message to at least acknowledge that the failure is on the server end. A page that says, "Sorry, we're having trouble accessing your files right now. Please try again in a minute.," seems more user-friendly to me than a download that suddenly fails with no explanation. Nevertheless, this is very cool.

mckoss · on March 25, 2012

Would be very cool to support HTTP Range requests [1] on the virtual stream - then client could restart at any point.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html section 14.35.

Benvie · on March 25, 2012

The important bit is that this is THE core abstraction used in node and the node community. If for no other reason, you should do it (if you're using node) because it's how you hook into the existing libraries.

The main benefit here isn't that it's possible to do this thing, as many people pointed out the myriad ways this is accomplished elsewhere. The key point is that everything that manipulates data, node core as well as the userland libraries, implement the same interface.

sirclueless · on March 25, 2012

That's not really true. Most libraries expose a callback mechanism, where the result of some IO is passed as a Javascript primitive to a callback function that you provide. The Dumply guys used to use an API like that.

The notion of piping the output from some I/O (say, a request to S3) into the input of some other I/O (say, a currently writing HTML response) without ever referencing it is blessed by node, which has a stream type as part of its standard library. But it's far from the most common abstraction of asynchronous work.

robfig · on March 25, 2012

Using the Play! Framework (Java):

  public static void myEndpoint() {
    HttpResponse resp = WS.url("url-of-file").get();
    InputStream is = resp.getStream();
    renderBinary(is);
  }

Or am I missing something?

(EDIT: This doesn't do the zipping or multiple files -- I guess I need a ZipOutputStream to take it the rest of the way)

arthurschreiber · on March 25, 2012

I don't know the Play! framework, but the main difference probably is the use of nonblocking IO in nodejs, in contrast to blocking IO in the example you just have given. (I'm not saying either is better).

simonw · on March 25, 2012

Will that definitely stream from one to the other without buffering the full file in memory? That's the main benefit of the Node.js streaming approach - it doesn't need to hold the whole thing in memory at any time, it just has to use a few KBs of RAM as a buffer.

robfig · on March 28, 2012

Play! uses Apache HttpClient under the covers. I haven't tried it experimentally, but the documentation explicitly says that it's streaming.

WiseWeasel · on March 25, 2012

I like how this is done, but I do see one problem with this approach for users connected through certain wireless ISPs, such as Verizon, who have all their http image requests automatically degraded to a lower bitrate to save bandwidth. They might think they're getting a usable local copy of their project, when they've actually got ugly, butchered versions of all the assets. That would not have been an issue with the server-side implementation.

icebraining · on March 25, 2012

This is still server-side; it just streams instead of downloading and then pushing.

WiseWeasel · on March 25, 2012

The image requests are client side, no? The client requests the images, then streams the responses to a zip file. Wouldn't Verizon Wireless' network management software replace the images requested with low quality versions in this case? If it does, then it may be advisable to keep the old method around as an option when this method is impractical for whatever reason. Maybe there could be client-side code to test whether images are being re-encoded by the ISP (calculate checksum for a known image), and request a zip via the old method if they are.

simonw · on March 25, 2012

"The image requests are client side, no? The client requests the images, then streams the responses to a zip file."

No - the image requests happen on the server, which then concatenates them together in to a zip file which is served to the client. The client never sees the actual image files, just the resulting zip file.

I'd imagine (well, hope anyway) that the ISP proxies that downscale image files do so based on the HTTP Content-Type header - since the images contained within a zip file would be part of a file with a different Content-Type they should be left alone.

WiseWeasel · on March 25, 2012

OK, if the server is the one creating the zip file and sending it to the client, then it probably avoids the issue. I was under the impression that the point of all this was to have the client create the zip, not the server.

aioprisan · on March 25, 2012

that's all dandy until you run our of RAM, as everything is done in RAM and nothing to disk. you honestly don't see a scalability issue here? it may be ok for a few thousand concurrent downloads but anything above that will kill it. heck, you might not even get to 1k concurrents, depending on the file size..

atesti · on March 25, 2012

He's streaming: Node will only buffer a few kb per connection and push it right out to the downloader. There is absolutely no need to download complete files. That's the beauty of streams and pies!

moonboots · on March 25, 2012

Have you considered jszip and/or webworkers to perform the zipping on the client?

dmmalam · on March 25, 2012

The full size origs are only stored on the server, the client just uses thumbnails so not much point in zipping on the client

moonboots · on March 25, 2012

Thanks, I thought the client may have already downloaded some of the full size photos, so zipping on the client would reduce download sizes. In your uservoice feedback, there are a few votes for full size zooming, so this client zipping may be useful if you implement zooming. This method also simplifies server architecture as the frontend server now only needs to reverse proxy image request to s3.

lucaspiller · on March 25, 2012

...or you could use Erlang. :)

marcocampana · on March 25, 2012

Sure you could use Erlang for that, but what Dharmesh is saying is that building this kind of solution in node.js would definitely be easier and possibly faster to code that writing it with other languages/frameworks.

masklinn · on March 25, 2012

I'm not sure about that though: the gain here mostly seems to be the "stream" standard abstraction, and it being implemented (via adapters if needed) by many data-processing utilities leading to high pipeability (letting the developer define the chain, and the runtime handle all the data flow within that chain).

Many other languages have similar abstractions — python and generators for instance http://www.dabeaz.com/generators/ — although their usage would likely require more work as they probably are not as standardized as far as usage goes.

I mean in this case it's "faster" to write it because somebody else had already gone through the motion of creating a zipping stream (which they still needed to fork), it's not like node magically did it.

tldr: the node community is re-discovering dataflow, and a few are trying to pass it as some sort of magical property of node.

cpr · on March 25, 2012

Or, to rephrase your point, the node community has built some nice pipeable abstractions in a way that's easier to use than Python (e.g.), and people are making good use of those abstractions. ;-)

dmmalam · on March 25, 2012

pipelining data isn't new (Unix pipes!), but it's an very elegant architecture for IO heavy servers, and highly underused because of (perceived) implementation complexity. Many devs don't realise they can create streaming pipelines, and end up with complex systems with multi level caches, and work queues because thats 'how it's done'.

Theres nothing specific about node (architectures scale not frameworks), I'm just saying that node makes this kinda of architecture relatively simple. Use the correct tool for the job.

also checkout lazy haskell lists, which provide a very powerful (and related) abstraction at the language level.

luriel · on March 25, 2012

or Go.

bluespice · on March 25, 2012

Stop it already.

http://teddziuba.com/2011/10/node-js-is-cancer.html

tomgruner · on March 25, 2012

I have to be honest, the writing quality of that article is so low and aggressive that I could not even finish it.

mcantelon · on March 25, 2012

Node is possibly overhyped, and certainly not a panacea, but that article is silly in its absolute dismissal of the framework. Node is very well-suited for I/O bound TCP/IP applications.