Hacker News new | past | comments | ask | show | jobs | submit login
OpenRoss – fast, scalable, on-demand image resizer (lyst.com)
85 points by Peroni on June 23, 2014 | hide | past | favorite | 33 comments



There's also thumbor (https://github.com/thumbor/thumbor). It's a very mature implementation of this type of server and has been very battle-tested (https://github.com/thumbor/thumbor/wiki/Who%27s-using-it).

At globo.com we have near a billion images (we are a big portal). Can you imagine pre-generating that many images every time a new format gets added?

We serve everything with thumbor with a Varnish cache in front of it and we're very happy with it. It has enabled our designers to work with any image size they can think of.

If you guys need more info, please check thumbor's docs: https://github.com/thumbor/thumbor/wiki/


We use Thumbor at Yipit, and we are very happy with it. One thing that was not mentioned was extensibility... With thumbor it's easy to create new plugins and filters to extend your installation set of features.

We had detailed how we scaled thumbor at Yipit last year: http://tech.yipit.com/2013/01/03/how-yipit-scales-thumbnaili... The blog post doesn't mention S3, but we have a storage plugin that reads from and writes to S3.


Is the amount of traffic the site gets worth billions of images?


Definitely... We get around 50M page views/day in our website. And since we are a media company we need to have a storage of images from celebrities, sports and news in general.


I'm a fan of pilbox [1]. It's "fast" (builds off of OpenCV + Tornado), easy to understand and does much more than just image resizing. Also, it should be simpler to deploy than OpenRoss - since it doesn't bake in nginx, you can deploy it to Heroku (which is what we do), or behind Amazon's ELB.

We use it on a fairly large production site, and it works fine. If you throw it behind CloudFront or some other CDN that supports cache busting, it's plenty fast.

1: https://github.com/agschwender/pilbox


I just built a very simple version of this in about 100 lines of go, using http://github.com/disintegration/imaging for the image processing.

I was surprised how... latent downloading from S3 is, and how it didn't seem to matter if I was running on Digital Ocean or EC2, downloading the image from S3 was always the slowest part. (Digital Ocean was actually faster!)

It takes about a second to process an image, but it is cached in front of CloudFront similar to the article.

Any tips for improving the latency from S3?


I've worked with systems that pre-size images on creation, and I'd much rather work with a system like this that resizes and caches on demand.

You never know when you're going to need a new size (the introduction of retina images a few years ago doubled the dimensions of images you need to serve for example, and your design team is likely to come up with new size requirements occasionally as well) and backfilling to resize millions of images is a big pain and costs a lot in terms of storage.


This architecture can definitely save a bundle on storage costs but usually it runs into slowness when resizing, or, run into problems with image quality when resizing repeatedly.

22ms for a GraphicsMagick resize is quite fast. I'm curious what the average input and output sizes were used when computing that number?


http://imageshack.us/pages/resize/ -- Our Imagizer cluster does 6 Gbps at the moment thats 750MBytes/sec all day long, its fast and scalable ;)

https://imageshack.com/discover All of the images here are rendered with Imagizer; We store all originals in our HBASE/Hadoop cluser, while Imagizer does on-demand tranformations.

It works with non-imageshack links too:

http://imagizer.imageshack.us/v2/500x500q90/http://actionfor...


At populr.me, we use a variation on the architecture described here. One difference is that we cache resized images to S3, rather than to on-disk cache. This enables all servers to share the cache. Otherwise, when a new server is brought online, it doesn't benefit from the cache, so for a time, every request it receives incurs the most costly path of source image retrieval and resizing.

An added benefit to caching to S3 is that since S3 won't run out of space, we can cache rendered images for longer (we use S3 lifecycle to keep cache expiration simple). The scaled images tend to be smaller than the source images, so the retrieval from S3 is pretty fast. Over the past week, retrieving scaled images from S3 has cost ~46ms versus ~84ms for the larger source images.


Why not skip the backend completely and just use nginx + cdn?

(Have nginx proxy_pass to S3 and image_filter the response)

http://nginx.org/en/docs/http/ngx_http_image_filter_module.h...


The CDN does not have all sizes of images. We use the ondemand resizer+cache because we may modify our website design in the future and need a new image size. Serving the exact image size makes our pages faster to render and saves bandwidth. Plus that image filter is quite limited and doesn't handle compositing.


I'm curious to know how people feel about offline (pre-transformed) vs. on-demand transformations. Are there any HN'ers out there that have worked on a site with a large set of images, and have an opinion on this? Adobe's Scene7 product works in an offline mode as far as I can tell, and seems to have captured a large segment of retail companies with product catalogs.


I had worked for a social network, and our system provides a function to let user upload their photo then transform it to some fixed size of original one. We did have pre-transformed and on-demand too. Pre-transformed for the image that's most viewed by user, like new feed's photo (720x720), large photo (1024x768), and the origin one (if user's screen is detected as big screen), we have to resize it asap. Other sizes, like thumbnail, we do on-demand transform using nginx resize filter plugins, and caching using varnish and/or traffic server. That system have been working well until this time. I would say on-demand transformation is good idea, since you don't have to store resized-image that's never viewed by any user, so you save your storage. But that idea must be implemented well, very well if you're going to serve million users.


Based on all the links to similar apps in this thread, writing an image-resizing service to sit behind a CDN seems to be a right of passage. My creation from a couple years ago: https://bitbucket.org/btubbs/thumpy/src

It's still running happily in production.


I would love to see a golang version of this.


I wrote something similar during the GopherCon hack day, maybe it's useful: https://github.com/ericflo/slimgfast (disclaimer: haven't run it in production)



As a healthy feedback - the correct architecture sits right below your nose, since you already have all of the components.

You should pre-process images in your scraping cycles, and not when a client comes to request it.

In this way, your "scale" is always predefined, bounded, expected, and much smaller - defined by your scraping scale and not user scale.

Good luck!


Isn't that what they started with, and decided against? "In our infancy, we saved all product images with 10 preset sizes, and then rendered the image which was nearest in size to what we required. As we grew, this solution became unwieldy for the levels of traffic we were experiencing and nor was it appropriate for our mobile app."


They say that, but there's one giant missing ", because..." in that paragraph.

They never explain why they render in 10 sizes (why don't they know which sizes they need), and why is loading one of those sizes on a mobile app not feasible.

It's stated like that and left hanging in the air.

Either their use case is very weird, or they're deliberately vague for some reason only they know about.

Also to clarify why 10 sizes is weird, normally what you'd do is double the width/height of your images with every size (so quadruple the pixels), same as one does with mip-maps. Or with icons.

Let's say the smallest sensible size is 128x128 (minimum needed to discern a product, and easy to downsize from there, won't get much smaller at good quality).

So we have 128, 256, 512, 1024, 2048, that's 5 sizes. And I'm pretty sure the images they need aren't over 2048x.

So 10 sizes is just pointless.

And with OpenRoss, doing cropping and adding whitespace at the server, producing pointless image duplicates, is senseless. Even the most crippled client-side technique can do the cropping and whitespace for you (yeah, even html).

This screams "we didn't think this through much".


When the site was originally designed, there were only 4 or 5 sizes, one for the feed, one for the product page, one for related products, one for thumbnails, etc.

As the design changed, extra sizes got added to the image processing, and as with most early stage start-ups, it worked therefore there was no need to fix it.

We eventually ended up in the position of having 2.5M products all with preprocessed images of certain sizes from our design history. As we wanted more flexibility with design, but also knew that a lot of our images would have a very low likelihood of being accessed (fashion items have single runs and are never remade in future seasons), a big batch process didn't seem appropriate. Additionally, it would mean storing several different copies of images in our S3 bucket, even if we knew the product would not likely be seen again.

A more attractive solution (at least to us) was to do the hybrid approach, where we would resize on demand, and then cache for a long time. This way, we only do the processing for images that need it, in almost a functionally identical way to large scale batch processing, but the process is demand-led.


But the nice thing is - using a storage solution where you don't pay for what you don't use.


It's a chicken/egg scenario. How does the resource cost of resizing on-demand compare to storing all sizes? Also, like @mantraxC said, are all those image sizes really needed?

I have a system I'm trying to sunset that has a 160 and 130 size. Totally redundant, and doesn't really save that much space/bandwidth.

Still, OpenRoss is a cool project to learn from. It might not fit many use-cases, but apparently it works for Lyst.


I have had this exact problem/need overtime on several projects, and it doesn't mean "we didn't think this through much". Products change/grow and there becomes a need for tens of different images. First it starts with a thumbnail or two, then composite images, then, etc. Not to mention you need 2x images for retina devices.

Bottom line, overtime you tend to understand that there is great flexibility in this type of architecture.


"This screams "we didn't think this through much"."

That's actually quite rude.


"Even the most crippled client-side technique can do the cropping and whitespace for you (yeah, even html)."

Facebook, Twitter and Google Plus all have a feature where a URL gets expanded out to a "rich preview" (Twitter Product Cards for example are particularly relevant to a site like Lyst: https://dev.twitter.com/docs/cards/types/product-card )

These all work best with an image that has been pre-cropped and resized.

As an added bonus, a new one of these emerges every now and then - with a different size requirement.


You should definitely limit the sizes that can be generated otherwise with URLs like http://host/WIDTH/HEIGHT/MODE/path/to/image it's pretty easy to mildly DDOS you.


At first I thought OpenRoss was a new competitor to GraphicsMagick and ImageMagick, but it's just a Twisted plugin. The title should be changed.


The title is fine. It just says image resizing. Image/GraphicsMagick are a whole different ballgame.


Looks like you built your own http://cloudinary.com but with less features


At scale, processing thousands of highly granular items (images) in any way is an embarrassingly parallel task even in its most crude and naive implementation.

So claiming "fast", ok, but claiming "scalable" feels like a redundant buzzword, a bit hand-wavey.

When you make such generic claims, be quick to explain what you mean, or people will not treat you seriously.


Yeah what i was thinking. There is no state to be managed. Its just a pipeline which doesnt not anything more than more machines to handles more load. Its relatively easy to implement.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: