At globo.com we have near a billion images (we are a big portal). Can you imagine pre-generating that many images every time a new format gets added?
We serve everything with thumbor with a Varnish cache in front of it and we're very happy with it. It has enabled our designers to work with any image size they can think of.
We use Thumbor at Yipit, and we are very happy with it. One thing that was not mentioned was extensibility... With thumbor it's easy to create new plugins and filters to extend your installation set of features.
Definitely... We get around 50M page views/day in our website. And since we are a media company we need to have a storage of images from celebrities, sports and news in general.
I'm a fan of pilbox [1]. It's "fast" (builds off of OpenCV + Tornado), easy to understand and does much more than just image resizing. Also, it should be simpler to deploy than OpenRoss - since it doesn't bake in nginx, you can deploy it to Heroku (which is what we do), or behind Amazon's ELB.
We use it on a fairly large production site, and it works fine. If you throw it behind CloudFront or some other CDN that supports cache busting, it's plenty fast.
I was surprised how... latent downloading from S3 is, and how it didn't seem to matter if I was running on Digital Ocean or EC2, downloading the image from S3 was always the slowest part. (Digital Ocean was actually faster!)
It takes about a second to process an image, but it is cached in front of CloudFront similar to the article.
I've worked with systems that pre-size images on creation, and I'd much rather work with a system like this that resizes and caches on demand.
You never know when you're going to need a new size (the introduction of retina images a few years ago doubled the dimensions of images you need to serve for example, and your design team is likely to come up with new size requirements occasionally as well) and backfilling to resize millions of images is a big pain and costs a lot in terms of storage.
This architecture can definitely save a bundle on storage costs but usually it runs into slowness when resizing, or, run into problems with image quality when resizing repeatedly.
22ms for a GraphicsMagick resize is quite fast. I'm curious what the average input and output sizes were used when computing that number?
http://imageshack.us/pages/resize/ -- Our Imagizer cluster does 6 Gbps at the moment thats 750MBytes/sec all day long, its fast and scalable ;)
https://imageshack.com/discover All of the images here are rendered with Imagizer; We store all originals in our HBASE/Hadoop cluser, while Imagizer does on-demand tranformations.
At populr.me, we use a variation on the architecture described here. One difference is that we cache resized images to S3, rather than to on-disk cache. This enables all servers to share the cache. Otherwise, when a new server is brought online, it doesn't benefit from the cache, so for a time, every request it receives incurs the most costly path of source image retrieval and resizing.
An added benefit to caching to S3 is that since S3 won't run out of space, we can cache rendered images for longer (we use S3 lifecycle to keep cache expiration simple). The scaled images tend to be smaller than the source images, so the retrieval from S3 is pretty fast. Over the past week, retrieving scaled images from S3 has cost ~46ms versus ~84ms for the larger source images.
The CDN does not have all sizes of images. We use the ondemand resizer+cache because we may modify our website design in the future and need a new image size. Serving the exact image size makes our pages faster to render and saves bandwidth. Plus that image filter is quite limited and doesn't handle compositing.
I'm curious to know how people feel about offline (pre-transformed) vs. on-demand transformations. Are there any HN'ers out there that have worked on a site with a large set of images, and have an opinion on this? Adobe's Scene7 product works in an offline mode as far as I can tell, and seems to have captured a large segment of retail companies with product catalogs.
I had worked for a social network, and our system provides a function to let user upload their photo then transform it to some fixed size of original one. We did have pre-transformed and on-demand too.
Pre-transformed for the image that's most viewed by user, like new feed's photo (720x720), large photo (1024x768), and the origin one (if user's screen is detected as big screen), we have to resize it asap. Other sizes, like thumbnail, we do on-demand transform using nginx resize filter plugins, and caching using varnish and/or traffic server.
That system have been working well until this time.
I would say on-demand transformation is good idea, since you don't have to store resized-image that's never viewed by any user, so you save your storage. But that idea must be implemented well, very well if you're going to serve million users.
Based on all the links to similar apps in this thread, writing an image-resizing service to sit behind a CDN seems to be a right of passage. My creation from a couple years ago: https://bitbucket.org/btubbs/thumpy/src
I wrote something similar during the GopherCon hack day, maybe it's useful: https://github.com/ericflo/slimgfast (disclaimer: haven't run it in production)
Isn't that what they started with, and decided against? "In our infancy, we saved all product images with 10 preset sizes, and then rendered the image which was nearest in size to what we required. As we grew, this solution became unwieldy for the levels of traffic we were experiencing and nor was it appropriate for our mobile app."
They say that, but there's one giant missing ", because..." in that paragraph.
They never explain why they render in 10 sizes (why don't they know which sizes they need), and why is loading one of those sizes on a mobile app not feasible.
It's stated like that and left hanging in the air.
Either their use case is very weird, or they're deliberately vague for some reason only they know about.
Also to clarify why 10 sizes is weird, normally what you'd do is double the width/height of your images with every size (so quadruple the pixels), same as one does with mip-maps. Or with icons.
Let's say the smallest sensible size is 128x128 (minimum needed to discern a product, and easy to downsize from there, won't get much smaller at good quality).
So we have 128, 256, 512, 1024, 2048, that's 5 sizes. And I'm pretty sure the images they need aren't over 2048x.
So 10 sizes is just pointless.
And with OpenRoss, doing cropping and adding whitespace at the server, producing pointless image duplicates, is senseless. Even the most crippled client-side technique can do the cropping and whitespace for you (yeah, even html).
When the site was originally designed, there were only 4 or 5 sizes, one for the feed, one for the product page, one for related products, one for thumbnails, etc.
As the design changed, extra sizes got added to the image processing, and as with most early stage start-ups, it worked therefore there was no need to fix it.
We eventually ended up in the position of having 2.5M products all with preprocessed images of certain sizes from our design history. As we wanted more flexibility with design, but also knew that a lot of our images would have a very low likelihood of being accessed (fashion items have single runs and are never remade in future seasons), a big batch process didn't seem appropriate. Additionally, it would mean storing several different copies of images in our S3 bucket, even if we knew the product would not likely be seen again.
A more attractive solution (at least to us) was to do the hybrid approach, where we would resize on demand, and then cache for a long time. This way, we only do the processing for images that need it, in almost a functionally identical way to large scale batch processing, but the process is demand-led.
It's a chicken/egg scenario. How does the resource cost of resizing on-demand compare to storing all sizes? Also, like @mantraxC said, are all those image sizes really needed?
I have a system I'm trying to sunset that has a 160 and 130 size. Totally redundant, and doesn't really save that much space/bandwidth.
Still, OpenRoss is a cool project to learn from. It might not fit many use-cases, but apparently it works for Lyst.
I have had this exact problem/need overtime on several projects, and it doesn't mean "we didn't think this through much". Products change/grow and there becomes a need for tens of different images. First it starts with a thumbnail or two, then composite images, then, etc. Not to mention you need 2x images for retina devices.
Bottom line, overtime you tend to understand that there is great flexibility in this type of architecture.
"Even the most crippled client-side technique can do the cropping and whitespace for you (yeah, even html)."
Facebook, Twitter and Google Plus all have a feature where a URL gets expanded out to a "rich preview" (Twitter Product Cards for example are particularly relevant to a site like Lyst: https://dev.twitter.com/docs/cards/types/product-card )
These all work best with an image that has been pre-cropped and resized.
As an added bonus, a new one of these emerges every now and then - with a different size requirement.
At scale, processing thousands of highly granular items (images) in any way is an embarrassingly parallel task even in its most crude and naive implementation.
So claiming "fast", ok, but claiming "scalable" feels like a redundant buzzword, a bit hand-wavey.
When you make such generic claims, be quick to explain what you mean, or people will not treat you seriously.
Yeah what i was thinking. There is no state to be managed. Its just a pipeline which doesnt not anything more than more machines to handles more load. Its relatively easy to implement.
At globo.com we have near a billion images (we are a big portal). Can you imagine pre-generating that many images every time a new format gets added?
We serve everything with thumbor with a Varnish cache in front of it and we're very happy with it. It has enabled our designers to work with any image size they can think of.
If you guys need more info, please check thumbor's docs: https://github.com/thumbor/thumbor/wiki/