Interesting that they upload to S3 first, only to pull it off soon after and process it.
I imagine this reduces complexity and lets them increase their throughput of uploads but I wonder if 500px toyed around with the idea of first uploading it to local (temp) storage, having the workers process it and then finally persist it. But I guess the cons are a) some point it needs to go to S3 in original form b) increases data loss
I'm the CTO of 500px. Decoupling uploaders and converters simplifies the overall design and makes it more robust. Converters are a part of the legacy design that we are planning to retire. In the past, we would generate multiple sizes and crops for every image so it can be efficiently displaced on the site and/or mobile apps.
Today, we rely on the new Resizer service that can resize and crop images on the fly. It's an interesting piece of technology that we will be writing about on our blog: developers.500px.com
Thanks for sharing this and being on HN to answer - very interesting architecture! How do you deal with customers from remote locations (Asia, Oceania) complaining about slow upload speeds?
I will keep posted to your blog. Thanks for sharing the link. Always looking to learn about memory heavy operations. Currently am the lead dev on software that revolves around user generated content (images and audio). Its an ever ending road of (fun) challenges.
performance is the main reason. Additionally we needed a solution to modify images on the fly (watermarking and attribution). That's why we decided to build our own service using go
It could be that there are a lot of images/versions that simply never get seen, or seen rarely enough that it's inefficient to store them. And they're being cached by the CDN anyway, so popular images will still only be generated once per edge.
bingo! Extra storage for all the resized images that never get seen, lack of flexibility to introduce new sizes, plus requirements for watermarking and attribution on images are the main reasons why we are moving away from pre-conversion to dynamic image resizing
Generating many conversions for all photos incurs both storage and computational overhead that may never pay off. By only generating the required images on the fly for photos we avoid this overhead while allowing more flexibility. If watermarks or attribution change in the future it's easier to have the resize service in place to handle this as images expire from the CDN than to do a batch job to pre-process the updated conversion requirements for all existing photos.
Also note that photos are stored indefinitely, so any additional permanent storage (new conversions) would add additional storage costs, miniscule as they might be, forever.
To me this is a very common practice. At least in my experience a lot of AWS processing are done with S3 throughout the pipeline, whether you run EMR or run simple workflow. You can also create API/temp credential per user in your app allows the user to upload things to "storage" (but in this case S3) and process from there.
And with Lambda based on S3 events, you can run stuff automatically on that as well. However, if you want your uploaders to get feedback on their uploads immediately and/or synchronously, you need to upload to app servers.
I'd like to be able to have users upload directly to S3 (obviously less infrastructure and code to maintain) but without being able to provide immediate feedback on the upload I've found it to be preferable to have our own application servers in the upload path. This lets us immediately detect unsupported formats and act on the completion of the upload (whether successful or failure) without delay.
It also means that you need to scale your upload servers (IO & long requests), this is more difficult than trusting S3 to scale for you. However having your S3 open to uploads has it's downsides. We at Cloudinary do uploads to our own (autoscaled) instances and only then persist.
My initial thought was: wow, this looks like a great validation of the idea behind Joyent Manta[1] storage+process. I wonder if it would've been a good fit for 500px, and how it would compare in terms of price/performance? Granted, they now have a working system on s3, so maybe for the next up-and-comming competitor? ;-)
Great write-up, btw. And thanks for the heads-up about vips/nip2 -- I wasn't aware of those.
Thumbs up for VIPS, used it 4-5 years ago, for building a system which processed high resolution images. Its fast, lean and an amazing piece of software.
Totally agree, I've used it for some projects as well. The only thing I wasn't able to grok (which still kills me to this day), was that I couldn't get it to handle EPS files correctly. If anybody has done that with VIPS/IM, I would love to hear about it.
I'm replying here to let Melraidin (who posted a sibling comment) know that their account is dead, for reasons I can't understand since their comments seem helpful.
VIPS has been great to work with. The docs are generally very good and the performance has been great. The author (https://github.com/jcupitt/) has always been responsive to issues and questions and seems to be constantly working on the project.
My only, relatively minor quibble, is dev work happening on master, but when it's largely a one-man project who am I to judge. So far we haven't come across any problems that weren't either self-inflicted or already fixed in master.
Are they somehow intercepting middle click on their site so that links do not open in a new tab? Control-click works fine for me, but middle click would just replace the article, which was inconvenient. (latest Chrome on Ubuntu)
I'm getting the same thing and it's incredibly frustrating on an otherwise interesting article. I had to Right Click > "Open In New Tab" for links I wanted to read later.
I was getting excited about using 500px for the specific reason that it seems to load much faster than flickr. Then I loaded it up in a public space to get some inspiration by browsing other user's work - interspersed were various images of women in various states of undress. I quickly exited the site and never went back for fear of someone glancing over my shoulder and assuming I'm doing something other than getting artistic inspiration.
Not arguing that the images aren't art but I personally don't want them popping up on my screen when I'm just trying to get inspiration for landscape photography.
I imagine this reduces complexity and lets them increase their throughput of uploads but I wonder if 500px toyed around with the idea of first uploading it to local (temp) storage, having the workers process it and then finally persist it. But I guess the cons are a) some point it needs to go to S3 in original form b) increases data loss