That’s 1700 images per second. Doable on one (beefy) box. 3 to account for the d...

brian-armstrong · on Nov 15, 2017

Can you link to which resize library you're using? We'd love to see a 90% further reduction in instances

mbrumlow · on Nov 15, 2017

Sorry to be confusing, I am not resizing images. Just working with data sets as large as what I image 150M images would be. The software I am working on takes point and time backups of computers and uploads them to "the cloud", I mean servers in a data center. There they can be virtualized with a click of a button in mass or one at a time, and near instantly.

This involves transfering, encrypting, compression and creating checksum of terabytes of data a hour (per node). While not exactly resizing images, I would image the computational power was on par with the service described. The entire system has about 4 PB or 8 PB in it right now, as backups are pruned (based on what people will pay for storage).

My software has a ton of space to grow and become better, but I think a better story would have been how discord handles 150M images a hour. If anything bandwidth acquiring the source image would be what I would consider the largest problem, not the CPU time to resize. In fact as long as your resize code slightly faster than the download then streaming it in and out would put your bottleneck entirely on bandwidth.

I will also note I am not a fan of libraries :p but that is not what this is about.

EDIT:

Also kudos to you, somebody criticized your post and you had the best response one could have. Inquiring minds are awesome.

rockostrich · on Nov 15, 2017

Assuming the average image size is 3 MB which seems conservative, especially if they're handling GIFs as well, this is 450 TB per day. If you're handling that much data on one beefy machine then kudos.

mbrumlow · on Nov 15, 2017

I don't get why you are being down voted. This is almost exactly what I thought. It's just not that much data given the state is computer hardware.

Where I work we have single nodes processing near that much data a hour -- these are beefy systems though.

0xbear · on Nov 15, 2017

People have just drunk so much “cheap commodity hardware” kool aid by now, they don’t realize there are cheaper and easier ways of doing things now, assuming you have devs who can code and tune for performance. Same with “big data”. Most people have sub-1T datasets. You simply don’t need Spark or anything custom for that.