Hacker News new | past | comments | ask | show | jobs | submit login

The big question is, why does it need to run in 10s? The main reason I can see is to be able to run this analysis very frequently, but then your workload is approaching constant.

The total amount of data is 150 GB; that would easily fit into memory on a single powerful 2-socket server with 20 cores and would then run in less than 15 minutes. The hardware required to do that will cost you ~ $6000 from Dell; assuming a system lifetime of five years and assuming (like you do) that you can amortize across multiple jobs, the cost is roughly the same as from the cloud, about $0.036 per analysis.

I'm fairly certain that, in the end, it's not more expensive for the customer to just buy a server to run the analysis on.

Edit: I see OP says 80% of the time is spent reading data into memory, at about 100 MB/s. Add $500 worth of SSD to the example server I outlined, and we can cut the application runtime by >70%, making the dedicated hardware significantly cheaper.




Vcore is hyperthread of unknown CPU. So in reality 1000 vcores is 500 real cores. - All the overheads it's more like 450 given the low utilization until dataset loads to keep it at 10 sec you would need 90 real cores or 4 X 3 node dual boxes (ebay 1.5K each) and 2 X infiniband switches (ebay 2X300). For 6600 you have a dedicated solution with no latency bubbles fixed low cost.


Briefly... We have many data sets, and the <10sec calculations happen every few seconds for every data set in active use. Caching results is rarely helpful in our case because the number of possible results is immense. The back end drives an interactive/real-time experience for the user, so we need the speed. Our loads are somewhat spikey; overnight in US time zones we're very quiet, and during daytime we can use more than 1k vCPUs.

We've considered a few kinds of platforms (AWS spot fleet/GCE autoscaled preemptible VMs, AWS Lambda, bare metal hosting, even Beowulf clusters), and while bare metal has its benefits as you've pointed out, at our current stage it doesn't make sense for us financially.

I omitted from the blog post that we don't rely exclusively on object storage services because its performance is relatively low. We cache files on compute nodes so we avoid that "80% of time is spent reading data" a lot of the time.

(Re: Netflix, in qaq's other comment, I don't have a hard number for this, but I thought a typical AWS data center is only under 20-30% load at any given time.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: