Hacker News new | past | comments | ask | show | jobs | submit | dianamp's comments login

We've considered HDFS, but we really liked the idea of having compute only clusters and have our data kept completely separate. Clusters failure happen and having data on S3 makes us worry less if a cluster goes down. Just spin up a new one and you're good to go.

There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3.

We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed.


netflix i think said they see about a 10% perf hit using s3 instead of hdfs, using emr where they launch temporary clusters that do a job and shut down, and that performance cost was well worth the flexibility of being to launch independent clusters whenever they need.


We're also using S3 but we have a hybrid approach to the problem. The event data is immutable and you use instance stores with EC2 and cache the data to local SSDs and use S3 as backups. The thoughtput of HDFS is better than S3 or EFS but I would prefer to use EFS in this case since it also utilizes caching under the hood and cheaper alternative.


Oh great, thanks for the reply. I think thats about where I think we'll land... keep S3 as the primary source, but have HDFS be used for intermediate jobs.


Good luck and have fun! :D


So true! Been there myself! And have friends that are getting told by big tech companies that they don't do H1Bs anymore and prefer to do L1s and people working from their Europe offices.


Nice!


Hi, the backend developer position sounds very interesting, by any change do you accept h1b?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: