Hacker News new | past | comments | ask | show | jobs | submit login
Building Analytics at 500px (medium.com/samson_hu)
94 points by titanas on Oct 1, 2015 | hide | past | favorite | 16 comments



Thanks a lot for a detailed post. 20gb of log to not seem like much though? Not really sure of 500px scale. Fully conceding that at my company probably are logging too much, I'm wondering what takes ETL pipeline 4hr to run over that. Is just no need to optimize it since there's no benefit to have these metrics real time? Or am I missing some really data heavy part. Again not trying to be negative just wondering.

Luigi seems cool, can anyone comment on it compared Airflow or Spring XD, or is are those just different products.

Periscope looks like Kibana for relational stores, also cool to see.

Thanks again for great post!


Its not a lot of data, but the biggest constraints to me were cost time to implementation. You could actually get an amazon memory intensive server and do it all in memory, but I didn't have those resources. ETL + a Redshift server in the end cost me around $5000 a year, which is TINY compared to the value we got out of it and the cost that most companies pay.


Makes sense I'm used to physical servers with tons of ram at work. If you don't mind, and I assume you're on Amazon entirely then, what size instances/how many are you using for the ETL(luigi) nodes? And do is there any infrastructure besides s3 (mysql dump, logs) => amazon instances (luigi) => redshift being involved?


That is all there is. s3, one amazon instance for luigi which holds the mysql read replica, and amazon redshfit. There wasn't any heavy ETL in Luigi. Luigi mostly just extracted/dumped data. All the heavy lifting was in EMR


Also to add. Logging too much should not be a problem if you are using it. Wish's hadoop log store is over 40tb compressed right now. And its worth every penny.


Luigi creator Erik Bernhardsson has posted some notes on Luigi vs Airflow & more:

http://erikbern.com/2015/07/02/more-luigi-alternatives/

... and Pinball:

http://erikbern.com/2015/03/14/pinterest-open-sources-pinbal...


This is such an awesome post. Samson shared the details of his data engineering work at 500px and Wish during a Keen IO event last night. His slides are here -> https://keen.io/blog/130230045601/analytics-startups-and-lau...


Great post, and a nicely detailed read. This also came up on HN a few months ago (https://news.ycombinator.com/item?id=9760606), with some interesting alternatives to some of the presented tech in the comments.


Wow, this post is incredible--I can't thank the author enough for writing it! It's so thorough and has so many pieces of wisdom to offer in building a complete analytics solution from top to bottom. I'll be returning to this often.


Author here

I'll follow up and say that my first three months in SF/Bay area have been amazing. This is truly the capital of technology here. The level of talent and the one-ness of the community goes beyond anything that exists in Canada. I'm excited to take what I lear back one day.

Send me questions via email shanzhen.hu at gmail.com if you have any questions about the process. I'm happy to help.


Just wanted to say thanks for this writeup. I read it the last time on here and it was definitely inspirational.


Previous discussion 3 months ago: https://news.ycombinator.com/item?id=9760606


I really appreciate the amount of time - and honesty - that went into this post. The author really sounds like they have their head firmly on their shoulders. Thanks, Samson!


Very good post. So great to get this kind of detailed insights into the whole process: requirements, trade offs, architecture and implementation decisions, impact across the whole business and the end result and added value.


It is quite interesting post. I have been working in a scenario where the company is moving out of traditional Oracle based BI stack and this post just highlights how opex for BI can be reduced drastically.


What a terrific article. Well organized, clear, just enough detail to support the important points and on top of all that, a fun read. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: