Hacker News new | past | comments | ask | show | jobs | submit login

A question on Twitter (https://twitter.com/sayhar/status/671867539480305664) asked about how Simple decided to do ETL in Redshift.

I think the main benefit is flexibility. Since transformations are happening in the same system where the data already lives, there's no I/O overhead and the performance is pretty fast. Our full suite of nightly ETL jobs runs in about 2 hours, and that's without any effort to run jobs in parallel. There's also minimal schema maintenance involved; the nightly jobs completely drop the existing tables, and in most cases the table creation is handled automatically as part of a CREATE TABLE AS statement. That means we're pretty free to add, drop, or rename columns without needing to worry about migrations or keeping the schema in sync with what some external tool is doing.

That said, no one on our team has deep familiarity with ETL tools, so there may be other good options that we simply haven't investigated.




I do have experience with ETL tools. The dirty secret of ETL tools is that they are slow in the best case. In the worst case extremely, head meltingly, how-could-it-possibly-take-that-long slow. Before MPP and columnar DBs (Vertica, Greenplum, etc.) were widely available ETL tools were a good trade off. You were offloading work that (in many cases) literally could not be completed on either the source DB or the target data warehouse.

When we (my company at the time) got Netezza in 2004 we immediately realized that we needed to move onto ELT. I.e. the ETL tool (SQL Server DTS at the time) became just a scheduler running plain SQL against Netezza. It was massively faster to do it that way and we could put the ETL server budget into Netezza.

ETL vendors, e.g., Informatica, now have "pushdown" options where that you design your workflow in their tool but it executes as SQL inside your database. However, when you use a tool like this you're very constrained in the types of processing you can do, for example Informatica doesn't have the ability to parse timezones (yes, really).

We evaluated basically everything that claimed to support Redshift and couldn't find anything that was viable for what we need. Some tools were close though so maybe someone can close the gap.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: