Building Analytics at Simple

jklukas · on Nov 30, 2015

I wanted to mention some things that are still struggles with our current platform at Simple.

Maintaining our Redshift schema can be time-consuming, especially since there can be upstream changes that affect the schema of the messages we consume, and we often aren't notified. We'll just be silently dropping the new columns, or loads into Redshift will start to fail. Postgres 9.5 is supposed to include DDL messages in logical decoding, which will help for our pipeline from Postgres databases, but we'd love to develop a more robust system for automatically updating the Redshift schema to accommodate changes in the incoming data format. Naive solutions would likely be susceptible to unwanted behavior coming from spurious messages. We could end up with a large number of mostly empty columns in some cases.

We've also had trouble maintaining knowledge about the schema. We'd like to have some sort of "data dictionary" that would be available for folks to understand where the data in Redshift is coming from and what each of the columns mean. This might end up being a standalone document, but there's concern about it staying up to date with changes in the schema. We've toyed with the idea of maintaining this sort of documentation as Javadoc-style comments directly in the repository where we define our Redshift schema and migrations, but we haven't found any existing tooling to render documentation from comments in SQL source files.

jklukas · on Dec 2, 2015

A question on Twitter (https://twitter.com/sayhar/status/671867539480305664) asked about how Simple decided to do ETL in Redshift.

I think the main benefit is flexibility. Since transformations are happening in the same system where the data already lives, there's no I/O overhead and the performance is pretty fast. Our full suite of nightly ETL jobs runs in about 2 hours, and that's without any effort to run jobs in parallel. There's also minimal schema maintenance involved; the nightly jobs completely drop the existing tables, and in most cases the table creation is handled automatically as part of a CREATE TABLE AS statement. That means we're pretty free to add, drop, or rename columns without needing to worry about migrations or keeping the schema in sync with what some external tool is doing.

That said, no one on our team has deep familiarity with ETL tools, so there may be other good options that we simply haven't investigated.

joeharris76 · on Dec 3, 2015

I do have experience with ETL tools. The dirty secret of ETL tools is that they are slow in the best case. In the worst case extremely, head meltingly, how-could-it-possibly-take-that-long slow. Before MPP and columnar DBs (Vertica, Greenplum, etc.) were widely available ETL tools were a good trade off. You were offloading work that (in many cases) literally could not be completed on either the source DB or the target data warehouse.

When we (my company at the time) got Netezza in 2004 we immediately realized that we needed to move onto ELT. I.e. the ETL tool (SQL Server DTS at the time) became just a scheduler running plain SQL against Netezza. It was massively faster to do it that way and we could put the ETL server budget into Netezza.

ETL vendors, e.g., Informatica, now have "pushdown" options where that you design your workflow in their tool but it executes as SQL inside your database. However, when you use a tool like this you're very constrained in the types of processing you can do, for example Informatica doesn't have the ability to parse timezones (yes, really).

We evaluated basically everything that claimed to support Redshift and couldn't find anything that was viable for what we need. Some tools were close though so maybe someone can close the gap.

joeharris76 · on Nov 30, 2015

We've struggled with the same issues. We also keep a "base" schema in Redshift as a close mirror to source systems (tweaked a little for distribution keys). In our case we control the extracts but schema changes have to be manually added. If a new column is important and we miss it then we have to run time consuming backfills. Then only way we've been able to ameliorate it somewhat is to be more involved in the upstream release process but that is time consuming in itself.

The solution I'd like to implement is having the extracts and loads be built dynamically whenever a change is detected. E.g. have a process that runs separately checking the schema of our sources for changes, dynamically updating the ETL processes and then publishing them to Azkaban. Looking forward though our biggest data sources are going non-relational so that approach will entail a great deal more complexity.

Regarding schema knowledge / metadata, this is a perennial problem. I once worked on a huge Teradata DW system at a global bank who had a large internal website for just this purpose which was hugely out of date at all times.

When I started building out the analytic capability at my current company I made a fork of "Rocco" (https://github.com/joeharris76/rocco) for doing inline documentation as Markdown inside SQL block comments. http://pow.cx/docs/ is a good example of what I was hoping to achieve. However even our smallest components have many, many files and the docs were too difficult to navigate using the "Jump To…" dropdown menu.

So, while this produces some very nice docs, sadly they turned out to be largely unhelpful to anyone who didn't already understand the code. Our current approach is to try and keep individual components small, include a comprehensive README in each component and use Slack comms to fill in the gaps.

jklukas · on Dec 2, 2015

Being able to adapt dynamically to schema changes would be great. If you do develop a solution for that, please do publicize it. Simple uses segment.com for some communication of analytics info to third parties, and Segment folks have built some capabilities to infer schemas from incoming messages which they then apply to history tables in Redshift. We're hoping to learn more from Segment about how they make that happen.

What you described with documentation efforts is kind of scenario I fear. We could spend significant effort putting together a documentation system that simply doesn't meet a real need. It's good to know other folks are also struggling with that, though.

joeharris76 · on Dec 3, 2015

We're very interested in trying AWS' new Database Migration Service to see if it provides a nice, automated way to get a mirror of the source into Redshift. https://aws.amazon.com/dms/