Hacker News new | past | comments | ask | show | jobs | submit login

We've struggled with the same issues. We also keep a "base" schema in Redshift as a close mirror to source systems (tweaked a little for distribution keys). In our case we control the extracts but schema changes have to be manually added. If a new column is important and we miss it then we have to run time consuming backfills. Then only way we've been able to ameliorate it somewhat is to be more involved in the upstream release process but that is time consuming in itself.

The solution I'd like to implement is having the extracts and loads be built dynamically whenever a change is detected. E.g. have a process that runs separately checking the schema of our sources for changes, dynamically updating the ETL processes and then publishing them to Azkaban. Looking forward though our biggest data sources are going non-relational so that approach will entail a great deal more complexity.

Regarding schema knowledge / metadata, this is a perennial problem. I once worked on a huge Teradata DW system at a global bank who had a large internal website for just this purpose which was hugely out of date at all times.

When I started building out the analytic capability at my current company I made a fork of "Rocco" (https://github.com/joeharris76/rocco) for doing inline documentation as Markdown inside SQL block comments. http://pow.cx/docs/ is a good example of what I was hoping to achieve. However even our smallest components have many, many files and the docs were too difficult to navigate using the "Jump To…" dropdown menu.

So, while this produces some very nice docs, sadly they turned out to be largely unhelpful to anyone who didn't already understand the code. Our current approach is to try and keep individual components small, include a comprehensive README in each component and use Slack comms to fill in the gaps.




Being able to adapt dynamically to schema changes would be great. If you do develop a solution for that, please do publicize it. Simple uses segment.com for some communication of analytics info to third parties, and Segment folks have built some capabilities to infer schemas from incoming messages which they then apply to history tables in Redshift. We're hoping to learn more from Segment about how they make that happen.

What you described with documentation efforts is kind of scenario I fear. We could spend significant effort putting together a documentation system that simply doesn't meet a real need. It's good to know other folks are also struggling with that, though.


We're very interested in trying AWS' new Database Migration Service to see if it provides a nice, automated way to get a mirror of the source into Redshift. https://aws.amazon.com/dms/




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: