Luigi, AWS S3, DBT, Snowflake and Re:dash (currently analyzing Metabase or Looker to allow queries without SQL)
Luigi runs our scrapers and other workflow management tasks (e.g. DB backups).
All raw data lives in S3. We make an effort to be able to recreate the whole data warehouse from the raw data, so if any cleaning/normalization process fails, we have this safety net. I'm curious to hear if others use a similar pattern, or if there are better options.
DBT handles both loading the data from S3 into Snowflake (by creating the Snowflake Stages), and transforming the data in Snowflake. This isn't how DBT is used usually, but it felt wasteful to add Stitch or another tool to load the data into Snowflake, as snowflake supports it out of the box. I also created a `setup_snowflake` macro operation that creates our users, warehouses, databases, etc., in Snowflake (a kind of "poor man's Terraform")
I don't think Snowflake requires introduction. It's an amazing tool. We used Postgres before, but Snowflake is much much better, even though our DB is pretty small (~200 GB).
Finally, we use Re:dash as a BI, but I'm checking other options that allow usage without SQL (currently Metabase and Looker).
> All raw data lives in S3... I'm curious to hear if others use a similar pattern, or if there are better options.
You still may need to maintain information about what is in S3. If you're already using Snowflake you can probably achieve this via External tables. Otherwise you could run your own hive meta-store or use AWS external tables.
We're also trying to decide whether it's cheaper/easier to store things in S3 or just keep everything internal in snowflake. In some cases the compression is so good on some of our data, it's better keeping data stored in a table than in internal or external staging. Obviously this is bad if we ever have to move away from Snowflake, but we haven't committed to either approach and lots of data is still backed up in S3. Our total data warehouse is about 200TB at the moment and we're projecting significant growth over the next couple of years.
Luigi runs our scrapers and other workflow management tasks (e.g. DB backups).
All raw data lives in S3. We make an effort to be able to recreate the whole data warehouse from the raw data, so if any cleaning/normalization process fails, we have this safety net. I'm curious to hear if others use a similar pattern, or if there are better options.
DBT handles both loading the data from S3 into Snowflake (by creating the Snowflake Stages), and transforming the data in Snowflake. This isn't how DBT is used usually, but it felt wasteful to add Stitch or another tool to load the data into Snowflake, as snowflake supports it out of the box. I also created a `setup_snowflake` macro operation that creates our users, warehouses, databases, etc., in Snowflake (a kind of "poor man's Terraform")
I don't think Snowflake requires introduction. It's an amazing tool. We used Postgres before, but Snowflake is much much better, even though our DB is pretty small (~200 GB).
Finally, we use Re:dash as a BI, but I'm checking other options that allow usage without SQL (currently Metabase and Looker).