Hacker News new | past | comments | ask | show | jobs | submit login

It was the logging part which puzzled me.

The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own




> It was the logging part which puzzled me.

OP used "Log stacks", but "Log Stacks" are just a specific flavor of event-based timeseries/analytics stacks. If you were to build a log-ingest and log-aggregation system, you'd just be building an ETL but with a specific emphasis on logging.

> The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own

Snowflake's stage system works similarly to what I'm describing. You can use S3 as a stage, and then load data from the stage into a table. If something bad happens to your data _in Snowflake_, you can just reload from the Stage (with an updated INSERT).

For more ad-hoc ELT/ETL systems (ie not just All-In on Snowflake), you'd have to just assess your own tooling and build it yourself. In general, when building ingest systems I try to document whatever I can per-record. Meaning, each record in a store includes what version of what software ingested it, and a reference to the raw-est form of that data possible (ie, a JSON blob of the original event or a S3 URL to that event's backing source). This lets you say "We identified a bug in the ingest layer at version 0.1.1, we need to reingest all that data with 0.2.0", and then easily identify and remove the exact data that encountered that bug (because you recorded 0.1.1 as a part of the record itself), and then build a list of exactly what S3 files need to be reingested by 0.2.0.

If you're comfortable expanding your dataset a bit to include that type of metadata you save yourself a lot of time when bad things happen (which they will). It's always a game of metadata/bloat/compute time vs. savings, though.

edit I'll add, none of this matters if your dataset is small enough to be imported from 0 in almost no time at all. If you could write a small script to just parse every file in S3 and insert it into a database, and the time it would take to finish doesn't upset you, you're totally fine just doing that. What I described above is for when your data becomes so large that reimporting from 0 is basically impossible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: