Hacker News new | past | comments | ask | show | jobs | submit login

I've been in data analysis for a few years -- grew from scientific research in SAS with zero VCS, into programming and data analysis in R and STATA, still no VCS for a while...eventually I was allowed to use Github after I left my restrictive hospital research space. No one ever sat me down and trained me on Git, though. I just heard repeatedly: if you're not using Git, you're missing out on a huge set of important software-creation tools, you shouldn't be coding without VCS.

Why did this happen? Imagine it's this data analysts' first year out of school, something like a crappy statistics program that only teaches SAS or base-R with no VCS. This young padawan needs to practice analysis in preparation for a work project. They grab some data and stash it in a folder where Git is tracking. They do their analysis practice, it's Friday afternoon, they get lazy, they don't look at what they're committing, and they click buttons in the GUI without much thought.

It's incompetence and laziness, not malice. These tools that allow us to share widely, not just GH but also social media broadly: these tools have great power that should be used with greater training, responsibility, and care.




I don't think it's incompetence so much as professional ignorance (which might be incompetence, I don't think it is). Source control is for your source code, not for your data. Your data belongs in a data base. Git is not a database, or at least shouldn't be treated like one.

Sure it's easy to call it lazy for a data set to be in some local directory and accidentally get committed. Happens to all of us. The bigger problem is "why is that data sitting on your file system in a directory, when it should be in some data base, preferably not locally."

> these tools have great power that should be used with greater training, responsibility, and care.

This screams more and more that the tools are bad. Git is famously hard to use and even harder for non-plaintext data. Databases are annoying to initialize and get access to without a developer who's done it before. The tools suck, they can be better, and require less training. It's not wrong to be lazy - it's wrong to make the lazy path dangerous.


> Git is not a database

Um, yeah, it is - by most reasonable definitions of the word "database".

No doubt there are a few unusually-narrow definitions of "database" out there that would exclude Git, but I'm pretty certain they're in the minority.


I'm not talking about pedantry, but pragmatism. Git is not designed to be used as a conventional database, and should not be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: