Hacker News new | past | comments | ask | show | jobs | submit login

For normal VCS's, you are absolutely right. And you're actually right for mine, but I decided to redo the math to make sure.

My VCS will track things at a finer level than documents. In a C file, it will track individual functions and structs. In a Java file, it will track individual fields and classes. In a Microsoft Word document, it might track individual paragraphs. And in a Blender file, it will track each object, material, texture, etc. individually.

Yes, it will handle binary files.

Anyway, it will also be designed for non-technical users. To that end, it will hook into the source software and do a "commit" every time the user saves.

It will also track individual directories to make renames work.

I am a ctrl+s freak, so I save once a minute or more. However, other people are not, so let's assume 10 minutes (for autosave, perhaps).

Now let's assume a monorepo for a company of 100,000 people. And let's assume that when they save every 10 minutes, they save one object in one file (also tracked) two directories down. That means they create 5 hashes every 10 minutes (the fifth is the top level).

Let's assume an effective 6-hour work day.

That's 5 objects times 6 times per hour times 6 hours. That's 180 objects a day per person.

That's 18,000,000 total objects per day. Times 5 for days in a week, times 50 for work weeks in a year.

That's 4.5 billion.

Let's multiply that by 40 for 40 years that the repo exists, which includes some of the oldest software.

That's 1.8e11 objects. According to [1], a 128-bit hash would not be enough for the error correction on a disk at that point.

However, a 256-bit hash would give us a 10^31 objects before reaching that point, which gives us 10^20 times 40 years of space.

Yep, you're absolutely right that 512 bits is overkill. I stand corrected.

[1]: https://en.m.wikipedia.org/wiki/Birthday_attack




You're tracking things at the content level? How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete? Also, tracking the history of a class makes sense to some extent, but you say you want to commit every time there's a save. How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?


Good questions.

> How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete?

I've never seen a language parser do that, but if I run into a language that does that, I'll probably have my VCS track it at the file level, based on tokens or lines.

Dumb languages don't get nice things. :)

> How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?

This is less of a problem with binary files (assuming the source software does not have bugs in output), but with source files, you're right that that problem does exist.

As of right now, I would do a token-based approach. This approach removes the need for whitespace-only commits, and if I track the tokens right, I should be able to identify which right brace used to end the function until the broken code was saved. Then I would just save the function as broken using that same right brace.

For example, say you have this:

    int main() {
        return 0;
    }
My VCS would know that the right brace corresponds to the end of the function.

Then you write this:

    int main() {
        if (global_bool) {
        return 0;
    }
Yes, a dumb system might think that the right brace is for the `if`.

However, if you break it down by tokens, the VCS will see that `if (global_bool) {` were added before the return, so it should be able to tell that the right brace still ends the function.

I hope that makes sense.

Another plausible way to do it (at least in C) would be to look for things that look like declarations. The series of tokens `<type> <name> <left_paren>` is probably a function declaration. Java would be easier; its declarations are more wordy.

I still have to prove this is possible, but I think it is.


> As of right now, I would do a token-based approach

C++ is gonna get really funky there, with e.g. templates


Agreed. I'm starting with C.


In those cases you can just do error recovery in the parser (truncating an erroring function for example) and then store out-of-band information necessary to reconstruct the original file

This is also necessary to deal with whitespace for example (if you just reformat the code, you didnt change the ast but you changed the file)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: