Smart idea doing the hash choice per-commit. Just make sure that somebody puttin...

gavinhoward · on Nov 13, 2023

I agree.

There will be a set of presets of hash function and settings; if BLAKE3 fails, then I'll actually have to add SHA3 or something, with a set of settings, as presets.

The per-commit storage will then be an enum identifying the hash and its settings.

This will let me do other things, like letting companies use a 512-bit hash if they expect the repo to be large.

FabHK · on Nov 13, 2023

> letting companies use a 512-bit hash if they expect the repo to be large.

A repo would have to have more than 1e32 documents for a one in a trillion chance of a collision with a 256 bit hash. (Total annual world data production is estimated at less than 1e24 bytes.)

A 512 bit hash thus seems overkill for almost all purposes.

https://en.wikipedia.org/wiki/Birthday_problem

https://www.weforum.org/agenda/2021/05/world-data-produced-s...

gavinhoward · on Nov 13, 2023

For normal VCS's, you are absolutely right. And you're actually right for mine, but I decided to redo the math to make sure.

My VCS will track things at a finer level than documents. In a C file, it will track individual functions and structs. In a Java file, it will track individual fields and classes. In a Microsoft Word document, it might track individual paragraphs. And in a Blender file, it will track each object, material, texture, etc. individually.

Yes, it will handle binary files.

Anyway, it will also be designed for non-technical users. To that end, it will hook into the source software and do a "commit" every time the user saves.

It will also track individual directories to make renames work.

I am a ctrl+s freak, so I save once a minute or more. However, other people are not, so let's assume 10 minutes (for autosave, perhaps).

Now let's assume a monorepo for a company of 100,000 people. And let's assume that when they save every 10 minutes, they save one object in one file (also tracked) two directories down. That means they create 5 hashes every 10 minutes (the fifth is the top level).

Let's assume an effective 6-hour work day.

That's 5 objects times 6 times per hour times 6 hours. That's 180 objects a day per person.

That's 18,000,000 total objects per day. Times 5 for days in a week, times 50 for work weeks in a year.

That's 4.5 billion.

Let's multiply that by 40 for 40 years that the repo exists, which includes some of the oldest software.

That's 1.8e11 objects. According to [1], a 128-bit hash would not be enough for the error correction on a disk at that point.

However, a 256-bit hash would give us a 10^31 objects before reaching that point, which gives us 10^20 times 40 years of space.

Yep, you're absolutely right that 512 bits is overkill. I stand corrected.

[1]: https://en.m.wikipedia.org/wiki/Birthday_attack

deadbeeves · on Nov 13, 2023

You're tracking things at the content level? How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete? Also, tracking the history of a class makes sense to some extent, but you say you want to commit every time there's a save. How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?

gavinhoward · on Nov 13, 2023

Good questions.

> How will you deal with files that are purposely broken, or which cause the parser to take impractical (but finite) times to complete?

I've never seen a language parser do that, but if I run into a language that does that, I'll probably have my VCS track it at the file level, based on tokens or lines.

Dumb languages don't get nice things. :)

> How will you maintain a history when most commits are likely to contain unparseable code and so break the continuity of objects?

This is less of a problem with binary files (assuming the source software does not have bugs in output), but with source files, you're right that that problem does exist.

As of right now, I would do a token-based approach. This approach removes the need for whitespace-only commits, and if I track the tokens right, I should be able to identify which right brace used to end the function until the broken code was saved. Then I would just save the function as broken using that same right brace.

For example, say you have this:

    int main() {
        return 0;
    }

My VCS would know that the right brace corresponds to the end of the function.

Then you write this:

    int main() {
        if (global_bool) {
        return 0;
    }

Yes, a dumb system might think that the right brace is for the `if`.

However, if you break it down by tokens, the VCS will see that `if (global_bool) {` were added before the return, so it should be able to tell that the right brace still ends the function.

I hope that makes sense.

Another plausible way to do it (at least in C) would be to look for things that look like declarations. The series of tokens `<type> <name> <left_paren>` is probably a function declaration. Java would be easier; its declarations are more wordy.

I still have to prove this is possible, but I think it is.

dmoy · on Nov 14, 2023

> As of right now, I would do a token-based approach

C++ is gonna get really funky there, with e.g. templates

gavinhoward · on Nov 14, 2023

Agreed. I'm starting with C.

nextaccountic · on Nov 13, 2023

In those cases you can just do error recovery in the parser (truncating an erroring function for example) and then store out-of-band information necessary to reconstruct the original file

This is also necessary to deal with whitespace for example (if you just reformat the code, you didnt change the ast but you changed the file)

agodfrey · on Nov 13, 2023

Maybe you’re already aware, but you glossed over something: Since you’re using the hash to locate/identify the contect (you mentioned Merkle and git), if you support multiple hash functions you need some assurance that the chance of collisions is low across all supported hash functions. For example two identical functions that differ only in the value of their padding bytes (when the input size doesn’t match the block size) can’t coexist.

gavinhoward · on Nov 13, 2023

You are absolutely right. And yes, I am aware.

Location will actually be done by prefixing the hash with the value of the enum for the hash function/settings pair that made the hash.

tczMUFlmoNk · on Nov 13, 2023

Since you seem to have done a fair bit of research in this area, do you have any opinions or thoughts about the Multihash format?

https://multiformats.io/multihash/

It fills in some of the blanks in your "prefixing the hash with the value of the enum for the hash" step.

gavinhoward · on Nov 13, 2023

The multihash format is an excellent format that I am tempted to use.

However, there are a two general problems:

* The spec is not done, and it doesn't look like much has been done.

* While I agree with the FAQ that agreeing on a set of hash functions is possible, the format only allows 256 possible hash functions, so it can run out of space, especially since there can be multiple formats of the same function (BLAKE2B and BLAKE2S, for example), and especially since they want to allow non-cryptographic hash functions.

Then specifically for my use case:

* There is the problem brought up by AdamN [1]: if multihash is supported, an obscure hash may be supported, and that may cause problems.

* As an extension of that, once a hash function and set of settings is supported, it's supported forever, so I want to be more picky, and an enum allows me to restrict what is usable.

* By using a 16-bit enum, I have more space to grow.

* And finally, by using an enum, if my software encounters a repo with a enum value greater than all of the values it knows, it knows that that repo needs a later version of the software, since I will only add enum values.

[1]: https://news.ycombinator.com/item?id=38250444

TiredOfLife · on Nov 14, 2023

zpaq archiver solves that by including decompression bytecode inside archives. So check if repository supports your algorithm, if not then include it inside your commit.

tatersolid · on Nov 15, 2023

Remote code execution by design. What could possibly go wrong?