More

redsaz · 2024-08-06T07:32:23

> empirically measured that completing a task... is twice as fast in [Rust] than in [C++]

I have not read up on which tasks you're referring to that are empirically measured, apologies. The reason I'm curious on what the tasks are, is that depending on the task, navigability may not matter.

For example, if the task is "build a tool that does X", then navigability of the code does not matter. Once built, the tool does X, and there's no reason to revisit the code, and thus no reason to navigate the code.

But if the task is "Given a tool that already does W, X, Y, make the tool also do X', Y', and Z", then navigability of the code matters. This is because the coder must understand what the tool already does, and where the changes need to be made.

Most of my professional life, (and I'm willing to bet, most other coders here as well) I more often find myself in the second task than the first.

But, I'm not interested in Rust vs C++. I'd be more interested in the results of "given a version that makes high use of type inference vs not, how quickly can someone new to the project add X', Y', and Z." That would be a more appropriate test for what the author describes here. And I'd imagine that probably, those that are using sufficiently advanced IDEs would beat out those without, regardless of if type inference used or not, and would probably be slightly faster when given the highly type-inferenced version.

redsaz · 2024-08-06T07:07:29

> This post can definitely be considered a “religious” opinion piece

The author certainly has that right, because the post steps on two programming religion landmines, from how I read it:

1. strict static typing (without type inference) is good. 2. code should be written to allow IDEs to enhance navigability, rather than written on the assumption that IDEs will be the sole provider for navigability.

I believe there is a point to be made in the "when we don't know what we're getting back, that harms navigability" camp. But as another commenter posted, there's a point to be made in the "when we overspecify what we're getting back every time, that can harm readability, too" camp.

I can't express where this balance is. It's somewhere between poetry and a legal document, the prose where you can really get into a good book and enjoy the world that the author presents. Some people really like the beauty of a short poem. Other people may require precise wording that leaves no room for enjoyment or interpretation. The rest of us can have the majority of fun somewhere in between.

Where that "in between" equivalent would be in my day-to-day programming, I'm not entirely sure, because what I'm writing could be a short script where brevity is vital (poetry-ish) vs some section of unfortunately highly complex code with lots of tests for edge cases (legalese), and all the other code where I'm still world-building and conveying ideas (prose). And I believe that complexity should be spelt out as precisely as it can in the code itself, rather than rely on the hope that somebody else is using the same IDEs and features as me. I've tried using type inference where it seems fine to use, and then spelling out the exact type that a variable wants where it isn't clear what might get returned, all in the same app, but it comes across as sloppily inconsistent in my mind. Ah well.

redsaz · 2024-07-31T00:25:10

We have a similar saying: "You can lead a horse to water but you can't make 'em drink." At least, it seems similar to me.

pdimitar · 2024-07-31T00:52:19

Oh yeah, that's exactly what I mean. Thanks for reminding me the English / American idiom.

redsaz · 2024-07-24T23:52:31

Answer: because they don't want to be promoted to manager, they'd rather be coding.

GoToRO · 2024-07-25T10:41:24

that's just another myth spread by unskilled managers.

redsaz · 2023-11-05T16:17:32

Writing a program like this is one of the first exercises I give myself when learning a new programming language, because it touches a little bit of everything (reading files, output, CLI, using libraries, hashmaps, functions, loops, conditionals, etc) and isn't too onerous to implement.

My latest (it's a few years old at this point) is lsdup (rust version) using blake3 for hashing the content: https://github.com/redsaz/lsdup/

All it does is list the groups of duplicate files, grouped by hash, groups ordered by size. I'll usually pipe the output to a file, then do whatever I want to the list, and run a different script to process the resulting list. It works fine enough.

shpx · 2023-11-05T19:14:25

Checking the file size first is good, but reading every byte of every file that has at least one file of matching size and and doing a ton of XOR steps (or whatever BLAKE3 is) for all those bytes can't be optimal when different files probably differ within the first few bytes.

If you only have two files, read them X bytes at a time in parallel and XOR the bytes directly, stopping at the first different byte. For more than 2 files, to a first approximation, you have your list of file pointers, you read the first X-number of bytes for all of them and then sort your list in-place based on those first X bytes (sorting is a linear operation if all the X bytes are the same), then you iterate over that sorted list processing each run of identical X bytes in a depth-first fashion. If you have just one element that starts with a given set of X bytes, it's unique and you don't have to process it anymore, otherwise repeat the process reading the next X bytes but only for the files that started with the same bytes. X is probably 8, so that you can efficiently XOR 64 bits together for your comparisons.

redsaz · 2023-11-05T19:37:27

I've considered adding a "just check the first {user-configged-bytes}" mode, which would offer enough of the speed up you describe, which I think jdupes does (maybe? It's been awhile since I looked at it). I think the speed of opening the file and reading the first buffered page of bytes would dominate the time of the operation, especially if one were to do async reads (which my program does NOT do. I should look into it.)

Worth some performance measurements, anyway!

bodyfour · 2023-11-05T18:34:58

(Warning: rambling ahead, since in the past I've spent a decent amount of time on the same problem)

> using blake3 for hashing the content

Using any hash algorithm isn't a good design, at least for SSD storage.

First, it invites some degree of cryptographic risk. i.e. if a collision is ever discovered in the hash then your program can be tricked into discovering a false duplicate. Whether that is a problem depends on the use case, but isn't ideal.

Worse, though, is that it just doesn't make any sense algorithmically. Consider the simple example of having two 1TB files and you want to discover if they're identical. You could do a cryptographic hash of each of them and (barring any malicious collisions) tell whether they're the same.

However, now imagine that those two files differed in the very first byte. Now it seems that you could have figured out they're different a lot faster, right?

So what you really want to do is read both files chunk-by-chunk (probably some number of disk blocks at a time) so you can detect the files-are-different case early. (After all, the common case of files that differ is that they'll differ early!). You could still compare the chunks using a cryptographic hash, but now there is no benefit: you can just compare the two blocks of memory directly faster than you can take a crypto hash of them. C's memcmp() works fine but since you are probably working on fixed-sized aligned blocks you can do slightly better with a hand-rolled SIMD loop.

The one advantage that a cryptographic hash gives you is that it provides a memory-efficient way of reading all of file A and then all of file B. Therefore if disk seeks are expensive, it can be a benefit (again, if you can accept the risk of a malicious false-positive). However if the files are SSD backed and you have enough RAM to read a decent sized chunk of each file into memory simultaneously this ceases to be a problem.

To extend this from 2 files to many, first stat() all of the files and group them by file size. After all, two files of different sizes aren't going to ever be equal. You can think of the size as a "hash" of the contents that you get for free. Any files that are 0 bytes are (of course) duplicates of each other so you can just return those as "hits". Any file that has a unique size (which thankfully is often the common case) is not a duplicate of anything and you don't even have to open it. If you care about hardlinks, you also want to track the inode numbers at this step so you can avoid comparing two files that are actually the same inode.

Then for any group of files with the same size, read each block in turn. The tricky part is that you want to subdivide the group by the contents of the file. i.e. if you have 4 same-sized files and two have contents "AAAA..." and the other two have contents "BBBB..." then you didn't find any unique files yet, but you need to split the set of 4 files into two new subsets of 2 files each. Data-structure wise, keep a worklist of items containing of (1) a set of (at least two) files that could be duplicates and (2) how many bytes you've already verified are the same among them. Then when you encounter this "split" scenario you can just push a new worklist item and continue working on one of the groups.

The bit you need to be careful of is not introducing bad worst-case performance here in a hard case (e.g. you have a million potentially-duplicate files, and reading one block separates them into 5000 groups each with 200 members). Just a decent hash-table is enough with the key being the whole disk block.

Because maintaining this hash table adds a bit of complication, it can seem worthwhile to build a special-case for groups of exactly two files, where you can just do the simple read-and-compare. But then you can re-combine this by observing that the case where you have N files that are all the same is also worth optimizing for. So instead, just do the read-and-compare on all of the files until you find the first two that are different -- only then start building the hashtable when you have two different blocks and more files still to read. That way the common-ish case where you have many files all the same can be handled as fast as possible.

There are things that operating systems could provide that would make this even better:

1. It would be nice if it were easy to estimate the likely seek cost before picking an algorithm. If the file system would simply indicate whether it thought it was backed by spinning rust that would be great.

2. Also if you could ask for the filesystem to read a non-fixed number of bytes (without resorting to async-I/O) that would be helpful. ("Give me up to 1MB of the file, but stop early if you have to seek to a new extent"). Having the ability to basically read the file extent-by-extent instead of block-by-block would mean we could be seek-efficient while reading multiple files in parallel.

3. Finally, it would be great if there was some portable way to access any block-hash metadata the filesystem itself keeps. A filesystem that does its own deduplication work might already know that two blocks must have different contents without reading them because it already scanned them. On the flip-side, if the filesystem supports copy-on-write file snapshots then it could tell us in advance that two inodes are really the same underlying file before we open them.

at_a_remove · 2023-11-05T21:49:20

So, my deduplication is about merging various archives I have of various things I've pulled off of 4chan since 2005. My algorithm is a bit backward because it starts with the proposition that all files are the same. This will turn out to save on comparisons.

Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database.

So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ...

I will probably work on image matching, eventually.

The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them.

I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now.

redsaz · 2023-11-05T22:25:18

I like the strategy of only hashing the first part of a file as a multiphase approach of deduping in order to quickly eliminate unique files, I wish I had done it that way with my util. Maybe for v2!

at_a_remove · 2023-11-05T23:44:11

I think my next pass at it will be to merge the MD5 and SHA-1 steps so that we get two outputs at the end, this way I would save on the file-reading.

But after that, I think one percent of the end of the file. Then ten percent of the start of the file, then ten percent of the end of the file ...

Given that metadata is typically at the end or beginning of a file, those seem like the best place to look for differences.

I would be open to other hashes so long as they were drop-in easy. I'm not concerned about malicious, forced collisions because they would have to overcome two different kinds of hashing, and the most it would earn is a delay, since there's always a byte-by-byte comparison at the very end.

I suspect I would similarly want to use multiple fingerprinting methods for the visual characteristics of an image file.

redsaz · 2023-11-06T04:55:02

> I'm not concerned about malicious, forced collisions

Consider taking a look at xxh3, possibly. It seems a pretty decent contender, hashing speed-wise, to trade off of secure hashing: https://github.com/Cyan4973/xxHash

redsaz · 2023-11-05T19:20:10

This isn't the first time I've heard concerns about using hashes for checking file equality. I've considered adding a "paranoid mode" to do the direct byte-for-byte checks for such folks that don't even want to introduce a so-remote-it's-virtually-impossible theoretical chance for a collision to occur.

I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.

bodyfour · 2023-11-05T20:59:32

"The math" only matters for random collisions, which are effectively impossible (less likely than the CPU malfunctioning). However that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.

Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.

redsaz · 2023-11-05T22:06:56

> that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.

Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D

Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.

The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.

(You do have me curious as to how much difference it could make, though)

Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.

bodyfour · 2023-11-06T00:54:38

> when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.

But the ideal tool should also try to handle the opposite case well too.

First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.

But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

redsaz · 2023-11-06T05:56:58

> But the ideal tool should also try to handle the opposite case well too... a huge collection of unique ~100KiB files.

and

> But the pathological case is even worse -- what if nearly all of your files are about the same size?... If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

I agree this is a worthy consideration. No sense in reading the entirety of each of those files when only reading the first block would do, in order to remove the uniques early. If I were to redo the util, it'd probably be something like:

1. Group all files into lists of same file sizes.

2. After all files are read, eliminate any groupings with just one file, these are unique files

3. Read first N bytes to pare down files in those lists, (so now the key is filesize and hash-of-first-N-bytes or even filesize and first-N-bytes if N is small enough, either way)

4. After each filesize-group is subgrouped by first-N-bytes evaluation, eliminate any subgroupings with just one file, these are unique.

5. What remains are files fairly likely to be duplicates.

5a. For users that consider this "good enough", allow them to stop here. (Some deduper tools do this)

5b. For everybody else, in order to make sure the files are dupes or not, the files can next be subgrouped by fully byte-for-byte comparison and/or hashed, whatever the user is good with.

6. The remaining groupings of two or more files are dupes.

In the end I opted not to go for this rewrite, at least not yet, because I got sidetracked thinking about how the whole reason I'm doing this in the first place is because the way I've backed up data across all my machines for years is pretty horrible, all things considered, and now I've got my wife's data to consider too, and I still want my data to be locally available on my laptop, and I don't want to entirely rely on cloud services for syncing, and, and, and... so now I'm making a tool for all that. And then when it is finished, somebody can come along and say "you could've just used owncloud and syncthing and rclone and a pair of external drives, good grief man."

Still though, I might rework the deduper logic anyway.

redsaz · on Aug 17, 2023

I've not heard that version before, I like it. The way I usually hear it end:

So, he yells out "102!" and... Crickets.

"What'd I do wrong?"

"Ehh, you must not have told it right."

(...Or, in this case, "you're probably not using the right model GPU")

redsaz · on June 27, 2023

> Genesis 4:25 And Adam knew his wife again; and she bare a son, and called his name Seth: For God, said she, hath appointed me another seed instead of Abel, whom Cain slew.

They had other children.

For the sun part, this probably refers to Genesis 1:3 "let there be light" and dividing day from night, and Genesis 1:16 "And God made two great lights; the greater light to rule the day, and the lesser light to rule the night" and often in between there's "morning and evening". The way this happens after the fourth day is with the sun and moon lights.

I wasn't there at the time, but I assume reality pre-day four was running prior to OpenGL 2.4 and was using global illumination, and then afterwards switched to RealEngine 4, probably. Just a guess.

Humans being created before and after animals likely refers to Genesis 1:26 and 2:19. That is a bit curious, agreed. The way I take it, since Genesis 2:4-29 happened after the 7th day, is something like "hold up, let's go back to those days when plants, animals, and humans were created, and get slightly more specific about it.

The origin story in the Bible is short on specifics, so it leads to a lot of interesting fan theories, and divides many of the staunchest fans. "These are literal days here" "no, these are phases, a day could be any length of time, the sun wasn't even there yet" and so on, as if that was the main point of the passages, when it isn't.

edvinbesic · on June 27, 2023

And on the fourth day God refactored the rendering pipeline…

redsaz · on June 2, 2023

Randal Hyde now has an Art of Assembly edition for 64-bit.

fuzztester · on June 2, 2023

Cool, good to know.

redsaz · on May 28, 2023

The idea of this have me a chuckle. Instead of worring about skynet destroying all of humanity, we should be worried about them constantly asking about our car's extended warranty.

Seriously though, a real life robot is more expensive to make and run than an online one. Plus, there's nowhere for them to put "permanent" content, they'd have to talk to us one by one, or start graffitotagging ads everywhere. Any robots they did make would be obviously robots, and we can avoid them. It doesn't scale the same way.

redsaz · on May 13, 2023

It's important to note that the author uses denormalized tables for data analysis only. It was never outright stated "don't do this for source-of-truth, authoritative data," but yeah, don't do this for source-of-truth, authoritative data (in general).