Show HN: Dut – a fast Linux disk usage calculator

montroser · 2024-07-11T02:17:52 1720664272

Nice work. Some times I wonder if there's any way to trade away accuracy for speed? Like, often I don't care _exactly_ how many bytes is the biggest user of space, but I just want to see some orders of magnitude.

Maybe there could be an iterative breadth-first approach, where first you quickly identify and discard the small unimportant items, passing over anything that can't be counted quickly. Then with what's left you identify the smallest of those and discard, and then with what's left the smallest of those, and repeat and repeat. Each pass through, you get a higher resolution picture of which directories and files are using the most space, and you just wait until you have the level of detail you need, but you get to see the tally as it happens across the board. Does this exist?

mos_basik · 2024-07-11T04:07:12 1720670832

Something like that exists for btrfs; it's called bdtu. It has the accuracy/time trade-off you're interested in, but the implementation is quite different. It samples random points on the disk and finds out what file path they belong to. The longer it runs the more accurate it gets. The readme is good at explaining why this approach makes sense for btrfs and what its limitations are.

https://github.com/CyberShadow/btdu

renewiltord · 2024-07-11T04:58:05 1720673885

Damn, `ext4` is organized differently entirely. You can't get anything useful from:

    sudo debugfs -R "icheck $RANDOM" /dev/nvme1
    sudo debugfs -R "ncheck $res" /dev/nvme1

and recursing. That's a clever technique given btrfs structs.

jszymborski · 2024-07-11T04:17:47 1720671467

That's so cool.

oehpr · 2024-07-11T18:32:21 1720722741

That is so cool!!! I have always wanted something like this! Arg I wish other filesystems supported a strategy like this!

201984 · 2024-07-11T02:28:52 1720664932

Thanks!

What you described is a neat idea, but it's not possible with any degree of accuracy AFAIK. To give you a picture of the problem, calculating the disk usage of a directory requires calling statx(2) on every file in that directory, summing up the reported sizes, and then recursing into every subdirectory and starting over. The problem with doing a partial search is that all the data is at the leaves of the tree, so you'll miss some potentially very large files.

Picture if your program only traversed the first, say, three levels of subdirectories to get a rough estimate. If there was a 1TB file down another level, your program would miss it completely and get a very innaccurate estimate of the disk usage, so it wouldn't be useful at all for finding the biggest culprits. You have the same problem if you decide to stop counting after seeing N files, since file N+1 could be gigantic and you'd never know.

montroser · 2024-07-11T03:33:20 1720668800

Yeah, maybe approximation is not really possible. But it still seems like if you could do say, up to 1000 stats per directory per pass, then running totals could be accumulated incrementally and reported along the way.

So after just a second or two, you might be able to know with certainty that a bunch of small directories are small, and then that a handful of others are at least however big has been counted so far. And that could be all you need, or else you could wait longer to see how the bigger directories play out.

geertj · 2024-07-11T03:54:20 1720670060

You would still have to getdents() everything but this way you may indeed save on stat() operations, which access information that is stored separately on disk and eliminating these would likely help uncached runs.

You could sample files in a directory or across directories to get an average file size and use the total number of files from getdents to estimate a total size. This does require you to know if a directory entry is a file or directory, which the d_type field gives you depending on the OS, file system and other factors. An average file size could also be obtained from statvfs().

Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories. Once you have seen the corresponding number of subdirectories, you know that there are no more subdirectories you need to descend into. This could allow you to abort a getdents for a very large directory, using eg the directory size to estimate the total entries.

olddustytrail · 2024-07-11T11:48:32 1720698512

> Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories.

For anyone who doesn't know why this is, it's because when you create a directory it has 2 hard links to it which are

    dirname
    dirname/.

When you add a new subdirectory it adds one more link which is

    dirname/subdir/..

So each subdirectory adds one more to the original 2.

BeeOnRope · 2024-07-11T02:29:43 1720664983

This seems difficult since I'm not aware of any way to get approximate file sizes, at least with the usual FS-agnostic system calls: to get any size info you are pretty much calling something in the `stat` family and at that point you have the exact size.

fsckboy · 2024-07-11T02:47:08 1720666028

i thought files can be sparse and have holes in the middle where nothing is allocated, so the file size is not what is used to calculate usage, it's the sum of the extents or some such.

BeeOnRope · 2024-07-11T02:53:28 1720666408

Yes, files can be sparse but the actual disk usage information is also returned by these stat-family calls, so there is no special cost to handling sparse files.

lenkite · 2024-07-11T04:19:08 1720671548

Wish modern filesystems maintained usage per dir as a directory file attribute instead of mandating tools to do this basic job.

nh2 · 2024-07-11T05:31:08 1720675868

CephFS does that.

You can use getfattr to ask it for the recursive number of entries or bytes in a given directory.

Querying it is constant time, updates update it with a few seconds delay.

Extremely useful when you have billions of files on spinning disks, where running du/ncdu would take a month just for the stat()s.

hsbauauvhabzb · 2024-07-11T04:57:08 1720673828

This is an excellent point and I wholeheartedly agree!

masklinn · 2024-07-11T05:04:31 1720674271

Is it? That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what? Do you “du” in your shell prompt?

Not to mention it would likely be unable to handle the hardlink problem so it would consistently be wrong.

IsTom · 2024-07-11T09:53:22 1720691602

> That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what?

You can be a little lazy about updating parents this and have O(1) update and O(1) amortized read with O(n) worst case (same as now anyway).

aidenn0 · 2024-07-12T00:31:11 1720744271

This is probably the right solution, but tou need to rebuild on an unclean unmount if you do it lazily.

hsbauauvhabzb · 2024-07-11T21:51:07 1720734667

Disks have improved in I/O and write speed metrics substantially to the point where windows will literally index your file system so you can search faster, and antivirus will scan files in the background before you open them. I don’t think maintaining size state on directories would be all that much of a challenge.

Someone · 2024-07-11T22:23:04 1720736584

I expect performance would suffer quite a lot. In a system with high I/O, there would be a lot of contention on updating the size of such directories as /home or /tmp, let alone /.

Also, are you going to update a file’s size for every write (could easily be a thousand times if you’re copying over a 10MB file) or are you going to coalesce updates to file sizes? If the latter, how do you recover after a crash?

Virtual directories such as /dev and /proc would require special-casing.

Mounting and unmounting disks probably would require special-casing.

hsbauauvhabzb · 2024-07-12T01:36:57 1720748217

Haven’t many similar issues been solved in journaled file systems and/or things like database transaction logs and indexes? Real-time high precision accuracy is not required, knowing how big a directory is, is a frequent use case of directories. Hell, ‘df’ tracks this at the partition level, including edge cases, as does ‘du’

Someone · 2024-07-12T07:04:52 1720767892

As far as I am aware, neither of those cascade sizes up.

Also, doing that in databases isn’t a solved problem. count(*) can be slow in databases. See for example

- PostgreSQL: https://dba.stackexchange.com/questions/314371/count-queries..., https://wiki.postgresql.org/wiki/Count_estimate

- Oracle: https://forums.oracle.com/ords/apexds/post/select-count-very...

(Both databases use MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...) to ensure that concurrent queries all see the database in a consistent state. That makes it necessary to visit each row and check their time stamp when counting rows)

aidenn0 · 2024-07-11T23:47:18 1720741638

I have a "du" command currently running that has been running for ~50 hours. I'd much rather have it update a half-dozen directory entries on each write.

robocat · 2024-07-11T08:07:57 1720685277

> but I don't like how unintuitive the readout is

The best disk usage UI I ever saw was this one: https://www.trishtech.com/2013/10/scanner-display-hard-disk-... The inner circle is the top level directories, and each ring outwards is one level deeper in the directory heirarchy. You would mouse over large subdirectories to see what they were, or double click to drilldown into a subdirectory. Download it and try it - it is quite spectacularly useful on Windows (although I'm not sure how well it handles Terabyte size drives - I haven't used Windows for a long time).

Hard to do a circular graph in a terminal...

It is very similar to a flame graph? Perhaps look at how flame graphs are drawn by other terminal performance tools.

OskarS · 2024-07-11T10:03:43 1720692223

I've used graphical tools very similar to this, and I always come back to this:

   du -h | sort -rh | less

(might have to sudo that du depending on current folder. on macos, use gsort)

You just immediately see exactly what you need to delete, and you can so quickly scan the list. I'm not a terminal die-hard "use it for everything" kinda guy, I like GUIs for lots of stuff. But when it comes to "what's taking up all my space?" this is honestly the best solution I've found.

pnutjam · 2024-07-11T18:13:46 1720721626

I like to use:

    du -shx \*

I used to pipe that to:

    | grep G

to find anything gig sized, but I like your:

    | sort -rh

Thanks!

OskarS · 2024-07-12T09:17:21 1720775841

Good tip! Yeah, the `sort -rh` is what makes it sing, it's such a cool feature of coreutils that `sort` knows how to sort human-readable output from `du` or `df`

pricechild · 2024-07-11T08:13:24 1720685604

"Disk Usage Analyser" / "Baobab" on Linux is awesome with the same UI: https://apps.gnome.org/en-GB/Baobab/

haskman · 2024-07-11T10:20:36 1720693236

And Filelight from KDE - https://apps.kde.org/filelight/

bmicraft · 2024-07-11T10:15:51 1720692951

Also Filelight (KDE)

ajnin · 2024-07-11T10:46:21 1720694781

I don't like radial charts because the outer rings have a larger area, which makes it look like files deep in the hierarchy take more space than in reality. And also it leaves the majority of screen space unused.

I prefer the more classic treemap view, my personal favorite being the classic version of SpaceMonger but it's Windows only and very slow.

roelschroeven · 2024-07-11T15:50:45 1720713045

WizTree also uses a treemap view and is very fast. It's also Windows-only though.

entropicdrifter · 2024-07-11T17:32:59 1720719179

Linux and MacOS have QDirStat: https://github.com/shundhammer/qdirstat

stavros · 2024-07-12T00:25:09 1720743909

Thank you for this, I've always looked for a WinDirStat alternative for Linux.

krackers · 2024-07-11T08:32:41 1720686761

DaisyDisk on mac does that. Also it's blazing fast, it seems to even beat "du" so I don't know what tricks they're pulling.

supernes · 2024-07-11T08:44:39 1720687479

I think they're reading some of the info from Spotlight metadata already collected by the OS for indexing, but I could be wrong.

radicality · 2024-07-11T16:46:22 1720716382

That’s probably it. It’s likely powered by whatever thing gives you quick directory sizes in Finder after you do View Options (cmd+j), and select “Show All Sizes”. I have that setting always on for all directories and pretty sure it’s cached as it’s fast.

rlue · 2024-07-11T08:22:41 1720686161

That’s called a ring chart or a sunburst chart.

kreyenborgi · 2024-07-11T11:36:02 1720697762

duc http://duc.zevv.nl/ does this

Neil44 · 2024-07-11T08:39:00 1720687140

On Windows I always used to use Windirstat but it was slow, then I found Wiztree which is many orders of magnitude faster. I understand it works by directly reading the NTFS tables rather than spidering through the directories laboriously. I wonder if this approach would work for ext4 or whatever.

Filligree · 2024-07-11T10:31:15 1720693875

NTFS is pointlessly slow, so bypassing the VFS provides a decent speedup in exchange for the ridiculous fragility.

Linux doesn’t have the same issue, and I’d be quite concerned if an application like this needed root access to function.

luma · 2024-07-11T16:30:28 1720715428

I think you underestimate how much of a speedup we're talking about: it can pull in the entire filesystem in a couple seconds on a multi TB volume with Bs of files. I have yet to see anything in the linux world (including the OP) that comes anywhere near this performance level via tree walking.

ablob · 2024-07-11T21:01:01 1720731661

I want to take this opportunity to recommend the talk "NTFS isn't that bad" (https://www.youtube.com/watch?v=qbKGw8MQ0i8). NTFS prefers a different access pattern than most usual file systems. I remember that a part of the talk was about speed-ups on Linux as well. So even if it doesn't sway your opinion it should enhance your perspective on how file systems work.

p_l · 2024-07-12T12:28:02 1720787282

The issue usually isn't NTFS, but the other layers in the I/O stack.

NTFS-the-on-disk-structure by itself can easily provide setup comparable to XFS realtime extensions.

Gormo · 2024-07-11T11:59:36 1720699176

If you do like WinDirStat, there's a good Linux equivalent called QDirStat: https://github.com/shundhammer/qdirstat

fsfod · 2024-07-11T15:46:08 1720712768

There is a fork of Windirstat that also reads the NTFS MFT as well https://github.com/ariccio/altWinDirStat

utensil4778 · 2024-07-11T17:41:23 1720719683

> it works by directly reading the NTFS tables rather than spidering through the directories

Maybe I'm just ignorant of linux filesystems, but this seems like the obvious thing to do. Do ext and friends not have a file table like this?

nh2 · 2024-07-11T05:53:55 1720677235

> I don't know why one ordering is better than the other, but the difference is pretty drastic.

I have the suspicion that some file systems store stat info next to the getdents entries.

Thus cache locality would kick in if you stat a file after receiving it via getdents (and counterintuitively, smaller getdents buffers make it faster then). Also in such cases it would be important to not sort combined getdents outputs before starting (which would destroy the locality again).

I found such a situation with CephFS but don't know what the layout is for common local file systems.

jeffbee · 2024-07-11T02:54:01 1720666441

It's also interesting that the perf report for running dut on my homedir shows that it spends virtually all of the time looking for, not finding, and inserting entries in dentry cache slabs, where the entries are never found again, only inserted :-/ Great cache management by the kernel there.

ETA: Apparently the value in /proc/sys/vm/vfs_cache_pressure makes a huge difference. With the default of 100, my dentry and inode caches never grow large enough to contain the ~15M entries in my homedir. Dentry slabs get reclaimed to stay < 1% of system RAM, while the xfs_inode slab cache grows to the correct size. The threads in dut are pointless in this case because the access to the xfs inodes serializes.

If I set this kernel param to 15, then the caches grow to accommodate the tens of millions of inodes in my homedir. Ultimately the slab caches occupy 20GB of RAM! When the caches are working the threading in dut is moderately effective, job finishes in 5s with 200% CPU time.

201984 · 2024-07-11T03:06:37 1720667197

Are you referring to the kmem_cache_alloc calls in the profile? If so, that's all in kernel space and there's nothing I can do about it.

https://share.firefox.dev/3XT9L7P

jeffbee · 2024-07-11T03:09:25 1720667365

No, see how your profiles have `lookup_fast` at the leaves? Mine has `__lookup_slow` and it is slow indeed.

201984 · 2024-07-11T03:27:38 1720668458

I just saw your edit. You have WAY more stuff under your home directory than I do. I only have ~2.5M inodes on both my laptop drives combined. The difference in the buff/cache output of `free` before and after running `dut` is only 1 GB for me.

Also, TIL about that kernel parameter, thanks!

jeffbee · 2024-07-11T17:58:13 1720720693

Yeah I have a TB of bazel outputs in my cache directory. Unfortunately automatically deleting old bazel outputs is beyond the frontier of computer science and has been pushed out to future releases for 6 years and still going: https://github.com/bazelbuild/bazel/issues/5139

INTPenis · 2024-07-11T06:56:12 1720680972

Reminds me of someone's script I have been using for over a decade.

    #/bin/sh
    du -k --max-depth=1 "$@" | sort -nr | awk '
         BEGIN {
            split("KB,MB,GB,TB", Units, ",");
         }
         {
            u = 1;
            while ($1 >= 1024) {
               $1 = $1 / 1024;
               u += 1
            }
            $1 = sprintf("%.1f %s", $1, Units[u]);
            print $0;
         }
        '

mshook · 2024-07-11T07:37:11 1720683431

I don't understand the point of the script, it's nothing more than:

  du -h --max-depth=1 "$@" | sort -hr

out-of-ideas · 2024-07-11T07:44:44 1720683884

`-h` is not available in all `sort` implementations

BlackLotus89 · 2024-07-11T10:08:20 1720692500

Even the busybox port has it. The only sort implementation I know of that doesn't have -h is toybox (I guess older busybox implementations are missing it as well), but I'm using -h for well over a decade and seldom had it missing

out-of-ideas · 2024-07-12T00:26:05 1720743965

i was actually curious when busybox's sort added it; but didnt search too hard. was certainly easy to see gnu get it in 2009 i think (but even then if the dude setup there bashrc long ago and that func/alias works, likely no reason to change it immediatly)

i can say an `BusyBox v1.35.0 (2022-08-01 15:14:44 UTC)` did not have -h; so it having it now is kind of a shock to me (looks like busybox v1.36.1 has it - at least from 2023-06-22) - good too! always frustrating when a dev tries using gnu-args and it blows up and i gotta explain the diff between mac-shell-cmds, gnu, and busybox

INTPenis · 2024-07-11T13:41:59 1720705319

I found this online a long time ago, and it's been with me across BSD, Macintosh and Linux. So I can't say why it is that way, and I didn't know about sort -h before today.

dotancohen · 2024-07-11T08:17:55 1720685875

The point is that it is faster.

zamadatix · 2024-07-11T19:11:47 1720725107

A bash script for postprocessing the sorting is certainly slower than just having sort do it correctly in the first place.

matzf · 2024-07-11T07:38:19 1720683499

Any particular reason for doing the human readable units "manually"? `du -h | sort -h` works just fine.

kelahcim · 2024-07-12T08:30:54 1720773054

Nice!

kelahcim · 2024-07-11T08:43:38 1720687418

I will definitely try this one and compare with my daily stuff

`du -s -k * | sort -r -n -k1,1 -t" "`

IAmLiterallyAB · 2024-07-11T03:20:51 1720668051

I'm surprised statx was that much faster than fstatat. fstatat looks like a very thin wrapper around statx, it just calls vfs_statx and copies out the result to user space.

201984 · 2024-07-11T03:39:52 1720669192

Out of curiosity, I switched it back to fstatat and compared, and found no significant difference. Must've been some other change I made at the time, although I could've sworn this was true. Could be a system update changed something in the three months since I did that. I can't edit my post now though, so that wrong info is stuck there.

mg · 2024-07-11T05:27:11 1720675631

I have this in my bashrc:

    alias duwim='du --apparent-size -c -s -B1048576 * | sort -g'

It produces a similar output, showing a list of directories and their sizes under the current dir.

The name "duwim" stands for "du what I mean". It came naturally after I dabbled for quite a while to figure out how to make du do what I mean.

laixintao · 2024-07-11T07:31:28 1720683088

> Anyone have ideas for a better format?

Hi, how about flamegraph? I always want to display the file hierarchy in flamegraph like format.

- previous discussion: https://x.com/laixintao/status/1744012609983295816

- my work display flamegraph in terminal: https://github.com/laixintao/flameshow

kccqzy · 2024-07-11T01:13:23 1720660403

I'm away from my Linux machine now but I'm curious whether/how you handle reflinks. On a supported file system such as Btrfs which I use, how does `cp --reflink` gets counted? Similar to hard links? I'm curious because I use this feature extensively.

201984 · 2024-07-11T01:40:05 1720662005

I've actually never heard of --reflink, so I had to look it up. `cp` from coreutils uses the FICLONE ioctl to clone the file on btrfs instead of a regular syscall.

I don't handle them specifically in dut, so it will total up whatever statx(2) reports for any reflink files.

vlovich123 · 2024-07-11T03:43:53 1720669433

You’ll probably end up with dupes (and removing these files won’t have the effect you intend) but I don’t know that there’s a good way to handle and report such soft links anyway.

namibj · 2024-07-11T18:51:59 1720723919

Btdu will be you friend.

inbetween · 2024-07-11T16:27:03 1720715223

I often want to know who there is a sudden growth disk usage over the last month/week/etc, what suddenly take space. In those cases I find myself wishing that du and friends would cache their last few runs and would offer a diff against them, this easily listing the new disk eating files or directories. Could dut evolve to do something like that?

beepbooptheory · 2024-07-11T16:41:44 1720716104

  du[t] > .disk-usage-"`date +"%d-%m-%Y"`"

And then use diff later?

inbetween · 2024-07-11T20:16:42 1720729002

Almost all of them will have some difference. What is needed is to parse the previous state, calculate the difference in size, and show only the "significant" difference.

namibj · 2024-07-11T18:49:14 1720723754

btdu extra (or expert?) mode with snapshots kinda does that: you can see what's only in the new version and not in a snapshot; and vice-versa. Also it offers attributing size to folders only for extends that aren't shared with a different folder (snapshots are essentially just special folders), to kinda get a diff between the two (stuff only present in the old snapshot is shown there; stuff only present in the new version is shown there).

thesh4d0w · 2024-07-11T16:45:38 1720716338

gt5 does this - https://gt5.sourceforge.net/

inbetween · 2024-07-11T20:17:49 1720729069

That looks perfect, thanks!

timrichard · 2024-07-11T01:24:48 1720661088

Looks nice, although a feature I like in ncdu is the 'd' key to delete the currently highlighted file or directory.

201984 · 2024-07-11T01:49:02 1720662542

This isn't an interactive program, so ncdu would be better for interactively going around and freeing up space. If you just want an overview, though, then dut runs much quicker than ncdu and will show large files deep down in subdirectories without having to go down manually.

teamspirit · 2024-07-11T03:08:51 1720667331

Nice job. I've been using dua[0] and have found it to be quite fast on my MacBook Pro. I'm interested to see how this compares.

[0] https://github.com/Byron/dua-cli

201984 · 2024-07-11T03:17:41 1720667861

I benchmarked against dua while developing, and the results are in the README. Note that dut uses Linux-specific syscalls, so it won't run on MacOS.

TL;DR: dut is 3x faster with warm caches, slightly faster on SSD, slightly slower on HDD.

shellfishgene · 2024-07-11T05:10:07 1720674607

What I need is a du that caches the results somewhere and then does not rescan the 90% of dirs that have not changed when I run it again a month later...

kstrauser · 2024-07-11T05:28:56 1720675736

And it would know they did not change without scanning them because how?

shellfishgene · 2024-07-11T11:08:59 1720696139

Maybe it could run in the background and use inotify to just update the database all the time, or at least keep track of what needs rescanning?

shellfishgene · 2024-07-11T11:16:08 1720696568

Thinking about this some more, does this system not already exist for the disk quota calculation in the kernel? How does that work? Would it be possible for a tool to scan the disk once, and then get information about file modifications from the system that's used to update quota info?

svpg · 2024-07-11T05:42:48 1720676568

It could hash the contents of a dir. Along the lines of git

Galanwe · 2024-07-11T05:57:35 1720677455

Except hashing requires... reading.

There is not much to be done here. Directories entries are just names, no guarantees that the files were not modified or replaced.

The best you could do is something similar to the strategies of rsync, rely on metadata (modified date, etc) and cross fingers nobody did `cp -a`.

shellfishgene · 2024-07-11T06:23:36 1720679016

I would be fine with the latter, the program could display a warning like "Results may be inaccurate, full scan required" or something.

I guess I'm just annoyed that for Windows/NTFS really fast programs are available but not for Linux filesystems.

legends2k · 2024-07-11T05:57:49 1720677469

And to hash something needs reading all of its data. I think deducing the file size would actually be faster in some file systems and never slower with any.

mort96 · 2024-07-11T07:19:05 1720682345

Faster in all file systems I'd guess, stat is fast, opening the file and reading its contents and updating a checksum is slow, and gets slower the larger the file is.

imiric · 2024-07-11T11:44:59 1720698299

I've been using my own function with `du` for ages now, similar to others here, but I appreciate new tools in this space.

I gave `dut` a try, but I'm confused by its output. For example:

  3.2G    0B |- .pyenv
  3.4G    0B | /- toolchains
  3.4G    0B |- .rustup
  4.0G    0B | |- <censored>
  4.4G    0B | /- <censored>
  9.2G    0B |- Work
  3.7G    0B |   /- flash
  3.8G    0B | /- <censored>
   16G  4.0K |- Downloads
  5.1G    0B | |- <censored>
  5.2G    0B | /- <censored>
   16G    0B |- Projects
  3.2G   42M | /- <censored>
   17G  183M |- src
   17G    0B | /- <censored>
   17G    0B |- Videos
  3.7G    0B | /- Videos
   28G    0B |- Music
  6.9G    0B | |- tmp
  3.4G    0B | | /- tmp
  8.8G    0B | |- go
  3.6G    0B | |   /- .versions
  3.9G    0B | | |- go
  8.5G    0B | | |     /- dir
  8.5G    0B | | |   /- vfs
  8.5G    0B | | | /- storage
  8.5G    0B | | /- containers
   15G  140M | /- share
   34G  183M /- .local
  161G    0B .

- I expected the output to be sorted by the first column, yet some items are clearly out of order. I don't use hard links much, so I wouldn't expect this to be because of shared data.

- The tree rendering is very confusing. Some directories are several levels deep, but in this output they're all jumbled, so it's not clear where they exist on disk. Showing the full path with the `-p` option, and removing indentation with `-i 0` somewhat helps, but I would almost remove tree rendering entirely.

201984 · 2024-07-11T12:07:43 1720699663

It is being sorted by the first column, but it also keeps subdirectories with each other. Look at the order of your top-level directories.

  3.2G    0B |- .pyenv
  3.4G    0B |- .rustup
  9.2G    0B |- Work
   16G  4.0K |- Downloads
   17G  183M |- src
   28G    0B |- Music
   34G  183M /- .local

If you don't want the tree output and only want the top directories, you can use `-d 1` to limit to depth=1.

imiric · 2024-07-11T12:44:10 1720701850

Ah, I see, that makes sense.

But still, the second `Videos` directory of 3.7G is a subdirectory of `Music`, so it should appear below it, no? Same for the two `tmp` directories, they're subdirectories of `.local`, so I would expect them to be listed under it. Right now there doesn't seem to be a clear order in either case.

db48x · 2024-07-12T16:32:18 1720801938

Subdirectories cannot be larger than the directory that they are within, so they cannot be sorted _below_ their parent. Thus, the tree branches upwards, not downwards. The root is at the bottom, where a tree’s root should be!

Incidentally, dust sorts things the same way but presents it with a nicer tree:

     db48x  ~  1  dust -Db
    2.6G     ┌── saves
    2.6G   ┌─┴ .factorio
    3.4G   │   ┌── Steam
    3.4G   │ ┌─┴ share
    3.4G   ├─┴ .local
    1.8G   │ ┌── EgoSoft
    3.5G   ├─┴ .config
    7.2G   │ ┌── Amadeus (1984) DC (1080p BluRay x265 HEVC 10bit AAC 5.1 Tigole)
     16G   │ ├── Amadeus.1984.DC.INTERNAL.REPACK.1080p.BluRay.x264-CLASSiC[rarbg]
     23G   ├─┴ video
    1.9G   │   ┌── build
    2.0G   │ ┌─┴ notcurses
    2.2G   │ ├── wezterm
    2.1G   │ │ ┌── master
    2.6G   │ │ │   ┌── tiles
    2.6G   │ │ │ ┌─┴ obj
    4.1G   │ │ ├─┴ missions
    2.6G   │ │ │   ┌── tiles
    2.6G   │ │ │ ┌─┴ obj
    4.1G   │ │ ├─┴ iteminfo
    5.6G   │ │ ├── follower_rules
    8.5G   │ │ │     ┌── pack
    8.6G   │ │ │   ┌─┴ objects
    8.6G   │ │ │ ┌─┴ .git
     10G   │ │ ├─┴ uilist
     26G   │ ├─┴ cataclysm
     33G   ├─┴ src
     72G ┌─┴ .

imiric · 2024-07-12T20:43:01 1720816981

Gotcha, thanks for explaining.

Yeah, I guess my confusion was with how the tree is rendered in dut. The pipe rendering of dust makes this clearer.

bitwize · 2024-07-11T15:35:57 1720712157

Neat, a new C program! I get a little frisson of good vibes whenever someone announces a new project in C, as opposed to Rust or Python or Go. Even though C is pretty much a lost cause at this point. It looks like it has some real sophisticated performance optimizations going on too.

anon-3988 · 2024-07-11T05:21:09 1720675269

I have been using diskonaut, its fast enough given that it also produces a nice visual output.

jonhohle · 2024-07-11T04:49:28 1720673368

Did you consider the fts[0] family of functions for traversal? I use that along with a work queue for filtered entries to get pretty good performance with dedup[1]. For my use case I could avoid any separate stat call altogether, the FTSENT already provided everything I needed.

0 - https://linux.die.net/man/3/fts_read

1 - https://github.com/ttkb-oss/dedup/blob/6a906db5a940df71deb4f...

201984 · 2024-07-11T11:38:11 1720697891

Those are single threaded, so they would have kneecapped performance pretty badly. 'du' from coreutils uses them, and you can see the drastic speed difference between that and my program in the README.

nh2 · 2024-07-11T05:24:18 1720675458

fts is just wrapper functions.

You cannot around getdents and stat family syscalls on Linux if you need file sizes.

sandreas · 2024-07-11T05:53:06 1720677186

Nice work! There is also gdu[1], where the UI is heavily inspired by ncdu and somehow feels way faster...

1: https://github.com/dundee/gdu

tambourine_man · 2024-07-11T02:54:57 1720666497

> https://dev.yorhel.nl/doc/ncdu2

I wasn't aware that there was a rewrite of ncdu in Zig. That link is a nice read.

hsbauauvhabzb · 2024-07-11T01:15:33 1720660533

This looks handy. Do you have any tips for stuff like queued ‘mv’ or similar? If I’m moving data around on 3-4 drives, it’s common where I’ll stack commands where the 3rd command may free up space for the 4th to run successfully - I use && a to ensure a halt on failure, but I need to mentally calculate the space free when I’m writing the commands as the free space after the third mv will be different to the output of ‘df’ before any of the commands have run.

201984 · 2024-07-11T01:51:26 1720662686

I haven't run into a situation like that, but if I did, I'd be doing mental math like you. `dut` would only be useful as a (quicker) replacement for `du` for telling you how large the source of a `cp -r` is.

cycomanic · 2024-07-11T19:24:38 1720725878

This looks awesome!

One comment, I find the benchmark results really cumbersome to read. Why don't you make a graph (e.g. a barplot) that would make results obvious at a quick glance. I'm a strong believer in presenting numerical data graphically whenever possible, it avoids many mistakes and misunderstandings.

bArray · 2024-07-11T09:06:20 1720688780

I think that 'ls' should also be evaluating the size of the files contained within. The size and number of contained files/folders really does reveal a lot about the contents of a directory without peeking inside. The speed of this is what would be most concerning though.

chomp · 2024-07-11T02:41:08 1720665668

GPLv3, you love to see it. Great work.

jftuga · 2024-07-11T03:04:07 1720667047

You should include the "How to build" instructions near the beginning of the main.c file.

201984 · 2024-07-11T03:14:06 1720667646

mixmastamyk · 2024-07-11T05:32:33 1720675953

Not as featureful, but what I've been using. If you can't install this tool for some reason, it's still useful. I call it usage:

    #!/bin/bash

    du -hs * .??* 2> /dev/null | sort -h | tail -22

frumiousirc · 2024-07-11T11:22:48 1720696968

dut looks very nice.

One small surprise I found came when I have a symlink to a directory and refer to that with a trailing "/". dut doesn't follow the link in order to scan the real directory. Ie I have this symlink:

    ln -s /big/disk/dev ~/dev

then

    ./dut ~/dev/

returns zero size while

    du -sh ~/dev/

returns the full size.

I'm not sure how widespread this convention is to resolve symlinks to their target directories if named with a trailing "/" but it's one my fingers have memorized.

In any case, this is another tool for my toolbox. Thank you for sharing it.

trustno2 · 2024-07-11T06:22:16 1720678936

Does it depend on linux functionality or can I use it on macos?

Well I can just try :)

Ringz · 2024-07-11T08:15:15 1720685715

From the author:

„Note that dut uses Linux-specific syscalls, so it won't run on MacOS.“

tonymet · 2024-07-11T17:12:39 1720717959

great app. very fast at scanning nested dirs. I often need recursive disk usage when I suddenly run out of space and scramble to clean up while everything is crashing.

jbaber · 2024-07-11T22:52:03 1720738323

I always want treemaps.

console: rust: cargo install diskonaut python: pip install ohmu GUI: gdmap windows: windirstat mac: grand perspective (I seem to r call)

jmakov · 2024-07-11T04:48:36 1720673316

Would be great to have a TUI interface for browsing like ncdu.

tiku · 2024-07-11T08:54:55 1720688095

Ncdu is easy to remember and use, clicking through etc. would be cool to find a faster replacement, same usage instead of a new tool with parameters to remember..

pmdfgy · 2024-07-12T21:56:09 1720821369

Nice work. I really miss the simplicity of C. One file. One Makefile and that's it. Has anyone tested with a node_modules folder ?

tamimio · 2024-07-11T05:34:05 1720676045

Someone, please create a Gdut, a fork that will produce graphs for a quick and easy way to read, it’s almost impossible to read in small vertical screens.

notarealllama · 2024-07-11T01:31:50 1720661510

If this accurately shows hidden stuff, such as docker build cache and old kernels, then it will become my go-to!

201984 · 2024-07-11T01:41:39 1720662099

As long as it has permissions, it totals up everything under the directory you give it including names that start with a '.'. It won't follow symlinks though.

nottorp · 2024-07-11T06:56:40 1720681000

Uh, even basic du "shows hidden stuff" accurately doesn't it?

dot files are just a convention on unix.

classified · 2024-07-11T11:38:19 1720697899

I get boatloads of "undefined reference" errors. Where's the list of dependencies?

201984 · 2024-07-11T13:51:41 1720705901

The only dependency is a recent Linux C standard library. What are the missing symbols? On older versions of glibc you do have to add -pthread.

israrkhan · 2024-07-11T19:32:27 1720726347

The author should have included a Makefile. You need to add -lpthread to command provided in README.md

jepler · 2024-07-11T01:53:33 1720662813

did you consider using io_uring? if not, was there a reason other than portability?

201984 · 2024-07-11T02:00:29 1720663229

io_uring doesn't support the getdents syscall, so there's no way to traverse the filesystem with it. I considered using it for statx(2) to get the disk usage of each file, but decided not to because (a) it would be complicated to mix normal syscalls and io_uring and (b) perf showed the kernel spending most of its time doing actual work and not syscall boilerplate.

nh2 · 2024-07-11T05:39:25 1720676365

Are you sure the perf may not be misleading?

E.g. memory accesses might show up as slower die to CPU caches being flushed when switching between user and kernel space.

I would be extremely interested in a quick (standalone?) benchmark of e.g. 1M stats with vs without uring.

Also https://github.com/tdanecker/iouring-getdents reports big uring speedups for getdents, which makes it surprising to get no speedups for stat.

If uring turns out fast, you might ignore (a), just doing the getdents first and then all stats afterwards. Since getdents is a "batch" syscalls covering many files anyway, but stat isn't.

jepler · 2024-07-11T02:44:25 1720665865

I appreciate the explanation!

rafaelgoncalves · 2024-07-11T05:21:20 1720675280

neat tool, congrats on the release and thank you for this and the analysis/comparison.

kseistrup · 2024-07-11T17:08:23 1720717703

Now available from AUR: https://aur.archlinux.org/packages/dut-git

oigursh · 2024-07-11T22:44:26 1720737866

ncdu has been my go to for years. Pleased to have a modern alternative.

dima55 · 2024-07-11T04:53:09 1720673589

Ideas for a better format: do what xdiskusage does.

metadat · 2024-07-11T05:59:28 1720677568

What specifically do you feel xdiskusage does well?

dima55 · 2024-07-11T17:06:56 1720717616

It graphically displays the relative sizes of things, and allows you to interactively zoom into any particular subdirectory to see the relative sizes of the things inside it

miew · 2024-07-11T06:12:52 1720678372

Why C and not Rust or even Zig?

201984 · 2024-07-11T11:53:56 1720698836

Because (a) I felt like it, (b) it would be more difficult to make raw syscalls in Rust, and (c) I use[1] flexible array members to minimize allocations, whereas Rust support for those quite bad[2]. It doesn't even look possible to allocate one with size known at runtime.

[1]: https://codeberg.org/201984/dut/src/branch/master/main.c#L16...

[2]: https://doc.rust-lang.org/nomicon/exotic-sizes.html#dynamica...

estebank · 2024-07-11T16:53:44 1720716824

> it would be more difficult to make raw syscalls in Rust

Would you like to expand on this? Is it because of type conversions that you'd have to do?

> I use[1] flexible array members to minimize allocations

I was under the impression that FAM was a non-standard extension, but alas it is part of C99.

From what I'm seeing you have intrusive list where each `entry` points to the previous and next element, where the path itself is a bag of bytes as part of the entry itself. I'm assuming that what you'd want from Rust is something akin to the following when it comes to the path?

    struct Entry<const NAME_LEN: usize> {
        ..
 mode: Mode,
 name: InlineName<NAME_LEN>,
    }

    struct InlineName<const NAME_LEN: usize> {
 value: [u8; NAME_LEN],
    }

201984 · 2024-07-11T17:35:44 1720719344

For syscalls, I would have needed to either pull in dependencies or write FFI bindings, and neither of those options are appealing when I could simply write the program in Linux's native language.

For the FAM, your example looks like it requires a compile-time constant size. That's the same as hardcoding an array size in the struct, defeating the whole point. Short names will waste space, and long ones will get truncated.

estebank · 2024-07-11T18:05:15 1720721115

> For the FAM, your example looks like it requires a compile-time constant size. That's the same as hardcoding an array size in the struct, defeating the whole point. Short names will waste space, and long ones will get truncated.

You made me realize that the const generics stabilization work hasn't advanced enough to do what I was proposing (at least not in as straightforward way): https://play.rust-lang.org/?version=nightly&mode=debug&editi...

Those are const arguments, not const values, which means that you can operate on values as if they didn't have a size, while the compiler does keep track of the size throughout.

I'd have to take a more detailed look to see if this level of dynamism is enough to work with runtime provided strings, which they might not be, unless you started encoding things like "take every CStr, create a StackString<1> for each char, and then add them together".

estebank · 2024-07-11T16:42:44 1720716164

This is not an appropriate question to ask, that I see sometimes in these threads. "Because the author wanted" is good enough a reason for them to write a program in C. It being a new project written in C can also be good enough reason for you not to use it: dust already exists and is written in Rust, which you can use instead.

Galanwe · 2024-07-11T06:14:58 1720678498

Why Rust or Zig and not C?

nottorp · 2024-07-11T06:59:25 1720681165

Better for the author's resume, if they want to make it hype driven.

Also some nebulous "being more secure". Never mind that this tool does not have elevated privileges. You gotta watch out for those remote root exploits even for a local only app, man.

_flux · 2024-07-11T11:18:00 1720696680

I mean you could extract an archive you've downloaded to your filesystem and said archive could have funky file names and then you use this tool..

But I suppose it's not a very likely bug to have in this kind of tool.

nottorp · 2024-07-11T11:54:48 1720698888

Or “they” could enter your house at 2 am, drug you and hit you with a $5 wrench until they get access to your files :)

gsck · 2024-07-11T14:41:28 1720708888

Never understood this line of thought. Not everything needs to be super secure. Not everything is going to be an attack vector. No one is going to deploy this onto a production server where this program specifically is going to be the attack vector.

Memory safety is cool and all, but a program that effectively sums a bunch of numbers together isn't going to cause issues. Worst case the program segfaults

craftoman · 2024-07-11T09:13:06 1720689186

Because C is like a cult Toyota Supra with twin turbo and Rust or Zig is like another cool boring Corvette roadster.