I am the original author of Hex Fiend. Like Tweak, Hex Fiend uses a BTree (actually a B+tree, which also cross-links nodes at the same level, for easy iteration). Each interior node is annotated with the amount of data contained in its subtree, so it is possible to quickly find data at an offset. I think this is the same as the author's idea.
Hex Fiend has an additional trick: it can save files in-place without needing additional disk space. It does this by computing a dependency graph, and then writing chunks in topological order. For example, say you have a 100 byte file, and the user inserts 100 bytes at the beginning. We get a graph:
[100, 200) depends on [0, 100)
[0, 100) from memory (no dependencies)
The topo sort means we'll write [100, 200) first, so its source data is not overwritten.
Thank you for making HexFiend, that’s the top hex editor!
One thing that I really really like about HexFiend is that scrolling is very fast, yet precise and smooth, no matter the size of the file. Is it based on custom classes? Scrolling in macOS used to be quite ok up to several releases ago. Then it became crappy, especially in some apps (AppleScript Editor comes to my mind).
I understand why you favor a declarative template format for describing files with tags -- that probably scales really nicely. What I'm doing is pretty grotty, but works surprisingly well in practice. I procedurally generate a description of each file in a shell script, and then open the file in wxHexEditor. I'm generating huge XML files, which is slow, but there are simple workarounds to get acceptable performance.
wxHexEditor doesn't support declarative tags. But even if it did I might not want to use them; the on-disk format of PostgreSQL is much more complicated than most file formats, and isn't supposed to be consumed by third party tools. Writing a C program that uses the struct definitions from the server itself makes the complexity quite manageable -- the tool is basically feature complete, even though I haven't spent a huge amount of time on it.
That said, it would be great if I could adapt pg_hexedit to a hex editor that had some kind of "best of both worlds" support for tags - tags that can be generated lazily and on-demand, when a portion of the file needs to be drawn or redrawn. This would be easy to adapt to -- Postgres relation files always consist of a series of 8KiB blocks/pages. My tool can easily generate tags for any single block without knowing any special context or having any expensive-to-generate state -- I just need a block number (i.e. an 8KiB-aligned byte offset).
Really interesting stuff. I'm super curious how the Btree data structure is used to optimize this use case. The idea of editing a file is straight forward but the application of a b+tree to this problem for me isn't.
Can you describe a bit more how a btree is used here? Do you create a btree of the file data with fixed sized blocks? And then make changes to the btree as the user modifies the file and then flush the btree back to disk?
A related question. It's common for filesystems to use btrees... why not just mmap the file?
Great questions! A Hex Fiend document is represented in-memory as a list of slices of files and memory buffers. The b+tree is just an optimization on top of the list, and is not written to disk.
For example, say the user opens file1, which is 100 bytes. Our document starts off with one slice:
(file1, 0, 100)
Now the user scrolls to offset 75 and types 30 bytes. This splits the original slice at offset 75, so our new list is:
(file1, 0, 75), (memory, 0, 30), (file1, 75, 100)
The user may also open file2, copy part of it, and paste it at the end of the document. Now the document references two files:
So there's no real dependence on the original file: a document may reference arbitrarily many files (until it's saved). By organizing the list as a b+tree, it's fast to find the slice at a given byte offset in the document.
Hex Fiend does not use mmap because I wanted to support 4+GB files in 32 bit mode. But error handling with mmap is miserable, so I think it is wise to avoid mmap even in 64 bit.
I'm surprised you don't mention insertion/deletion for why plain mmap (without chunk lists or trees) does not work... isn't that a deal breaker regardless, or am I missing something?
EDIT: Ah, did you just mean why it doesn't use mmap at all, e.g. backing the actual data in the tree?
Yep. A simple no-insertion editor could just mmap a file with in-place editing. But you could support insertions by weaving together the mmap'd buffer with other buffers. This is how the suckless vis editor works: https://lists.suckless.org/dev/1409/23497.html
The advantage is that your document can be a simple list of (ptr, length) pairs. The price you pay is error handling: mmap reports errors through raising SIGSEGV or SIGBUS, which are very painful to handle properly.
Yes, a document can just be a (ptr, length). I guess I don't see the benefit a b-tree adds here. I'd probably just use a red-black, AVL, even binary tree with nodes either pointing to blocks on the file (ptr, length) or nodes containing new data in memory (ptr, length, data) with the invariant that no two nodes can ever have overlapping (ptr, length).
My understand is that btrees are a multi-level index that reduces the number of blocks that need to be read from disk to manage an index. I don't see that being a use-case here and thought maybe I was missing something.
With that said, I wonder if the calculus changes if the file is so large that it won't fit in memory and thus, b-trees maybe somehow helpful in flushing data to disk. However, this then starts to look like a block allocator problem. Suppose as you run out of memory you start flushing blocks to disk. But the user is still editing the file and so some of these blocks that were flushed to disk are no longer valid. Youu'd hve to keep track of those "unallocated" blocks on disk to enable re-use thus improve space efficiency. As far as I understand, block allocators are not solved with btrees.
Ah, yes. That would indeed still be a chunk list, albeit one backed by mmap.
I can of course be totally wrong, but I think the OP was wondering why you couldn't simply mmap the entire file in place without managing any lists, trees, or any data structure except for a flat array representing the whole file.
If you only mmap the file, how do you handle the case of inserting or deleting bytes in the middle of a large file?
Answering that might also help imagining why and how B+Trees are used (the "+" for iteration actually being at least extremely useful, if not important).
- man ./tweak.1 (shows the default hotkeys, but not easy to parse as a quick reference)
- ./tweak -D (success)
For comparison in top, h works, ? also works, and pressing Ctrl-H shows a helpful message "Unknown command - try 'h' for help"
> * hotkeys are reasonable if you use readline(1)
Readline is reasonable for a line editor, not for an ncurses app. Just as an example, in every other hex editor I've used, pressing Tab would switch between the hex/ascii panes.
> * C-s, C-r are search forward and backwards, as expected.
I missed that, but can't search by hex, which is another basic functionality
Agreed entirely on your summation at the bottom. No good CLI hex editors exist. It's been like this for years, not sure why. dhex is indeed the "closest" to decent, but still lacks... a lot.
A service people can publish extremely specific ideas to about how they want certain things - apps, systems, hardware, policies, anything - to work, and the best/most insightful ideas float to the top. Then, people who a) want to Start Again™ or b) vent off some steam in the general direction of a project they're already familiar with, can just look at this site for community-ranked suggestions of where the most demand is.
HxD is pretty good in windows but I have had a few minor issues with it. I’ve been trying to get an image of a floppy disk to a analyze but unfortunately loading it from HxD seems to cut some of the data off. I verified this by transferring some files from the original device via xmodem to a PC and noticed the end of the last file was cut off in the floppy disk image. I’ve tried a couple other hex editors with various permissions issues dumping a disk for some reason. Any idea for something that can dump this disk so I can analyze it in a hex editor?
Non-trivial for sure. But much easier with a hex editor than with, say, a standard text editor. No need to guess how many lines some text takes up due to word wrapping; no need to even search for newlines. There's a fixed number of bytes per row, and a fixed row height, so mapping between byte offset, line number, and pixel y-position is just a matter of multiplying and dividing.
I might be wrong, but I believe the fastest way is to seek to the end of file and that way you can get the total file size. Or use like fstat or something
Maybe scrolling is non trivial, but there is definitely a lot of room for optimizations. You could store large chunks in memory and populate them asynchronously with smaller chunks from file for example.
edit: I completely missed the point with the size, comex's answer is way better
Hex Fiend has an additional trick: it can save files in-place without needing additional disk space. It does this by computing a dependency graph, and then writing chunks in topological order. For example, say you have a 100 byte file, and the user inserts 100 bytes at the beginning. We get a graph:
The topo sort means we'll write [100, 200) first, so its source data is not overwritten.Here is Hex Fiend's B+tree: https://github.com/HexFiend/HexFiend/blob/master/framework/s...