Hacker News new | past | comments | ask | show | jobs | submit login
Tweak: An Efficient Hex Editor (greenend.org.uk)
101 points by hprotagonist on Feb 26, 2021 | hide | past | favorite | 39 comments



I am the original author of Hex Fiend. Like Tweak, Hex Fiend uses a BTree (actually a B+tree, which also cross-links nodes at the same level, for easy iteration). Each interior node is annotated with the amount of data contained in its subtree, so it is possible to quickly find data at an offset. I think this is the same as the author's idea.

Hex Fiend has an additional trick: it can save files in-place without needing additional disk space. It does this by computing a dependency graph, and then writing chunks in topological order. For example, say you have a 100 byte file, and the user inserts 100 bytes at the beginning. We get a graph:

    [100, 200) depends on [0, 100)
    [0, 100) from memory (no dependencies)
The topo sort means we'll write [100, 200) first, so its source data is not overwritten.

Here is Hex Fiend's B+tree: https://github.com/HexFiend/HexFiend/blob/master/framework/s...


Thank you for making HexFiend, that’s the top hex editor!

One thing that I really really like about HexFiend is that scrolling is very fast, yet precise and smooth, no matter the size of the file. Is it based on custom classes? Scrolling in macOS used to be quite ok up to several releases ago. Then it became crappy, especially in some apps (AppleScript Editor comes to my mind).


Hex Fiend uses native Mac scrollbars, but manages its scroll position itself. In Cocoa speak it uses NSScroller but not NSScrollView.


I am the author of a tool that generates wxHexEditor tags and annotations for Postgres relation files -- pg_hexedit:

https://github.com/petergeoghegan/pg_hexedit

I've invested quite a lot of effort in it, and it would be nice to have support for multiple hex editors. That was anticipated to some degree:

https://github.com/petergeoghegan/pg_hexedit#supporting-othe...

I understand why you favor a declarative template format for describing files with tags -- that probably scales really nicely. What I'm doing is pretty grotty, but works surprisingly well in practice. I procedurally generate a description of each file in a shell script, and then open the file in wxHexEditor. I'm generating huge XML files, which is slow, but there are simple workarounds to get acceptable performance.

wxHexEditor doesn't support declarative tags. But even if it did I might not want to use them; the on-disk format of PostgreSQL is much more complicated than most file formats, and isn't supposed to be consumed by third party tools. Writing a C program that uses the struct definitions from the server itself makes the complexity quite manageable -- the tool is basically feature complete, even though I haven't spent a huge amount of time on it.

That said, it would be great if I could adapt pg_hexedit to a hex editor that had some kind of "best of both worlds" support for tags - tags that can be generated lazily and on-demand, when a portion of the file needs to be drawn or redrawn. This would be easy to adapt to -- Postgres relation files always consist of a series of 8KiB blocks/pages. My tool can easily generate tags for any single block without knowing any special context or having any expensive-to-generate state -- I just need a block number (i.e. an 8KiB-aligned byte offset).


Really interesting stuff. I'm super curious how the Btree data structure is used to optimize this use case. The idea of editing a file is straight forward but the application of a b+tree to this problem for me isn't.

Can you describe a bit more how a btree is used here? Do you create a btree of the file data with fixed sized blocks? And then make changes to the btree as the user modifies the file and then flush the btree back to disk?

A related question. It's common for filesystems to use btrees... why not just mmap the file?


Great questions! A Hex Fiend document is represented in-memory as a list of slices of files and memory buffers. The b+tree is just an optimization on top of the list, and is not written to disk.

For example, say the user opens file1, which is 100 bytes. Our document starts off with one slice:

    (file1, 0, 100)
Now the user scrolls to offset 75 and types 30 bytes. This splits the original slice at offset 75, so our new list is:

    (file1, 0, 75), (memory, 0, 30), (file1, 75, 100)
The user may also open file2, copy part of it, and paste it at the end of the document. Now the document references two files:

   (file1, 0, 75), (memory, 0, 30), (file1, 75, 100), (file2, 200, 300)
So there's no real dependence on the original file: a document may reference arbitrarily many files (until it's saved). By organizing the list as a b+tree, it's fast to find the slice at a given byte offset in the document.

Hex Fiend does not use mmap because I wanted to support 4+GB files in 32 bit mode. But error handling with mmap is miserable, so I think it is wise to avoid mmap even in 64 bit.


I'm surprised you don't mention insertion/deletion for why plain mmap (without chunk lists or trees) does not work... isn't that a deal breaker regardless, or am I missing something?

EDIT: Ah, did you just mean why it doesn't use mmap at all, e.g. backing the actual data in the tree?


Yep. A simple no-insertion editor could just mmap a file with in-place editing. But you could support insertions by weaving together the mmap'd buffer with other buffers. This is how the suckless vis editor works: https://lists.suckless.org/dev/1409/23497.html

The advantage is that your document can be a simple list of (ptr, length) pairs. The price you pay is error handling: mmap reports errors through raising SIGSEGV or SIGBUS, which are very painful to handle properly.


Yes, a document can just be a (ptr, length). I guess I don't see the benefit a b-tree adds here. I'd probably just use a red-black, AVL, even binary tree with nodes either pointing to blocks on the file (ptr, length) or nodes containing new data in memory (ptr, length, data) with the invariant that no two nodes can ever have overlapping (ptr, length).

My understand is that btrees are a multi-level index that reduces the number of blocks that need to be read from disk to manage an index. I don't see that being a use-case here and thought maybe I was missing something.

With that said, I wonder if the calculus changes if the file is so large that it won't fit in memory and thus, b-trees maybe somehow helpful in flushing data to disk. However, this then starts to look like a block allocator problem. Suppose as you run out of memory you start flushing blocks to disk. But the user is still editing the file and so some of these blocks that were flushed to disk are no longer valid. Youu'd hve to keep track of those "unallocated" blocks on disk to enable re-use thus improve space efficiency. As far as I understand, block allocators are not solved with btrees.


Ah, yes. That would indeed still be a chunk list, albeit one backed by mmap.

I can of course be totally wrong, but I think the OP was wondering why you couldn't simply mmap the entire file in place without managing any lists, trees, or any data structure except for a flat array representing the whole file.


If you only mmap the file, how do you handle the case of inserting or deleting bytes in the middle of a large file?

Answering that might also help imagining why and how B+Trees are used (the "+" for iteration actually being at least extremely useful, if not important).


Thanks for making Hex Fiend. It's my go to hex editor, and I use it a lot.


I just want to say THANK YOU for Hex Fiend!


Tweak misses the following things I expect in a hex editor:

* No help screen to show the hotkeys

* Changes aren't highlighted

* Hotkeys are unintuitive unless you use Emacs (less than 5% of users [1])

* No undo/redo

* No search

* No option for decimal offsets/size

I've tried dozens of hex editors:

* hexedit [2] is the best option for Linux, despite it being web based.

* dhex [3] is probably the best option for the terminal, but not great.

* HxD [4] is great if you're on Windows

1. https://insights.stackoverflow.com/survey/2019#development-e...

2. https://hexed.it/

3. https://www.dettus.net/dhex/

4. https://mh-nexus.de/en/hxd/


* the manpage is pretty comprehensive (https://linux.die.net/man/1/tweak)

* no highlighting is a bummer.

* hotkeys are reasonable if you use readline(1)

* no undo, true.

* C-s, C-r are search forward and backwards, as expected.

* width and offset are configurable in .tweakrc, or pass -w and -o at launch.


> * the manpage is pretty comprehensive (https://linux.die.net/man/1/tweak)

I can't think of any ncurses app that doesn't have in-app help.

These are the steps I went through to figure out the hotkeys:

- press h, Shift+H, ?, Ctrl-H, Ctrl-H+Ctrl-H, F1, :help (nope)

- open new tab

- change directory to where I compiled tweak

- ./tweak --help (nothing there either)

- ls (figure out what the man filename is)

- man ./tweak.1 (shows the default hotkeys, but not easy to parse as a quick reference)

- ./tweak -D (success)

For comparison in top, h works, ? also works, and pressing Ctrl-H shows a helpful message "Unknown command - try 'h' for help"

> * hotkeys are reasonable if you use readline(1)

Readline is reasonable for a line editor, not for an ncurses app. Just as an example, in every other hex editor I've used, pressing Tab would switch between the hex/ascii panes.

> * C-s, C-r are search forward and backwards, as expected.

I missed that, but can't search by hex, which is another basic functionality


> can't search by hex, which is another basic functionality

As the manpage mentions in the “Searching” section, you can search for byte values by preceding them with backslashes.


Agreed entirely on your summation at the bottom. No good CLI hex editors exist. It's been like this for years, not sure why. dhex is indeed the "closest" to decent, but still lacks... a lot.


High-Inertia Random Idea #218,554,763:

A service people can publish extremely specific ideas to about how they want certain things - apps, systems, hardware, policies, anything - to work, and the best/most insightful ideas float to the top. Then, people who a) want to Start Again™ or b) vent off some steam in the general direction of a project they're already familiar with, can just look at this site for community-ranked suggestions of where the most demand is.


You could implement a rough version of this as a subreddit.


No good CLI hex editors exist.

HIEW has my preferred hex editor for over two decades.

I would not use a hex editor with no search function, yet "decimal offsets/size" is one of the last things I care about if I'm using a hex editor.


I think Okteta has all of the features you mentioned except change highlighting.


Yes, Okteta does everything except change highlighting, and is probably the best GUI option on Linux.


HxD is pretty good in windows but I have had a few minor issues with it. I’ve been trying to get an image of a floppy disk to a analyze but unfortunately loading it from HxD seems to cut some of the data off. I verified this by transferring some files from the original device via xmodem to a PC and noticed the end of the last file was cut off in the floppy disk image. I’ve tried a couple other hex editors with various permissions issues dumping a disk for some reason. Any idea for something that can dump this disk so I can analyze it in a hex editor?


Try WinImage on Windows, or dd on Linux


For a Mac GUI app, try Hexfiend: https://hexfiend.com. (App store link: https://apps.apple.com/us/app/hex-fiend/id1342896380?mt=12)


That thing is the best. I don’t understand exactly how it works, but it can handle files up to any size without breaking a sweat.


You don't have to load the entire file to RAM, you can read just parts of it.


While chunking data, it's non-trivial to handle fast scrolling and populating the scrollbars with the correct size of what the eventual size would be.


Non-trivial for sure. But much easier with a hex editor than with, say, a standard text editor. No need to guess how many lines some text takes up due to word wrapping; no need to even search for newlines. There's a fixed number of bytes per row, and a fixed row height, so mapping between byte offset, line number, and pixel y-position is just a matter of multiplying and dividing.


Agree. Nothing to argue there.


I might be wrong, but I believe the fastest way is to seek to the end of file and that way you can get the total file size. Or use like fstat or something

Maybe scrolling is non trivial, but there is definitely a lot of room for optimizations. You could store large chunks in memory and populate them asynchronously with smaller chunks from file for example.

edit: I completely missed the point with the size, comex's answer is way better


Why is it non-trivial? Each line shows a fixed amount of data, and you know the total amount of data. Sounds pretty trivial to me.


Screenshot of it editing itself, in case someone wants to know what it looks like: https://imgur.com/a/lndAWbm


Thank you! The author should really add a screenshot to the webpage.


Took a bit of remembering what a pain autoconf can be :)


Well I have to mention JOE, in hex edit mode (Hit Ctrl-T ' after opening a file)

- Has help screen

- Has undo/redo

- Handles large files. Also you can specify a range of a file to edit, like this: "joe -hex /dev/sda,0,1024" to edit the partition table in hex..

- Has search/replace, including incremental search

- Shows offset: (Ctrl-K SPACE or put it in the status line)

- Can jump to byte: (Ctrl-K X byte). You can enter math at this prompt, like: "size-10" to go to 10 bytes from the end

- Supports insert or overtype mode

- Enter ASCII, decimal, hex, or octal bytes


Nice. I'm personally not a fan of EMACS style control codes everywhere, but I like the simple and straightforward interface.


Another project screaming for a screenshot in the homepage. So many projects that don't have one.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: