Hacker News new | past | comments | ask | show | jobs | submit login
Text Editor: Data Structures (averylaird.com)
268 points by ibobev 9 months ago | hide | past | favorite | 79 comments



VSCode have a blog article talking out their move to using Piece Table as their main data structure. https://code.visualstudio.com/blogs/2018/03/23/text-buffer-r...

Has some good visualizations as well.


> The worst way to store and manipulate text is to use an array.

Claim made from theoretical considerations, without any actual reference to real-world editors. The popular Micro[1] text editor uses a simple line array[2], and performs fantastically well on real-world editing tasks.

Meanwhile, ropes are so complicated that even high-quality implementations have extremely subtle bugs[3] that can lead to state or content corruption.

Which data structure is "best" is not just a function of its asymptotic performance. Practical considerations are equally important (arguably more so).

[1] https://github.com/zyedidia/micro

[2] https://github.com/zyedidia/micro/blob/master/internal/buffe...

[3] https://github.com/cessen/ropey/pull/67


it's talking about an array of bytes, and you're talking about an array of lines, which performs acceptably in a much wider range of situations than what's being dismissed

ropes can be complicated but they don't have to be. ropey is ten thousand lines of code, and only six years old, so it's not surprising that it's buggy. https://github.com/ivanbgd/Rope-Data-Structure-C is 339 lines of code, but it's probably not a real-world rope

probably the most widely used implementation of ropes is the one from the sgi stl, which is in gcc as libstdc++/ext/rope https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER... and libstdc++/ext/ropeimpl.h https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER..., which total 1485 lines of code, about halfway in between the two. in a garbage-collected language, like the cedar language where ropes originated, most of that complexity goes away, and production-quality ropes are only a few hundred lines of code. also c++'s verbose syntax isn't doing brevity any favors here

http://www.bitsavers.org/pdf/xerox/parc/cedar/Cedar_7.0/09_C...


gap buffer performs surprisingly well vs rope https://coredumped.dev/2023/08/09/text-showdown-gap-buffers-...


this article is excellent, thank you

(but i didn't think it was surprising)


> > The worst way to store and manipulate text is to use an array.

> Claim made from theoretical considerations, without any actual reference to real-world editors.

I also found this weird, especially when the author then extols the piece table (an approach going back at least to the early 60s) which itself is nothing but an array!

And the author did cite emacs as an example (emacs inherited that approach from TECO).


s/piece table/buffer gap/ or else my comment makes absolutely no sense.

Note to self: do not comment from bed.


The worst way to store and manipulate text is to use an array. Firstly, the entire file must be loaded into the array first, which raises issues with time and memory. Even worse still, every insertion and deletion requires each element in the array to be moved. There are more downsides, but already this method is clearly not practical. The array can be dismissed as an option rather quickly.

Someone has clearly not heard of Notepad nor used the prolific quantity of text editors for DOS that appeared in the 80s/early 90s, a time when memory bandwidths were in the low MB/s; or even before that, micros with CP/M where memory bandwidths were measured in KB/s (and the whole address space was only 64k). Modern systems have memory bandwidths of dozens of GB/s.

Or perhaps that old saying really applies: worse is better.

IMHO an array is all you need for a regular general-purpose text editor. I hate needless complexity in software, but unfortunately others seem to love it.


> an array is all you need for a regular general-purpose text editor.

until you try to open up a multi gigabyte log file in notepad, and find that it's consuming all your system resources.

But opening that same file in sublime text works perfectly, scrolls fast and zero lag. Guess which one is my daily driver?


Reading log files is completely different than constantly manipulating source code? I wish editors have read-only mode that uses data structure that lazily loaded continuous memory.


if you use an array in the naïve way on an 8080 with, as you correctly point out, memory bandwidths on the order of 16 kilobytes per second, as soon as your text file has grown to 48 kilobytes (about 17 printed pages and less than half a floppy disk), inserting a letter near the beginning of it will take multiple seconds; you can type several times as fast as the editor can respond, even with hunt and peck

consequently the editors we used on those computers did not use this approach

however, a gap buffer is perfectly workable, and preserves most of the array's advantages, such as rapid text search. and so emacs continues to use gap buffers to this day


Yep.

Q: How big is the 90th percentile text file?

64KB? That text file fits in some some level of cache and not RAM. An array is just fine most of the time.

Use a different purpose-built editor for those multi-GB text files.


If you _have _ an editor purpose-built for those files, why not just use it for everything?


Because those editors aren't your preferred editor? I don't really care about being able to edit files that big so that's not going to be the reason why I pick an editor.


A craftsmen has different tools for different purposes.

E.g. Text editor Hex editor Source code editor


Turns out there's no need for that when all the good editors support all of those things out the box.


Agreed. Our computers are so powerful, do we really need this additional complexity?

And of course, if performance becomes an issue, we can make minor improvements then.


That’s why all these programs land these improvements - the performance optimizations were needed. It gets tricky of course because a performance optimization you needed at one time may no longer be needed as HW perf increased or may even be a penalty as HW architecture diverged from the target one that was optimized for.


In my text editor (https://github.com/alefore/edge) I use a balanced binary tree containing small chunks (std::vector) of contiguous lines.

That works well enough for me: https://asciinema.org/a/314752

This requires loading the entire file into memory, but computers are so fast today that optimizing that away is pointless IMO (as the video shows, a 12MB file with 400k lines can be loaded/edited/saved reasonably).

The tree structure is implemented as a generic container here: https://github.com/alefore/edge/blob/master/src/language/con.... The tree is actually deeply immutable; each modification operation returns a new tree (where most contents are shared with the original tree). The leafs contain std::vectors that hold between 128 and 256 lines (save for a few cases). The actual specialization that holds a sequence of lines is LineSequence::Lines here: https://github.com/alefore/edge/blob/master/src/language/tex...


12 MB isn’t necessarily that much. Here is a 96 MB SVG file: https://commons.m.wikimedia.org/wiki/File:Koppen-Geiger_Map_...

Here are genome sequence files that are several hundred MB compressed (the decompressed format is text): https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Finally here is an almost 2 TB XML file: https://wiki.openstreetmap.org/wiki/Planet.osm

These aren’t the most typical use cases, but it’s nice if an editor can at least handle > 32-bit files (larger than 2 or 4 GB).


I just use a (Ruby) array of strings for mine, on the basis that I have never on purpose wanted to actually edit a file large enough that this is a problem. Computers keeps getting faster, but files I want to edit really don't.

If I open a huge one it's usually either an accident or because I want to view it, not edit it. It's easy enough to support abnormally large files, but it felt to me like a distraction that adds complexity for the sake of a fringe case I was perfectly happy to just define as out of scope.

Incidentally, the server process for my editor holds every buffer that I've ever opened (over several years) and not explicitly killed in memory, and it's consuming too little for me to bother adding any garbage collection for buffers or similar (it's synced to disk, so on reboot I have the full same set of buffers available). We're talking many hundred buffers. RAM is cheap and plentiful.


You've worked primarily in your own home built editor for years? That's pretty cool.


Yeah. 6 years or so now. Putting the buffers in a server that checkpoints regularly was key to making it easy to do early - made it really hard to lose any data even when it crashed regularly.


I'm not the person you directly replied to, but I've worked on my text editor since 2014 and used it exclusively since ~2015. It's been a lot of fun!


How does that perform when the entire 12MB file is a single line?

Before you say it's unrealistic; sure, it's not a normal use case for code, but I frequently find myself wanting to open a huge JSON file, for example, just to perform a search/replace, or a minor edit, or copy some fragment, or just look at its contents. Pretty-printing first, or doing the change with a specialized tool like jq, is also something I do, but being able to just load the file into a text editor is often more convenient.

In my experience, editors tend to break down on edge cases like these. Editors (e.g. Jetbrains) often turn off syntax highlighting and/or edit capabilities on large files, not to mention put an upper limit on the number of cursors.


Thank you for your response. You ask a good question.

I expect very long lines should work just fine, but I'll give it a try when I have a chance.

I represent lines using a virtual class that has a few implementations. IIRC, the line will start as a tree containing 64kb chunks. As edits are applied, these chunks will gradually get replaced with other implementations (of the virtual class) that delegate to the original instances (and the representation is occasionally optimized (which can incur copying) to avoid too much nesting). So simplifying, I think this should just work due to the optimized representation that I use for lines, which doesn't require the characters to be contiguous in memory.

I fully agree with your observation about how things often break in the corner cases. I actually also put a configurable cap on the number of cursors (e.g., a reg exp search stops after 100 matches).

For syntax highlighting I don't currently have a cap, but this is ~fine: it just ~wastes a background thread (all syntax parsing is offloaded to a background thread, not blocking the main thread). But yeah, I should probably make this thread give up after a configurable time out.


Looks like Codemirror does similar with long/ huge files https://codemirror.net/examples/million/


You would have to load the entire file either way, even if you'd mmap() it, to be able to break the lines to find their extents and start offsets.

mmap() isn't entirely safe under Unix/Linux anyway because there is no Mandatory Locking. And it would still be useful only if you operate on the original text encoding.


Yeah, good point.

You want to at the very least be able to show the user on which line his cursor is, and that already requires reading all the contents from the start up to the cursor. You might as well read the entire file and also show the user the total line count.


It's all fun and games until you're loading a multi gigabyte CSV file


There's lots of comment here about how this isn't needed for "typical" editing of smallish files and you can get away with something very simple, but that isn't very interesting.

What is interesting is the opposite question: how do you enable editing that is bounded only by storage limits?

A piece table is a good solution for this since the original file and the added data is immutable, which lends itself to mmap-ing. Then you can implement the piece table and various indexes (line breaks, etc) in a database (such as SQLite) or similar. Overall this gives you consistent performance at any size or number of edits limited only by storage.


the original file isn't immutable unless you have reflinks or equivalent


From a linked article https://web.archive.org/web/20160308183811/http://1017.songt...:

>It was also possible that characters that were still stored in the document file, had been deleted since those characters were first copied into the file. That meant if a user of the program had typed embarassing or incriminating text in a Word document, saved it and then thought better of publishing such thoughts, and erased them, that once saved text still was recorded in a Word file. If you used an operating system utility to dump the raw content of a fast saved Word file, this stuff became visible to the technically minded nosy person.

This fast-save feature is hilarious and reminds me of https://en.wikipedia.org/wiki/ACropalypse. Wondering how many awkward stories out there caused by this ”optimization”.



But what is a good structure for modern text editors in 2024? Nowadays text editors need to keep track of git commits (by whom, when, which line(s)), reference and footnotes (the author mentions Markdown's manual [^...] notation but ideally you'd want something automatic—even latex isn't automatic by default), etc.


RRB-trees will give you an immutable data structure with pretty darn ok random access and insertion, and free undo/redo.

Not only that, multiple cursors is simple.


the best writing i've found on that topic is https://xi-editor.io/docs/rope_science_00.html


Ropes have all kinds of bonus benefits. The Xi editor author has some good notes about text editor rope coding: https://abishov.com/xi-editor/docs/rope_science_01.html


Project site linked from the GitHub[0] is https://xi-editor.io. Linked doc is a mirror of this[1], which was originally written by Raph Linus.

[0]: https://github.com/xi-editor/xi-editor

[1]: https://xi-editor.io/docs/rope_science_01.html


A nice companion piece, focused on a related probem:

"An Efficient Data Structure For A Hex Editor" https://www.chiark.greenend.org.uk/~sgtatham/tweak/btree.htm...


Previous discussion from 2017: https://news.ycombinator.com/item?id=15381886 (121 comments)


You can get the best of piece tables and ropes using a B-tree-based rope- at small sizes, you get a single node that works like a sorted piece table, but you can also scale up to larger documents without the high constant factors of a binary tree.


You can also get the best of btrees and gap buffers by putting a gap buffer in each leaf node in the tree. That allows the leaves to be much bigger (400 bytes seems to work well in practice) - and that helps keep the tree itself small and thus in the cpu cache.

As the article says, it’s quite a complex data structure though.


I've been really curious about the algorithms and data structures behind text editors. Does anyone know of a book or YouTube series that does a job slowly building a working text editor?


I don't know if it answers your question per se but let me mention: The Craft of Text Editing

https://www.finseth.com/craft/

https://news.ycombinator.com/item?id=13518170


That looks great indeed, thank you. I assume the concepts are generalizable even though it uses Emacs to concretize them? I ask since I'm a Vim guy (not trying to start a flame war haha)


Sure. After all, you can use evil-mode to give emacs vi/vim keybindings.


What about giving it all away to SQLite or similar and treat your document as a table with index, and text as rows with parent references?

Then you don’t need to keep it in memory, can have unlimited undo and hand over to complexity to a solution (SQL) than you know with be there the next 20years


it sounds like a reasonable approach, though some algorithms like string search may be trickier to implement than they would be with a simpler data structure, because a string may cross piece boundaries, maybe even more than once

you don't even need parent references; you can just assign synthetic sequence identifiers to the pieces and rely on sql lexicographical ordering to put them in order. (sql is not really very good with parent references.) this would be handy when you want to annotate the text with things like line numbers and syntax highlighting information. you wouldn't get undo for free, though


What happens when someone opens up a 500 MB one-line json blob or something?


sql rows don't have to be text lines


Sadly, no one will use your extremely slow text editor.


Rope has been the hype since years and now author is claim that it was piece table all along.


Honestly an array isn’t really the worst… for smallish files on modern processors. Sequential memcpy is pretty dang fast and pointer indirection can be slow.

Just sayin’


Worked on a homegrown Mac wsywyg editor back in the 90s. Arrays worked perfectly. If you are assuming that files fit in memory, using BlockMove() was very, very fast indeed.

I can see if you need to edit multi-gig log files and things will not fit in memory, but for small files, array is totally fine.

There were other tricks that were done back then to keep the number of single char inserts down to a minimum while typing. Like reading chars into a small buffer during fast typing and then inserting all the keystrokes at once as soon as you had the time.


Yes. When typing normally there is no slowness because you’re swapping one string for another.

But when pressing enter, pasting a bigger snippet, or deleting lines the slowdown is less noticeable. When you type a word, the deltas are tiny and it is easy to process for your brain. While the changes in multiple lines happen less often, so waiting a couple of frames doesn’t bother you much.

This is at millions of lines. With small files it is indeed too fast.


I was amused that this approach was dismissed without analysis as impractical. Sure it was impractical in living memory when CPUs and memory were many many orders of magnitude slower.


An array and nothing else? It very quickly stops making sense to copy the entire tail section of the file every time the user inserts text in the middle.


A conservative estimate for memmove() speed on a computer from even a decade ago would be 1GB/s. That's 1ms to move 1MB. All humans won't notice a latency below 10ms.


this is true, and a good point; the machine i'm typing this on is closer to 20 gigabytes a second. but, for example, if i refill a longish paragraph in emacs, it might easily do dozens of insertions; if i indent or outdent a block of code, it might do hundreds. it's easy to imagine a situation where i have a 13-megabyte mail.log open in emacs and want to do a search-and-replace to reduce the noise level in it, maybe making 12895 edits. this currently takes less than a second (emacs uses a gap buffer), but if each of those edits took 6 milliseconds it would have been a minute or two


If it was an issue, and a simple gap buffer isn't fast enough, you could do a linked list of gap buffers. Would still be pretty simple, and probably very fast.


and if an array doesn't work, a gap buffer isn't much additional complexity.

all of my C homebrew editors have been gap buffer based, and I told myself I'd reimplement if I ever wanted to edit something large enough for that to be a problem — but it never was.


This sort of data structure evaluation mostly starts to make sense when you approach interactive large-file editing as a soft-realtime problem.


if you don't have reflinks (or some non-linux equivalent) you're going to have to load the entire file into an array when you open it anyway, even if that array isn't where you edit it

the performance cost of a dumb array is being oversold here. it's true that insertion and deletion are slow in arrays, but let's put this in perspective. i'm displaying my editor on a 2-megapixel display, which is 8 megabytes of pixels because they're 32 bits per pixel. drawing a single frame of video involves the same amount of memory traffic as memmove()ing 4 megabytes of text. so until you have a few megabytes, even with a dumb array, ordinary text editing operations will easily support interactive responsivity. even things like refilling paragraphs and indenting or outdenting regions, which might involve hundreds of individual edits, is fine on everyday-sized files. this laptop can do about 20 gigabytes a second to main memory, 10 gigabytes of memmove(), so in 16.7 milliseconds (a 60-hertz frame) it can memmove() 167 megabytes. it isn't until you get into things like search-and-replace jobs on big logfiles that you start to feel the pain

the complexity of ropes is also being oversold here. as i pointed out in a comment in a subthread, you can make ropes quite complex, but https://github.com/ivanbgd/Rope-Data-Structure-C is 339 lines of code. it's probably not a real-world rope, though; it has far too much memory overhead

probably the most widely used implementation of ropes is the one from the sgi stl, which is in gcc as libstdc++/ext/rope https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER... and libstdc++/ext/ropeimpl.h https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER..., which total 1485 lines of code, about five times larger. c++'s verbose syntax isn't doing brevity any favors here

in a garbage-collected language, like the cedar language where ropes originated, most of that complexity goes away, and production-quality ropes can be only a few hundred lines of code

http://www.bitsavers.org/pdf/xerox/parc/cedar/Cedar_7.0/09_C...

another somewhat popular rope implementation is librope, in c https://github.com/josephg/librope which is 684 lines of code and looks to be production-quality

there's an unrelated librope-ocaml in debian (and opam); it's 1029 lines of code but nothing in debian depends on it. it looks quite full-featured; the documentation says it 'has all non-deprecated operations of String' http://chris00.github.io/ocaml-rope/doc/rope/Rope/

in ur-scheme i used the simplest, dumbest kind of rope for building up output text (input was parsed into s-expressions character by character): an output text was either a string, a pair (representing the concatenation of its arguments), or nil (representing the empty string). this is very similar to erlang's io lists. the only operations supported were (constant-time) concatenation (implemented as cons) and (linear-time) flattening into a string; this took 27 lines of scheme because the subset of scheme i implemented it in was quite limited. http://canonical.org/~kragen/sw/urscheme/

finseth's book is worth reading https://www.finseth.com/craft/, tatham's notes on b-tree ropes https://www.chiark.greenend.org.uk/~sgtatham/tweak/btree.htm... are worth reading, and raph levien's notes on xi and ropes https://xi-editor.io/docs/rope_science_00.html are worth reading. chris wellons's post about gap buffers https://nullprogram.com/blog/2017/09/07/ is a little bit silly but has animations


Emacs is the fastest editor the author has worked with!!??


To be fair on a modern setup it is fast. Sure it's not nano or vi(m) but compared to Electron-based stuff, it's plenty fast.

Five years ago:

    time emacs -Q -eval '(kill-emacs)'
would be slow at 320 ms.

But on a "recent" AMD 7700X that'd be 80ms.

That's starting emacs, evaluating elisp code, and exiting Emacs.

Surely if you can start Emacs, execute elisp, code and exit in 80ms it shouldn't lag too much to insert a character.

Now, I know, some modes can be slow. But Emacs in recent version as seen lots of changes enhancing its performances. And hardware have become really very fast.

YMMV but my Emacs is really incredibly responsive.

P.S: I was already using Emacs on a 486 back in the "Eight Megabytes And Constantly Swapping" days. The joke was fun and true. Nowadays Emacs is actually one of the lightest sofware I use.


Yeppers. The days of waiting a minute or more for emacs to initialize itself are long gone for most. Still marginally slower than nvim or vim for me (on a 2016 vintage i7-6700k on Linux under X11 inside of a sucksless `st` terminal):

    $ tim 'vim --clean -c quit' 'nvim --clean -c quit' 'emacs -nw -Q -eval "(kill-emacs)"'

    (3.45 +- 0.27)e-04      (AlreadySubtracted)Overhead
    (7.929 +- 0.052)e-03    vim --clean -c quit
    (8.329 +- 0.031)e-03    nvim --clean -c quit
    0.05302 +- 0.00023      emacs -nw -Q -eval "(kill-emacs)"
There can still be a human-noticeable difference when you fire up emacs on some 2.5 MiB org-mode "active document" and it takes over 1000 milliseconds (a whole heartbeat! omg! /scarcasm) while vim still takes under 20 ms. And, similarly, vim/nvim with a lot of plugins loading can push up their start-up times as well.

EDIT: BTW, chances are good that the 4X reduction from 320ms to 80ms over 5 years you are seeing is in large part due to ahead-of-time native compilation of elisp (https://www.gnu.org/software/emacs/manual/html_node/elisp/Na... ) and libgccjit/etc. not just CPU/DRAM improvement.


Even under WSL emacs is fast:

    [9front@RIPPER ~]$ time emacs -Q -eval '(kill-emacs)'
    
    real    0m0.193s
    user    0m0.136s
    sys     0m0.010s


There are other editors?!


He never tried anything else. For speed, vi is number one, followed by sublimetext.


Even the suggestion that vim is fast is funny to me.

Vim feels sluggish compared to NeoVim.

NeoVim feels sluggish compared to Helix.


That's why GP said 'vi', not 'vim' or 'neovim'. It's an important distinction. You can still get old-school vi on most distributions, and it is far smaller and lighter than vim. In my experience, it feels snappy even if you're in an ssh session to a server in a different timezone.


> NeoVim feels sluggish compared to Helix.

Of course, if you offer only a portion of NeoVim's features the speed improvement should not be surprising.


At the same time, NeoVim lacks many, many of Helix' features unless you use plugins. That's fine, of course, plugins and customization are part of what makes NeoVim so great.

Anyway, can you name a few NeoVim features you miss in Helix? I still use NeoVim now and then, so I'd be happy to learn more, even if it's something small.


How does Vim feel sluggish? Everything in Vim feels instantaneous to me.


Granted, it's been long enough I don't even quite remember the exact issues I had, so take it with a grain of salt. It was definitely primarily weird edge cases, and might've been related to plugin use.

This is one area where NeoVim afaik comes out on top: Vimscript isn't fast, and NeoVim allows Lua plugins via the embedded LuaJit, and those should be significantly faster.


As an emacs user I find it offensive that he tried other editors


> In fact, the editor I’m writing this in – Emacs – uses a gap buffer, and it’s probably the fastest editor I’ve ever used. That fact alone is a pretty convincing argument to use a gap buffer.

I know for a fact that vim is faster than Emacs and I'm pretty sure it uses an array (although I don't know that for a fact).


i thought vim used an array of lines, not an array of bytes, but it turns out it's more like an array of pages; https://www.free-soft.org/FSM/english/issue01/vim.html describes it, https://github.com/vim/vim/blob/e723c42836d971180d1bf9f98916... is the implementation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: