TruffleRuby is an implementation of Ruby that uses ropes instead of character arrays for strings, if you want to see a modern implementation in practice.
It works brilliantly for mostly string concatenation workloads, like rendering web page templates.
Gets tricky when you decide to flatten it if people are repeatedly mutating single characters or indexing.
Cool thing is that it can provide a char* interface to a rope data structure, in order to support C extensions not expecting ropes.
This was an interesting paper, and I read over it while thinking about implementing ropes for https://www.oilshell.org/ (which would be a huge change, affecting the entire interpreter, since all shell scripts mostly deal with strings)
What surprised me is that Ruby template engines make heavy use of string concatenation? Like 'a' + 'b' ?
The generated Ruby fragment from ERB makes heavy use ofstring concatenation, which is generally favorable to ropes. Theright-most graph of Figure 5 shows that TruffleRuby ropes are 9.4× faster than MRI, rendering around 870 000 templates per second.
I would have thought the idiom for writing a template engine is to use something like Java's StringBuffer, or the Python pattern of appending immutable strings to a list, and joining them at the end. So you avoid a ton of intermediate strings that way. (And I have written a template engine myself in both Python and JS, and that is the technique I used, and it's plenty fast)
Would that technique speed up ERB, or is there something I'm missing? Not to say it's not worthwhile to still do ropes, but it seems like low hanging fruit.
-----
Though what turned me off of ropes is that for many workloads, most strings are small. And small strings are often smaller than records of pointers (and they have more locality). If I have a pointer to a string, a start offset, and an end offset, on machines that's 8 + 4 + 4 = 16 bytes, which is already approaching the size of an HTML tag or two. And a GC has to chase that pointer (in my situation), whereas a GC can stop if it's a blob of bytes.
Not to say it's not a useful technique of course ... I think it does make sense as an optimization when you have existing workloads to measure, but I was hesistant to do it starting from scratch.
I don't think ERB literally does +, I think it does <<, which is exactly the same as the string buffer thing you're suggesting.
But doesn't appending to a string buffer still mean copying a lot of bytes as you build up the final string?
In ropes you don't need to copy anything, you just create a little node of a couple of words to remember that two strings were joined. Yeah there's non-locality and things like that, but most people are building up strings not reading them. Then you write out the final string - and we have system calls like writev that work with ropes.
Finally, remember that if your string is created and used within a single compilation unit remember that rope nodes with disappear with escape-analysis and scalar-replacement-of-aggregates. Helps operations on strings fold and become constant.
> But doesn't appending to a string buffer still mean copying a lot of bytes as you build up the final string?
Look at the article and see for yourself. They do the string-builder measurement and show results. It's in figure 5.
The "GEOMETRIC" approach to string building from small fragments is slightly slightly faster than the rope approach. That's where characters are copied to a contiguous buffer, and the buffer is reallocated by doubling in size whenever it is filled.
The slowest by far is the "UNBUFFERED" [cord] version, which creates a new cord node for each append, instead of copying the characters.
"BUFFERED" [cord] is quite good. It's still a bit slower than "GEOMETRIC", but good enough.
In a nutshell, the paper shows that copying bytes when the number of bytes at a time is small is essential to all fast methods of string building.
That's because creating nodes and changing pointers is slower than copying contiguous bytes, up to some number of bytes. That number is probably larger now than it was when the paper was written due to modern memory architecture.
Hm that's interesting, actually the paper is sort of wishy-washy about the contribution? It doesn't claim that ropes are faster.
It claims that they make systems "more robust". I think they basically mean that certain apps that use long strings are less likely to run out of memory.
(Also, I would expect penalties of pointer chasing and thus ropes to be much more severe in 2020 vs. 1995, like maybe 10x more?)
We claim that the traditional implementations ofstrings, and often the supported functionality, are not well suited to such general-purpose use.They should be confined to applications with specific, and unusual, performance requirements.We present ‘ropes’ or ‘heavyweight’ strings as an alternative that, in our experience leads to systems that are more robust, both in functionality and in performance.
Conclusion:
It is not clear whether the implementation we suggested above is in any senseoptimal. However, it performs sufficiently well for our purposes, and sufficientlywell that for most purposes it is preferred over traditional ‘flat’ string representations.Our experience has been that it results in software that is noticeably more robust,particularly in that it scales correctly to unexpectedly large inputs.
edit: I guess they are making a computational complexity claim then? But again I would say it's a common idiom in Python and JS to append to a list and join(). At least now it is, maybe in 1995 programmers weren't used to that idiom.
Right, << and StringBuffer is still less efficient, at least in terms of the asymptotic number of bytes copied.
I guess what I'm saying is that if you "properly" write a template engine, there should be no advantage to ropes at all.
Rather than a tree of non-copied fragments in the language runtime (ropes), you can simply create an array of non-copied fragments in the language itself.
Following the template engine I wrote in Python and JS:
- Fragments come from two places: holes and consts. holes come from variables, and consts are figured out at template compile time (e.g. which happens when the web server starts, and shouldn't be benchmarked).
- Fragments are never copied. They are simply appended to a big array of fragments. The template engine is a recursive tree walk interpreter that always appends to an array.
This is even true for a feature like #include. Some template engines may do the "right" thing for local fragments, but the copying may sneak back in for #include. But if you care, it's easy to handle #include in a consistent way.
So if you use that "obvious" architecture, then I claim there should be no advantage to ropes. But I guess that's not what happens in practice!
(Well, seeing some sibling comments about copying being faster, at least the claim is that this architecture should be as fast or as slow as ropes! :) But yeah I think ERB is probably doing something different and even worse, which is not surprising.)
> I guess what I'm saying is that if you "properly" write a template engine, there should be no advantage to ropes at all.
If you're building up your string as an array of strings, and you want to pass that to some existing code that expects a full string (like the web framework you're using) then you're going to need to flatten it to create the final full string.
Ropes don't need to do that. They still have an advantage.
Also... the implementation doing things for you so you don't have to manually optimise your code is kind of the point of an optimisation. We can always say 'but I could do this manually' to any optimisation, couldn't we?
Yeah it's a moot point either way, because ERB is like that... but
- This wasn't an optimization, it was literally the first thing I thought of when I wrote my first and only template engine 10 years ago. And I have seen plenty of others do it; it is "obvious".
The code is the exact same length as the string version, so I don't consider it an optimization.
- Python's WSGI accepts an iterable of strings for the body. From a cursory reading, looks like Ruby's rack does too:
Depends on what you mean by "a lot", doesn't it? Yes, you are going to copy all the bytes in the final result once (at least). But it's hard to see how to avoid that, and no, delegating that task to the kernel doesn't magically/consistently make it faster.
> you just create a little node of a couple of words
Hmm..."just" creating a "little" node can often be quite expensive on today's machines. And "words" are pretty big, typically 8 bytes per word.
> Yeah there's non-locality and things like that
Well, non-locality and "things like that" are kind of a biggie, these days. The biggie, actually, most of the time.
"...computation is essentially free, because it happens “in the cracks” between data fetch and data store; on a clocked processor, there is little difference in energy consumption between performing an arithmetic operation and performing a no-op." -- Andrew Black in Object-oriented programming: some history, and challenges for the next fifty years
Yes lots of caveats... but it works for Ruby in practice, often getting 10x on template rendering workloads.
Particularly, you 'actually...' me about how fetch and store dominates... but we're not storing anything into existing memory! To concatenate strings we only create new nodes and refer to the old nodes. You don't even need to read what's in the existing nodes! The essential fast-path of building up a template is write-only with bump-allocation into the extremely well-managed TLAB. No reading required. That's why it works so well.
Yes traversing the tree means writing it out is a little slower, but that's one small part at the end, with good pre-fetching opportunities.
Appending to a string buffer that grows exponentially will copy roughly twice as many bytes as building the result in a buffer that has the exact size. If you have to copy out of the resizable buffer into an exact-size area of memory, that becomes three times.
Hey Chris. Completely unrelated question. I've been following your work on truffleruby (it looks pretty awesome btw), and I was wondering what the ultimate goal was? Do you believe truffleruby can eventually become the goto Ruby for Rails developers? Is Shopify going to completely move to truffle and the graalvm?
I think the ropes are less useful now than they were in 1995 because of the evolution of CPU and memory systems. In 1995 100 Mhz was considered a fast speed, and memory access was not that much slower than the CPU.
Now CPU's are in the GHZ ranges, and memory is a huge bottleneck. This is why a significant portion of the CPU is the integrated memory controller and various levels of cache.
With the current paradigm, cache friendliness will often trump something that has fewer instructions. With regular strings, scanning or copying a string is in order memory access which will take advantage of pre-fetching of the cache. With ropes, there will be pointer chasing which will cause all sorts of cache misses.
So, I think ropes are likely an even more niche use case than they were in 1995.
> memory access was not that much slower than the CPU.
Yes it was. 1995 was not 1965.
> This is why a significant portion of the CPU is the integrated memory controller and various levels of cache.
Microprocessor architectures available in 1995 had L1 and L2 caches. L2 was available for Pentiums, so it basically trickled to the consumer market.
These caches were small: misses were more likely, so optimizing for locality was super important.
I remember hearing that same talk in the 1990's: you don't want linked lists: think of the pointer chasing and cache misses, and the storage overhead for the pointers that wastes your precious little cache space (8K in an 80486).
> With regular strings, scanning or copying a string is in order memory access which will take advantage of pre-fetching of the cache.
The point of ropes is to handle very long strings. Scanning a long string, many kilobytes or even megabytes long, is not a simple memory access, though. If you have to copy the entire tail of the flat string to insert a character, but a rope does not, the rope can win in that case. See the motivating statements in the paper: "Common string operations should scale to long strings. There should be nopractical bound on the length of strings. Performance should remain acceptable for long strings."
There are various trade-off levels possible in ropes depending on the chunk sizes: a rope can have a small number of large pieces or a large number of small ones.
I suspect why ropes are perhaps not popular is that people don't care that much about the utmost performance in string processing. A performance hiccup in some string processing isn't going to make video skip, a game framerate lag, or a CNC cutter spoil a piece. A lot of text processing is batch processing, or else non-real-time interactive (like in a text editor). Moreover, the cases where ropes are useful (huge strings) are a niche within text processing. People often don't have long strings so they don't feel the sting of bad performance.
> I remember hearing that same talk in the 1990's: you don't want linked lists: think of the pointer chasing and cache misses, and the storage overhead for the pointers that wastes your precious little cache space (8K in an 80486).
in a 2020 machine with a normal workload I have yet to see a list outperform a contiguous array outside of pathological cases
> The point of ropes is to handle very long strings. Scanning a long string, many kilobytes or even megabytes long
I mean, my 2016 CPU has 20 megabytes of L3 cache, you can fit a fair amount of text in that, likely more than 99.9% of average text editor usage
The exact same things about arrays and linked lists were said in 1995 about 1995 machines. They ring true, if you pin down exactly what you mean by your choice of pathological cases.
Even if we completely eliminate all caching effects, so that every memory access is uniform, regardless of past access history, then arrays are still going to be faster in many cases than linked lists. Obviously, in the random access case: fetching the n-th element. Pointers are metadata; a linked list has metadata for every node, and that requires memory bandwidth to suck through the processor. If every list node has a word of pointer for a word of data, then it means scanning through a million word list means we are loading two million words. If the data is aggregated into an array, then scanning it means we are loading only one million words. That's half the cycles. Likely fewer instructions executed in the loop and so on.
Note that arguments like, "but in year 20XX, such a case now fits into my Ln cache" work toward arguing that caching is less relevant. The more cache you have, the less relevant it is. As a cache gets bigger, more and more situations that previously exhibited "poor locality" now have "good locality" of reference.
Your 2016 machine can cache much more of a given linked list than your 1995 machine; it performs relatively better with linked lists than the 1995 machine, even if we account for basic differences of instruction issue rate (pipelining, branch prediction, etc).
the question is, how often does it happen when compared to iterating over your loop ?
Like, every codebase I've worked on had some array sorting at some point in it... but the frequency of needing to sort something vs having to loop over all the X makes it a completely not-relevant thing.
If the only time you add an element to an array is as a result of the user clicking somewhere, then I'd say it's definitely not worth taking into account if your array isn't 5000000 elements (which it will likely won't be if you are giving the user the ability to just add something to it).
I would be very surprised if there is any code that could potentially allocate in the hot (lol) path of a rocket. There'll just be static arrays in there.
Exactly. And you additionally need GC or reference counting.
If you write a text editor you certainly should think about something like that, but claiming it should be the default string type is a bit off.
You should probably only use data structures with a lot of indirection if you really can measure it to be faster in the situation.
Maybe there's a hybrid approach. For instance, I can imagine a rope-like data structure where the nodes are sized to a cache row for example. It might not be optimal from a memory usage standpoint, but there are probably use cases where it could make sense
The point isn't reading, the point is writing. Inserting a single character into a string is an O(n) operation. Inserting one in a rope is basically O(log(n)), same for deletion.
I recall that a couple of years ago most browsers started putting a lot of effort into optimizing their string implementations regarding concatenation, so that makes me wonder: do browsers use ropes? Or is there some other data structure that is better optimized for how people commonly use strings in the browser?
AvalonEdit, the text editor for SharpDevelop implemented ropes. I think SharpDevelop was folded into MonoDevelop and the text editor there was changed. So I don't think MonoDevelop currently uses ropes?
I think a hybrid approach would work these days... because pointers are 8 bytes long, and you need a few of them just to make a node... perhaps all chunks less than N (200ish?) bytes long could be stored flat in the node?
Also, it would be good to store both the Byte Count, and the Character Count, now that we have UTF-8, etc... and ensure characters don't get split across nodes.
This would give us structures that work well in multi-processor environments, due to the immutability and lock-free nature, handle Unicode in a sane and safe manner that saves a lot of headaches, and does small common operations in a fast manner saving many pointer operations.
Premature optimization is the root of all evil. If we were to use these instead of strings by default, we would probably end up referring to them as nooses rather than ropes.
xi-editor does use xi-rope as its main text storage data structure. This part worked well; the problems with xi were much more around the CRDT and the fully async design. I did a RustLab talk on the use of immutable data structures (of which rope is a good example) for incremental UI, and this will be published soonish.
Yeah I remember thinking the async design seemed a bit over-engineered. It seemed like they were approaching the UI constraints almost like they were developing a game engine, which doesn't seem needed for a text editor. For a great many actions, in that setting it's acceptable to have synchronous actions between key press and view update.
It depends hugely on the workload. Ropes are ideal for editing sessions, where you have a sequence of small edits on a large string. But if the string is small, just storing it contiguously is simpler and faster.
They're also good for forming larger strings by concatenating smaller ones. But if you can structure your code so that it just appends to a buffer, for example, by threading a mutable string buffer reference through function calls rather than returning strings, that's also faster.
So, as with most such tradeoffs, it depends on what you're trying to do.
CPUs are very fast at copying memory. And continuous memory allows for efficient use of CPU caches. So for small string sizes simple algorithms are better.
Another reason might be compacting GC. When you compact the heap, you need to rewrite pointers. So you need less objects overall and less pointers. When your string is just one array, it's much more friendly than having bunch of substrings with pointers.
If you're dealing with megabyte strings, you need to think about algorithms, ropes might be the right choice.
I did like the quote "While strings are occasionally the right choice..."
Nearly three decades of utility since, quite the occasion!
Certainly, we could all have been wrong this entire time. So often in programming I find that my working, perfectly useful software, unbeknownst to me, was wrong all along! Imagine my face on the day!
(I am being facetious, it is of course perfectly valid to challenge the status quo)
Representing a short string as a contiguous vector of characters takes much less CPU time. The rope doesn't really hit its stride until you want to do something that a vector is bad at, such as splicing something into the middle of a long string.
The word "tree" already has a very general and overloaded meaning within the context of data structures. Presumably, this "rope" implementation is using a tree structure internally, but that's true of many thousands of things; it's a basic foundation of data structures and computer science. I assume "rope" was chosen as a tongue in cheek alternative to "string".
It works brilliantly for mostly string concatenation workloads, like rendering web page templates.
Gets tricky when you decide to flatten it if people are repeatedly mutating single characters or indexing.
Cool thing is that it can provide a char* interface to a rope data structure, in order to support C extensions not expecting ropes.
https://rubybib.org/#menard2018