Depends on your use case. If you use a rope or tree representation (like most ed...

wruza · on Aug 13, 2017

Much this, use case. Statistically, my use case is 95% cat cat cat cat... (hundreds of them) and then seqwrite of some form, never accessing it randomly. I'm generating json, xml, sql, etc to send it over socket. Sometimes it is utility library, sometimes me (or me writing utility library). My string type is always an array, but this type-shift is not obvious to new coders.

"Why StringBuilder" was once very popular on SO, not sure for now. I oversimplified editors example, their structures are not for 'everyday' use, but I think that simple linked list in string would wipe the entire obvious class of cat-in-loop performance issues.

majewsky · on Aug 13, 2017

> Statistically, my use case is 95% cat cat cat cat... (hundreds of them)

If you use a linked-list representation like you proposed, your performance would die a death by 1000 allocations. A simple string class backed by an exponentially growing contiguous memory buffer (i.e., what 99% of string classes do) sounds much better from a performance standpoint. Especially if you are able to know (or estimate) the size of your response, and allocate once upfront.

wruza · on Aug 13, 2017

I'm surely able to write a C routine that uses custom 1.5x-growing buffer and does everything, including subexpr indentation shift etc, in one pass.

The point is that I cannot delegate this work to newbie/lowcost who will cat-cat-cat strings allocated by other newbie thousand times with no idea that it is slow and consuming. Even telling them how to do it properly I'll spend more machine cycles than simply sending it in production. Even then, very source strings are not under my control and I'm no world dictator to tell a bunch of maintainers who don't even know me what they should do for my performance case. It is real world, not my cool homegrown toolkit.

It is not how-to talk, it is why not make already inefficient method slightly more forgiving by appropriate means.

If you don't like a list, take an array, but I like to see tests against it before taking that "death" argument. Maybe tomorrow I'll test all three (multicat, list, array) of immutable strings here.

wruza · on Aug 13, 2017

https://pastebin.com/APHt8C5V (valid for 1 month)

In short, 1M iterations of 1) s=s..x, 2) insert(t,x)+concat(t) 3) l={l,x}+concat(flatten(l)). Time in integer seconds passed.

  > lua5.1.exe x.lua
  ..      1850000 bytes, 10000 iterations         -- n / 100, since it never ends with 1M
  time:   23
  array   185000000 bytes, 1000000 iterations
  time:   5
  list    185000000 bytes, 1000000 iterations
  time:   6                                       -- seems pretty alive

edit: table.concat() seems to eat most of the time here, so lists are effectively instant, as are arrays for 1M. (Full GC included in all test times.)

edit2: on n=10M without actual concat(): array takes 10s, list 19s. Conclusion is, malloc() is not too slow compared to 2x realloc step used in Lua tables.

fnord123 · on Aug 13, 2017

Not to disagree, but you can barely perform random access on a utf8 string. You need to explode it out to utf16 or utf32 which isn't what most languages have built in. Rust and go largely work with utf8 while c and c++ love them byte arrays (not sure I've even seen std::wstring in the wild)

burntsushi · on Aug 13, 2017

This is misleading. UTF-16 doesn't actually provide random access, because codepoints outside the basic multilingual plane are encoded with two UTF-16 code units (4 bytes). UTF-32 guarantees 4 bytes for every codepoint, which is quite wasteful, but even then, random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

> but you can barely perform random access on a utf8 string.

This really isn't true, or at least, isn't a problem in practice. If you need indices into UTF-8 strings, then you can record them by decoding the string. This is sufficient for most string related algorithms except for the "give me the first N characters" variety, which actually turns out to be a relative good thing since "give me the first N characters" should require applying Unicode's grapheme algorithm, which is never amenable to random access in UTF-8, UTF-16 or UTF-32.

fnord123 · on Aug 13, 2017

> random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

That's a good point.

>If you need indices into UTF-8 strings, then you can record them by decoding the string.

Sure but the context is that I was responding to "If you use a rope or tree representation (like most editors do these days), random access is O(log n) at best, in many implementations typically O(sqrt n) or even O(n), whereas concatenation string is O(1).".

Decoding the string is O(n); not O(1) (if GP meant "concatenation string" as in a string where the text is concatenated into a single buffer - if GP meant "concatenating a string in a rope or tree is O(1)" then I'm off on a wild tangent).

beagle3 · on Aug 13, 2017

But you only need to decode it once (or otherwise receive those indices without even deciding) whereas random access to a rope/tree is always o(log n) or o(n).

Use case is everything.

fnord123 · on Aug 15, 2017

>Use case is everything.

amen.

beagle3 · on Aug 13, 2017

Well, depends if you want the random access to be by bytes or code points - you might have the byte indices from a previous access.

Also, factor (and iirc python too) keeps it at 8 bits if all codepoints fit in 8 bits, 16 if it fits in 16, and 32 otherwise.