> Strings as they are today almost everywhere are simply of bad design. I very m...

wruza · on Aug 14, 2017

Personally I see no difference between the current statement of

  sep.join(list) is idiomatic

and possible statement of

  s.flatten() before BM/SIMD is idiomatic

except that join is ubiquitous and BM/SIMD searching is never heard of by a regular developer in modern boring language. Not to mention that few of us who aware can implement or use it properly.

edit: searching/iterating methods could simply flatten() string on demand, because one doesn't search in semi-composed string [to trash performance back again]. If string doesn't have to be searched, but is for socket/file, writing methods could `s.fetch_next(BUFSIZ)` and skip huge concatenation before BUFSIZ-chunking-again at all. What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

burntsushi · on Aug 14, 2017

> BM/SIMD searching is never heard of by a regular developer in modern boring language

Of course not, because it's usually implemented in the standard library of whatever language you're using. But if your strings aren't stored in contiguous memory, then having to flatten every string into contiguous memory before searching also presents a significant performance cost.

> What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

I feel like you're missing my point. My point isn't that "contiguous memory is better than not contiguous memory." My point is that "contiguous memory strings are bad design" is completely unwarranted outside of the context of a specific use case. Outside of specific use cases, the only thing we should be doing is discussing trade offs, and not declaring a monopoly on what constitutes "bad design."

beagle3 · on Aug 14, 2017

> What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

First, it's simplicity (of implementation, explanation, etc. by far).

Second, it depends very significantly on your use case.

If your strings are of an average length of 4 bytes (not uncommon, even much shorter in code that does ''.join('<',tag,'>',body,'</',tag,'>') if bodies are short) then you are very likely to take some x4 to x10 memory compared to what you actually need (e.g. if you have allocations and 64-bit pointers).

With respect to your socket example, iovecs could be used instead of flattening if you have access to low-level OS primitives, but (a) it would be much slower if the sections are short, and (b) most programming languages and frameworks do not actually give you access to iovec (or equivalent) io calls, for either sockets or files.

Ropes have been available since forever in C++, going as far back as the original SGI STL in 1995, RogueWave in 1991 IIRC, and countless other independent implementations. for your typical use case, the complexity (in time and space) that they provide is apparently a net negative.

The "s.flatten() before XYZ is idiomatic" is exactly the kind of leaky abstraction that everyone hates, and it requires an understanding that is lacking among most programmers (guaranteeing half the flatten()s will be useless and half of those needed will be missing).

beagle3 · on Aug 13, 2017

I agree completely. Back in the 8086 days, Boyer Moore string search delivered significant speed up over linear memory scan, but nowadays nothing practically beats SIMD linear scans except in very specialized situations.

burntsushi · on Aug 13, 2017

Indeed. Many Boyer Moore implementations implement the skip loop by using memchr. But in practice, it's often faster to forgo Boyer Moore completely and heuristically select the rarest byte in the string you're searching, and then run memchr on that byte. You can even pick the second rarest byte and use that as a quick way to throw out false positives before doing a full string comparison.

(The aforementioned optimization is one of the tricks ripgrep uses that occasionally causes it to be faster than GNU grep. GNU grep uses a traditional Boyer Moore substring search with memchr always applied using the last byte in the pattern.)