> it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.
You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.
The cpython mark-and-sweep garbage collector does very little on purpose. It basically only gets involved for reference cycles. Anything else is dealt with by reference counting. This way you prevent long GC pauses.
True, but that's by no means a inherent characteristic of garbage collectors, or even garbage collectors operating on objects of a python implementation in particular.
> You can get around that with a smarter garbage collector, though.
That complexifies the GC (already a complex beast) as it requires strong type-specific specialisation, more bookkeeping, and adds self-inflicted edge cases. Even the JVM folks didn't bother (though mayhaps they did not because they have more than one garbage collector).
You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.