Hacker News new | past | comments | ask | show | jobs | submit login

> If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string?

Yes. And doing otherwise is pretty risky as the Java folks discovered, ultimately deciding to revert the optimisation of substring sharing storage rather than copying its data.

The issue is that while data-sharing substringing is essentially free, it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

Languages which perform this sharing explicitly — and especially statically (e.g. Rust) — don't have this issue, but it's a risky move when you only have one string type.

Incidentally, Python provides for opting into that behaviour for bytes using memory views.




> it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.


The cpython mark-and-sweep garbage collector does very little on purpose. It basically only gets involved for reference cycles. Anything else is dealt with by reference counting. This way you prevent long GC pauses.


True, but that's by no means a inherent characteristic of garbage collectors, or even garbage collectors operating on objects of a python implementation in particular.


True, but that is the current philosophy behind the garbage collector in the most used python implementation and it's unlikely to change.


> You can get around that with a smarter garbage collector, though.

That complexifies the GC (already a complex beast) as it requires strong type-specific specialisation, more bookkeeping, and adds self-inflicted edge cases. Even the JVM folks didn't bother (though mayhaps they did not because they have more than one garbage collector).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: