If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string? If x is near my memory limits, and I do y = x[:-1] will it basically double my memory usage? Is that what you meant by every string is a new string?
> If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string?
Yes. And doing otherwise is pretty risky as the Java folks discovered, ultimately deciding to revert the optimisation of substring sharing storage rather than copying its data.
The issue is that while data-sharing substringing is essentially free, it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.
Languages which perform this sharing explicitly — and especially statically (e.g. Rust) — don't have this issue, but it's a risky move when you only have one string type.
Incidentally, Python provides for opting into that behaviour for bytes using memory views.
> it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.
You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.
The cpython mark-and-sweep garbage collector does very little on purpose. It basically only gets involved for reference cycles. Anything else is dealt with by reference counting. This way you prevent long GC pauses.
True, but that's by no means a inherent characteristic of garbage collectors, or even garbage collectors operating on objects of a python implementation in particular.
> You can get around that with a smarter garbage collector, though.
That complexifies the GC (already a complex beast) as it requires strong type-specific specialisation, more bookkeeping, and adds self-inflicted edge cases. Even the JVM folks didn't bother (though mayhaps they did not because they have more than one garbage collector).
Scheme does the right thing here (by convention), that mutating procedures end with a bang: (string-upcase str) returns a new string, whereas (string-upcase! str) mutates the string in place.
The details for mutation of data in scheme go beyond that, though. Sometimes procedures are "allowed but not required to mutate their argument". Most (all?) implementations do mutate, but it is still considered bad for to do something like:
(define a (list 1 2 3))
(append! a (list 4))
(display a)
As append! returns a list that is supposed to supercede the binding to a. Using a like that "is an error", as a valid implementation of append! may look like this
(define append! append)
Which would make the earlier code snippet invalid.
IMO, this is a defect in the language: the lack of a "must_use" annotation or similar. If that annotation existed, and the .upper() method was annotated with it, the compiler could warn in that situation.
Notice in the first example, right after CALL_METHOD the return value on the stack is just immediately POP'd away. The parent is saying that when you run `python example.py` CPython should see that the return value is never used and emit a warning. This would only happen because `upper()` was manually marked using the suggested `must_use` annotation.
Ternaries don't discard results that are generated, they are just special short-circuiting operators;
x if y else z
Is effectively syntax sugar for:
y and x or z
Nothing is discarded after evaluation, one of three arms is never evaluated, just as one of two arms of a common short-circuiting Boolean operator often (but not always) is not. That's essentially the opposite of executing and producing possible side effects and then discarding the results.
That byte code is then interpreted at runtime, so the meaning of s.upper() could change. What something does, when it’s parsed, is not fixed.
You can definitely catch most cases at runtime. I’ve done something like this, in an library, to catch a case where people were treating the copy of data as a mutable view.
> Python is interpretted, not compiled, and completly dynamic. You cannot check much statically.
The existence of mypy and other static type checkers for Python disproves that; given their existence, warning of an expression producing a type other than “any” or strictly “None” was used in a position where it would neither be passed to another function or assigned to a variable that is used later should be possible. Heck, you could be stricter and only allow strictly “None” in that position.
> And honestly, I would be rich if I got a dollar every time a student does this:
> msg.upper()
> Instead of:
> msg = msg.upper()
> And then call me to say it doesn't work.
On this, isn't the student's reasoning sensible? E.g. "If msg is a String object that represents my string, then calling .upper() on it will change (mutate) the value, because I'm calling it on itself"?
If the syntax was upper(msg) or to a lesser extent String.upper(msg) then the new-to-programming me would have understood more clearly that msg was not going to change. Have you any insights into what your students are thinking?
That was the original syntax [0], before the string functions became methods. I agree that a method more strongly implies mutation than a function does.
Also, for consistency with list methods like `reverse` (which acts in place) and `reversed` (which makes a copy), shouldn’t the method be called `uppered`?!
My favorite example of something similar to this, since you brought it up:
>>> a = [254, 255, 256, 257, 258]
>>> b = [254, 255, 256, 257, 258]
>>> for i in range(5): print(a[i] is b[i])
...
True
True
True
False
False
In Python, integers in the range [-5, 256] are statically constructed in the interpreter and refer to fixed instances of objects. All other integers are created dynamically and refer to a new object each time they are created.
I mean, people should be using `==` for this. The fact that `is` happens to work for small numbers is an implementation detail that shouldn't be relied upon.
Absolutely. But because it does work they might start using it without knowing it's wrong, then be surprised when it doesn't work. Python has other areas where the common English definition of a word leads to misunderstandings about how they're to be used.
What is the rationale behind this? '==' works all the time, and 'is' only works sometimes. Using 'is' wherever possible requires the user to know some rather arbitrary language details (which objects are singletons and which are not), wheras '==' will always give the correct answer regardless.
Correct. In Numpy the slices are views on the underlying memory. That’s why they’re so fast, there’s no copying involved. Incidentally that’s also why freeing up the original variable doesn’t release the memory (the slices are still using it).
So there is no possible confusion.