Strings are immutable in Python, and all strings operations return new strings, ...

ianhorn · on May 28, 2020

If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string? If x is near my memory limits, and I do y = x[:-1] will it basically double my memory usage? Is that what you meant by every string is a new string?

masklinn · on May 28, 2020

> If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string?

Yes. And doing otherwise is pretty risky as the Java folks discovered, ultimately deciding to revert the optimisation of substring sharing storage rather than copying its data.

The issue is that while data-sharing substringing is essentially free, it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

Languages which perform this sharing explicitly — and especially statically (e.g. Rust) — don't have this issue, but it's a risky move when you only have one string type.

Incidentally, Python provides for opting into that behaviour for bytes using memory views.

a1369209993 · on May 28, 2020

> it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.

Doxin · on May 29, 2020

The cpython mark-and-sweep garbage collector does very little on purpose. It basically only gets involved for reference cycles. Anything else is dealt with by reference counting. This way you prevent long GC pauses.

a1369209993 · on May 29, 2020

True, but that's by no means a inherent characteristic of garbage collectors, or even garbage collectors operating on objects of a python implementation in particular.

Doxin · on May 29, 2020

True, but that is the current philosophy behind the garbage collector in the most used python implementation and it's unlikely to change.

masklinn · on May 29, 2020

> You can get around that with a smarter garbage collector, though.

That complexifies the GC (already a complex beast) as it requires strong type-specific specialisation, more bookkeeping, and adds self-inflicted edge cases. Even the JVM folks didn't bother (though mayhaps they did not because they have more than one garbage collector).

BiteCode_dev · on May 28, 2020

If x is near your memory limits, and you do y = x[:-1], you will get a MemoryError :)

For those situations, bytes() + memoryview() or bytearray() can be used, but then you are on your own.

ianhorn · on May 28, 2020

Huh, I've had a wrong understanding of that for over a decade! TIL, thanks.

BiteCode_dev · on May 28, 2020

Hey!

https://xkcd.com/1053/

And honestly, I would be rich if I got a dollar every time a student does this:

    msg.upper()

Instead of:

    msg = msg.upper()

And then call me to say it doesn't work.

bjoli · on May 29, 2020

Scheme does the right thing here (by convention), that mutating procedures end with a bang: (string-upcase str) returns a new string, whereas (string-upcase! str) mutates the string in place.

The details for mutation of data in scheme go beyond that, though. Sometimes procedures are "allowed but not required to mutate their argument". Most (all?) implementations do mutate, but it is still considered bad for to do something like:

    (define a (list 1 2 3))
    (append! a (list 4))
    (display a)

As append! returns a list that is supposed to supercede the binding to a. Using a like that "is an error", as a valid implementation of append! may look like this

    (define append! append)

Which would make the earlier code snippet invalid.

cesarb · on May 28, 2020

IMO, this is a defect in the language: the lack of a "must_use" annotation or similar. If that annotation existed, and the .upper() method was annotated with it, the compiler could warn in that situation.

diarrhea · on May 28, 2020

But you are free to do

  if title == user_input.upper():

That is, you convert a string to upper without binding the result to a name. You just use it in-place and discard the result, which is fine.

With compiler, you mean mypy or linters?

TkTech · on May 28, 2020

That's still "using" the resulting value for a comparison. CPython isn't an optimizing compiler, or it would completely remove the call to upper().

    >>> def up(v):
    ...     v.upper()
    ...
    >>> dis.dis(up)
    2           0 LOAD_FAST                0 (v)
                2 LOAD_METHOD              0 (upper)
                4 CALL_METHOD              0
                6 POP_TOP
                8 LOAD_CONST               0 (None)
                10 RETURN_VALUE

    >>> def up(v):
    ...     if v.upper() == "HelloWorld":
    ...        return True
    ...
    >>> dis.dis(up)
    2           0 LOAD_FAST                0 (v)
                2 LOAD_METHOD              0 (upper)
                4 CALL_METHOD              0
                6 LOAD_CONST               1 ('HelloWorld')
                8 COMPARE_OP               2 (==)
                10 POP_JUMP_IF_FALSE       16

    3          12 LOAD_CONST               2 (True)
                14 RETURN_VALUE
            >>   16 LOAD_CONST               0 (None)
                18 RETURN_VALUE

Notice in the first example, right after CALL_METHOD the return value on the stack is just immediately POP'd away. The parent is saying that when you run `python example.py` CPython should see that the return value is never used and emit a warning. This would only happen because `upper()` was manually marked using the suggested `must_use` annotation.

delaaxe · on May 28, 2020

He meant that writing a line of code with only contents:

    msg.upper()

should trigger a warning as this clearly doesn't do anything.

BiteCode_dev · on May 28, 2020

Python is interpretted, not compiled, and completly dynamic. You cannot check much statically.

In fact, any program can replace anything on the fly, and swap your string for something similar but mutable.

It's the trade off you make when choosing it.

pansa2 · on May 28, 2020

I agree, there’s no way to issue a warning about a bare `s.upper()` at compile time. I wonder if it would be possible at runtime?

mark-r · on May 28, 2020

Don't think so, Python doesn't really care if you dispose of the results of an expression. Think about the problems you'd have with ternaries.

dragonwriter · on May 29, 2020

Ternaries don't discard results that are generated, they are just special short-circuiting operators;

  x if y else z

Is effectively syntax sugar for:

  y and x or z

Nothing is discarded after evaluation, one of three arms is never evaluated, just as one of two arms of a common short-circuiting Boolean operator often (but not always) is not. That's essentially the opposite of executing and producing possible side effects and then discarding the results.

owl57 · on May 29, 2020

What's the problem with ternaries?

mark-r · on May 29, 2020

One of the two possible sub-expressions isn't used.

mkl · on May 29, 2020

It's also not evaluated. There is no discarding, so there would be no problem.

nomel · on May 28, 2020

What is this “compile time” you speak of?

pansa2 · on May 28, 2020

When the Python source code is compiled into bytecode.

nomel · on June 2, 2020

That byte code is then interpreted at runtime, so the meaning of s.upper() could change. What something does, when it’s parsed, is not fixed.

You can definitely catch most cases at runtime. I’ve done something like this, in an library, to catch a case where people were treating the copy of data as a mutable view.

    interface[address][slice] = new_values # fancy noop

Where a read, modify, write was required:

    byte_values = interface[address]
    byte_values[slice] = new_values
    interface[address] = byte_values

It would log/raise a useful error if the there was no assignment/passing of the return value.

dragonwriter · on May 29, 2020

> Python is interpretted, not compiled, and completly dynamic. You cannot check much statically.

The existence of mypy and other static type checkers for Python disproves that; given their existence, warning of an expression producing a type other than “any” or strictly “None” was used in a position where it would neither be passed to another function or assigned to a variable that is used later should be possible. Heck, you could be stricter and only allow strictly “None” in that position.

liveoneggs · on May 29, 2020

so what are these annoying pyc files about?

pbowyer · on May 28, 2020

> And honestly, I would be rich if I got a dollar every time a student does this:

> msg.upper()

> Instead of:

> msg = msg.upper()

> And then call me to say it doesn't work.

On this, isn't the student's reasoning sensible? E.g. "If msg is a String object that represents my string, then calling .upper() on it will change (mutate) the value, because I'm calling it on itself"?

If the syntax was upper(msg) or to a lesser extent String.upper(msg) then the new-to-programming me would have understood more clearly that msg was not going to change. Have you any insights into what your students are thinking?

pansa2 · on May 28, 2020

> String.upper(msg)

That was the original syntax [0], before the string functions became methods. I agree that a method more strongly implies mutation than a function does.

Also, for consistency with list methods like `reverse` (which acts in place) and `reversed` (which makes a copy), shouldn’t the method be called `uppered`?!

[0] https://docs.python.org/2/library/string.html#deprecated-str...

moonchild · on May 28, 2020

'uppercased'

pansa2 · on May 29, 2020

Ah, of course.

Also, it looks like that’s the name that Swift uses.

BiteCode_dev · on May 28, 2020

A student don't know anything about mutability, and since Python signatures are not explicit, there is no way to know they have to do that.

It's just something to be told. A design decision, like there are thousands to learn in IT, that you just can't guess.

ORioN63 · on May 28, 2020

Yes. Although you can use `islice` from itertools to get around this problem, when a problem.

Erlangen · on May 28, 2020

Slicing in Python always create a new object. You can test it with a list of integers..

cecilpl2 · on May 28, 2020

My favorite example of something similar to this, since you brought it up:

  >>> a = [254, 255, 256, 257, 258]
  >>> b = [254, 255, 256, 257, 258]
  >>> for i in range(5): print(a[i] is b[i])
  ...
  True
  True
  True
  False
  False

In Python, integers in the range [-5, 256] are statically constructed in the interpreter and refer to fixed instances of objects. All other integers are created dynamically and refer to a new object each time they are created.

mark-r · on May 28, 2020

Leaky abstractions at its finest.

sullyj3 · on May 29, 2020

I mean, people should be using `==` for this. The fact that `is` happens to work for small numbers is an implementation detail that shouldn't be relied upon.

mark-r · on May 29, 2020

Absolutely. But because it does work they might start using it without knowing it's wrong, then be surprised when it doesn't work. Python has other areas where the common English definition of a word leads to misunderstandings about how they're to be used.

kgm · on May 29, 2020

Though "when" the object is created isn't always so straightforward:

  >>> x = 257
  >>> y = 257
  >>> x is y
  False
  >>> def f():
  ...     x = 257
  ...     y = 257
  ...     return x is y
  ...
  >>> f()
  True

The lesson being that `is` is essentially meaningless for immutable objects, and to always use `==`.

pansa2 · on May 29, 2020

> `is` is essentially meaningless for immutable objects

OTOH it’s recommended to use `is` rather than `==` when comparing to “singletons like `None`”.

https://www.python.org/dev/peps/pep-0008/#programming-recomm...

joppy · on May 29, 2020

What is the rationale behind this? '==' works all the time, and 'is' only works sometimes. Using 'is' wherever possible requires the user to know some rather arbitrary language details (which objects are singletons and which are not), wheras '==' will always give the correct answer regardless.

pansa2 · on May 29, 2020

Classes can overload `==`:

    class C:
        def __eq__(self, other):
            return True

    print(C() == None)  # True
    print(C() is None)  # False

masklinn · on May 28, 2020

> Slicing in Python always create a new object.

It always creates a new object but it doesn't necessarily copy the contents (even shallowly).

For instance slicing a `memoryview` creates a subview which shares storage with its parent.

ianhorn · on May 28, 2020

It'll always create a new object but my understanding is that at least in numpy the new and old object will share memory. Am I wrong there too?

aidos · on May 28, 2020

Correct. In Numpy the slices are views on the underlying memory. That’s why they’re so fast, there’s no copying involved. Incidentally that’s also why freeing up the original variable doesn’t release the memory (the slices are still using it).

dialamac · on May 28, 2020

CPython is pretty terrible. Numpy has the concept of views, cpython doesn’t do anything sophisticated.

throwaway287391 · on May 29, 2020

> Slicing in Python always create a new object.

Since you said "always", I'll be "that guy"...

For the builtin types, yeah. But not in general; user code can do whatever it wants:

    class MyType:
      def __getitem__(self, *args):
        return self
    my_obj = MyType()
    my_other_obj = my_obj[3:7]
    assert my_other_obj is my_obj

sadfklsjlkjwt · on May 28, 2020