Hacker News new | past | comments | ask | show | jobs | submit login

Strings are immutable in Python, and all strings operations return new strings, including all string methods.

So there is no possible confusion.




If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string? If x is near my memory limits, and I do y = x[:-1] will it basically double my memory usage? Is that what you meant by every string is a new string?


> If I have a string x = "this is a very long string..." and do y = x[:10], then it's a whole new string?

Yes. And doing otherwise is pretty risky as the Java folks discovered, ultimately deciding to revert the optimisation of substring sharing storage rather than copying its data.

The issue is that while data-sharing substringing is essentially free, it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

Languages which perform this sharing explicitly — and especially statically (e.g. Rust) — don't have this issue, but it's a risky move when you only have one string type.

Incidentally, Python provides for opting into that behaviour for bytes using memory views.


> it also keeps the original string alive, so if you slurp in megabytes, slice out a few bytes you keep around and "throw away" the original string, that one is still kept alive by the substringing you perform, and you basically have a hard to diagnose memory leak due to completely implicit behaviour.

You can get around that with a smarter garbage collector, though. On every mark-sweep pass (which you need for cycle detection even if you use refcounts for primary cleanup), add up the number of bytes of distinct string objects using the buffer. If it's less than the size of the buffer, you can save memory by transparently mutating each substring to use it's own buffer. If it's not, then you actually are saving memory by sharing storage, so you should probably keep doing that.


The cpython mark-and-sweep garbage collector does very little on purpose. It basically only gets involved for reference cycles. Anything else is dealt with by reference counting. This way you prevent long GC pauses.


True, but that's by no means a inherent characteristic of garbage collectors, or even garbage collectors operating on objects of a python implementation in particular.


True, but that is the current philosophy behind the garbage collector in the most used python implementation and it's unlikely to change.


> You can get around that with a smarter garbage collector, though.

That complexifies the GC (already a complex beast) as it requires strong type-specific specialisation, more bookkeeping, and adds self-inflicted edge cases. Even the JVM folks didn't bother (though mayhaps they did not because they have more than one garbage collector).


If x is near your memory limits, and you do y = x[:-1], you will get a MemoryError :)

For those situations, bytes() + memoryview() or bytearray() can be used, but then you are on your own.


Huh, I've had a wrong understanding of that for over a decade! TIL, thanks.


Hey!

https://xkcd.com/1053/

And honestly, I would be rich if I got a dollar every time a student does this:

    msg.upper()
Instead of:

    msg = msg.upper()
And then call me to say it doesn't work.


Scheme does the right thing here (by convention), that mutating procedures end with a bang: (string-upcase str) returns a new string, whereas (string-upcase! str) mutates the string in place.

The details for mutation of data in scheme go beyond that, though. Sometimes procedures are "allowed but not required to mutate their argument". Most (all?) implementations do mutate, but it is still considered bad for to do something like:

    (define a (list 1 2 3))
    (append! a (list 4))
    (display a)
As append! returns a list that is supposed to supercede the binding to a. Using a like that "is an error", as a valid implementation of append! may look like this

    (define append! append)
Which would make the earlier code snippet invalid.


IMO, this is a defect in the language: the lack of a "must_use" annotation or similar. If that annotation existed, and the .upper() method was annotated with it, the compiler could warn in that situation.


But you are free to do

  if title == user_input.upper():
That is, you convert a string to upper without binding the result to a name. You just use it in-place and discard the result, which is fine.

With compiler, you mean mypy or linters?


That's still "using" the resulting value for a comparison. CPython isn't an optimizing compiler, or it would completely remove the call to upper().

    >>> def up(v):
    ...     v.upper()
    ...
    >>> dis.dis(up)
    2           0 LOAD_FAST                0 (v)
                2 LOAD_METHOD              0 (upper)
                4 CALL_METHOD              0
                6 POP_TOP
                8 LOAD_CONST               0 (None)
                10 RETURN_VALUE

    >>> def up(v):
    ...     if v.upper() == "HelloWorld":
    ...        return True
    ...
    >>> dis.dis(up)
    2           0 LOAD_FAST                0 (v)
                2 LOAD_METHOD              0 (upper)
                4 CALL_METHOD              0
                6 LOAD_CONST               1 ('HelloWorld')
                8 COMPARE_OP               2 (==)
                10 POP_JUMP_IF_FALSE       16

    3          12 LOAD_CONST               2 (True)
                14 RETURN_VALUE
            >>   16 LOAD_CONST               0 (None)
                18 RETURN_VALUE
Notice in the first example, right after CALL_METHOD the return value on the stack is just immediately POP'd away. The parent is saying that when you run `python example.py` CPython should see that the return value is never used and emit a warning. This would only happen because `upper()` was manually marked using the suggested `must_use` annotation.


He meant that writing a line of code with only contents:

    msg.upper()
should trigger a warning as this clearly doesn't do anything.


Python is interpretted, not compiled, and completly dynamic. You cannot check much statically.

In fact, any program can replace anything on the fly, and swap your string for something similar but mutable.

It's the trade off you make when choosing it.


I agree, there’s no way to issue a warning about a bare `s.upper()` at compile time. I wonder if it would be possible at runtime?


Don't think so, Python doesn't really care if you dispose of the results of an expression. Think about the problems you'd have with ternaries.


Ternaries don't discard results that are generated, they are just special short-circuiting operators;

  x if y else z
Is effectively syntax sugar for:

  y and x or z
Nothing is discarded after evaluation, one of three arms is never evaluated, just as one of two arms of a common short-circuiting Boolean operator often (but not always) is not. That's essentially the opposite of executing and producing possible side effects and then discarding the results.


What's the problem with ternaries?


One of the two possible sub-expressions isn't used.


It's also not evaluated. There is no discarding, so there would be no problem.


What is this “compile time” you speak of?


When the Python source code is compiled into bytecode.


That byte code is then interpreted at runtime, so the meaning of s.upper() could change. What something does, when it’s parsed, is not fixed.

You can definitely catch most cases at runtime. I’ve done something like this, in an library, to catch a case where people were treating the copy of data as a mutable view.

    interface[address][slice] = new_values # fancy noop
Where a read, modify, write was required:

    byte_values = interface[address]
    byte_values[slice] = new_values
    interface[address] = byte_values
It would log/raise a useful error if the there was no assignment/passing of the return value.


> Python is interpretted, not compiled, and completly dynamic. You cannot check much statically.

The existence of mypy and other static type checkers for Python disproves that; given their existence, warning of an expression producing a type other than “any” or strictly “None” was used in a position where it would neither be passed to another function or assigned to a variable that is used later should be possible. Heck, you could be stricter and only allow strictly “None” in that position.


so what are these annoying pyc files about?


> And honestly, I would be rich if I got a dollar every time a student does this:

> msg.upper()

> Instead of:

> msg = msg.upper()

> And then call me to say it doesn't work.

On this, isn't the student's reasoning sensible? E.g. "If msg is a String object that represents my string, then calling .upper() on it will change (mutate) the value, because I'm calling it on itself"?

If the syntax was upper(msg) or to a lesser extent String.upper(msg) then the new-to-programming me would have understood more clearly that msg was not going to change. Have you any insights into what your students are thinking?


> String.upper(msg)

That was the original syntax [0], before the string functions became methods. I agree that a method more strongly implies mutation than a function does.

Also, for consistency with list methods like `reverse` (which acts in place) and `reversed` (which makes a copy), shouldn’t the method be called `uppered`?!

[0] https://docs.python.org/2/library/string.html#deprecated-str...


'uppercased'


Ah, of course.

Also, it looks like that’s the name that Swift uses.


A student don't know anything about mutability, and since Python signatures are not explicit, there is no way to know they have to do that.

It's just something to be told. A design decision, like there are thousands to learn in IT, that you just can't guess.


Yes. Although you can use `islice` from itertools to get around this problem, when a problem.


Slicing in Python always create a new object. You can test it with a list of integers..


My favorite example of something similar to this, since you brought it up:

  >>> a = [254, 255, 256, 257, 258]
  >>> b = [254, 255, 256, 257, 258]
  >>> for i in range(5): print(a[i] is b[i])
  ...
  True
  True
  True
  False
  False
In Python, integers in the range [-5, 256] are statically constructed in the interpreter and refer to fixed instances of objects. All other integers are created dynamically and refer to a new object each time they are created.


Leaky abstractions at its finest.


I mean, people should be using `==` for this. The fact that `is` happens to work for small numbers is an implementation detail that shouldn't be relied upon.


Absolutely. But because it does work they might start using it without knowing it's wrong, then be surprised when it doesn't work. Python has other areas where the common English definition of a word leads to misunderstandings about how they're to be used.


Though "when" the object is created isn't always so straightforward:

  >>> x = 257
  >>> y = 257
  >>> x is y
  False
  >>> def f():
  ...     x = 257
  ...     y = 257
  ...     return x is y
  ...
  >>> f()
  True
The lesson being that `is` is essentially meaningless for immutable objects, and to always use `==`.


> `is` is essentially meaningless for immutable objects

OTOH it’s recommended to use `is` rather than `==` when comparing to “singletons like `None`”.

https://www.python.org/dev/peps/pep-0008/#programming-recomm...


What is the rationale behind this? '==' works all the time, and 'is' only works sometimes. Using 'is' wherever possible requires the user to know some rather arbitrary language details (which objects are singletons and which are not), wheras '==' will always give the correct answer regardless.


Classes can overload `==`:

    class C:
        def __eq__(self, other):
            return True

    print(C() == None)  # True
    print(C() is None)  # False


> Slicing in Python always create a new object.

It always creates a new object but it doesn't necessarily copy the contents (even shallowly).

For instance slicing a `memoryview` creates a subview which shares storage with its parent.


It'll always create a new object but my understanding is that at least in numpy the new and old object will share memory. Am I wrong there too?


Correct. In Numpy the slices are views on the underlying memory. That’s why they’re so fast, there’s no copying involved. Incidentally that’s also why freeing up the original variable doesn’t release the memory (the slices are still using it).


CPython is pretty terrible. Numpy has the concept of views, cpython doesn’t do anything sophisticated.


> Slicing in Python always create a new object.

Since you said "always", I'll be "that guy"...

For the builtin types, yeah. But not in general; user code can do whatever it wants:

    class MyType:
      def __getitem__(self, *args):
        return self
    my_obj = MyType()
    my_other_obj = my_obj[3:7]
    assert my_other_obj is my_obj


Yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: