This is a pretty good list of gotchas, but it's important when writing something targeted at beginners to be as precise and clear as possible. Nearly every section here either uses terminology poorly, is slightly incorrect, or has difficult examples.
Python supports optional function arguments and allows default values to be
specified for any optional argument.
No, specifying a default is what causes an argument to be optional.
it can lead to some confusion when specifying an expression as the default value
for an optional function argument.
Anything you specify as a default value is an expression. The problem is when the default is mutable.
the bar argument is initialized to its default (i.e., an empty list)
only the first time that foo() is called
No, rather it's when the function is defined.
class variables are internally handled as dictionaries
As dictionary keys, and that's still only roughly correct.
In "Common Mistake #5", he uses both a lambda and array index based looping, neither of which are particularly Pythonic. A better example of where this is a problem in otherwise Pythonic code would be good.
In "Common Mistake #6" he uses a lambda in a list comprehension -- for an article of mistakes mostly made by Python beginners, this is going to make it tough to follow the example.
In "Common Mistake #7", he describes "recursive imports" where he means "circular imports".
In "Common Mistake #8" he refers repeatedly to "stdlib" where he means the Python Standard Library. Someone is going to read that and try to "import stdlib".
Hey, thanks for the great feedback! We agreed with (almost :-) ) all of your comments and have made corresponding mods/corrections to the post. Thanks again!
Good changes. One more issue (I believe recently introduced): LEGB ends with "Built-in", not with "Module". It's also good to note in the blog post that it's been updated.
Did they add in "(Note: This article is intended for a more advanced audience than Common Mistakes of Python Programmers, which is geared more toward those who are newer to the language.)" later?
Because it states it right near the top of the article.
1, 2, 4, 5, and 8 all seem to me like mistakes only / primarily made by beginners. I can't see an article aimed at primarily intermediate Python users spending time on them.
3 is a really easy mistake to make for anyone (which is why the syntax was changed).
6, 7, 9 and 10 are more obscure, and where I really appreciate this article -- and they definitely can be issues for more experienced Python devs.
Slightly off topic, but does anyone know of a resource that has 'most common mistakes' for different languages all in one place? It's certainly possible to google for blog posts and stack overflow questions to assemble such a list, but it would be handy to have them all in one place.
My use case is when interviewing candidates I often ask them to rate themselves on a scale of 1-5 in the languages they know, and then ask them increasingly 'tricky' questions in each language to get a feel for how their "personal" scale aligns to their real knowledge. This works fine if we have an overlap of several languages, but in the case where I know nothing or very little of one of the languages they know I lose that data point.
I find it valuable to know what a "I am a 1 at X" vs "I am a 3 X" vs "I am a 5 at X" means to them, since I've found little correlation between how harshly someone rates themselves and their true ability. Sometimes self-rated 5s are really 5s by my book, sometimes self-rated 3s are really 5s by my book, and sometimes self-rated 5s are really 2s by my book. So I want to know how "my scale" translates to "their scale". If it was more formalized I'd go as far as to get a "confidence quotient" for a person as self-critical and self-confident people can be fantastic engineers or horrible engineers.
Does anyone else do this process when interviewing?
While such a resource would make your job easier, it would make the interviewee's job easier still. They'd just have to memorize all the points in the reference.
I have thought about this, but in the context of understanding the root causes of these problems:
1) is it a language design problem,
2) a misunderstanding or misconception on the part of the programmer,
3) due to or related to bad coding/smells (e.g. method body too long),
4) high complexity code (could be related to (3), could reflect the domain),
5) reduced programmer cognitive capacity (distraction, stress, sleep deficit, lack of motivation, etc.).
These would be interesting research areas for instrumenting IDEs / other eco-system tools to collect some of this data. (I'm sure there is already some work in some of these areas and would appreciate names or links to high-quality reviews.)
This list is an excellent summary. If tasked with a #11 I'd probably add the slightly more obscure, but still super painful (when you do run into it) implicit string concatenation:
>>> l = ["a",
... "b",
... "c"
... "d"]
>>> l
['a', 'b', 'cd']
You could easily use a + operator then.
I find the behavior surprising. I would expect a syntax error. You get a syntax error if you write two integers next to each other (separated by a space) or two of any other thing but somehow "a" "b" got converted to "ab".
If I've discovered it myself I would be tempted to fill a bug report. It goes against Python mantra:
Yeah, when I discovered this little "feature" I had a read through that. The folks that use this for blocks of multi-line text are very defensive about the practice. I do understand not wanting to break compatibility though, especially since finding instances of this is hard (which is another reason it shouldn't exist in the first place!). Oh well :)
With triple quotes you get a string with newlines and indentation in it. While you can not indent the following lines it looks ugly, and you can't do anything about newlines.
There are other language which support multiline strings (Here docs) with indents stripped via means of syntax, like YAML (with |), Racket (which doesn't do dedenting, but being language it is it's very easy to add) and many shells (with <<-). Python doesn't have this feature, and parse-time string literals concatenation serves this purpose.
Of course, you can do something like:
foo = """bar
indented at first
and after newline
"""
textwrap.dedent(foo)
(or use list literals with str.join, or use a regex, or many, many other thing), but you can do this in all languages. Languages with syntactic sugar for this make writing slightly-longer-but-not-too-long strings much easier and cheaper (only done once during parsing, no need for imports, etc.), and Python makes up for not having explicit way of doing this with implicit parse-time string literals concatenation.
You don't need the continuation, but that's where you're most likely to make the mistake. It'll catch you when you leave off the comma from the last element of the list, then go back later and add another element to the end.
#6 is really confusing. Whenever I encounter something like this my first reaction is that whenever possible such obscure components of a language should be avoided and more verbose/clear code used instead.
Programming languages are meant to be read as well as written, and someone relatively new to Python (and many who have used the language for a long time) is certain to get confused about the difference between:
Agreed 100%, this type of constructions should be avoided in the first place in favor of more "readable" ones but this happens in a fair amount of code that I've seen (and I keep seeing).
Some of it seems to come from people cargo-culting their knowledge of anonymous and first class functions, so they end up believing that the only way to pass a function around is to construct it anonymously.
> "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics."
I have an issue with that statement. No languages are inherently "compiled" or "interpreted", that's a property of the implementation.
If we are talking about CPython here, Python code is compiled to bytecode which is then interpreted. Not unlike Java - with the difference that the main implementation has a JIT and afaik, Python's does not.
But that's CPython. What about PyPy? It has a JIT.
> No languages are inherently "compiled" or "interpreted", that's a property of the implementation.
A language and it's implementation are usually designed at the same time. Compiled or interpreted will affect design choices that go into the language. While additional implementations may follow, it can be hard/impossible to design a compiler (machine code, not byte code) for a language that was designed to be interpreted without dropping features (ie eval).
It may be more correct to say 'Python was designed to be interpreted' than 'Python is interpreted'
Not really - Javascript was designed to be interpreted, and yet V8/SpiderMonkey/Nitro all JIT-compile it down to machine code, sometimes very effectively.
Then Java and .NET are considered "interpreted"? And Android under Dalvik is "interpreted", but under ART is "compiled" (using Java as the language, which I'd always thought of as compiled, yet apparently is interpreted under your definition)? What if you embed Clang & LLVM in your application to run C++?
I think this just illustrates the fuzziness of these definitions. A compiler is just a piece of code; you can embed it into another piece of code and run it whenever necessary. Maybe in the world of shrink-wrapped desktop software there was a sharp distinction between AOT compiled languages and interpreted ones, but we haven't lived in that world for a couple decades now.
I feel that what most people actually mean when they say "compiled" vs. "interpreted" is whether or not the language specification has additional static checking beyond what is required by parsing. Essentially it is whether the language defers all errors to run-time or attempts to detect classes of them at compile-time. A language like JavaScript accepts as a program any string that parses, while a language like Java rejects many strings based on additional checks such as type rules. You can add static type-checking to JavaScript or Python, but it isn't part of the language spec. You can run C++ or Java with run-time type-checking, but it doesn't conform the to spec. In this way you could say that the language is fundamentally compiled or interpreted.
Of course, this doesn't address issues of incremental evaluation which often requires additional semantics for compiled languages.
Let's just say that the lines got a lot blurrier over the last couple of decades. The difference is generally if there's an explicit compilation step that is not hidden from developers. If you "run the source file", then it's interpreted. If you "run something generated from the source file(s)", then it's compiled. Stronger type checking and compiled mode is likely since the cost of compilation is higher and more resource intensive, so it makes sense to push it into an offline step.
> You can run C++ or Java with run-time type-checking, but it doesn't conform the to spec.
I'm not sure what you mean by "conforming with the spec". You can opt out of type checking by using only `object` or `void*`. But golang has a mode where you run a go file directly and a mode where you generate a binary. There are no traditional interpreted languages anymore (or at least only very few). And what we call "compiled" languages today are not actually compiled ones. The way Java runs is closer to how JavaScript runs than to how C runs. It's only about the interface the offer to developers.
That's technically true, but I still think it's reasonable to casually refer to Python as "interpreted," since it conveys useful ideas, although those ideas are more accurately conveyed with different phrasing.
That said, for more precise discussions, what you pointed out is valid and important. One of the early questions I ask in a programming interview is to explain some high-level differences between two languages they're familiar with, which is often Java and Python. One of the common responses I get is that Java compiles to bytecode which is executed by a VM, while Python is interpreted. Of course, I point out that CPython is also compiled to bytecode and executed by a VM.
I think its reasonable to refer to the reference implementation of Python (CPython) as just "Python". The compilation to bytecode is an intermediate step because Python specifies a virtual machine. The language is definitely interpreted though. Its completely accurate to call Python an interpreted language. There's no version of Python that I know of (including PyPy) that ever produces compiled machine executable binaries prior to runtime.
The compiled vs. interpreted thing is kind of a historical relic that people still cling on to. It's not dissimilar to where one draws the line for "scripting" or "nth generation" languages; it's all somewhat nebulously defined.
From my experience, the only time I've heard it nowadays is from people who were taught how to code by people who haven't coded since at least the 80s and never went on to develop the skill either professionally or as a hobby. So, yes, very much a historical relic!
That's right, in most Python implementations there's a "compiled" part (generally to bytecode) and then an "interpreted" one to run that bytecode. PyPy is a good example of a Python interpreter built on top of RPython (compiled with RPython, that is) adding a JIT to it.
High-level is also a relative term (with multivariable semantic, leading to not comparable languages), and the term "object-oriented" is getting less expressive by the day.
Anyway, all of those terms do communicate something, even if tomorrow they may wrongly describe the language.
I've always thought that #1 is a sign of an incorrect operation altogether. If you want to always modify the passed parameter, it doesn't make sense to have a default. If you want to return a modified version of the input, you should make a copy immediately and then you don't get this problem. Doing both an in-place modification and returning a modified object at the same time is just wrong.
Again, the problem is not what it appears. You're keeping a reference to an existing item rather than making a copy. The results would be just as bad if you passed in an initial list rather than taking the default.
I think the surprising thing to most people is that you don't automatically get a copy when you do the assignment. That's how it works in older languages like C and C++, and how it appears to behave when you use immutable objects.
"Thus, the bar argument is initialized to its default (i.e., an empty list) only the first time that foo() is called, but then subsequent calls to foo() (i.e., without a bar argument specified) will continue to use the same list to which bar was originally initialized."
This actually happens when the function is defined, not when it's called the first time.
Also crashes Safari on my OS 10.5 Mac, and is unusably laggy in Firefox on the same computer. All sorts of thrashy javascript nonsense seems to be going on.
I use JavaScript Blocker for Safari, which is sort of like a less paranoid (and more convenient) version of NoScript. Looks like this site attempts to load 45 JavaScript files over 12 iframes, 17 of which JS Blocker blocked.
I'll do that. In general Safari doesn't crash so I was letting the OP know for their own benefit as they are in a better position to isolate the cause and work around it.
I have been bitten by #6 in a similar situation in the past. My solution was the analogue of the rather convoluted
def create_multipliers():
def multiplier(i):
return lambda x: i*x
return [multiplier(i) for i in range(5)]
for multiplier in create_multipliers():
print multiplier(2)
In "Common Mistake #2", I'd say that the mistake is fairly obvious to anyone who understands even a little bit about OOP and inheritance. Since class C doesn't define its own variable x, it has to be that it inherits the x in class A, so there's no reason to be surprised that C.x changes when A.x does.
While I agree that the "problem" case can be seen as obvious when considered in isolation, really it's the behaviour of the two cases taken together that can seem inconsistent. Nothing about understanding OOP or inheritance will prepare a person for that.
This modified version of #2 might help clear things up. I've only added print statements. In general, when issues like this come up, printing the id()s of identifiers can help:
Edit: Added some extra blank lines because lines were getting joined together.
# class_variables.py
class A(object):
x = 1
class B(A):
pass
class C(A):
pass
print "Initially, A.x, B.x, C.x and their ids:"
print A.x, B.x, C.x
print id(A.x), id(B.x), id(C.x)
B.x = 2
print "After B.x = 2, A.x, B.x, C.x and their ids:"
print A.x, B.x, C.x
print id(A.x), id(B.x), id(C.x)
A.x = 3
print "After A.x = 3, A.x, B.x, C.x and their ids:"
>really it's the behaviour of the two cases taken together that can seem inconsistent.
Why do you think so? I think that both cases seem consistent, or rather, correct (and therefore this example should not be treated as a common Python mistake), because x is not assigned a value anywhere in class C, and C inherits from A, so it should be clear to anyone knowing OOP and inheritance, that C's x is the same as A's x. (And the same holds true for inherited methods.) Even the OP says that in the post:
>In other words, C doesn’t have its own x property, independent of A.
What's happening here is that a variable is inheriting its value from the superclass, except for when it doesn't. And when it doesn't, why is that? Well presumably it's because something's been overridden - OO tells us that's how we change the properties that are inherited from the superclass. No wait, that's not it; nothing's been overridden here. All that's happened is we've assigned a value to B.x, and doing so seems to have changed the inheritance of our class.
So this variable is neither completely shared across classes and their subclasses (per Smalltalk class variables), nor completely independent across classes and their subclasses (per Smalltalk class instance variable), but instead its [in]dependence alters based upon whether (and where) you assign values to it.
While I can understand that in terms of the dictionary mechanism used to implement it, from my point of view it's just weird behaviour.
This is actually the same thing as regular Python scoping rules; there's not even any fancy OOP logic behind it. Here's the same thing, but using global scope and functions instead of classes and inheritance.
I think there is an argument to be made that classes are special and "reaching upwards" into the superclass scope should not occur - a unique copy should be made - but I also think that Python's way of doing it makes enough sense that it is not confusing. The Python devs are at least consistent about having their own way of doing things.
That's an interesting point, that the behaviour of an inherited class variable is consistent with a case you show where inheritance plays no part at all.
So from that point of view, it comes down to whether we expect that an inherited class variable really is just some variable in an outer scope that we can shadow with a local variable of the same name (per your example), or whether we expect that inheritance provides some stronger notion of ownership of the inherited variable.
I dislike the former case, largely because I dislike the idea that the location at which a variable is stored can appear to change merely by assigning to it. But then, I dislike Python's implicit declaration of local variables for exactly the same reason. So you're right, there IS some consistency there. ;-)
If instead of x being an integer, it were a function x(), then it makes more sense. Really, for Python, there is no difference between the two in this example. When you assign a new value to B.x, you're overriding the value that B inherited from A. When you override the value in A, any subclass that doesn't have it's own overridden value will use the new A.x, but any subclass that is overridden will be unchanged.
I suppose I just prefer the idea that the meaning of assigning a value to a variable should be "assign this value to the variable", rather than "alter the inheritance behaviour of my class such that mutable state is stored in it where it wasn't stored before, and then assign this value to the variable."
x isn't a variable, it's a tag. Because Python is dynamically typed, x can be an int one minute and a function the next, so every attribute on a class is stored as a pointer to an object, including attributes that are integers (which is an object) and functions (which are also objects). Because you aren't declaring the type of x (Python doesn't allow that), Python has to treat x = 1 the same way it'd treat def x(self).
Where x is an identifier (presumably aka tag?) that refers to a variable, it is not beyond the bounds of possibility for "x = 1" to be interpreted as "store the value of 1 into the variable that is referred to by identifier x". Plenty of languages, including dynamically typed ones, manage to do this, as indeed Python does in many cases.
Not knowing the type of x is unrelated to question of where x's value is stored, or whether x's value will be stored somewhere else after we've assigned a new value to it.
I'm not really sure I understand your point. The value of x is still going to be in memory, but you may not have any references to it and it will be garbage collected. The disconnect is in thinking x is a variable and not an identifier. x points to the object in the parent until it is overridden. This allows me to dynamically alter the functionality of a class and all it's subclasses that don't override that functionality during runtime.
I'm not disputing how or why it works, just saying that I think it is poor design, as it causes a statement that looks like variable assignment to actually produce overriding. I think this violates the principle of least astonishment, and I suspect there is no good reason (beyond implementation simplicity) why Python class variables do behave this way.
(edit: replaced "an expression" with "a statement")
No, the reason it exists is so you can override functionality of subclasses at run time, otherwise things like monkey patching would be impossible. I guess it COULD do a pass to see if the attr is an immutable or some form of primative type when __new__ is called and force those to be instantiated as instance attributes at the cost of internal consistency, but it seems pretty straight forward to me the way it works now. You don't need to be an expert, you just need a basic understand of how Python evaluates code and the difference between tags and variables.
Here’s a thought experiment for you—think of “common mistakes in language X” as “design flaws in language X” or “ways in which language X is surprising” and what could have been done to mitigate that.
Regarding circular imports and #7:
The main problem in arises when using the from mymodule import mysymbol notion.
The example solved this by properly using import mymodule, although this might cause some more problem if your design is wrong, as see in the example. Calling f() from the module ("library") code itself is a very bad idea. Instead one should do this:
For the first gotcha, using None as a default argument solves the problem, but checking `if not bar` instead of `if bar is None` can produce different results if bar evaluates to None in a boolean context.
>>> def foo(bar=None):
... if not bar:
... bar = []
... bar.append("baz")
... return bar
...
>>> bar = []
>>> foo(bar)
["baz"]
>>> bar
[]
And I would have thought that incorrect usage of bytestrings for text and then asking on Stack Overflow about the UnicodeDecodeErrors would be quite common as well ...
For #7, now you have a performance problem of importing every time you run that function. Rather, you can place the import at the bottom of b.py and be okay.
Python caches modules imported, you can check it out in your local shell with: import sys ; sys.modules. That's why whenever you make changes to a module which has been already loaded you won't see the changes until you load the module again, either quitting the shell or using reload(module) on Python 2.x
Any reason why you're using a slice here?
>>> numbers[:] = [n for n in numbers if not odd(n)]
I'm thinking that doing
>>> numbers = [n for n in numbers if not odd(n)]
wouldn't be a problem since the assignment is executed after the computation of the list comprehension.
"when the default value for a function argument is an expression, the expression is evaluated only once"
I would explain the behavior he shows as due to the default value being mutable. I don't see an expression there, just an empty list used as a default.
The default value is an expression which is executed when the module is instantiated, even if the expression results in an empty list. If I had:
def f(now=datetime.datetime.now()):
...
now would be the time when the module was loaded, not when f is called the first time, or when f is called after that, despite datetime objects being immutable.
If you define several functions at different times, the default argument will be evaluated each time you define a new function. You'll have the same behaviour if you keep reassigning lambdas at the same function name, or if you keep edditing the globals.
You are misunderstanding how dynamic Python is. And, yes, the part about "module load time" was a simplification.
Right, it's (re) defined every time you call 'foo2' but that's a different scenario than the one written in the article (hadn't you had 'foo2' and only 'foo' then you'd have the same time in all your calls to the function without supplying arguments).
In Python, variables are references (pointers). `[]` gets evaluated on import time, and returns a pointer to an object (an empty list). This object is then further mutated on the body on the next calls because if you don't pass the parameter, it still points to the same object.
The scoping rules in #4, combined with being able to reference variables before definite assignment, is what leads to the 'variable hoisting' in Javascript.
Is #6 really called 'late binding'? That seems like the wrong term.
I've always heard it used to refer to method dispatch, which that wikipedia article also seems to. However, it seems like the Python spec does use it to refer to when variable values are resolved.
Even the first argument doesn't make sense to me. The optional argument is within the scope of the function; why is the temporary optional argument getting carried over?
Some of the mistakes mentioned in OP (#1, #3, and #4) can be automatically caught by tools like PyLint (and to lesser extent, Pyflakes), as well as good unittests.
I think the way he described it is pretty accurate. Default keyword arguments are only evaluated once at function definition, so supplying a mutable default keyword argument can cause issues.
Your example is pretty contrived and doesn't illustrate what he was pointing out, as you're creating a new function foo every time foo2 is called, and only calling it once.
In "Common Mistake #5", he uses both a lambda and array index based looping, neither of which are particularly Pythonic. A better example of where this is a problem in otherwise Pythonic code would be good.
In "Common Mistake #6" he uses a lambda in a list comprehension -- for an article of mistakes mostly made by Python beginners, this is going to make it tough to follow the example.
In "Common Mistake #7", he describes "recursive imports" where he means "circular imports".
In "Common Mistake #8" he refers repeatedly to "stdlib" where he means the Python Standard Library. Someone is going to read that and try to "import stdlib".