Let's Write an LLVM Specializer for Python

Erwin · on Jan 22, 2015

Interesting. I currently use Python's AST to convert some nested logical query expression (in a syntax unique to my application) into bytecode executed by a specialized VM (I originally tried using V8 and LuaJit for this but performance wise that was unsuccessful; the project replaced some old Boost::Python C++ code). This article should make it easy to get started attempting an LLVM replacement.

travisoliphant · on Jan 23, 2015

Yes, LLVM is a great approach for doing code-gen from arbitrary specialized VMs. And the Python interface to it makes it easy to experiment. We no longer use llvmpy for Numba (we use a simpler interface llvmlite) and so llvmpy could use a maintainer.

walkamages · on Jan 22, 2015

An excellent article! I had wanted to get back into some python recently after seeing the changes in 3.4, I had also wanted to become more familiar with LLVM, and this does both.

gtirloni · on Jan 22, 2015

there is this discussion (flame war?) about python3 not bringing too many benefits. i haven't made my mind yet. could you elaborate what you saw in 3.4 that was nice?

exDM69 · on Jan 22, 2015

Simply put: it's a better language. The whole "discussion" is whether it makes sense to migrate since many parts of the ecosystem (many important libraries and frameworks) have not made the transition. And many distros ship Python 2 by default, Python 3 is optional. Python 3 only is not feasible.

To me, the killer feature is better lazy evaluation (generators). In particular, important builtins like map, filter, zip, enumerate, etc are generators, instead of returning lists. This makes it feasible to write things like

    (process(line) for line in map(str.upper, open('giantfile.txt')) if line.lstrip()[0] != '#')

Some of the above can also be done with itertools package in Python 2, but not everything.

Python 3.4 changelog is here, it contains e.g. asynchronous io facilities (asyncio module): https://www.python.org/downloads/release/python-342/

edit: added enumerate() in the example above, for line in open(filename) returns a generator in Python 2.x too.

edit2: enumerate is lazy in python2, I replaced it with map(str.upper)

halflings · on Jan 22, 2015

The example you gave works perfectly in Python 2.7 (would also be a generator, and you're not using map filter or else); but I agree: those should've been generators from day 1, especially zip and enumerate since they make more elegant code but often come with a performance overhead in Python 2.7

maxerickson · on Jan 22, 2015

Python 2 enumerate returns an 'enumerate' object that is more or less a light weight wrapper of the sequence that was passed in.

Generators provide a convenient syntax to implement that sort of object.

exDM69 · on Jan 22, 2015

D'oh. I put in a map(), that returns a list.

danohuiginn · on Jan 22, 2015

Turning them into generators would have broken a lot of existing code, though, so it's reasonable to leave them until a major version change. Rather, these are the kind of small, obviously-useful changes that should have come immediately in 3.0, giving people some encouragement to switch.

exDM69 · on Jan 22, 2015

Most of the breaking changes (including making map, zip, filter, etc lazy) were done in Python 3.0.

exDM69 · on Jan 22, 2015

Ah, you're absolutely right. for line in open(filename) returns a generator. I added map(str.upper, xxx) there to make it more it make sense.

But the point should be obvious, without generators that would potentially consume a lot of memory.

danohuiginn · on Jan 22, 2015

        (process(line) for line in open('giantfile.txt') if line.lstrip()[0] != '#')

Is that line really using any new features in Python 3? The lazy evaluation there is in the file object and the generator expression, both of which have long been present in python2.

exDM69 · on Jan 22, 2015

No, it wasn't. I added map(str.upper, xxx), now it does.

masklinn · on Jan 22, 2015

Then again a lazy map is just an import away in P2. It removes pitfalls from Python but the improvement is… very limited (as opposed to e.g. `yield from` which is a big convenience, or for much more specialised uses the ellipsis literal being universal)

dalke · on Jan 23, 2015

  (process(line.upper()) for line in open('giantfile.txt') if line.lstrip()[0] != '#')

exDM69 · on Jan 23, 2015

That's besides the point. It's trivial to write that as a for loop or in a million different ways to avoid the issue. It's a contrived example written to demonstrate a difference.

Here's another one you can't change that easily:

    tests_pass = all(process(input) == output for (input, output) in zip(open('inputs.txt'), open('outputs.txt'))

andreasvc · on Jan 23, 2015

You can change that easily, with izip from itertools.

The fact that a bunch of builtins and the values/items methods of dictionaries have become iterators is not very siginificant IMHO. Python 2 code could already be written to use iterators or generator expressions, so in the parts where it was crucial it was already done. In this regard Python 3 has not added new functionality but only changed defaults.

The unicode change is the big one.

dalke · on Jan 23, 2015

In this case (checking that process(input) == output), itertools.izip_longest is probably that right solution, unless there's an out-of-band way to know that inputs.txt is the same length as outputs.txt.

dalke · on Jan 23, 2015

You felt the need to correct yourself earlier, so I think your "besides the point" should be directed to yourself. I was pointing out that your correction wasn't persuasive.

smazga · on Jan 22, 2015

We went to Python 3 for the multiprocessing module. At the time it was 3.3, but now 3.4 has all that async magic. I wish I still worked on that project.

chrisheller · on Jan 22, 2015

You'll get an IndexError exception on that if there are any blank lines in the file.

Changing that to line.lstrip().startswith('#') would be an alternate approach.

exDM69 · on Jan 22, 2015

You're right but that is irrelevant, it's a somewhat contrived example anyway. It's not like I spent a lot of time trying it out.

ak217 · on Jan 22, 2015

Without listing any of the modules or improvements that are in the standard library in 3 but backported as PyPI modules to 2 (of which there are many), here are the features that I actually use in Python 3: unicode handling that isn't insane, function annotations, async improvements, exception chaining, enums, single-dispatch generics, better SSL support, generator delegation, better int-bytes conversion support, unittest module improvements.

The key point is that 2.7 is a language frozen in time, while 3.4+ is continuing to develop and improve. And most of the hand-wringing was before the critical mass of third-party modules was ported to 3.x.

https://docs.python.org/3/whatsnew/3.4.html

https://docs.python.org/3/whatsnew/3.3.html

https://docs.python.org/3/whatsnew/3.2.html

https://docs.python.org/3/whatsnew/3.1.html

https://docs.python.org/3/whatsnew/3.0.html

ngoldbaum · on Jan 22, 2015

My favorite new feature is PEP-442 [0]. Basically, it's now safe to add a __del__ method to a class without worrying about memory leaks caused by reference cycles.

[0] https://www.python.org/dev/peps/pep-0442/

pekk · on Jan 22, 2015

Most people should not be writing __del__ methods at all, especially if what they are trying to do is deterministic cleanup.

mkesper · on Jan 22, 2015

Saner handling of Unicode, for example.

chc · on Jan 22, 2015

This alone is pretty wonderful. I've only been working in Python a few months and the number of issues I've had to debug in 2.7 that came down to Unicode handling is kind of nuts.

gamesbrainiac · on Jan 22, 2015

I'd like to know too. Are there any LLVM specific enhancements that python 3.4 brings to the table?

exDM69 · on Jan 22, 2015

No. AFAIK, there's nothing related to LLVM in core Python. And not in 3.4 changes either.

andreasvc · on Jan 23, 2015

I can't say I disagree with such sentiments, but over time I've ran into a few issues (features and performance improvements) which where only addressed in Python 3. This is reason enough to try working with Python 3.

_ondq · on Jan 22, 2015

Add to the list:

- function annotations (allows runtime type checking via third party modules)

- asyncio (not as easy to use as Go's goroutines but still vastly superior to the multiprocessing module)

baq · on Jan 22, 2015

much better exceptions and sane unicode are the biggest improvements.

travisoliphant · on Jan 23, 2015

This is a great tutorial about first-generation Numba. The author learned a lot about LLVM and llvmpy while working with several of our devs. If you are interested in the "Further work" in his article, come join the Numba project.

tadlan · on Jan 23, 2015

What is second generation numba? And any plans to branch numba out of pure numeric application s?

jonstewart · on Jan 22, 2015

I really appreciate the length and detail in this blog post. It's comprehensive, not just showing off.

ch0wn · on Jan 22, 2015

Stephen Diehl continues to blow my mind on a regular basis. His latest work in progress "Wrote You a Haskell"[0] is also worth keeping an eye on. I've worked through the first couple of chapters and it's fantastic.

[0] http://dev.stephendiehl.com/fun/

illumen · on Jan 22, 2015

Very nice article! :)

Storing types via traces could be another step for gathering types. As well as using the more advanced static type checking code that is around for python.

Now I have something to work through on the weekend. Looking forward to part 2!

travisoliphant · on Jan 23, 2015

The numba code-base implements quite a bit of this. We actually moved away from the AST approach and went back to the byte-code approach because the AST approach quickly becomes unwieldy as the number of Visitors that you apply grows. Compile times are also slower.

The author is definitely helping people learn about LLVM and how it can be used with Python --- which is great, because this is exactly what Numba is: http://numba.pydata.org. But, please don't start another "Numba". Just come help us improve the current one.

wedesoft · on Jan 23, 2015

I did something similar in Ruby but using GCC as a "JIT" compiler for image processing (software [1], thesis [2]). I can really recommend JIT compilation for doing array processing.

[1] http://www.wedesoft.de/hornetseye-api/ [2] http://www.wedesoft.de/downloads/thesis_wedekind.pdf

EDIT: In my approach I didn't go through the Ruby AST though. Rather I used the approach of injecting "GCCVariables" which emit C code instead of doing the actual computation.

chrisseaton · on Jan 23, 2015

You should submit your thesis to the Ruby Bibliography http://rubybib.org

wedesoft · on Jan 23, 2015

Ok, will do. Cheers :)

ericfrederich · on Jan 23, 2015

I love Python but when I use a statically typed language like C, Rust, Go, etc I really feel that it is missing from Python.

I'd love to see a new language exactly like Python but compiled and statically typed. Something similar to Cython, but rather than generating a bunch of C code it would target LLVM. Additionally it would be able to generate pure Python code simply by removing any typing syntax.

cyberneticcook · on Jan 23, 2015

Could this be used to ahead-of-time compile Python code ? I'm more interested in getting to a native executable or library.