Hacker News new | past | comments | ask | show | jobs | submit login
PyPy - We need Software Transactional Memory (morepypy.blogspot.com)
135 points by j4mie on Aug 23, 2011 | hide | past | favorite | 63 comments



Clojure's STM has several advantages going for it:

* Clojure's data structures are purely functional, and built with the STM in mind. Updates and rollbacks to structures are cheap.

* Clojure user-level code has always had access to the STM. Because that's the accepted paradigm, There are an order of magnitude fewer assignments in typical clojure code, greatly simplifying the problem. In typical clojure, one function that uses the STM will call 10 pure functions. In typical python, most of those 10 calls will all mutate local state.

* Clojure requires all STM activity to happen within the dosync block. In python, that would be like requiring the magic to happen within a "with transaction(): ...". Again, this limits the amount of work that needs to happen. The compiler doesn't need to decide whether or not a random function needs to be a transaction; the user will tell the compiler.

* Clojure's STM explicitly only works with functional code. This will likely cause fewer leaky abstractions going forward. PyPy will have to modify all existing data structure classes to work with the new STM, and they won't be able to fix all the library and user code out in the world.

Good luck to them, but I suspect bolting STM into an existing language is much harder than designing it from scratch, similar to building in GC.


How does Clojure enforce that there are no side-effects in the transactional code? Is there a macro that "compiles" the transaction somehow?

STM in Haskell is very natural, since Haskell already enforces pure code and side-effecting code. STM and STVar work pretty similarly to IO and IORef or ST and TVar.


    How does Clojure enforce that there are no side-effects in the transactional code?
Technically it doesn't. If you have a (println "foo") in a transaction it may be printed multiple times if the transaction gets retried.

Clojure does give you an io! macro that you can wrap all your side effects in to manually provide that kind of protection. If you try to use io! inside a transaction you'll get an error at runtime.

    user=> (defn foo [n]
             (io!
               (println n))
             n)
    #'user/foo
    user=> (defn bar [] (dosync (foo 10)))
    #'user/bar
    user=> (bar)
    java.lang.IllegalStateException: I/O in transaction (NO_SOURCE_FILE:0)


It doesn't. What I meant by my last statement is that the transaction API forces updates to be functions of the old state, a somewhat foreign concept in python.


I think you've misunderstood what they're proposing here. Especially this:

> Clojure requires all STM activity to happen within the dosync block. In python, that would be like requiring the magic to happen within a "with transaction(): ...".

Under the proposal, all Python code would run under STM, only C-API and OS-level codes would run outside STM.

Edit: to clarify, the most common issue with implementing a viable STM (in just about every language apart from haskell) is the interaction between STM-managed objects and non-stm-managed objects. Under Pypy's proposal, there would be no such thing as all objects would be STM-managed.


I do understand that aspect of their proposal. My point is that causes unnecessary amounts of work. In typical Clojure, there's a very high ratio of pure functional code to STM code. i.e. you'll probably make 10 or 100 pure functional calls between each STM interaction.

Under the current pypy proposal, they're going to STM-ize every function call, regardless of whether it needs to do STM work. If the language had been built with STM in mind, with an explicit STM call, more user and library level code would adapt to that, and use STM sparingly.


This may work, this may not, but some nod to the fact that they've read things like http://queue.acm.org/detail.cfm?id=1454466 (as linked by codevine before me) and have some plan to address the problems would be nice. So far, attempts to bodge STM on to an imperative language that wasn't built with it in mind has not merely been failure, but radical failure. Immense amount of effort by very smart people, followed by the total unusability and unsalvagability of the resulting code.

The fact that only place it has been shown to work is basically Clojure and Haskell needs to be addressed; how are you going to address the apparent fundamental incompatibility between imperative languages and STM? And while you may be able to answer that at the VM level (very tiny transactions implemented between opcodes, basically), I would expect it is absolutely hopeless to allow users access to this. How are you going to address the endless-side-effect replay problem? (Hint: You aren't.)

This is a kill-PyPy level of dangerous decision, and I think everyone on that team ought to demand a very good answer to that question before buying into this.


Until we have a new conceptual breakthrough, multiversion concurrency control (MVCC) [1], which STM represents in this context, is the only solution they have.

Yes it will impose a tax on all mutable data structures, and push people towards a more functional style of programming. This is generally good. It may be a challenge for people who think solely in terms of OOP, but then OOP will evolve in response, and that is also good.

[1] http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...


The way I read it, it's a PyPy-the-VM feature, used in RPython, but not a PyPy-the-interpreter feature callable in the Python code interpreted, hence it's not available to users (i.e Python devs)


From a PyPy as a practical tool point of view, I agree.

From a PyPy as a research project, I'm excited. I'm skeptical of STM as a synchronization solution as well, but I still want us to explore it. If their approach truly is novel, then maybe there's something there.


That's kind of what I mean, some evidence that they've got a novel approach would be nice, because so far the answer is basically "STM doesn't work in code with unrestricted side effects", and personally, I'm pretty satisfied with the amount of due diligence that has been done on that result. Research should build on existing results, not retry the same things over and over.


I think the difference here is that there is not, in fact, unrestricted side-effects. Because the STM transaction scope is decided by the interpreter, not by the programmer, the interpreter can ensure the current STM transaction is committed immediately before any operation with side-effects, and a new transaction started immediately after. This loses some of the power of STM, certainly, but PyPy's only aiming to maintain internal consistency of interpreter data structures at this point, so this is good enough for them.


My thinking was that most research with STM for imperative codes have used the library approach. Since they will have full control over generated code and how it interacts with the STM, they may have more opportunities to optimize.


We've been mashing out Python for the last few months at my job, and we are consistently struggling with the Python GIL performance penalty for threads.

If PyPy can break the GIL, I would anticipate that we'd seriously consider pushing our users toward it.


This is not a dangerous decision at all and there is no reason to involve everyone working on PyPy either.

It would not be the first time they tried implementing something and failed either.


That's easy to say, but ignorant of history of other open source projects. A lot of open source projects have killed themselves by overcommitment to a bad idea. PyPy is just getting to where it might be considered a practical answer for some problems and if the core project wants to drop that property, committing to a path like this is a great way to do it.

Sure, if it's some guy's branch, hey, whatever, but I'm running on the assumption that this is being promoted not as SomeGuy'sBranch but as an argument about the way the official project should go, because otherwise, why proselytize when you can just do? An awful lot of projects have died this way.

The problem is, this won't die in week two. It'll only die after a lot of effort and either several stalled or buggy releases. It's not an "it just doesn't work, crashes the system" problem, it's an infinite regress problem.


This will be implemented as an interpreter transformation, the entire idea behind this is that you don't have to care about it, just as you usually don't have to care about the JIT or the GC.


Possibly!

As Armin mentions in the blog post, the STM can be explored at the transformation/translation level. It seems to me the level of commitment here is more "interesting branch that through trial and measurement may lead to a important language-level breakthrough" rather than "stone albatross that drowns the project."


Microsoft has been doing extensive research with STM, so their work might be worth looking at along the way. A working STM package (STM.NET) was actually shipped with VS2010, where you can type this:

    Atomic.Do(() => { /* atomic code */ })
Here's some channel 9 videos about their progress, which are in of themselves very interesting:

http://channel9.msdn.com/Shows/Going+Deep/Software-Transacti...

http://channel9.msdn.com/Blogs/Charles/STMNET-Who-What-Why

And here are some research papers published by Tim Harris and the old STM research blog:

http://research.microsoft.com/en-us/um/people/tharris/

http://blogs.msdn.com/b/stmteam/archive/2010/05/12/stm-net-d...

Fascinating stuff. I'm wary of having it be at the core of Python, but very interested to see where PyPy ends up.


They also have a file system that can enlist in transactions:

http://en.wikipedia.org/wiki/Transactional_NTFS


>if [a JVM thread] reads a field of an object, and another thread writes to the same field of the same object, then the program will see either the old value, or the new value, but not something else entirely

Nitpick: unless the field in questions is a "long" or "double". In those cases the operation becomes two bytecodes and you can in fact see something else entirely if you are unlucky. At least it used to be like that.


On 32-bit platforms.

On 64-bit, you won't. =)


STM sounds all nice, yet there are few implementations of STM currently that give better performance than a good serial implementation.


Then again, hardware transactional memory may be arriving: http://www.eetimes.com/electronics-news/4218914/IBM-plants-t...


Note that it's currently limited to transactions will fairly small read and write sets, but this is not an inherent limitation of hardware transactional memory, and it can be used as part of a hardware/software hybrid system to accelerate small transactions.

I'm pretty excited. The chip modifications here are really small and cheap, so I think we may start seeing this in more mainstream server chips in a few years.


I hope you have sources and/or benchmarks to support your claims. Bashing a promising future technology without any backing is not constructive.


codedivine has a point and it's even admitted in this posting: "we expect the initial performance to be abysmally bad (maybe 10x slower); however, with successive improvements to the locking mechanism, to the global program transformation inserting the locks, to the garbage collector (GC), and to the Just-in-Time (JIT) compiler, we believe that it should be possible to get a roughly reasonable performance (up to maybe 2x slower)."

It will likely be difficult to beat expertly written code using explicit locks. But most people aren't experts in concurrency and will either get it wrong or have slow implementations. And if transactional memory catches on, we may even see some hardware assistance in future CPUs.

(S)TM is definitely worth exploring more and even a 2x slower implementation (as envisioned by the PyPy team) could cover most concurrency needs, which will make it a success in most people's eyes.


Do they mean 2x slower than CPython or 2x slower than PyPy? If they wind up with something 2x slower than current PyPy (which is much faster than CPython in many cases), that'll still be a version of Python much faster than CPython that doesn't have the limitations of the GIL and can thread across cores, and that'll be a huge win.


One good read is: http://queue.acm.org/detail.cfm?id=1454466 "Software Transactional Memory: Why Is It Only a Research Toy?" by Cascaval et al.

I think transactional memory systems with at least some hardware support are potentially more interesting.


Some of the problems addressed in the papar can be solved with compiler enforceable constraints such as is done in Haskell's type system.

What comes to the performance, the benchmarks in the paper range from no speedup at all with 8 cores to having 1.5x perf increase with 2 cores to 4x increase with 8 cores. To me, that definately sounds like something worth researching.


The point of that paper is that they have been researching it, quite extensively. But the performance results have often been disappointing, and the semantics of nested transactions can be surprisingly complex. At what point do you start looking into other things?


There's the same issue with normal locks: any synchronization primitive is going to incur overhead. That is, in fact, one of the defense heralded by some in the Python community against the "GIL should Go" mobthink.


I think you misunderstood codedivine. His point was not that applications that use STM and make use of a single thread are slower than sequential implementations of the same application. We expect the parallel implementation's single threaded performance to suffer by a percentage or two (or ten...) because it has to do more hardware-level atomic operations.

codedivine's point was that it is often the case that applications implemented with STM that use multiple threads often perform worse than the sequential version. See the excellent ACM Queue article he links to in a sibling comment.


> I think you misunderstood codedivine. His point was not that applications that use STM and make use of a single thread are slower than sequential implementations of the same application.

And the experience has been exactly the same in locks-based GIL-removal tests in Python: David Beazley recently "unearthed" and tested a GIL-removal patch from the 1.4 days[0]...

> To test threads, I wrote a small sample that subdivided the work across two worker threads is an embarrassingly parallel manner (note: this code is a little wonky due to the fact that Python-1.4 doesn't implement thread joining--meaning that you have to do it yourself with the included binary-semaphore lock).

> [...]

> If you run this code with the GIL, the execution time is about 2.5 seconds or approximately 1.3 times slower than the single-threaded version (1.9 seconds). Using the GIL-less Python, the execution time is 18.5 seconds or approximately 1.45 times slower than the single-threaded version (12.7 seconds). Just to emphasize, the GIL-less Python running with two-threads is running more than 7 times slower than the version with a GIL.

[0] http://dabeaz.blogspot.com/2011/08/inside-look-at-gil-remova...


The main difference being that we have plenty of examples of lock-based code scaling well. Even large, non-trivial systems, such as the Linux kernel. It used to rely solely on the BKL - big kernel lock - for synchronization, but they have moved to finer grained, data structure level locking.

That makes me think that an implementation of Python that uses fine-grained (or at least finer grained) locks that scales is possible. It's just a question of how feasible is it to transform CPython into such an implementation. Beazley's experiment indicates that the changes may have to be fundamental, such as moving away from reference counting.


Would it be possible to only run the STM code when the program is multi-threaded? This would require the program detecting that it is single or multi threaded (would that be difficult?) and might provide a good trade off: either be single threaded and be fast, or use a lot of threads and be 2-10 times slower than Cython, but without the GIL.


Python will always fail to do any true shared memory concurrency in my mind, so why not go the other route, have concurrency with enforcement of no shared state and instead only allow message passing and selective receive.

Thats all the rage these days anyways as its easier to reason about. Better still it matches distributed computing models much better.


I'm a little disapointed. PyPy has just recently been becoming a fast attractive alternative to CPython (or Cython compilation).

If they go the STM route and spend years hashing out the details only to be 2x as slow as CPython, then that cuts out a huge number of people who would like to practically use PyPy.


I am pretty sure that is 2x slower than PyPy not CPython. So it would basically end up being over 2x faster than CPython.


I think he might have meant 2x slower than without STM and not 2x slower than CPython. Also from everything I have read you would still have the option to not use STM at all and just use the GIL which would not come with the extra overhead.


How would this be ever an issue? You could simply compile PyPy without STM just as you can compile PyPy without the JIT or with a different GC.


STM and locks have some interesting clean-up and (as is cited in the article) you really need for it not to crash and not to deadlock or go all priority-inverted on you.

Message-passing, on the other hand, doesn't need particular locking and crashes are (somewhat) simpler to deal with.

(And most any scheme I've encountered can deadlock, and will usually choose the most inopportune moment to demonstrate this capability.)

There's an older blog post here http://enfranchisedmind.com/blog/posts/thoughts-on-paralleli... that's worth a read.


    With this in mind, it is possible to imagine a whole-
    program transformation that would add locking on every 
    object manipulated in RPython by the interpreter. This 
    would end up in a situation similar to Jython. However, 
    it would not automatically solve the issue of deadlocks, 
    which is avoided in the case of Jython by careful manual 
    placement of the locks. (In fact, being deadlock-free is 
    a global program property that cannot be automatically 
    ensured or verified; any change to Jython can in theory 
    break this property, and thus introduce subtle deadlocks. 
    The same applies to non-atomicity.)
This is just not true. It is true for the general case, but not for this specific case. Remember the pithy solution to the dining philosophers (number all the chopsticks and pick up the lowest numbered one first)?

I don't know the details of the pypy interpreter, but it does seem that given a byte-code that affects several objects, you could just acquire the locks for the objects as ordered by the objects address.

Now the pypy guys are very smart and this is a simple solution, so I'm sure it yields horrible performance, but the fact of the matter is it's not true that "being deadlock-free is a global program property that cannot be automatically ensured" (it is true that it can't be verified, but if you generate all the locking code it can be ensured by strict ordering) and to say otherwise is wrong.


You missed this point: and moreover if you need to have accessed the first object before you can decide that you will need to access the second object

That is, he's talking about situations where they don't have full knowledge when entering a critical section which objects will be accessed. Hence, they can't order lock acquisition in such a way to avoid deadlock.


I would be extremely surprised if this happened at the byte-code level.


You should be able to determine the set of objects which might be touched by a section of bytecode such that they need locking. However, the majority of the time that set will vastly over-estimate the number of objects which actually get touched (the problem bytecodes here are, just from inspection, JUMP_IF_{TRUE,FALSE}, EXEC_STMT, IMPORT_STAR, BUILD_CLASS, {STORE,DELETE}_GLOBAL, probably a few more I can't be bothered to think through right now), and a lot of the time will simply be "all the objects in the system."

Fundamentally, it doesn't make a difference whether you analyse the python or the bytecode; further, I'm not even sure that RPython is ever converted to bytecode, so it's not entirely relevant here.


We are using Haskell + STM for scaling a major multicore system at Bump, and it's been exceedingly awesome thus far. However, STM happens to suit Haskell's type system-enforced purity particularly well; it's hard to see this being anything but a monumental undertaking in a language like Python.


Does this proposal entail writing Python code without mutation of non-local state? Isn't this harder in Python than in Haskell or Ocaml?


> Does this proposal entail writing Python code without mutation of non-local state?

Pypy using STM should not change the semantics of the Python language or require alterations to code which does not rely on implementation details (code which relies on the GIL for its thread-safety may have to be fixed, it already has to be fixed for Jython compatibility).

> Isn't this harder in Python than in Haskell or Ocaml?

I'd expect STM to be much harder in imperative languages than in functional languages indeed. In fact, I don't think I've heard of a successful STM implementation in imperative languages yet.

On the other hand, I think most tentatives so far have been with STM as a library applied explicitly, and you run into issues of (mutable) objects "crossing" the transaction barrier in imperative languages, issue which "does not" happen in Haskell (there are ways to subvert the type system, but then you're on your own. In normal uses, I think objects under STM live in a different monad, and therefore can't be freely mixed with objects living outside of it) (I believe this issue would happen in OCaml, so on that front OCaml would be little better than imperative languages). This issue should not exist if all of the code runs under STM (ignoring explicit hooks to do so built into the system), and this should allow for "easy" implementation of explicit memory transactions as well (one of STM's promises is composability, so you should be able to compose multiple bytecode-transactions in a single explicit transaction).


After reading this, I thought a bit about other possible solutions. Esp. more like Jython is doing it but in an automatic way. I came up with this: https://github.com/albertz/automatic_object_locking


There's a great, very detailed retrospective on Microsoft's STM implementation efforts here:

http://www.bluebytesoftware.com/blog/2010/01/03/ABriefRetros...


Afaik, all builtin functions are considered to be atomic.

Wouldn't it be enough to have a mutex for each object and when such a builtin function is called, all objects which are passed as parameters (including and esp. `self`) are locked?


No, that is not enough. If you want to implement atomicity for operations that touch multiple objects, a simple mutex-per-object approach just won't cut it. If you start holding multiple mutexes at once, there will be deadlocking if they are not locked in a consistent order.

Not to mention that if every object has a mutex associated, the performance will be horrible. Even if the mutexes are not contended, there will be so many atomic compare and swap operations that the interprocessor memory bus will simply die.

If your solution would work, every method in Java would be synchronized by default.


I agree with your points on performance, but it's worth noting that they apply to STM as well: if you make the atomic regions too numerous and coarse grain, the compare-and-swap operations will kill performance.


That's essentially what Jython does. He discusses why he does not want to go in that direction in the article.


He says that it would be much work and hard to do manually.

But I think you can add such a behavior in an automatic way.


It is also stated why doing that automatically is unfeasible.


The only issue stated about that is the possibility of deadlocks when done in an automatic way.

However, I claim that you can also do it in an automatic way and avoid deadlocks at the same time.


In order to avoid dead locks you need to be aware of the language semantics, which you are not if you automatically transform the interpreter.


What I mean is that even without knowing that, you can still do it in an automatic way.

E.g., like this: https://github.com/albertz/automatic_object_locking/blob/mas...


You are assuming that all the objects that will be accessed are known up front, before any one of them is accessed. Is that what happens in Python?


Could that probably lead to performance issues?


[deleted]


Have you considered reading the fine article? That issue is covered.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: