Hacker News new | past | comments | ask | show | jobs | submit login
About concurrency and the GIL (merbist.com)
82 points by telemachos on Oct 4, 2011 | hide | past | favorite | 50 comments



I'm really tired of these disingenuous justifications and arguments to 'just use processes!'

A GIL is an unnecessary limitation that precludes a huge swath of architecture optimizations. It's there because it's difficult to remove once you've made the incorrect choice to rely on it, not because a GIL is a good idea.


I have to completely disagree. I think GIL is useful in ruby and is a mitzvah, not a sin.

Reason: Well bla bla bla unicorn does light-weight forks. But the real fun is in "synchronized" keyword of java. You see, in java multithreading is hard, but it is a heluvalot easier than in many other languages, ruby included. If anything, we should design "Truby" which is a new languages for multi-threaded ruby. Truby will need syntax for dealing with asynchronous operations. Truby can then remove the GIL and life will be good. The problem is that you take ruby as it is and remove GIL, sure you can get away with things, rewrite libs, etc, but in the end you won't have a really good language for async programming anyways.

So yes, just use processes. Once we have a good multithreaded op model without a GIL then we can just use threads. Until then...

edit: jruby can get away without a GIL because it relies on libraries implemented in java, which have very little issue writing async code.


>But the real fun is in "synchronized" keyword of java. Can

>we should design "Truby" which is a new languages for multi-threaded ruby

But thats just syntactic sugar for managing a lock. I don't follow ruby exceptionally closely, is the syntax of the language frozen forever? Can similar sugar not be added?


When I did multi-threaded Ruby stuff (shudders) I wrote a simple lock block using either an object or a symbol as the lock identifier. The resulting code looked like:

  lock(some_obj) do
    // synchronized code here
  end
Using a mechanism like that you can add synchronization to Ruby without any new syntax.


yes, true, but the real fun is in making design decisions with respect to how threads work etc to ensure true concurrency and easy concurrency. Good concurrency frameworks, etc. TBH we can do that now with the GIL by giving ruby code to a c-library that handles multi-threaded calls. But end-of-the-day lets change the game, no act catch-up to java.


Yes it can, but its not needed directly. Maybe for Ruby 2.0 this could get shoehorned in.

JRuby and (sharpish) Rubinius can already do GILless ruby execution. But that doesn't mean everything "just works", it only exposes the problems both languages have when two concurrent contexts try to update the same object, like adding hash elements at the same time. In JRuby it throws an exception, at least allowing you to retry, Rubinius is copying that behavior as well. But technically, there is no standard for what happens.

My knowledge of non CPython interpereters isn't enough to talk about their GIL or lack thereof implementations.


@nupark2 how was my explanation "disingenuous"? I merely exposed the reasons why the C Ruby team decided to use a GIL and are not planning on removing it. It's not because you disagree with the core team conclusion, that it makes my post a "disingenuous justification".


This is very true. It reminds me of the BKL, yeah we could technically live with it, but it would be lazy and bad practice for a platform.

Others in this thread are saying that GILs are OK because they're working on web applications. Not everyone shares your narrow requirements when using a general purpose language.


> Not everyone shares your narrow requirements when using a general purpose language.

IMHO, if the resource difference between processes and threads is significant for your workload then moving away from an interpreted language is probably a step to take before going threaded. i.e. remove the overhead of the interpreter, before you worry about overhead of a GIL.

Also note that a GIL gives better performance on a single-threaded workload, since you avoid the overhead of fine-grained locking.

So, asking for GIL removal is asking all users of the language to pay a 'threading tax', when it isn't at all clear that there would be any real benificiaries (since the above argument can be made that those workloads would be better in a different language).

I'm not rabidly anti-thread, but I think it is important to note that there are good reasons (not just laziness) why you might not want the GIL removed - i.e. I don't think it's as simple as being bad practice.


>remove the overhead of the interpreter, before you worry about overhead of a GIL. The overhead of the GIL killing parallelism is nontrivial, even next to the interpreter overhead

>(since the above argument can be made that those workloads would be better in a different language).

Thats a pretty self defeating argument. Every single possible ruby workload would perform better in another language.

>So, asking for GIL removal is asking all users of the language to pay a 'threading tax',

First, the existence of the multithreading support already has a threading tax, the GIL. Second, sprinkling in the use of some more mutexes wouldn't make a significant performance impact, especially not in a interpreted language that we already agree is not good for performance.

Of course, if the runtime knew it was only running one thread it could ignore much the synchronization code altogether, but we don't want to pay the tax of a single branch decision, do we?

>IMHO, if the resource difference between processes and threads is significant [...]

What about all situations where you want to use threads, but not due to resource constraints? I would assume your answer would be "don't use threads" but you say you're not rabidly anti-thread.


"Second, sprinkling in the use of some more mutexes wouldn't make a significant performance impact"

The people who tried this in python would differ.

http://www.artima.com/weblogs/viewpost.jsp?thread=214235

"This has been tried before, with disappointing results, which is why I'm reluctant to put much effort into it myself. In 1999 Greg Stein (with Mark Hammond?) produced a fork of Python (1.5 I believe) that removed the GIL, replacing it with fine-grained locks on all mutable data structures. He also submitted patches that removed many of the reliances on global mutable data structures, which I accepted. However, after benchmarking, it was shown that even on the platform with the fastest locking primitive (Windows at the time) it slowed down single-threaded execution nearly two-fold, meaning that on two CPUs, you could get just a little more work done without the GIL than on a single CPU with the GIL."


And my response from the last time that was posted (http://news.ycombinator.com/item?id=2916886):

we have plenty of examples of lock-based code scaling well. Even large, non-trivial systems, such as the Linux kernel. It used to rely solely on the BKL - big kernel lock - for synchronization, but they have moved to finer grained, data structure level locking. That makes me think that an implementation of Python that uses fine-grained (or at least finer grained) locks that scales is possible. It's just a question of how feasible is it to transform CPython into such an implementation. Beazley's experiment indicates that the changes may have to be fundamental, such as moving away from reference counting.


I don't think we disagree about the fundamentals, but there are trade-offs there too. C extensions become harder to write for example. This is fine for something like the JVM where people stay in JVMland for the vast majority of things.

I don't see why people are so insistent that other languages need to go down this route. Use one that does if you really want that stuff. Hell, use ruby then you can switch between C ruby and JRuby depending on your needs.


> What about all situations where you want to use threads, but not due to resource constraints? I would assume your answer would be "don't use threads" but you say you're not rabidly anti-thread.

But you can use threads now,with the GIL? The discussion about removing the GIL is a performance issue?

So - my answer would be "sure, if you want to use threads for convenience, go ahead - but note that in this case the GIL isn't a problem".

> Of course, if the runtime knew it was only running one thread it could ignore much the synchronization code altogether, but we don't want to pay the tax of a single branch decision, do we?

That's an interesting one. I'd love to see the numbers. Well, if you remove the GIL it would be a branch per elided lock. You may well also find yourself increasing the size of all potentially-shared data structures (to hold the lock). It would be interesting to compare that to the case of uncontended mutexes.



I wasn't really clear, to me a GIL is like the BKL in the reasons in need(ed) to go away, not that it was still kicking around.


Is it really coincidence that two of the most popular scripting languages, Python and Ruby, made this language design choice that you think is incorrect?

It seems more like a case of "worse is better". The GIL lets programmers ignore a lot of parallelism bugs, so it lets people get their regular work done faster at the cost of long-term scalability. That certainly seems to be the "Rails way" and it is a philosophy that has led to popularity.


> it lets people get their regular work done faster at the cost of long-term scalability.

More like it lets them get their work done at the cost of having to use processes or co-routines calling non-GIL code instead of threads.


With most web architectures, your state is outside the process (RDBMS, NoSQL, filesystem, whatever persistent store you're using).

So the main benefit using threads (easy sharing of in-process state) goes away. For web architectures, concurrent request handling seems to me to be best achieved by multi-process, rather than multi-thread.

There are some minor benefits to threading (e.g. an in-memory cache of global state can be shared amongst multiple threads, rather than being replicated among multiple processes) but:

- there are out-of-process cacheing technologies (memcached, redis etc)

- threading doesn't come "for free", in that you need to pay a performance cost in terms of locking in your interpreter and/or a code complexity cost in terms of access to your shared data structures.


"With most web architectures, your state is outside the process (RDBMS, NoSQL, filesystem, whatever persistent store you're using)."

I submit the fact that is nearly 100% is effect, not cause. If web technologies weren't so bad at handling multiple cores, you'd be more likely to move your web apps up to where ever the "real" computation is taking place, so as to remove a link in the chain. But since that is almost entirely not an option, nobody does it. It doesn't prove it's the "right" way to do it if it's the only way to do it right now.

Using Erlang and Haskell tech, I've actually had some great fun writing applications that are heavily multithreaded and happen to include a web server too, built in. It isn't the solution to every problem, but damn it's fun and nice not to spend so much time marshalling things over a database gap for what is basically no reason. The database goes back to just being persistence and querying, instead of communication as well.


> I submit the fact that is nearly 100% is effect, not cause.

Well, there's also the point that your shared state needs to be process-external as soon as you want to scale beyond a single process. (e.g. need to run on more than one server).

Yes - most sites/apps don't need that, but given that a lot of the web site framework and tooling is set up to make this relatively easy to do, it's nice to not have to rework your app fairly fundamentally once you hit that scaling wall.


Depending on the application, there are some options for process-internal multi-machine shared state - Erlang's mnesia distributed database, for instance.


That's interesting, thanks.


threading doesn't come "for free", in that you need to pay a performance cost in terms of locking in your interpreter and/or a code complexity cost in terms of access to your shared data structures

vs

there are out-of-process cacheing technologies (memcached, redis etc)

If you subject "out-of-process cacheing technologies" to the same measurements (performance cost, code complexity), as threading solutions, what do you find?


> If you subject "out-of-process cacheing technologies" to the same measurements (performance cost, code complexity), as threading solutions, what do you find?

Yes, I think you find that shared state is difficult.

So it's the old process (private by default, shared as the exception) versus threads (shared by default) split. imho, "less shared state" == "simpler". And imho, "private by default" == "less shared state".

And you have an easier time of it the shared state helps you with concurrent access. This is one benefit of RDBMSs (transactions) and also systems like redis. The fact that your state is external means it is more likely to have been designed to be conncurrency-safe.


I don't get this argument. If I replace my memcache logic with an in-memory thread-safe hashmap, it will always be faster. I can't see how it can be slow.

As for 'simpler', as long as you aren't actually implementing the thread safe map, it is simpler than using memcache too.

So as long as you use well written shared state implementations, (e.g. Java's concurrent collections), shared state isn't as hard as you make it out to be.


> If I replace my memcache logic with an in-memory thread-safe hashmap, it will always be faster. I can't see how it can be slow.

Because in a threaded application context, you have to either:

- have locking/synchronization on all data structures (performance cost)

OR

- manage which of your data structures are shared and which are not (complexity cost/race bugs)

With external, shared state (RDBMs, memcached, etc) you get fast in-process access to your (private) data structures (no performance cost due to locking) and a (mostly) concurrency-safe, explicit datastore.

And as pointed out elsewhere, to scale beyond a single process you need the external state anyway.

There are other approaches to this, software transactional memory, functional programming, clojure paradigm etc. But it's a hard problem and it's not clear these are the right solution at scale.


manage which of your data structures are shared and which are not

That is exactly what you are doing when you decide which of your data structures go in memcache and which don't.


> That is exactly what you are doing when you decide which of your data structures go in memcache and which don't.

Yes, apart from the fact that the ones you don't think about are private. i.e. processes are "private by default" and threads are "shared by default".

Processes give you a safe default, threads give you a dangerous one.

This is the main (only) difference between threads and processes - and it is the important one.


So you concede Antrix's point that mutlithreading is not slower?

I'd just like to confirm that you are even able to determine when your argument has been refuted, as you seem to think that that the proper response is to just pretend it didn't happen and come up with a new reason instead. That is to say, are you a reason factory supporting of an internally held belief despite evidence to the contrary, or are you a rational sentient who is in a discourse for the purpose of determining a common truth.


> So you concede Antrix's point that mutlithreading is not slower?

Slower than what?

I said (and you quoted): threading doesn't come "for free", in that you need to pay a performance cost in terms of locking in your interpreter and/or a code complexity cost in terms of access to your shared data structures

For your reference, I stand by that comment and don't think I've said anything to contradict it. For clarity, this is what I think:

1) a multithreaded (fine-grained locking) interpreter will run a single-threaded workload slower than an interpreter with a GIL (reference from elsewhere in this thread): http://www.artima.com/weblogs/viewpost.jsp?thread=214235 http://mail.python.org/pipermail/python-dev/2001-August/0170...

2) the threading programming model imposes a complexity burden on the programmer, since all data structures are shared by default and so they must think about every data structure and whether it can become shared in practice (and so must be concurrency-safe or not)

Basically - either your interpreter locks everything for you (perf cost) or you have to worry about it (complexity cost) or a bit of both.

I don't think that threaded access to an in-process data structure is slower than multi-process access to memcached.

I do think that defaulting to private data and having explicitly shared data is wise and is an easier programming model.

I hope that's clear. Please let me know if you think I've been inconsistent, rude or done anything other than espouse these points in this thread.

ps. I found your last reply rude. Also I'm not trying to say "threads are bad and you are bad for using them". I'm also not attacking your (or anyone else's) integrity.


Ok, its the former then.

Slower than what?

The great thing about HN, and similar discussion systems you'll find on the internet, is that you can read the conversation. So when I say "slower", and reference "antrix", a sentient (or even a reasonable AI) could infer that I was referring to this statement, by antrix:

>If I replace my memcache logic with an in-memory thread-safe hashmap, it will always be faster. I can't see how it can be slow.

And that you replied to by quoting it.

Clearly, by reading the english therein, antrix is comparing the memcache logic with a thread-safe hashmap. So in case you still aren't getting it, the answer to your question "Slower than what?" is "than a thread-safe hashmap". That you are not aware that this was antrix's assertion would explain why your subsequent posts fail to refute it.

So your behavior is not merely that of a response factory, but a response factory with a 1 deep context buffer.


The reason I asked "slower than what", is because at no point have I claimed that going to memcached would be faster than a local hash (with locking).

Whereas I have (in this thread) claimed that an interpreter without a GIL (and with fine grained locking) would be slower than one with a GIL (for single-threaded workloads).

I wanted to know which you meant.

You haven't shown (I believe because it's not there) where I claimed that going to memcached would be faster.

And you're being rude and trying to provoke a reaction.

And HN is hiding the reply link because it's heuristics have determined that the signal/noise of these posts is likely to be low.

And I agree and so won't reply further in this thread.


That they're massively easier to work with?


Also, don't forget that threads don't scale beyond one machine; so when you want expand beyond one machine you need to take care of two levels of scaling, threads and processes. This can be avoided when using multiple processes that communicate or share state some other way in the first place.

Even with one CPU, with many cores the internal synchronization that happens to "simulate" a shared memory space can be expensive, if you're not careful your cores get clogged ping-ponging memory pages between each other.

It might sound more convenient to use threads instead of processes, but in the end I'm not sure that all the work to remove the GIL (and introduce shitloads of finer-grained synchronization primitives) is worth it.


Threads are important not because you want to scale horizontally, but because you want to also scale vertically, as in having an optimum ratio of performance per Watts or CPU cores or RAM MB.

Threads are much, much more efficient than processes - threads consume less memory (even with COW, VMs with garbage-collectors basically prevent COW from being effective), threads have faster context-switch, and threads achieve better cache-locality.

And shared memory, which is considered the plague of the software industry along with pointers to memory, is actually a mirror of our current hardware architecture. The fact is that hard-disks and SSDs are much slower than RAM, RAM memory is much slower than L2 cache, which is slower than L1 cache, which is slower than when registers get accessed. You achieve good performance characteristics when you keep your data in one place, keeping your most accessed items in the CPU's cache, or at least prevent those items from being stored in swap.

Food for thought: the NoSQL solutions so loved by Rails or Node.js developers are mostly written in C/C++ and rely on Posix threads, with some Erlang here and there.

     might sound more convenient to use threads 
     instead of processes
That's because it IS more convenient to use threads instead of processes in most cases. You can mostly get away with it only when the processes don't need to synchronize (i.e. your workload can be processed fully in parallel), but when processes do need to synchronize lots of bad shit can happen, as processes are more unreliable than threads.

Of course, I'm referring to real kernel-managed POSIX threads and processes, not what Erlang does, but then again, the Erlang's light-weight processes are just an abstraction over POSIX threads, not processes as that would have been dumb.


On Linux, there is very little distinction between processes and threads. (Assuming you're using the standard POSIX threads implementation, which you probably are.) They are represented by the same data structure in the kernel (the task_struct), hence they are treated the same by the system. Their context switch overhead is the same. The only real difference is that processes get their own address space, which will incur a higher cost at creation time.

If you're talking about threads implemented by a language runtime (which may execute its own concept of threads on top of kernel threads), then the above may not apply. And on Windows, I understand that there is a real difference between threads and processes, although I have no systems programming experience with it.


The Windows approach to threads is essentially equivalent, both kernels mainly deal with threads that only incidentally belong to processes. While in Linux process is essentially an special case of thread, Windows kernel actually has processes as distinct entities, but that is mostly an implementation detail, to some extent, NT kernel is even more thread-oriented, as there is more state associated with thread (and surprisingly, some things that might look like in-kernel process state from Unix standpoint are actually illusion provided by Windows userspace components).


It really is performance-wise? Does a process switch between two equal processes avoids a TLB-flush?(How it could?) I don't know, I'm genuinely curious.


No, you're correct, I forgot about the TLB.


Threads are much, much more efficient than processes - threads consume less memory (even with COW, VMs with garbage-collectors basically prevent COW from being effective), threads have faster context-switch, and threads achieve better cache-locality.

There is hardly any "context switch" if you run one process per core (compared to one thread per core, it makes no difference).

And yes, you'll save memory if you do everything in one process (even with COW), but don't forget garbage collection with multiple threads is a much more complex beast than single-threaded, resulting in extra overhead.

I'm not denying that threads have some advantages, for example, when parallelizing heavy number-crunching work, but the point I was trying to make is that "they are much more efficient" is not that clear-cut and universally true.

That's because it IS more convenient to use threads instead of processes in most cases

Not necessarily true either. When using threads there is a much larger chance of introducing race conditions and deadlocks and other synchronization fails. It might look more convenient when writing the code, but debugging and supporting it certainly isn't.


There is hardly any "context switch" if you run one process per core

There is hardly any context switch if I don't run any software at all on my server. Zero bugs either. Back to the real world, and the point of this discussion, we run multiple Ruby processes per CPU so that when one process is waiting for the db or memcache, another can be doing useful work on the same CPU.

When using threads there is a much larger chance of introducing race conditions and deadlocks and other synchronization fails

As opposed to dropping the problem into memcache and getting it wrong: your code "works", you get no crashes, but customers lose data, lose posts, or lose money. If you're advocating using memcache because multithreading is too hard for your programmer, then your customers are fucked.


We run multiple Ruby processes per CPU so that when one process is waiting for the db or memcache, another can be doing useful work on the same CPU

That's the trivial case. When waiting for the db or memcache (I/O in general), the GIL is lifted, so it doesn't get in the way. It only gets in the way if multiple threads are inside the ruby VM doing work.


Absolutely agree that you can make more efficient use of resources with threads.

I would say that a reasonable rule of thumb is that if your problem/workload is at the point where you want to scale vertically - or you are at all concerned about processor cache locality - your 1st step would be to avoid using an interpreted language.

i.e. I don't see this as an argument for having a threaded interpreted language.


threads achieve better cache-locality

Why do you say that? What exactly do you mean? Please assume I'm not a server hacker (I'm not, but I AM a computer scientist).

(BTW, not trying to pick - I am genuinely curious if there is something here I wasn't aware of, which I consider very likely.)


I furrowed my brow at that, too, but then I realized what I think he meant. If you use multiple threads that touch the same data in memory, that will yield better locality than if you had to marshall data across the process boundary.


Not sure whether this is what he meant, but context switching between threads does not require flushing the TLB.


Three quotes, hopefully not out of useful context, that seem incongruous:

"Rubinius is about the join JRuby and MacRuby in the realm of GIL-less Ruby implementations"

"I spend my free time working on an alternative Ruby implementation which doesn’t use a GIL (MacRuby)"

"I respect Matz’ decision to keep the GIL even though, I would personally prefer to push the data safety responsibility to the developers. However, I do know that many Ruby developers would end up shooting themselves in the foot"

So developers using GIL-less MacRuby, JRuby and Rubinius are prone to foot shooting? I wish they'd blog about this more, I've never once heard a MacRuby or JRuby developer blog saying "I went back to MRI because I needed my Ruby code to be run more safely".


I can't speak for JRuby or Rubinius, but in MacRuby one of the main reasons that lack-of-GIL is not a more serious problem is because MacRuby has the Dispatch library (based on libdispatch, a.k.a. Grand Central Dispatch) which makes working with multiple threads safe again.

If you were to invoke Ruby's Thread library directly in MacRuby, you would find that things get crashy rather quickly!


We run heavily-threaded code in jruby in production, but we also test on MRI because it detects and explicitly fails on deadlocks rather than just locking up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: