Hacker News new | past | comments | ask | show | jobs | submit login
Dashmap: Fast concurrent HashMap for Rust (github.com/xacrimon)
132 points by indiv0 on March 26, 2020 | hide | past | favorite | 48 comments



If you’re interested in this, you might like this live-coding session implementing Java’s ConcurrentHashMap in Rust: https://youtu.be/yQFWmGaFBjk


Nice! But a bit disappointing that read speed doesn't increase with number of threads. Curious what the bottleneck is here (first thought is it can't be the memory because CHT shows higher performance for reads in the graph). A comparison with a read-only hashmap would be nice here.


Hey! I do know what the issue is I think and it is something that will be resolved with v4 releasing later this year.

It's an atomic architecture specific thing.


I know this is a bit out of date, but I remember there was a port of the Scala Ctrie data structure to Rust using Hazard pointers. I would be quite interested in seeing how the performance of a snapshotable lock-free structure compares to this

https://github.com/ballard26/concurrent-hamt


The Contrie library is in the benchmarks and is a port. That said referring the note in the repo the benchmarks arent 100% scientific atm and will be revamped later this year.


I am curious how this differs from chashmap:

https://gitlab.redox-os.org/redox-os/chashmap

They provide benchmarks, but I would be more interested to know how the implementation differs.


Hi! I am the author of dashmap. CHashMap is essentially a table behind an rwlock where each slot is also behind its own rwlock. This is great in theory since it allows decently concurrent operations providing they don't need to lock the whole table for resizing. In practice this falls short quickly because of the contention on the outer lock and the large amount of locks and unlocks when traversing the map.

dashmap works by splitting into an array of shards, each shard behind its own rwlock. The shard is decided from the keys hash. This will only lock and unlock once for any one shard and allows concurrent table locking operations provided they are on different shards. Further, there is no central rwlock each thread must go thru which improves performance significantly.


I can't speak to the implementation differences between the two, but I know the author of dashmap is relatively active in responding online, so they may show up shortly to explain. In terms of performance comparisons, we're actually working on building a shared benchmarking tool for all of Rust's concurrent maps that you may find interesting: https://github.com/jonhoo/bustle.


I see Jon G referenced in the readme. Is this work based on his livestream series where he ported Java's concurrent Hashmap to Rust?

I love his youtube streams. He's extremely patient and thoughtfully thinks through problems in a similar way to me, making it easy to follow.

Regardless, kudos and great work. I'm sure I'll find a use for this in some of my tokio projects.


Nope, Dashmap is all xacrimon, and came on the scene long before my port. We've been collaborating on writing a shared benchmarking suite over at https://github.com/jonhoo/bustle/ though. For the time being, it looks like Dashmap outperforms the port of ConcurrentHashMap (called "flurry"), often by a significant amount. It seems to be mainly due to the garbage collection scheme flurry uses, but we're still digging into it (maybe you want to come help?).

In any case, I'm glad you enjoy the videos!


Ha "straight from the horse's mouth."

I'd love to help out if I can.


Awesome! Some good places to read up on and join the discussion are https://github.com/jonhoo/bustle/issues/2, https://github.com/jonhoo/flurry/issues/50, and https://github.com/jonhoo/flurry/issues/80. Happy to guide you further there!


Some feedback:

1. Mind the difference between concurrency and parallelism. This is around safe parallel access. Concurrent access can happen on one thread without any synchronization.

2. It’s oftentimes an anti pattern to model thread safety around primitive data structures as opposed to higher level concerns. It forces all data that has to be consistent across thread boundaries to be in this one map. When circumstances change, you might want to have some of that data in different data structures and still provide consistent access to them. This change will be hard when you rely on data structure level thread safety.


“Concurrent data structure” is a well-established term (both in common use and in the academic literature) for a data structure which is safe to access from multiple threads of execution.


I’ve always seen parallelism described as a special case of concurrency where the concurrent processing happens to run physically at the same time, while general case concurrency is about logically running at the same time (whether its actually at the same time or just appears that way).

The use of the word concurrent here is consistent with that definition and the literature, where concurrent data structures are data structures designed to for safe concurrent access (whether they are accessed literally at the same time or not).


The 'concurrent' terminology is correct in this context - that's the term used in the literature.


> This is around safe parallel access

I'm not sold that this semantic game is usually worth playing, but here it is pointless. Concurrent access on the same threads or different threads isn't going to break.

> It forces all data that has to be consistent across thread boundaries to be in this one map

Why would it force that? There is no single technique for concurrency or parallelism and that silver bullet line of thinking is a dead end.

Concurrent data structures are an important part of the puzzle, especially maps and queues. Fork join, data flow, message passing, copying, swapping buffers, read only data, etc. the list goes on. If it was simple it wouldn't be a problem.


> Why would it force that?

Because you don’t get atomic writes across multiple data structures that are each thread safe unless you perform all writes while holding a mutex. If you do that, you don't need data structures to be thread safe on their own.


You don't even need to say "atomic across multiple data structures", "atomic across multiple keys" in is enough as it's likely next level of atomicity you'll require after atomics over single keys.

But if you require that you're in this higher level of transactional semantics that hashmaps don't try to solve (you can still use them as building blocks of course).

I think the furthest hashmap api could go is to do: 1. CaS - simple compare-and-swap on single key 2. CaSS - compare-and-swap on one key and set other key + read two keys tuple

Having 2. would be quite powerful already and you could solve a lot of tasks with it.

Fancy locking over key ranges/key expressions would also be interesting but that starts to be more like database, not hashmap anymore.


It depends how you structure your data. It's possible to have one "last pointer" log for example which is updated synchronously and points at new values in multiple structures which are updated independently/optimistically. This has different timing/properties than using a single mutex.


> unless you perform all writes while holding a mutex

No. Maybe you are not familiar with the concurrency data structures community. Many techniques are invented and used in concurrency data structures, e.g., intrinsic CPU atomic instructions, like CAS, that make the data structures "lock-free".

CAS: https://en.wikipedia.org/wiki/Compare-and-swap.


From the comment:

> you don’t get atomic writes across multiple data structures

They're saying that lock-free approaches don't help when you have to ensure consistency across multiple datastructures.


11 uses of "unsafe".

(this is not a critique of this specific library, it's more a look at the rust ecosystem as a whole)

i keep looking at Rust, but at the end it seems it is not a language for me. Rust developers just seem to use more "unsafe" than what i am comfortable with. generally, if there could be a choice between using "unsafe", and taking a 2% performance penalty,i personally would go with the performance-penalty. of course, i can understand others have different priorities. the question is, what are the priorities of the rust ecosystem? i mean, can i find libraries that go with as-safe-as-possible or are most libraries as-fast-as-possible?

also, the claim that the rust language is fast and safe becomes harder to accept when the fast libraries use unsafe :) (i do understand code using "unsafe" can be safe if the developer does not make mistakes. the problem is, developers do make mistakes.)


> Rust developers just seem to use more "unsafe" than what i am comfortable with.

And that's a good thing! In most other languages, what would require an "unsafe" in Rust can be done without any visible marker. The Rust language shines a spotlight in these places, which allowed you to notice them.

> the question is, what are the priorities of the rust ecosystem? i mean, can i find libraries that go with as-safe-as-possible or are most libraries as-fast-as-possible?

From what I've seen, the priorities of the Rust ecosystem tends to be "as-fast-as-possible wrapped into an as-safe-as-possible abstraction". That is, presenting a safe interface around a fast core, which can be audited separately; users of the safe interface do not have to worry about unsafety in the implementation.


I think... you miss the point.

Usage of unsafe is effectively flagging areas for peer review. You can't build everything in safe Rust - certain things _require_ the use of unsafe. Having it gated, reviewed, and so on is effectively a check on a class of bugs that can be hard to pin down.

IMHO, the community cares way, way too much about the mere sight of an unsafe in a codebase - it borders on religious zealotry. It's just a tool like anything else in the (wonderful) language.


>IMHO, the community cares way, way too much about the mere sight of an unsafe in a codebase - it borders on religious zealotry. It's just a tool like anything else in the (wonderful) language.

Strongly agree. I personally find a lot of the people involved in https://github.com/rust-secure-code/safety-dance to be mildly annoying to very unpleasant in their zealotry and snarkiness.


i don't think i missed the point here.

i wrote: "i do understand code using "unsafe" can be safe if the developer does not make mistakes. the problem is, developers do make mistakes."

you wrote: "Usage of unsafe is effectively flagging areas for peer review. You can't build everything in safe Rust - certain things _require_ the use of unsafe. Having it gated, reviewed, and so on is effectively a check on a class of bugs that can be hard to pin down."

it's the same thing. the difference is that you look at it from the glass-half-full point of view (it's good that must-be-verified-by-a-person blocks are limited here), and i do from the other end (it's bad that these blocks are necessary).


I think you missed parent's point. There are constructs that the current compiler can't prove is correct and to write such code you need unsafe. It is often not about a trade-off between speed/safety.


I actually gave a talk about exactly this a few weeks back that may be relevant: https://youtube.com/watch?v=QAz-maaH0KM


oh, i see what you mean. yes, i did somehow miss the "You can't build everything in safe Rust - certain things _require_ the use of unsafe." part of the argument.

i did talk about it in my other comment here: https://news.ycombinator.com/item?id=22701550


You obviously don't understand, unsafe isn't always optional. You literally can't do some things without it. If you tried to build a Vec from scratch without unsafe for example (or using code that uses unsafe) you would fail.

Data structures are a perfect place for judicious use of unsafe.


If you don't mind the performance penalty, then just use Mutex<HashMap>. You don't have to use this.

Rust isn't about completely eliminating all unsafe code. It never has been. It has always been about building safe abstractions with auditable unsafe internal parts.


I'll critique the language. Unsafe is a bad name. It doesn't mean "not safe" it means "cannot be verified by the compiler to be memory safe." Some things are inherently unsafe by that definition, including things necessary for software development. It may be better than other names, but frankly - it's too scary.

In particular, the implementation of data structures. Raw memory allocation, usage of pointers, low level concurrency primitives - none of that can be done without the programmer manually enforcing invariants.

But unsafe isn't wanton abandon. You still have to obey the type system and ownership rules.

As for the goal, it depends on the author. Some prioritize one over the other. In general the Rust community these days tries to optimize the balance of safety and speed with as few compromises as possible - which is fundamentally why the language exists.


unsafe doesn't mean dangerous or hazardous. If we cannot verify safety then it's unsafe. The word's choice seems correct to me. I agree with the rest of your comment though.


Another name that came to my mind was trustme as you as the programmer have to uphold certain garantuees that outside an unsafe block the compiler would enforce.



sorry, what does "to optimize the balance of safety and speed with as few compromises as possible" mean? it's possible to cover many-many situations with that description.


If you're looking for concurrent code that doesn't use `unsafe` you won't find any. To start, implementing `Send` and `Sync` are both unsafe.


Yes and no. That is, they're "auto traits"; they get implemented automatically for anything that can safely implement them. They only don't get implemented for things like raw pointers, or structs that contain raw pointers, and then they're unsafe to implement.


Is that by design, necessity or compiler/language deficiencies? I'm curious if this is something the language hopes to improve on over time in the name of fearless concurrency or it's simply not possible.


I think this statement from the grandparent post might have been misleading:

> If you're looking for concurrent code that doesn't use `unsafe` you won't find any.

The Rust standard library provides a bunch of concurrency primitives. You can write 100% safe Rust code use these primitives to do things concurrently, and Rust guarantees that your resulting program is free of a specific category of bugs: memory errors, data races, etc. No other mainstream language gives you this combination of concurrency safety and efficiency.

The implementation of the concurrency primitives, however, often includes assembly, or C, or unsafe Rust code, which the Rust compiler can't provide guarantees for. We have to rely on humans for that. This is the norm for almost all programming languages.

Ideally, yes, it would be nice if the safe subset of Rust were powerful enough to implement efficient concurrency primitives, but that would add a TON of complexity to the type system.

People are generally ok with the Rust standard library using assembly, C, or unsafe Rust, because even though it can be tricky to write this kind of code correctly, the standard library is maintained by Rust experts who take every change seriously.

People tend to worry more about third party libraries because, on average, the authors are not Rust language experts and may not understand the subtleties.


Yes, by "concurrent code" I meant relatively low level concurrency primitives/data structures not, for example, a web server which happens to use multiple threads. I probably should have been clearer, but I thought it would be clear from the context as the code GGGP was criticizing for using too much `unsafe` is a perfect example of the former.


exactly this. i understand that it's necessary to have raw/unsafe/whatever-is-the-name code when you interact with the outside world, or to implement building blocks of the language. that's not my problem. my problem is when you look at something innocent-looking third-party library like a http-header-parsing library, and it contains unsafe-blocks. again, i understand it will give it a boost of performance. it's just not the tradeoff i would take personally.

[EDIT]: clarified that i mean third-party libraries.


I don't like the "unsafe for me, but not for thee" attitude where "third-party" code is held to a different standard than the compiler/stdlib code. This attitude is especially strange to me when it doesn't distinguish between something relatively high level like your example of parsing http headers, and something quite low level like the concurrent data structure implementation we're discussing here.


Stepping back from concurrency for a second, you can't even implement `RefCell<T>` without `unsafe`. Any method that takes `&self` and performs "interior mutation" is unsafe to implement.

Not sure whether you would consider that to be by "design", "necesity", or a "compiler/language deficiency".


In principle it could perhaps be enforced by the compiler, but it would be very difficult. Whether something is memory-safe to Sync or Send across threads can depend on things like platform-specific atomic operations, thread-local storage, how panics and stack unwinds proceed, etc. Sounds like a nightmare for the compiler to make guarantees about. Maybe when compiling for a specific version of a specific OS, but to do it at the language level would be an incredible feat, and might make the language unusably complex.


Unless you require your programs to be formally verified or use dependent linear types, the best you can do is an effect system, STM and GC. There is no panacea.


What's your point of comparison? C and C++ are all entirely unsafe. At least with Rust you can pinpoint the parts that need scrutiny.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: