Using Rust to Scale Elixir for 11M Concurrent Users (2019)

hugodutka · on Nov 11, 2020

The perfect Hacker News title has finally been crafted.

teraflop · on Nov 11, 2020

As was pointed out last time it was posted: https://news.ycombinator.com/item?id=19947458

ignoramous · on Nov 11, 2020

A glitch in the Matrix?

sdwolfz · on Nov 11, 2020

A glitch in the Matrex: https://github.com/versilov/matrex (sorry I couldn't resist)

ibraheemdev · on Nov 11, 2020

Rust, Elixir, Million, Concurrent ... all that's missing is Decentralized Web :)

drsozesakamoto · on Nov 11, 2020

What about blockchain ?

unmole · on Nov 11, 2020

That would quickly get flagged by the flamewar detector.

js4ever · on Nov 11, 2020

AI? VR/AR?

swat535 · on Nov 11, 2020

Better alternative: Rust leverages the power of FP in Elixir to Scale 11M Users.

dwaltrip · on Nov 11, 2020

And “scale”!

iExploder · on Nov 11, 2020

scalable paradigm disrupting shift on blockchain in rust

callamdelaney · on Nov 11, 2020

Is it web scale?

kilotaras · on Nov 11, 2020

Articles SortedSet[0] is basically a pseudo-BTree with fixed depth = 2 and fixed max_size of leafs. This gives O(log n) search and O(sqrt n) inserts [1].

It would mean that insertion at beginning is worst case scenario (split + need to move all buckets), but timing of insert is actually dominated by adding size of the buckets to calculate final index.

Spending a little bit of time on research, finding https://en.wikipedia.org/wiki/Order_statistic_tree and just using G++ implementation would probably yield better result and less code to support.

G++ has __gnu_pbds which add O(log n) "find_by_order" and "order_of_key" to trees, e.g. [2].

[0] https://github.com/discord/sorted_set_nif/blob/master/native...

[1] Technically O(n/max_size + max_size) but we can assume that max_size is selected to be ~sqrt n

[2] https://www.ideone.com/8mzxGR

elcritch · on Nov 11, 2020

You can also write Elixir/BEAM NIF extensions in Nim and Zig as well using Nimler or Zigler, resp [1,2]. Also I’ve found writing native extensions for the BEAM is simpler than many other dynamic languages since it relies on immutable data. It’s roughly the same as just a function to deserialize/serialize data with some extra references to GC’ed data. Especially with dirty “NIF”s (they allow the functions to run for as long as they want without messing up the BEAM scheduler).

1: https://github.com/wltsmrz/nimler 2: https://github.com/ityonemo/zigler

AlchemistCamp · on Nov 11, 2020

In the case of BEAM NIFs, I find Rust's compiler guarantees especially compelling.

That's because a crashing NIF can take down the whole Erlang VM. If fault-tolerance is what brought you to Erlang/Elixir and OTP to begin with, as it is for many, having the server crash could be a major issue.

elcritch · on Nov 11, 2020

True, Rust has the strongest guarantees. Nim offers similar guarantees overall (checked arrays etc) and Nimler has exception checking enabled so the compilers checks if you have any possible checkable exceptions. Personally, I use Nim since it’s easier to use on embedded Linux devices by precompiling to C. Rust has some odd dependencies on the target’s C linker but doesn’t check the standard $CC or $LD variables. Zig is interesting but it offers the least memory protection.

secondcoming · on Nov 11, 2020

Stop raining on the rust parade!

jjice · on Nov 11, 2020

The most interesting thing about Discord is their ability to scale. Their product really works well in my experience, and their scale is no joke.

StreamBright · on Nov 11, 2020

The have some really disciplined engineers (some of them I used to work with). I have learned a lot from these people. Erlang is everybody's secret weapon for high concurrency (like Discord) and Rust has predictable performance that is easy to tune. GC is not an option for many use cases they have. One interesting problem they run into was the GC is Go and the lack of configuration options for it. I think they are the prime example of using the right tool for the job, whatever it might be, instead of trying to sell you a single programming language that solves everything.

ketzo · on Nov 11, 2020

I've been using Discord on a near-daily basis for 9 years, in servers from 10 to 25,000 people, and I think I can count on two hands the number of actual breakages I have had with the product. It's honestly pretty nuts.

ggregoire · on Nov 11, 2020

> I've been using Discord on a near-daily basis for 9 years

Wikipedia says Discord was released 5 years ago. Am I missing something?

ketzo · on Nov 11, 2020

Hm — maybe that’s in reference to it coming out of beta or something? I can tell you I made my first Discord server in 2015.

johnisgood · on Nov 11, 2020

I have been using for 2-3 years, and it had loads of breakages for me. I cannot count on two hands, need more. Europe.

Deukhoofd · on Nov 11, 2020

I've seen it go down quite a lot. https://discordstatus.com/ shows at least 5 breakages in the last week.

0xcoffee · on Nov 11, 2020

I agree, I always found it strange that an app aimed for gaming really has better performance than slack/teams and their massive budgets behind them. But I never saw a company that uses discord for communication.

masklinn · on Nov 11, 2020

> I always found it strange that an app aimed for gaming really has better performance than slack/teams

Makes sense to me, its users would be running games alongside the software, and would be quite cross with the software shitting itself in the middle of a co-op game. Somewhat good performance and behaviour is (or at least was originally) absolutely necessary to getting traction, a bunch of checkboxes would not suffice.

zimpenfish · on Nov 11, 2020

> But I never saw a company that uses discord for communication.

Currently working at a place where at least one, if not two, teams use Discord for daily chatting. Official meetings are mostly handled over Zoom but I believe their intra-team stuff tends to happen over Discord.

rkangel · on Nov 11, 2020

My view is that they solved voice communication first - that was their original use case. Reliable multi-party voice is a harder problem than text chat and means that you can't/don't just rely on Electron for everything. Their product is a way to 'sell' some good non-trivial (but not groundbreaking) engineering.

Whereas Slack is successful because it understood it's usecase well (including how to sell to enterprises).

haar · on Nov 11, 2020

The closest I've found is the Spotify open-source project "Backstage", which uses Discord for communication (https://github.com/backstage/backstage) - I have no idea about the rest of Spotify, mind.

jbirer · on Nov 11, 2020

Budgets are not a subtitute for programming talent and willingness to experiment it seems.

ibraheemdev · on Nov 11, 2020

A lot of opensource teams use Rust for communication

hultner · on Nov 12, 2020

I know that rust is the greatest latest thing but I’m still not quite convinced of it’s excellence as a communication platform.

ntonozzi · on Nov 11, 2020

It'd be interesting to see it compared to https://doc.rust-lang.org/std/collections/struct.BTreeSet.ht....

Their skiplist description is a little funny — typically a skiplist has one element per leaf node, and you should never need to move items between buckets. Also skiplists have logarithmic layers in the number of elements, so inserting and searching are always bounded by O(log n).

infradig · on Nov 11, 2020

I've been using a bucketized version of skiplist for years now, I thought I invented it! Using most recently in https://github.com/infradig/trealla (a Prolog interpreter).

jhgg · on Nov 11, 2020

We needed to support arbitrary index access into the sorted set - which is why stdlib btree map doesn't work.

c0deb0t · on Nov 11, 2020

For binary trees, indexing can be done by saving the subtree size of each node and doing a sort of binary search. Not sure if this is fast for B-trees that have more than 2 children nodes, though.

Lichtso · on Nov 11, 2020

Do you mean OSTs?

https://en.wikipedia.org/wiki/Order_statistic_tree

ntonozzi · on Nov 11, 2020

Oh, that makes sense. I like the solution you arrived at!

didibus · on Nov 11, 2020

> There’s a fantastic Elixir project called Rustler. It provides nice support on the Elixir and Rust side for making a safe NIF that is well behaved and using the guarantees of Rust is guaranteed not to crash the VM or leak memory

Hum, can someone talk more about these NIF and what special thing Rustler brings?

What's different from just any kind of native interop, like say Java JNI ?

lawik · on Nov 11, 2020

So there are a few ways to do interop with non-BEAM languages.

Ports are the typical one for simpler things. You start an external command and talk to it via stdin/stdout.

NIFs are potentially much faster and are traditionally in C. But they carry the risk of crashing the VM. Rustler+Rust removes most, likely not all, but most of the risks for screwing up and killing the VM.

So NIFs are usually not recommended simply because the resilience and reliability of the BEAM VM relies on a bunch of cool strategies inside the VM and NIFs side-step all of that. The Rust ones should be far less dangerous.

latch · on Nov 11, 2020

NIFs run without most (all?) of the safety features the erlang VM provides, such as memory safety and scheduling. If the NIF crashes, the entire app (aka, VM) crashes.

This is anathema to some...i dunno..philosophies? expectations? of Erlang/Elixir.

NIFs written in Rust (via Rustler) should never be able to crash the VM. Unlike NIFs written in C.

Rustler brings other quality of life improvements, like easier interop (decoding/encoding values to and from) and managing the cleanup of objects that live in Rust but are referenced by Elixir.

pjmlp · on Nov 11, 2020

So what happens to the VM when an index out of bounds happens in Rustler?

Is there a panic handler that prevents a crash and returns a magic value that makes sense to the caller?

filmor · on Nov 11, 2020

The called code is made unwind-safe at the FFI boundary using `std::panic::catch_unwind`. A panic will be converted to an Erlang exception.

pjmlp · on Nov 11, 2020

Except it doesn't work always,

> Note that this function may not catch all panics in Rust. A panic in Rust is not always implemented via unwinding, but can be implemented by aborting the process as well. This function only catches unwinding panics, not those that abort the process.

filmor · on Nov 11, 2020

It works whenever it's configured to unwind, which is the default. Apparently one can force this by adding a few lines in the rustler Cargo.toml (https://doc.rust-lang.org/edition-guide/rust-2018/error-hand...), I'll create a PR.

c-cube · on Nov 11, 2020

If you compile a NIF you probably will not go out of your way to write "panic = abort" in your project. Unwinding is the default behavior.

pjmlp · on Nov 11, 2020

It depends on the dependencies.

bluejekyll · on Nov 11, 2020

From the docs filmor linked to, if a library sets panic = unwind and a user sets that to abort, “If any of your users choose to abort, they'll get a compile-time failure.”

pjmlp · on Nov 11, 2020

I was thinking more about libraries that choose to abort, and already compiled.

c-cube · on Nov 23, 2020

It's considered bad practice, as far as I know. The executable should decide that, not library.

latch · on Nov 11, 2020

We use rustler only as a thin wrapper to existing crates (like, for hdrhistograms). I'm far from a Rust expert, but I believe it uses std::panic::catch_unwind

pjmlp · on Nov 11, 2020

> Note that this function may not catch all panics in Rust. A panic in Rust is not always implemented via unwinding, but can be implemented by aborting the process as well. This function only catches unwinding panics, not those that abort the process.

bluejekyll · on Nov 11, 2020

I can’t speak to the Rustler, but the JNI bindings for Rust are quite good. There are a lot of areas where you can still shoot yourself in the foot, lots of unsafe. But there are patterns they encourage that are very safe, like associating lifetimes to Java objects that are instantiated in the Rust code appropriately. It also has good interfaces for working with arrays and avoid some of the foot guns in traditional C JNI libraries.

It’s not quite possible to have no unsafe code in JNI with Rust (in my experience) but it is very easy to isolate to very small areas of your code.

toast0 · on Nov 11, 2020

I'm not sure if the other answers were good enough.

NIF is native interop. There's a bunch of provided functions for interfacing with Erlang terms and other things like that. But, it's native interop, so you can also do whatever crazy stuff you want, even if it's a bad idea.

The BEAM VM cannot and does not protect you from bad NIFs; but if you follow the guidelines and are careful not to crash, you can have a good time. Often, the scope of the native work ends up so small, it's easyish to keep it safe. For larger scope, it's more common to use a Port or a C-node; a Port is just what Erlang calls running your other code in separate OS process communicating with stdin/stdout, and a C-node is your other code running separately, possibly on another machine, connected via network sockets, acting like a dist node. For this application though, it makes more sense to use a NIF to avoid copying the data to and from a separate OS process.

I haven't used Rustler, but my understanding is it's mostly just a framework of sorts for writing NIFs in Rust. Maybe making it easier to use Erlang terms, and setting up the compile settings etc.

vvanders · on Nov 11, 2020

My brief understanding of NIFs is they're similar to unsafe in Rust where a bunch of VM guarantees are only held so long as the NIF is well-behaved. I believe there was also some limits on how long a NIF could execute before starving the Erlang scheduler[1].

[1] http://erlang.org/doc/man/erl_nif.html

rkangel · on Nov 11, 2020

The other answers have covered the technical details well, but not so much the philosophical ones.

You may have heard about how Erlang with the BEAM is a fantastic implementation of the Actor programming model. This involves lots of processes (green threads) holding a little bit of state and communicating with messages. It's both a great concurrency model and a great programming model. The standard library (OTP) then also provides excellent tools for managing these processes in supervision trees.

Ports act like another actor in a supervision tree. This actor happens to be a C program and the messages are sent and received by stdout/stdin, but otherwise it behaves like a process. You can manage its runtime state with supervision trees and all the other excellent tools that BEAM/OTP give you.

NIFs on the other hand are best used to replace something that could be a pure function in Erlang, but is a lot faster in something compiled (or because you want to leverage some existing codebase that does that).

aloukissas · on Nov 11, 2020

The little that I know about NIFs is that if they crash, they bring down the entire BEAM, so it's pretty risky if not done properly. But I've seen that Rust complements Elixir (via NIFs) very well.

cultofmetatron · on Nov 11, 2020

OMG, my startup is trying to solve a similar problem. Albeit not at this scale (YET). Looks like I have a good excuse to delve back into rust

cultofmetatron · on Nov 11, 2020

damn, tried to pull this in as a dependency, doesnt look like it compiles on the most recent erlang versions. good reason to dig into the source this weekend.

bobnamob · on Nov 11, 2020

[2019]

transfire · on Nov 11, 2020

I was wondering if Crystal could be used thus way too. Found this https://github.com/splattael/erl_nif.cr

callamdelaney · on Nov 11, 2020

I believe you can use almost anything as an erlang / elixir nif

brightball · on Nov 11, 2020

The combination of Elixir and Rust has been really fascinating to watch. They compliment each other so well.

StreamBright · on Nov 11, 2020

You might be thinking complement. And yes those do complement each other perfectly. The other scaling article how and why they switched from Go to Rust is pretty mind blowing too.

mkl · on Nov 11, 2020

This one? https://blog.discord.com/why-discord-is-switching-from-go-to...

Discussed extensively last year: https://news.ycombinator.com/item?id=22238335

StreamBright · on Nov 11, 2020

Yep that one.

StreamBright · on Nov 11, 2020

Old article: May 17, 2019. Title should reflect that.