Splitting matrix.org into smaller homeservers wouldn't necessarily help (even if...

driminicus · on May 14, 2018

That's of course true for the massive rooms, such as #matrix:matrix.org, but if there are enough 1:1 conversations to contribute significantly to the overloadedness of matrix.org splitting the servers up and 'randomly' assigning a server to users would decrease the load.

That's a number of 'ifs' there, though.

Arathorn · on May 14, 2018

The problem is that the 1:1 conversations are a negligible weight relative to the massive 15,000 user rooms like #matrix:matrix.org - and spreading those 15,000 users over 15,000 servers rather than 1000 is going to just swap a client<->server API overload problem with an server<->server API overload...

driminicus · on May 14, 2018

Fair enough, that's also why I included: "but if there are enough 1:1 conversations to contribute significantly to the overloadedness of matrix.org"

Matrix.org is still running, so no complaints there. You'll figure things out soon enough.

juliangoldsmith · on May 14, 2018

Do you know what exactly the bottleneck is on Synapse? Is it syncing everything to users, or the actual room processing? I wonder how much of an impact JSON (de)serialization has, assuming you don't cache serialized requests.

Arathorn · on May 14, 2018

We do. The main bottleneck is merge resolution when unifying your copy of your room with everyone else’s. If the room starts to fragment due to netsplits or unreliable servers then this can get incredibly resource intensive. https://github.com/matrix-org/synapse/pull/3122 is the fix which switches the algorithm from roughly O(N) to O(1).

For context, a typical synapse actually only uses around 300MB of RAM. It only spikes up to 1-2GB when trying to resolve state on big rooms like Matrix HQ, and then python doesn’t relinquish the RAM.

We do cache responses in JSON to avoid serialisation overheads.

e12e · on May 14, 2018

Am I reading this right that this is the algorithmic fix, but it still needs a concrete implementation - and then the main server should be back down to "vertically scalable"?

Without going and reading the doc(yet); does this relate to:?

https://jneem.github.io/merging/

It would seem that a simpler, deterministic merge algo would be possible to parallelize - but I'm not sure if it's easy to match matrix idea of merges with what's discussed in that post/paper?

Arathorn · on May 14, 2018

The algorithm is already being implemented in synapse (but got delayed by dealing with some security bugs). There's also a rust test jig for playing with algorithms around merge resolution at https://github.com/erikjohnston/rust-matrix-state/.

It's not directly related to the Categorical Theory of Patches paper - the merge resolution here is much simpler than reasoning about VCS patches, although the approach of taking a formal mathematical approach is similar :)