Hacker News new | past | comments | ask | show | jobs | submit login

Perf problems arise with Synapse if those 2000 rooms include massive ones with tens or hundreds of thousands of users (or other state events). The number of local users is fairly irrelevant.

If you grep the logs for state-res you will probably see that some room is consistently chewing resources (these days we explicitly log the worst offenders); easiest bet is to ask your users to leave that room or use the shutdown api to remove it from the server.

Otherwise, it may be that there’s so much data flying around due to the busy rooms that the in-memory caches are blowing out. This makes everything slow down as the DB gets hammered, and unintuitively uses more RAM as slow requests stack up. The solution is to increase the cache factor, much as you would on an overloaded db. We’re currently looking at autotuning caches do you don’t have to do this manually.

If it’s still slow, then there’s probably a bug or other weirdness (ancient Python?) - my personal server has about 5 users on a 4 core cpu and uses about 2GB of RAM without a dedicated DB node and is in thousands of rooms (including busy ones).

(Also it hopefully goes without saying that all bets are off if you aren’t on the latest Synapse release - we are constantly landing improvements currently; eg the auth chain cover algorithm eliminates most of the known perf edge cases on state resolution).




Thanks for the pointers - on latest release with Python 3.8 and roughly 2~5 of those rooms are on the larger ends of the spectrum.

Sounds like I should tune some caches then - I have memory to spare if it turns out to make a difference.

BTW, I just noticed there is an option to add a Redis - would that be a significant improvement compared to just using the process caching?


So you’ll want to try dialling up the overall cache factor a bit.

Redis is only useful if you split the server into multiple worker processes, which you shouldn’t need to at that size (and even then, doesn’t provide shared caching yet, although there’s a PR in flight for it - we currently just use redis as a pubsub mechanism between the workers).

Highly recommend hooking up prometheus and grafana if you haven’t already, as it will likely show a smoking gun of whatever the failure mode is.

Are the logs stacking up with slow state-res warnings? Stuff like:

    2021-02-25 23:15:26,408 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by CPU time: ['!YynUnYHpqlHuoTAjsp:matrix.org (34.6265s)']
    2021-02-25 23:15:26,411 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by DB time: ['!YynUnYHpqlHuoTAjsp:matrix.org (148.6s)']




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: