Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: KySync – A C++ rewrite of Zsync with performance boost (kyall.notion.site)
118 points by kyotov on Sept 19, 2021 | hide | past | favorite | 35 comments



[Intro](https://kyall.notion.site/KySync-v1-0-29eaad446308449a8c9a23...)

KySync is an efficient way to distribute new data which makes use of older (but similar) data that you may already have present locally. KySync supports HTTP v1.1, but can easily be extended to support any server protocol which supports range queries.

KySync is [released](https://github.com/kyotov/kysync/releases) under the MIT Open Source License (see [COPYING](https://github.com/kyotov/ksync/blob/master/COPYING) in root of repository).

KySync is a full rewrite of [Zsync](http://zsync.moria.org.uk/) in modern C++. While no code was reused from Zsync, the awesome [Zsync technical paper](http://zsync.moria.org.uk/paper200503/) was the major resource used for the implementation of KySync.

The value proposition of KySync over Zsync is that it takes advantage of modern architecture features (multi-core multi-CPU systems as well as exceedingly fast IO subsystems, e.g. NVMe SSDs). KySync is 3x-10x (or more) faster than Zsync on such commonly available modern hardware. We have not spent much time optimizing KySync single-thread performance, so there are cases where with sufficiently high similarity, Zsync is faster when less then 4 threads are used in KySync.


Great work and great summary!


Having been around the block optimizing things for throughput - the fact that a multi-threaded "A" is slower than a single-threaded "B" when A's thread count is low is a red flag.

I skimmed through kysync docs and I don't see any meaningful discussion of this aspect.

If you are to start with B and then parallelize its bottlenecks (where possible), you will normally see performance gradually increasing with the thread count, then plateauing and only then, possibly, starting to drop.

But if A with a single thread is outright worse than B, it means that A's per-thread overhead is higher. In case of kysync it appears to be 3-5 times as high as zsync's. That's immense!

Similarly, if A with a low thread count is worse than B, then it may be an issue with A's sync/threading model (prolonged contended waits, etc.).

In either case, it's a sign that there's something inherently inefficient with A's implementation... even if spawning 16 threads helps offsetting it and beating B's single thread.


I should have made my intentions clearer from the start :)

KySync started as a pet project to brush up on my C++ skills which I had not used for 20 years and was going to need for a job that I started last week! I had spent my fair share of time optimizing single threaded performance back in grad school, and was looking for different experiences.

That said, and as we mentioned in the write up, we had some improvements cooking that we did not put in v1.0. I spent some time tonight to merge them in and run some performance experiments.

Single threaded performance is now much improved and practically on par with Zsync. I released it as v1.1.

https://kyall.notion.site/KySync-v1-1-dd9931f330f241469d3e60...

We will of course keep looking for more performance, but if you have any suggestions for further improvements, please share -- or join us!

Best, Kamen

P.S. Both of these are contributed by Chaim Mintz, whom I worked with in a previous life.


The performance claim seems fishy to me.

Within a single physical system ("multi-core multi-CPU" or with NVMe) - you rarely use something like rsync, zysnc or keysync: Your files are already local, on your system. If you want a second copy of a file on the same system, you would symlink or hard-link, or use other filesystem mechanisms. It's not even clear a clean copy would be slower than doing a bunch of comparisons.

On the other hand, between remote systems, the "modern day architecture features" mostly don't apply. I suppose a more clever use of modern kernels could help performance somewhat. Maybe.


We should make a clarification... The intent of KySync (as well as Zsync) is to use across systems, not on a single system. KySync supports HTTP (like Zsync) as well as HTTPS (which Zsync does not).

The primary reason to do the performance comparison on a single system, is so that the results are easy to replicate with as little setup as possible. Because we do this for both KySync and Zsync it is apples to apples.

HTTP bandwidth and storage cost money, and this is self funded project, so I can't afford to put up test files of data publicly visible to the world.

One thing we can look into is leverage AWS/S3 to upload some data and use it for a performance experiment, but that will need some logistics for the developer to set their AWS account properly. Will look into it.

Of course, the more similar the files are, the closer the remote results will be to this first set we published.


So, are you claiming a 3x-10x performance improvement...

* Within the same system?

* Over close-by systems on a LAN segment?

* To typical far machines over the Internet?


I can see your point. We really need to run some experiments in S3 to satisfy this point rigorously.

Intuitively, the more similar the files are, the more 3x-10x+ will be representative in the real world. As the files become more and more different, one is of course at the mercy of the bandwidth between the computers. If you need to transfer 1 GiB differences over 1 MiB/s connection, it will take ~1000 seconds -- no magical way around it.

The comparison so far is practically on the computational cost of the sync, which can be significant as differences pile up.


Could be useful in the backend infrastructure stuff if you have beefy machines processing lots of IO. There could be myriads other use cases. Hell, some people have 10gbs machines at home. Even modern CPUs, definitely mobile, can struggle to hit line rate in certain applications like large file transferring unless you’ve got a beefier machine.


> in this case it takes 4-8 threads or so for KySync to match Zsync's performance.

Assuming zsynch is single threaded, in an apples/apples comparison (1 thread) zsynch wins by over 10x. I can't help but wonder how much faster kysynch would be if they focused more on single threaded performance. The way things stand, kysynch's implementation sounds rather wasteful I'm terms of CPU resources, in spite of what this announcement claims. Still, kudos on this impressive project!


Given that it runs client side pairing with plain rsync servers, meant to efficiently support mass file distribution, using multiple cores but less efficiently to gain in overall performance seems a good tradeoff.

One thing most client/home systems have in abundance most the time is spare cores.


Dozens of them?

The huge speed-ups for KySync are when it has 16+ threads on a 96 core system. But if my computer has two cores, it is not in fact going to efficiently run 16 CPU banging threads, and actually it might choke badly and I should rather run the dual thread KySync, which is much less impressive.

Some problems in this area (data synchronization) look embarrassingly parallel, which means that sure, if you can run more at once you win, but, there's no interesting problem, you could also run four copies of Zsync, "embarrassingly parallel" means the concurrency problem is trivial, in this case each of those four maybe handles a separate file or region of the file, it's embarrassing because all the clever techniques you learned in class aren't important, the problem is so easy.

I can see it wasn't the point of this work, but from an inquisitiveness point of view my question even after just one chart is, "What happens if we run this on a much more modest VM?". Do the extra threads trample on each other? Are they automatically suppressed?


> The huge speed-ups for KySync are when it has 16+ threads on a 96 core system.

Only if you cherry pick the results and look only at 128MB files. At 512MB files, there is parity at 4 threads. At 2GB files there is almost parity at 2 threads, and significant gains at 4 threads.

> But if my computer has two cores

Then your CPU wasn't sold to you in recent years, as entry level desktop CPUs ship with 4 cores now and have for a couple years.

That said, if you're actually in this situation, then you just run zsync if it's faster. The existence of kysync isn't going to make it disappear or prevent you from using it, any more than zsync prevented rsync from being used.

I'm not sure the point you're trying to make. A new implementation of an existing protocol came out that allows for speed gains if certain criteria are met, which includes throwing more compute at it. It feels like you think this is a bad thing for some reason?



Awesome improvements, congrats!


I wonder whether they have an explanation for this. Probably more overhead in the abstractions necessary for parallel processing?


I can't say anything specifically about this case, but generally, sometimes the whole architecture has to be changed in order to support concurrency (especially in the "millions of mutexes" kinds of problems). And such alternative architectures are often inherently inefficient in serial execution.


Instead of an explanation, we decided to address some of the reasons :)

https://kyall.notion.site/KySync-v1-1-dd9931f330f241469d3e60...

Enjoy!


To save others the hassle

Zsync improves on rsync by

supports handling of some compressed formats. target can be http. offload more work to the client


My reading is the server with zsync can be any http server, client is zsync. Rsync on the other hand needs to use rsync on the server, and it scales badly because it does most work on the server.


That's exactly it. Zsync works only once per file, creates a signature in a second file and that's it. They are both distributed with an HTTP server, the most scalable way of centrally distributing data. The client fetches the signature, checks what part is missing and downloads it with Range: headers.


While I found the concept interesting, attempting to click through and scroll using spacebar/pgup/pgdn just doesn't work, and arrow keys do something very different from normal, both in Firefox and Chrome, which makes me think it's a deliberate choice.

Please consider not breaking how many people browse web sites.


Are you talking about the Notion site? I don’t think this is under his control. In case you’re not Tried it Notion is a note taking app (it’s pretty good).


Indeed... but thanks for the feedback.

Since the content of the page is pretty stable now, I can probably host it in more normal way than Notion.


I was also somewhat frustrated by notion's UI choices and wonder if as a quick hack you could get it into markdown and shove it in a gist (trying to optimise for "annoy me less with as little effort as possible for you" here ;)


Yup! Will be looking at that one of the following nights.

It is so convenient to edit in Notion, but I share the frustration about paging and navigation for reading!


Very cool, thanks for sharing. I did a deep dive in the past into various syncing/binary diff protocols and really liked zsync. It was probably my top choice for the application I was designing but I ended up not using it. The library I did use is called bita: https://github.com/oll3/bita. It is inspired by the same family of projects as zsync. The main advantage I found with bita is that the core logic is encapsulated in a library so that you don’t only have to use the binaries but can integrate it directly into an application. I’d be curious to know if that’s in the plans for KySync.


Thanks for the pointer -- I will look into comparing against bita!


We have released a v1.1 with improved performance!

In summary, we see:

- 10%-100%+ performance improvement with v1.1 on 2 GiB data and 16 threads;

- A significantly improved single thread performance, now on par with Zsync for 2GiB data.

Checkout the whole write-up at https://kyall.notion.site/KySync-v1-1-dd9931f330f241469d3e60...


Any recommendations for best practices in "modern C++"? I have to resurrect an old codebase and would like to refactor it using more modern idioms, but its been a loooong time since I wrote any decent c++.



Would be nice to see comparison with plain rcync.


It's not really apples-to-apples as rsync does work on both the client and the server.

Zsync has a comparison with rsync in isolated cases where it makes sense and it is very close in performance!

http://zsync.moria.org.uk/paper/ch02s08.html


I hope the folks at nextcloud are paying attention!


If there are specific scenarios that need to be optimized for the use case, please get in touch!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: