As a sysadmin, I'd rather use NVMe TCP or Clonezilla to do a slow write rather t...

mrb · 2024-03-12T08:28:17 1710232097

NVMe/TCP or Clonezilla are vastly more moving parts and chances to mess up the options, compared to dd. In fact, the author's solution exposes his NVMe to unauthenticated remote write access by any number of clients(!) By comparison, the dd on the source is read-only, and the dd on the destination only accepts the first connection (yours) and no one else on the network can write to the disk.

I strongly recommend against oflag=direct as in this specific use case it will always degrade performance. Read the O_DIRECT section in open(2). Or try it. Basically using oflag=direct locks the buffer so dd will have to wait for the block to be written by the kernel to disk until it can start reading the data again to fill the buffer with the next block, thereby reducing performance.

bayindirh · 2024-03-12T08:37:27 1710232647

> the author's solution exposes his NVMe to unauthenticated remote write access by any number of clients(!)

I won't be bothered in a home network.

> Clonezilla are vastly more moving parts

...and one of these moving parts is image integrity and write integrity verification, allowing byte-by-byte integrity during imaging and after write.

> I strongly recommend against oflag=direct as in this... [snipped for brevity]

Unless you're getting a bottom of the barrel NVMe, all of them have DRAM caches and do their own write caching independent of O_DIRECT, which only bypasses OS caches. Unless the pipe you have has higher throughput than your drive, caching in the storage device's controller ensures optimal write speeds.

I can hit theoretical maximum write speeds of all my SSDs (internal or external) with O_DIRECT. When the pipe is fatter or the device can't sustain that speeds, things go south, but this is why we have knobs.

When you don't use O_DIRECT in these cases, you see initial speed surge maybe, but total time doesn't reduce.

TL;DR: When you're getting your data at 100MBps at most, using O_DIRECT on an SSD with 1GBps write speeds doesn't affect anything. You're not saturating anything on the pipe.

Just did a small test:

    dd if=/dev/zero of=test.file bs=1024kB count=3072 oflag=direct status=progress 
    2821120000 bytes (2.8 GB, 2.6 GiB) copied, 7 s, 403 MB/s
    3072+0 records in
    3072+0 records out
    3145728000 bytes (3.1 GB, 2.9 GiB) copied, 7.79274 s, 404 MB/s

Target is a Samsung T7 Shield 2TB, with 1050MB/sec sustained write speed. Bus is USB 3.0 with 500MBps top speed (so I can go %50 of drive speeds). Result is 404MBps, which is fair for the bus.

If the drive didn't have its own cache, caching on the OS side would have more profound effect since I can queue more writes to device and pool them at RAM.

mrb · 2024-03-12T08:58:21 1710233901

Your example proves me right. Your drive should be capable of 1000 MB/s but O_DIRECT reduces performance to 400 MB/s.

This matters in the specific use case of "netcat | gunzip | dd" as the compressed data rate on GigE will indeed be around 120 MB/s but when gunzip is decompressing unused parts of the filesystem (which compress very well), it will attempt to write 1+ GB/s or more to the pipe to dd and it would not be able to keep up with O_DIRECT.

Another thing you are doing wrong: benchmarking with /dev/zero. Many NVMe do transparent compression so writing zeroes is faster than writing random data and thus not a realistic benchmark.

PS: to clarify I am very well aware that not using O_DIRECT gives the impression initial writes are faster as they just fill the buffer cache. I am taking about sustained I/O performance over minutes as measured with, for example, iostat. You are talking to someone who has been doing Linux sysadmin and perf optimizations for 25 years :)

PPS: verifying data integrity is easy with the dd solution. I usually run "sha1sum /dev/nvme0nX" on both source and destination.

PPPS: I don't think Clonezilla is even capable of doing something similar (copying a remote disk to local disk without storing an intermediate disk image).

bayindirh · 2024-03-12T09:31:38 1710235898

> Your example proves me right. Your drive should be capable of 1000 MB/s but O_DIRECT reduces performance to 400 MB/s.

I noted that the bus I connected the device has 500MBps bandwidth theoretical, no?

To cite myself:

> Target is a Samsung T7 Shield 2TB, with 1050MB/sec sustained write speed. Bus is USB 3.0 with 500MBps top speed (so I can go %50 of drive speeds). Result is 404MBps, which is fair for the bus.

mrb · 2024-03-12T09:57:43 1710237463

Yes USB3.0 is 500 MB/s but are you sure your bus is 3.0? It would imply your machine is 10+ years old. Most likely it's 3.1 or newer which is 1000 MB/s. And again, benchmarking /dev/zero is invalid anyway as I explained (transparent compression)

crote · 2024-03-12T14:20:50 1710253250

No, it wouldn't imply the machine is 10+ years old. Even a state-of-the-art motherboard like the Gigabyte Z790 D AX (which became available in my country today) has more USB 3 gen1 (5Gbps) ports than gen2 (10Gbps).

The 5Gbps ports are just marketed as "USB 3.1" instead of "USB 3.0" these days, because USB naming is confusing and the important part is the "gen x".

Arnavion · 2024-03-12T21:43:15 1710279795

To be clear for everyone:

USB 3.0, USB 3.1 gen 1, and USB 3.2 gen 1x1 are all names for the same thing, the 5Gbps speed.

USB 3.1 gen 2 and USB 3.2 gen 2x1 are both names for the same thing, the 10Gbps speed.

USB 3.2 gen 2x2 is the 20Gbps speed.

The 3.0 / 3.1 / 3.2 are the version number of the USB specification. The 3.0 version only defined the 5Gbps speed. The 3.1 version added a 10Gbps speed, called it gen 2, and renamed the previous 5Gbps speed to gen 1. The 3.2 version added a new 20Gbps speed, called it gen 2x2, and renamed the previous 5Gbps speed to gen 1x1 and the previous 10Gbps speed to gen 2x1.

There's also a 3.2 gen 1x2 10Gbps speed but I've never seen it used. The 3.2 gen 1x1 is so ubiqitous that it's also referred to as just "3.2 gen 1".

And none of this is to be confused with type A vs type C ports. 3.2 gen 1x1 and 3.2 gen 2x1 can be carried by type A ports, but not 3.2 gen 2x2. 3.2 gen 1x1 and 3.2 gen 2x1 and 3.2 gen 2x2 can all be carried by type C ports.

Lastly, because 3.0 and 3.1 spec versions only introduced one new speed each and because 3.2 gen 2x2 is type C-only, it's possible that a port labeled "3.1" is 3.2 gen 1x1, a type A port labeled "3.2" is 3.2 gen2x1, and a type C port labeled "3.2" is 3.2 gen 2x2. But you will have to check the manual / actual negotiation at runtime to be sure.

crote · 2024-03-19T05:56:57 1710827817

> There's also a 3.2 gen 1x2 10Gbps speed but I've never seen it used.

It's not intended to be used by-design. Basically, it's a fallback for when a gen2x2 link fails to operate at 20Gbps speeds.

mrb · 2024-03-12T19:32:20 1710271940

I didn't mean 5 Gbps USB ports have disappeared, but rather: most machines in the last ~10 years (~8-9 years?) have some 10 Gbps ports. Therefore if he is plugging a fast SSD in a slow 5 Gbps port, my assumption was that he has no 10 Gbps port.

doubled112 · 2024-03-12T13:18:12 1710249492

TIL they have been sneaking versions of USB in while I haven't been paying attention. Even on hardware I own. Thanks for that.

ufocia · 2024-03-12T13:45:11 1710251111

I wonder how using tee to compute the hash in parallel would affect the overall performance.

mrb · 2024-03-12T19:01:07 1710270067

On GigE or even 2.5G it shouldn't slow things down, as "sha1sum" on my 4-year-old CPU can process at ~400 MB/s (~3.2 Gbit/s). But I don't bother to use tee to compute the hash in parallel because after the disk image has been written to the destination machine, I like to re-read from the destination disk to verify the data was written with integrity. So after the copy I will run sha1sum /dev/XXX on the destination machine. And while I wait for this command to complete I might as well run the same command on the source machine, in parallel. Both commands complete in about the same time so you would not be saving wall clock time.

Fun fact: "openssl sha1" on a typical x86-64 machine is actually about twice faster than "sha1sum" because their code is more optimized.

Another reason I don't bother to use tee to compute the hash in parallel is that it writes with a pretty small block size by default (8 kB) so for best performance you don't want to pass /dev/nvme0nX as the argument to tee, instead you would want to use fancy >(...) shell syntax to pass a file descriptor as an argument to tee which is sha1sum's stdin, then pipe the data to dd to give it the opportunity to buffer writes in 1MB block to the nvme disk:

  $ nc -l -p 1234 | tee >(sha1sum >s.txt) | dd bs=1M of=/dev/XXX

But rescue disks sometimes have a basic shell that doesn't support fancy >(...) syntax. So in the spirit of keeping things simple I don't use tee.

usr1106 · 2024-03-12T22:06:46 1710281206

It's over 10 years ago that I had to do such operations regularly with rather unreliable networks to Southeast Asia and/or SD cards, so calculating the checksum every time on the fly was important.

Instead of the "fancy" syntax I used

   mkfifo /tmp/cksum
   sha1sum /tmp/cksum &
   some_reader | tee /tmp/cksum | some_writer

Of course under the conditions mentioned throughputs were moderate compared to what was discussed above. So I don't know how it would perform with a more performant source and target. But the important thing is that you need to pass the data through the slow endpoint only once.

Disclaimer: From memory and untested now. Not.at the keyboard.

Dylan16807 · 2024-03-12T09:05:39 1710234339

> ...and one of these moving parts is image integrity and write integrity verification, allowing byte-by-byte integrity during imaging and after write.

dd followed by sha1sum on each end is still very few moving parts and should still be quite fast.

bayindirh · 2024-03-12T09:34:02 1710236042

Yes, in the laptop and one-off case, that's true.

In a data center it's not (this is when I use clonezilla 99.9% of the time, tbf).

iforgotpassword · 2024-03-12T08:05:02 1710230702

I don't see how you can consider the nvme over tcp version less moving parts.

dd is installed on every system, and if you don't have nc you can still use ssh and sacrifice a bit of performance.

  dd if=/dev/foo | ssh dest@bar "cat > /dev/moo"

bayindirh · 2024-03-12T08:08:41 1710230921

NVMe over TCP encapsulates and shows me the remote device as is. Just a block device.

I just copy that block device with "dd", that's all. It's just a dumb pipe encapsulated with TCP, which is already battle tested enough.

Moreover, if I have fatter pipe, I can tune dd for better performance with a single command.

darkwater · 2024-03-12T08:33:18 1710232398

netcat encapsulates data just the same (although in a different manner), and it's even more battle-tested. NVMe over TCP use case is to actually use the remote disk over the network as it were local. If you just need to dump a whole disk like in the article, dd+netcat (or even just netcat, as someone pointed out) will work just the same.

iforgotpassword · 2024-03-13T07:31:21 1710315081

Nvme over TCP encapsulates the entire nvme protocol in TCP, which is way more complex than just sending the raw data. It's the opposite of "a dumb pipe encapsulated in tcp", this is what the netccat approach would be. Heck if you insist on representing the drive as a block device on the remote side you could just as well use NBD, which is just about as many moving parts as nvme over tcp but still a simpler protocol.