In the case of copying files to a mounted file system, I’ve sometimes found it faster to use a tar pipeline than cp when copying data to an USB stick or SD/microSD card.
Instead of:
cp -r ~/wherever/somedir/ /media/SOMETHING/
I would do
cd ~/wherever/
tar cf - somedir/ | ( cd /media/SOMETHING/ && tar xvf - )
And it would be noticably faster.
Not the same use case as linked article, but wanted to bring this up since it’s somewhat related.
There's also rsync which works perfectly with local paths and can resume from interruptions (by default), with or without crc checking the material ("-c"). That can be useful for removable storage which can sometimes be a bit unreliable.
Just take care that cp, tar and rsync each have slightly different handling of extended attributes and sparse files.
(By the way, I believe "tar -C /path" is the canonical way of doing "cd /path ; tar" without resorting to subshells.)
First thing that makes it so weird is how it assigns different meaning between paths including vs not including trailing slash. Completely different from how most command line tools I am used to behave in Linux and FreeBSD.
That alone is enough to remind me every time I try to use rsync why I don’t like and generally don’t use rsync.
While rsync is different than cp and mv, I dislike the cp and mv destination-state-dependent behaviour.
With rsync, destination paths are defined by the command itself and the command is idempotent. I don't usually need to know what the destination is like to form a proper command (though oopsies happen, so verify before using --delete).
With cp/mv - the result depends on the presence and type of the destination.
E.g. try running cp or mv, canceling then restarting. Do you need to change the arguments? Why?
mkdir s1 s2 d1
touch s1/s1.txt s2/s2.txt
# this seems inconsistent
cp -r s1 d1 # generates d1/s1/s1.txt
cp -r s2 d2 # generates d2/s2.txt
mkdir s1 s2 d1
touch s1/s1.txt s2/s2.txt
# I don't use this form, but it is consitent
rsync -r s1 d1 # generates d1/s1/s1.txt
rsync -r s2 d2 # generates d2/s2/s2.txt
# Same as above but more explicit
rsync -r s1 d1/ # generates d1/s1/s1.txt
rsync -r s2 d2/ # generates d2/s1/s2.txt
# I prefer this form most of the time:
rsync -r s1/ d1/ # generates d1/s1.txt
rsync -r s2/ d2/ # generates d2/s2.txt
I simply try to use trailing slashes wherever permitted and the result is amply clear.
If it didn't do that, it would have to add a switch that you'd still have to look up. It's a tool that's most often doing a merge and update, but looks like a copy command. I think that made it friendlier.
How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?
Halt with an error message explaining the ambiguity and the proper switch to use.
Same principle as rm requiring --no-preserve-root if you actually want to nuke / - right now it's way too easy to accidentally and destructively do the wrong thing.
> How would you separate "merge this directory into that one and update files with the same name" from "copy this directory into that directory"?
I would make "copy this directory into that directory" out of scope for the tool.
Let’s imagine a tool similar to rsync, but less confusing, and more in tune with what I want to do, personally. There are for sure a bunch of things that rsync can do, that this imagined tool can’t. That’s fine by me.
Let’s call the tool nsync.
It would work like this:
nsync /some/src/dir /some/dest/dir
And running that would behave exactly the same with or without trailing slashes.
I.e the above and the following would all be equivalent:
nsync /some/src/dir/ /some/dest/dir/
nsync /some/src/dir/ /some/dest/dir
nsync /some/src/dir /some/dest/dir/
And what would this do? It would inspect source and dest dirs. Then it would copy files from source to dest for which last modified was greater in source dir than in dest, or where files in source dir did not exist in dest dir.
In other words, it would overwrite older files that had older last modified time stamp, and it would copy files that did not exist.
Like rsync it would also work with ssh (scp/sftp). Maybe some other protocols too, but only if those other protocols supported the comparisons we need to make. Prefer fewer protocols, and this way of working over trying to be the subset of what works across a gazillion protocols.
If a file exists in dest but not in source dir, it is kept untouched in dest. Not deleted. Not copied back to source dir.
Then there would be one other mode; destructive mode. The flag for it would be -d.
nsync -d /whatever/a/b/c/ /wherever/x/y/z/
This would work similar to the normal mode. But it would remove any files in dest dir that are not present in source dir. Before actually deleting anything it would list all of the files that will be deleted, and ask for keyboard confirmation. [y/N] so that you have to explicitly hit y and then enter. Enter alone will be interpreted as no.
You would be able to override the confirmation with the -y argument.
nsync -dy /whatever/a/b/c/ /wherever/x/y/z/
And that’s it. That’s what I would want rsync to be for me.
There probably are some programs that behave exactly like this. I’ll eventually write one too. It’ll have a user base of 1. Me.
Lost too much data that way, trailing slash with delete. I still feel bitter, the UX is terrible considering the effect are so different from copy and delete.
I've found when there are many files but overall size being small i.e. many many small files, archiving, copy and unpacking works faster. Perhaps due to enumerating, analysing overall size and then copying vs archiving directly.
cp is single-threaded and only blocks on one IOP at a time.
With a tarpipe, you can block on two IOPs at a time, and they're decoupled by the pipe buffer.
This primarily makes a difference because the kernel cannot issue the I/O for small files ahead of time, like it does when you sequentially read a large file, so you do actually end up blocking and waiting.
It's because reading and writing happen in separate processes, i.e. simultaneously instead of interleaved.
In general writing to disk is handled asynchronously by the kernel (`write` just copies to a buffer and returns), but metadata operations like creating files are not, so this should help the most for many small files.
There's probably more than one reason it's faster. For example, tar ignores extended file attributes by default. Cp -r would have to check them for every file.
cp queries the preferred block size of the destination file in 'struct stat', and has specific tweaks for certain filesystems. As far as I can tell, dd does not do this as it calls through to 'write' directly.
In any case, the tests in ioblksize.h indicate that bs=4M is far too large and may perform worse than the default for cp/cat (128KiB). There is a script there that should clear things up for more modern systems.
The point about fdatasync is superfluous as you can run 'sync' yourself, or unmount the filesystem.
I've been re-implementing a bunch of coreutils as an exercise, and got stuck on dd input/output block sizes, AND disk/partition block sizes for a while. (As far as I understand it, for dd I need a ring buffer the size of max(ibs, obs), and then some moderately clever book-keeping to know when to trigger the next read/write, perhaps with code specific to ibs>obs, ibs<obs, etc; partitioning on the other hand is plainly stupid, there's decades of hardware and software just lying to each other and nothing makes sense.)
Thank you and everyone else in this thread for the know-how and references! I would like to eventually write an article (or at least heavily commented source) to hopefully explain all this nonsense for other people like me.
TFA mentions that you can use sync after cp to do the same thing as fdatasync. You can't "unmount the file system" because you're writing directly to the block device, the thumb drive isn't mounted.
Syncing as you go (ideally asynchronously) when you have to sync anyway (like when you're writing an image to a thumb drive, or write to NFS) has the big advantage that you don't end up saying "I'm done writing all data" and then blocking for five minutes waiting for the kernel to flush a couple gigs to disk.
TFA doesn't mention filesystems at all, sort of jumps in where we find the block device. Things could become messy if the device were mounted while the copy is attempted.
Using shell redirects sucks though, you need to run the shell as root.
Using cp seems overkill? It's really really designed to copy files between file systems and has a lot of logic to handle different cases and different optimisations which don't matter when writing to block devices.
As you say, there's no magic. I want a program which just calls 'open' on a path I give it and then uses 'write' to write my data to it. cp does so, so much more related to file systems, shell redirects aren't a separate program I can run with sudo, but I can trust dd to do the job. And it has a (kinds bad) progress monitor to boot.
The default 512 byte block size is unfortunate though.
As far as I know, you can't copy using shell redirects. At least not with a simple POSIX shell. Redirection only open file descriptors, but you need a command to actually make the copy from one to the other.
It can be "cat", which is often cargo culted in it own right. Or even "dd", which can work with stdin/out.
I usually prefer not to use shell redirects if there is a command that take filenames. That's because it gives more control to the app. The app knows how the file will be used, the shell doesn't, so it can open it the most appropriate way, avoid overwriting the output file if something goes wrong, output better error messages, etc... Now, if you don't trust the app (for example if you fear it will modify your input file), then shell redirects may be the better option.
You're right. You'd have to use the `read` built-in and `echo` or `printf` in a while loop combined with redirection, but the POSIX-specified `read` built-in is intended for text and isn't going work correctly with binary files.
With bash you could maybe use the non-standard `read -N` option but I'm not sure about null bytes.
Bash not so much, Ksh93 will do this, playing with has been a recent toy project. Typeset -b to specify a binary variable, rean -n/-N and print -v.
It can be really quit surprisingly fast too, 4GB/sec reading /dev/zero writing /dev/null. ( yes 40gbit/sec, binary copies, in a shell script!!! ). 2-3GB copying real data.
Oh boy but the edge cases and quirks.. there are so many reasons why this is a bad way to do things..
Curiously, I double the speed bu unsetting and recreating the buffer variable for each op. There also seems to be cases where ksh will buffer reads from a pipe to allow limited seeks, but also if you do a read -N from a pipe with too large a size (>8k I think, I'd have to check) and that read can't be completed because the source finished writing less than that, then that data is gone. Less than 8k, you can still read it with a subsequent read -n.
Probably the most actually useful thing I learned was that by ksh93 creates its pipes with unix sockets, not pipes, which means the buffer size is set by /proc/sys/net/core/[w|r]mem_default, rather than a fcntl call. That makes it easier to tweak from a script, and also makes ksh pipes faster by default for many streams, compared to most other shells ( depending on block sizes, and the difference goes away if you tweak up the pipe buffer size )
Don't get me wrong, not something I'd use in real life, but it was fun anyway.
I confess it's not as simple as $ <a >b
Sorry about that, because it means the ansdwer while technically true in some weird case where you really need it, isn't exactly convenient like you'd actually use it. I made it sound trivial and obvious and direct and it's not.
Use read in a loop, with special care with LANG and IFS to make all bytes meaningless. Except there is no way to avoid null being special, but you can handle null by making null the delimiter for read, and only reading one byte at a time. So even though you can't store an actual null in a variable, you can still detect that there was a null and print a new one back out, and since you only read one byte at a time, you do that for each individual input byte and strings of nulls are not collapsed.
It looks like a lot, but, read is a builtin, and at least in bash and ksh and zsh so is printf, and although this is a loop, it's actually not even a sub-shell. If you edit variables inside the loop, they are still there after the loop, ie, you never forked a child.
while LANG=C IFS= read -d '' -r -n 1 x ;do printf '%c' "$x" ;done <junk1.rnd >junk2.rnd
In the case of copying files to a mounted file system, I’ve sometimes found it faster to use a tar pipeline than cp when copying data to an USB stick or SD/microSD card.
Instead of:
I would do And it would be noticably faster.Not the same use case as linked article, but wanted to bring this up since it’s somewhat related.