For the un-initiated, sendfile() is a system call that sends data between two file descriptors. The intent is to make the kernel do the read/write cycle instead of the application (user-level) code, thereby cutting down the number of times the data needs to be mapped between kernel and userspace memory spaces.
> sendfile() is a system call that sends data between two file descriptors
Not quite. As Pike complains, the first file descriptor must be mmap-able and the second must be a socket. I think his objection is that such narrowly applicable system calls do not belong in what should be a general purpose API.
Yeah, it's all about context switching, which is just unnecessary work. Doesn't matter if you are serving 10 dynamic HTTP requests a day on a 16-core super machine. Does matter if you are serving the same file 100,000 times a second from your phone :)
(I do wish it worked for any fd to any other fd, because I have to write that code myself rather frequently. Example: copying data from a pty to the real terminal.)
It's not about context switching, it's about having to copy data from kernel space to user space and back. If there's no processing done between the calls to read() and write(), then there's no point copying it to user space.
read() then write() is expensive even using zero-copy techniques because there are at minimum 4 context switches for every chunk of data.
Further, every time the process stops running (another is scheduled say) the copy stops. With sendfile as a system call, this is not a problem as the kernel is working at this transfer every time it is running. (i.e. every context switch and every interrupt -- actually with dma, it the send could be happening even when the kernel is not running...)
No, this is not correct. The idea of sendfile() is that you can send a physical address (probably something in the buffer cache) directly to a network adapter in a zero copy manner. That's current impossible with a send()/write() to a file descriptor in Linux because you can't construct an SKB from a userspace address without forcefully pinning memory. Pinning memory from userspace is a privileged operation. OTOH, since buffer cache is not part of the memory of a process, you can obtain a physical address of it.
There's no magic in the kernel that avoids context switches. If a kernel thread has to switch to another kernel thread, that's still a context switch. And syscalls are ridiculously cheap on modern hardware. sysenter is like < 100 cycles.
That said, there's no theoretical advantage to doing this via sendfile verses doing a send() from an mmap()'d file. If Linux was smarter, it could realize instantly that the address given to send is file backed and could construct an SKB from the physical memory without pinning.
This is what the reference to a "5 minute hack" was vs. Rob Pike's claim that the interface exists to work around a problem in Linux. He's completely correct.
I'm not disagreeing with this. However, the specified algorithm which I was posting about was read then write. It is what the parent said, not mmap then send. The two sets of calls have very different semantics as you said. For the case of
data = read(fd, blocksize, dataptr);
while (data > 0) {
write(sock, data, ...);
data = read(fd, blocksize, dataptr);
}
My statement holds true. Even if the context switching is minimal overhead, there are more context switches caused by my code, and further blocking and other processes add more delays.
This goes half way toward solving the problem. What most people want is a pure socket->socket. As evidenced by haproxy, you must still use a pipe intermediary, requiring multiple data copies (on the plus side, it's still fast).
ret = splice(fd, NULL, b->pipe->prod, NULL, max,
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
edited to add (because I couldn't reply): It almost certainly is faster than read/write, because it's straight memcpy's (which are pretty fast on modern hardware), instead of memcpy's, context switching, special read()/write() logic, and other stuff.
The pipe intermediary doesn't necessarily add a copy. You SPLICE_F_MOVE the data from the first fd to the pipe, then SPLICE_F_MOVE it from the pipe to the second fd. If either or both of those can be done zero-copy, they will be. The pipe intermediary is just a way of holding onto the reference-counted pages.
"
Presently (Linux 2.6.9): in_fd, must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.
"
so apparently you cannot use sendfile for your scenario.
It’s really annoying that such bugs in news.arc’s renderer never get fixed. Surrounding a URL in <> is the official RFC-recommended way to do things. From RFC 2396 (1998):
> In practice, URI are delimited in a variety of ways, but usually within double-quotes "http://test.com/, angle brackets <http://test.com/>, or just using whitespace [...] Using <> angle brackets around each URI is especially recommended as a delimiting style for URI that contain whitespace.
The manual page: <http://linux.die.net/man/2/sendfile>. A related Linux Journal article: <http://www.linuxjournal.com/article/6345>.