For the un-initiated, sendfile() is a system call that sends data between two fi...

jbarham · on Sept 23, 2010

> sendfile() is a system call that sends data between two file descriptors

Not quite. As Pike complains, the first file descriptor must be mmap-able and the second must be a socket. I think his objection is that such narrowly applicable system calls do not belong in what should be a general purpose API.

jbjohns · on Sept 23, 2010

On Linux. I don't believe all Unix OSes have that requirement.

jrockway · on Sept 23, 2010

Yeah, it's all about context switching, which is just unnecessary work. Doesn't matter if you are serving 10 dynamic HTTP requests a day on a 16-core super machine. Does matter if you are serving the same file 100,000 times a second from your phone :)

(I do wish it worked for any fd to any other fd, because I have to write that code myself rather frequently. Example: copying data from a pty to the real terminal.)

leif · on Sept 23, 2010

It's not about context switching, it's about having to copy data from kernel space to user space and back. If there's no processing done between the calls to read() and write(), then there's no point copying it to user space.

sophacles · on Sept 23, 2010

read() then write() is expensive even using zero-copy techniques because there are at minimum 4 context switches for every chunk of data.

Further, every time the process stops running (another is scheduled say) the copy stops. With sendfile as a system call, this is not a problem as the kernel is working at this transfer every time it is running. (i.e. every context switch and every interrupt -- actually with dma, it the send could be happening even when the kernel is not running...)

aliguori · on Sept 24, 2010

No, this is not correct. The idea of sendfile() is that you can send a physical address (probably something in the buffer cache) directly to a network adapter in a zero copy manner. That's current impossible with a send()/write() to a file descriptor in Linux because you can't construct an SKB from a userspace address without forcefully pinning memory. Pinning memory from userspace is a privileged operation. OTOH, since buffer cache is not part of the memory of a process, you can obtain a physical address of it.

There's no magic in the kernel that avoids context switches. If a kernel thread has to switch to another kernel thread, that's still a context switch. And syscalls are ridiculously cheap on modern hardware. sysenter is like < 100 cycles.

That said, there's no theoretical advantage to doing this via sendfile verses doing a send() from an mmap()'d file. If Linux was smarter, it could realize instantly that the address given to send is file backed and could construct an SKB from the physical memory without pinning.

This is what the reference to a "5 minute hack" was vs. Rob Pike's claim that the interface exists to work around a problem in Linux. He's completely correct.

sophacles · on Sept 24, 2010

I'm not disagreeing with this. However, the specified algorithm which I was posting about was read then write. It is what the parent said, not mmap then send. The two sets of calls have very different semantics as you said. For the case of

  data = read(fd, blocksize, dataptr); 
  while (data > 0) {
    write(sock, data, ...);
    data = read(fd, blocksize, dataptr);
  }

My statement holds true. Even if the context switching is minimal overhead, there are more context switches caused by my code, and further blocking and other processes add more delays.

tonfa · on Sept 23, 2010

Would splice help? http://en.wikipedia.org/wiki/Splice_(system_call)

gruseom · on Sept 23, 2010

I'm glad you posted that. It links to the following superb email thread about slice(), tee(), and other zery-copy magic:

http://kerneltrap.org/node/6505

FooBarWidget · on Sept 23, 2010

Unfortunately not, splice requires that one of the fds is a file. I want to forward data from a socket to another.

fragmede · on Sept 23, 2010

In 2.6.31, support was added to

> Allow splice(2) to work when both the input and the output is a pipe.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...

alexgartrell · on Sept 23, 2010

This goes half way toward solving the problem. What most people want is a pure socket->socket. As evidenced by haproxy, you must still use a pipe intermediary, requiring multiple data copies (on the plus side, it's still fast).

  ret = splice(fd, NULL, b->pipe->prod, NULL, max,
               SPLICE_F_MOVE|SPLICE_F_NONBLOCK);

edited to add (because I couldn't reply): It almost certainly is faster than read/write, because it's straight memcpy's (which are pretty fast on modern hardware), instead of memcpy's, context switching, special read()/write() logic, and other stuff.

caf · on Sept 24, 2010

The pipe intermediary doesn't necessarily add a copy. You SPLICE_F_MOVE the data from the first fd to the pipe, then SPLICE_F_MOVE it from the pipe to the second fd. If either or both of those can be done zero-copy, they will be. The pipe intermediary is just a way of holding onto the reference-counted pages.

FooBarWidget · on Sept 23, 2010

Is that still faster than userspace forwarding with read() and write()?

ithkuil · on Sept 23, 2010

The sendfile manpage says:

" Presently (Linux 2.6.9): in_fd, must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket. "

so apparently you cannot use sendfile for your scenario.

wmf · on Sept 23, 2010

That's not what the man page says. AFAIK HAProxy splices from one socket to another.

berntb · on Sept 23, 2010

I hadn't seen splice, it was cool. But what really blew my mind was the discussion of a system call on Wikipedia. :-)

To put it into the wikipedia is obvious I guess, but I just fell in love with the net again. Thank you.

Edit: I really should do some low level stuff again.

jacquesm · on Sept 23, 2010

Remove extra > from url to make it work.

Think 'netcat'.

jacobolus · on Sept 24, 2010

It’s really annoying that such bugs in news.arc’s renderer never get fixed. Surrounding a URL in <> is the official RFC-recommended way to do things. From RFC 2396 (1998):

> In practice, URI are delimited in a variety of ways, but usually within double-quotes "http://test.com/, angle brackets <http://test.com/>, or just using whitespace [...] Using <> angle brackets around each URI is especially recommended as a delimiting style for URI that contain whitespace.

http://www.ietf.org/rfc/rfc2396.txt