Hi, I'm the guy who came up with the splice idea. It's based on what I learned doing this:
www.connectathon.org/talks96/bds.pdf
which was for the EIS (Earth Imaging System) project, a government effort to image the earth about 15 years ago. That project eventually had 200Mhz MIPS SMP boxes moving data through NFS at close to 1Gbyte/sec 24x7.
So far as I know, nobody else has ever come close to that even with 10x faster CPUs.
Most of the people in this thread pretty clearly don't understand the issues involved, Rob included (sorry, Rob, go talk to Greg). Moving lots and lots of data very quickly precludes looking at each byte by the CPU. The only thing that should look at each byte is a DMA engine.
Sendfile(2) is a hack, that's true. It is a subset of what I imagined splice(2) could be (actually splice(3), the syscalls are pull(2) and push(2)). But it's a necessary hack.
Jens' splice() implementation was a start but wasn't really what I imagined for splice(), to really go there you need to rethink how the OS thinks about moving data. Unless the buyin is pervasive splice() is sort of a wart.
I'm assuming that the real objection to an interface like sendfile() is that you shouldn't need to copy_from_user() just to construct an SKB if the user address being referred to is pointing at a file backed VMA. If we could do a zero-copy send() from userspace, there would be no need for sendfile() in the first place.
IOW, { data = mmap(fd, ...); ...; send(fd, data + offset, length); } could be just as fast as sendfile() if we were smarter.
We're starting to build infrastructure to do zero-copy networking in userspace specifically for KVM. So far, it's not at the socket level but instead at the macvtap level but that's arguably another problem with the networking stack--macvtap/tap should just be an address family :-)
I'm not so sure. The problem is that when send() returns, the application is free to modify the buffer and expect that the modifications won't be visible to the other side of the link - the data has already been "sent" as far as the app is concerned.
So this implies that if you wished to send by DMAing directly from the page cache to the NIC, send() would have to block - not just until the data (and all previous buffered written data on the connection) has been passed to the NIC, but until the TCP ACK from the other side has been recieved (or RST). (Non-blocking sockets could obviously never use this method - they would have to perform the copy).
sendfile() obviously has this limitation already, but if you were to do it for send() you'd at least have to hide it behind a setsockopt(). Otherwise the weird behaviour is likely to upset existing applications (eg. select() says that write() will not block - but it does anyway, because the buffer you sent from was mmap()ed?)
> So far as I know, nobody else has ever come close to that even with 10x faster CPUs.
With 10-gigabit becoming commonplace in high-end server rooms, dual 10-gigabit cards hitting the market, and 100-gigabit on its way, I've got to ask, what you mean by no one has come close?
(Certainly, 1 GByte/sec is achievable; splice is a large part of that.)
The big print giveth, the fine print taketh away :)
Just because you have a 10 gigabit pipe (which is 1.2Gbyte/sec) that doesn't mean you can fill it. And filling it with a benchmark is different than filling it with NFS traffic.
So far as I know, nobody has come close to what SGI could do years ago, i.e, stream 100's of MB/sec of data off the disk and out the network and keep doing it.
I've tried with Linux to build a disk array that would do gigabit, not 10-gigabit, and couldn't come anywhere close. I think about 60MB/sec was where things started to fail.
I'd love to hear about a pair of boxes that could fill multiple gigabit pipes with the following benchmark:
$ tar cf - big_data | rsh server "tar xf -"
You can set any block size you want in tar, I don't care, the files can be reasonably large, etc.
If you manage to fill one pipe, then try it with 10-gigabit and let's see what that does.
I'd love to be wrong but so far as I can tell, we're nowhere near what SGI could do.
Benchmarks, especially those from card manufactures are suspect. Just because you've got a network ping-pong test running at essentially line speed, doesn't mean I'll see any of that. I fully agree.
Filling any pipe with your 'rsh' benchmark is harder - on all the Linuxes I have on hand, rsh is aliased to ssh, and I've never seen a system deal with the overhead of ssh adequately. (There's high perf ssh if you really want; I've never tried it. http://www.psc.edu/networking/projects/hpn-ssh/) (Would you accept numbers over netcat?)
If I'm not getting 90MBytes/sec over Gigabit, then something, somewhere, is slowing it down.
We sell servers that sustain 300+ MBytes/s per client, via Samba for two clients. Those clients are very fussy about latency (video editing/DI). That's what we sell, so it's empirically that fast; ie not a benchmark. The server itself will actually do 900+ MBytes a second (to three clients, the same as page 14 on your linked pdf), but my small print is the market doesn't want that configuration.
These numbers are what I know, which doesn't include NFS. NFS-RDMA and pNFS are supposedly good for performance, but I don't have any first hand knowledge.
None of what I just stated contradicts you. But if I'm at 900 MBytes/sec with my level of hardware, I find it hard to believe someone with more budget isn't capable of a full Gigabyte/sec over Infiniband, arguably HIPPI's successor.
It's a product for a niche B2B market, but if you're buying, email's in my profile ;)
The network is 10-Gigabit ethernet, from server to a switch, then switch to client, all via CX-4 or SFP+. The data is actually coming off disk at that rate (in ram would be cheating :p ). It's backed by a 16-drive hardware raid using 2TB spinning disks (SATA) in a raid5. The performance we sell is actually 2 streams of read while simultaneously writing a stream to disk, so 600 out, 300 in.
The files are actually the hard part. This supports the dpx 'video' format. The format originated from film scanners, and so each frame is (basically) a 12.5 MByte jpeg. Multiply that by 24-30 fps, add in overhead for reading that many files/sec, and you get 300+ MBytes/sec. (This is also why I'm so sure there's more performance to be had if you don't have the same overhead.)
In my case, it's because it's a pretty good emulator of what we do.
We're the oh-so-loved BitKeeper guys and we're working on tech for very large repositories. A bk clone is essentially the described benchmark. Or should be, if we're slower than that we're doing something wrong.
The unnecessary data copying problem, as Robert Pike suggests, can be also solved by a more generic Zero-Copy approach, instead of adding a specific single purpose system call.
It has to be noted however that often the term Zero-Copy is used to describe a technique which avoids memory copying by employing virtual memory remapping.
VM tricks are also expensive because, depending on the architecture, it might require flushing the TLBs and impact subsequent memory accesses. The advantage of this way of zero copy approach thus depends on several factors such as the amount of data being transferred to the kernel's buffers.
I don't have any recent data regarding real word performances, any references are welcome. However it's far from being self-evident that VM tricks can rival the performance of a dedicated 'sendfile' like system call.
For the uninitiated, the TLB is the thing that keeps your MMU hardware from having to trawl through the page directory in memory every time it accesses a virtual address; it's a cache, and you generally want to avoid flushing it.
For the un-initiated, sendfile() is a system call that sends data between two file descriptors. The intent is to make the kernel do the read/write cycle instead of the application (user-level) code, thereby cutting down the number of times the data needs to be mapped between kernel and userspace memory spaces.
> sendfile() is a system call that sends data between two file descriptors
Not quite. As Pike complains, the first file descriptor must be mmap-able and the second must be a socket. I think his objection is that such narrowly applicable system calls do not belong in what should be a general purpose API.
Yeah, it's all about context switching, which is just unnecessary work. Doesn't matter if you are serving 10 dynamic HTTP requests a day on a 16-core super machine. Does matter if you are serving the same file 100,000 times a second from your phone :)
(I do wish it worked for any fd to any other fd, because I have to write that code myself rather frequently. Example: copying data from a pty to the real terminal.)
It's not about context switching, it's about having to copy data from kernel space to user space and back. If there's no processing done between the calls to read() and write(), then there's no point copying it to user space.
read() then write() is expensive even using zero-copy techniques because there are at minimum 4 context switches for every chunk of data.
Further, every time the process stops running (another is scheduled say) the copy stops. With sendfile as a system call, this is not a problem as the kernel is working at this transfer every time it is running. (i.e. every context switch and every interrupt -- actually with dma, it the send could be happening even when the kernel is not running...)
No, this is not correct. The idea of sendfile() is that you can send a physical address (probably something in the buffer cache) directly to a network adapter in a zero copy manner. That's current impossible with a send()/write() to a file descriptor in Linux because you can't construct an SKB from a userspace address without forcefully pinning memory. Pinning memory from userspace is a privileged operation. OTOH, since buffer cache is not part of the memory of a process, you can obtain a physical address of it.
There's no magic in the kernel that avoids context switches. If a kernel thread has to switch to another kernel thread, that's still a context switch. And syscalls are ridiculously cheap on modern hardware. sysenter is like < 100 cycles.
That said, there's no theoretical advantage to doing this via sendfile verses doing a send() from an mmap()'d file. If Linux was smarter, it could realize instantly that the address given to send is file backed and could construct an SKB from the physical memory without pinning.
This is what the reference to a "5 minute hack" was vs. Rob Pike's claim that the interface exists to work around a problem in Linux. He's completely correct.
I'm not disagreeing with this. However, the specified algorithm which I was posting about was read then write. It is what the parent said, not mmap then send. The two sets of calls have very different semantics as you said. For the case of
data = read(fd, blocksize, dataptr);
while (data > 0) {
write(sock, data, ...);
data = read(fd, blocksize, dataptr);
}
My statement holds true. Even if the context switching is minimal overhead, there are more context switches caused by my code, and further blocking and other processes add more delays.
This goes half way toward solving the problem. What most people want is a pure socket->socket. As evidenced by haproxy, you must still use a pipe intermediary, requiring multiple data copies (on the plus side, it's still fast).
ret = splice(fd, NULL, b->pipe->prod, NULL, max,
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
edited to add (because I couldn't reply): It almost certainly is faster than read/write, because it's straight memcpy's (which are pretty fast on modern hardware), instead of memcpy's, context switching, special read()/write() logic, and other stuff.
The pipe intermediary doesn't necessarily add a copy. You SPLICE_F_MOVE the data from the first fd to the pipe, then SPLICE_F_MOVE it from the pipe to the second fd. If either or both of those can be done zero-copy, they will be. The pipe intermediary is just a way of holding onto the reference-counted pages.
"
Presently (Linux 2.6.9): in_fd, must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.
"
so apparently you cannot use sendfile for your scenario.
It’s really annoying that such bugs in news.arc’s renderer never get fixed. Surrounding a URL in <> is the official RFC-recommended way to do things. From RFC 2396 (1998):
> In practice, URI are delimited in a variety of ways, but usually within double-quotes "http://test.com/, angle brackets <http://test.com/>, or just using whitespace [...] Using <> angle brackets around each URI is especially recommended as a delimiting style for URI that contain whitespace.
I disagree. I think it's nothing but a good idea compared to its current state, for the following reasons:
- IPC in userspace is insecure, and currently easily subverted by rootkits.
- When a system is under heavy load, the dbus daemon becomes a choke point for scheduling. "Spammy" processes can prevent messages from reaching higher priority processes.
- It is a lifesaver for embedded devices with poor multitasking capabilities. Think about ucLinux.
- It's not for everyone. It's like PF_RING for packet capturing.
- It makes sense not only for performance, but also reliability.
- Even though DBus can be improved in several other aspects, I think it's nice that someone sat down and wrote this instead of bikeshedding on some website.
>It can be written in a few lines of efficient user code.
I'm not sure he's understanding what this is. There is no copying needed here at all. The kernel could make the hard drive write to a place in memory, have the NIC read from that place and just manage interrupts between the two. The kernel wouldn't have to touch the data at all. This may not be how Linux does it (given the requirement for a memmap'able file descriptor) but that would be possible at least. I don't think you could do anything near this in user code.
The read(2)/write(2) solution from userspace would involve no copying as well, assuming a competent kernel implementation. The only "overheads" to speak of relative to what's possible with sendfile("2") would be those associated with writing one contiguous page table entry for every 4KB of data to set up the mappings; since sendfile(2) requires that you use mmap'able input, it probably incurs the same overheads.
Fine, let's say doing the right thing with read(2)/write(2) is hard, even really hard, and sendfile(2) is faster today. Making expedient shortcuts to performance in the system call API has historically not turned out well in UNIX. People write software which depends on this interface, and that software may well outlive any existing hardware and its quirks.
It's unwise to go against Rob Pike and Keith Adams, but I'm going to give it a try...
The read(2)/write(2) solution from userspace would involve no copying as well, assuming a competent kernel implementation.
AFAIK that was tried and it's slower than sendfile; sometimes it's even slower than copying data.
software may well outlive any existing hardware and its quirks.
Virtual memory operations like modifying page tables appear to be consistently getting slower relative to other operations, so I would expect that sendfile/splice will always be faster.
Modifying page tables is just a store on x86; since you would be reducing protections (going from unmapped, to mapped read-only), there is no need for TLB flush or shootdown (both of which are, admittedly, getting relatively more costly).
Apps tend to reuse buffers, so if you zero-copy read() into a buffer that's already populated you'd have to drop the existing pages and replace them with new ones. Then a zero-copy write() has to mark the pages as COW. Would this require a TLB flush? It seems like it would.
I initially thought that was a killer point, but on second thought, if the memory is already mapped writably the kernel can avoid remappings entirely by directing the I/O into the already-mapped pages.
Problem is you don't know which recv() a packet is destined for until you actually receive it and can parse the header and apply all of the filtering rules.
But what you describe is exactly how O_DIRECT achieves zero/copy for read/write.
Of course, if you assume multiple receive queues that are programmable, you can short cut the netfilter code. That was roughly the idea behind VJ channels.
Classic read and write pointer can have any address, it doesn't have to be a multiple of kernel page size or whatever. Is OS supposed to implicitly do a new "hidden" mmap whenever it sees an address that's multiple of the page size? And to do the lookups for matching chunk still mmaped for every write? How long are you going to keep such relations at all in the system with more reads and writes in progress? I don't see how that concept can be made efficient, and Pike doesn't actually explain in the post how he'd do it. sendfile looks ugly and inelegant but solves a real problem.
Realistically, you'd also have to track the allocation, and deallocation, of the range the file is being mmap(2)d into.
sendfile(2) (it appears, I don't really spend time in the Linux kernel) piggybacks on the kernel's pipe code; it allocates some non-contiguous buffer pages in kernel space, but it doesn't map any new pages, and it doesn't allocate enough pages for the entire file.
It's also been around (and presumably making things faster) for over ten years. That's a long time to let a hypothetical future "best" triumph over an achievable "okay".
No. If the act of write() is to construct and transmit and SKB, and SKB can be created from a scatter/gather list of pages plus an offset into the first page.
But this implies that you can track the SKBs lifecycle in the kernel such that you can know when the SKB is done being used and therefore can allow write() to return.
The trouble is, a write() to a socket doesn't actually generate an SKB. Instead it copies data to a socket buffer and then the socket buffers are used to generate SKBs. Some of this is meant as an optimization to create large packets even from small writes.
But of course, a copy is a copy and sometimes this buffering is not actually helpful to userspace.
I don't see why the buffers have to be specifically page aligned. But there certainly exists some hardware, that requires some specific (unrelated to page size) alignment of it's DMA buffers. But more importantly, such buffers have to be placed in physical memory such that they are accessible to said hardware, which often means in lowest 4GB of physical address space and both kernel (does not know what just allocated page is going to be used for) and user space (cannot influence page allocation) has no way to ensure this without copying pages. IOMMU should solve this problem, but then you have additional costs coming from reprogramming IOMMU.
Also, if I read(2), it's supposed to be putting stuff in my buffer (obtained with malloc or whatever). It would seem very complicated to change the mapping for addresses in this buffer. And if I write to the buffer, it must be copy-on-write (in fact, it has to copy the page if I dereference any pointer into it, because most hardware doesn't distinguish between read-only and read-write at the instruction level.
It just seems like if the kernel really did this level of mapping tricks, that MPI implementations could do something better without this patch. http://lkml.org/lkml/2010/9/14/468
The "copy-on-write borrowing" on write() is where the problem lies. To turn the writable pages into COWs you have to update the process's page tables to remove write permission, which requires a TLB flush. These kinds of games are a net loss, at least on modern hardware.
(And then the app likely reuses the buffer for the next `read()` anyway, requiring either replacing the page or faulting in a fresh one and doing the C in COW).
Huh. Like to see some numbers on that. Not arguing, just a little surprised. So you have data that shows bcopy() to be faster than the copy-on-write stuff?
As for the buffers, known problem, all the apps that used this sort of thing cycled through N buffers for exactly that reason. I think N=4 was way more than enough.
> The read(2)/write(2) solution from userspace would involve no copying as well, assuming a competent kernel implementation.
Even though you've got many others chiming in as to why this might not be so, I'll add my own: A userspace program can modify a buffer after data is read in, then write the modified version out -- this implies a local copy or, at the very least, the semantics of a local copy.
I know who he is and gave him the benefit of the doubt. That's why I said "I'm not sure he's understanding what this is." instead of just assuming the person had no idea what they were talking about.
>The read(2)/write(2) solution from userspace would involve no copying as well
I don't believe this is the case as at all. First you have the context switch problem and second of all the reason sendfile can be fast is because it's possible that the kernel never touches the data. It just assigns the disk DMA to a kernel buffer, and then gives the kernel buffer location to the NIC driver to read from. All data reading and writing can be done completely with the hardware with no CPU involvement.
read/write can't do it this way because you have the option of doing something with the data you get back from read. Well, it could optimistically hope that nothing will in fact be done with the data but that assumption will be wrong in nearly every single usage.
By having mmap you avoid copying from the kernel RAM space to the user RAM space, mmap makes both sharing the same chunk, therefore the restriction for the thing being mmapable. If you don't allow mmap the system will certainly have to copy between kernel and user memory (in both directions with separate read and write calls).
No, sendfile takes a file descriptor and forwards the contents to a socket. It wouldn't have to mmap the file because the user code never touches the file data. This is probably why other OSes that support sendfile don't have this restriction.
>If you don't allow mmap the system will certainly have to copy between kernel and user memory
Why? All the kernel has to do is set the DMA location for the drive to be a kernel buffer and follow any inodes when the file is split up on the disk. The NIC driver can just be told to load the data from those locations directly so no kernel copying at all, much less into user space.
I was just doing research on using sendfile's modern zero-copy replacement(s), splice & tee: http://kerneltrap.org/node/6505
They were linked elsewhere here, but worth repeating in a top-level comment. :) Possibly a handy tool for the rare times you need to squeeze blood out of a stone.
www.connectathon.org/talks96/bds.pdf
which was for the EIS (Earth Imaging System) project, a government effort to image the earth about 15 years ago. That project eventually had 200Mhz MIPS SMP boxes moving data through NFS at close to 1Gbyte/sec 24x7. So far as I know, nobody else has ever come close to that even with 10x faster CPUs.
Most of the people in this thread pretty clearly don't understand the issues involved, Rob included (sorry, Rob, go talk to Greg). Moving lots and lots of data very quickly precludes looking at each byte by the CPU. The only thing that should look at each byte is a DMA engine.
Sendfile(2) is a hack, that's true. It is a subset of what I imagined splice(2) could be (actually splice(3), the syscalls are pull(2) and push(2)). But it's a necessary hack.
Jens' splice() implementation was a start but wasn't really what I imagined for splice(), to really go there you need to rethink how the OS thinks about moving data. Unless the buyin is pervasive splice() is sort of a wart.