TCP in 30 instructions (1993)

tptacek · on Dec 27, 2013

This is a very famous post, and Van Jacobsen is the archetypical networking systems programming bad-ass. Important context here though: this is just the per-segment receive processing piece of TCP; it's describing a very simple fast path that segments can take, but not all the TCP logic!

stplsd · on Dec 27, 2013

>Van Jacobsen

Van Jacobson

He is probably most famous for his work on early congestion control algorithms in 1988 (http://ee.lbl.gov/papers/congavoid.pdf)

caf · on Dec 28, 2013

Right - the famous 30 instructions are in the hardware interrupt path, and responsible for putting the packet on the correct process's queue and waking that process up. As the original post says:

  The TCP protocol processing is done as we remove packets
  from the queue & copy their data to user space (and since
  we're in process context, it's possible to do a 
  checksum-and-copy).

The TCP protocol processing mentioned here done in process context is certainly not covered by the 30 instructions.

FuckFrankie · on Dec 27, 2013

Wouldn't "all the TCP logic" be the entire internet?

randyrand · on Dec 27, 2013

The internet is a very large and general term. It comprises millions of servers, routers, generally includes things like web browsers, http, etc.

TCP is just one protocol of the thousands of protocols used for the internet, typically to carry other protocols. UDP is also used on the internet.

"not all the logic" means that this implementation is not a full implementation of the TCP protocol.

Check out: http://en.wikipedia.org/wiki/Internet_protocol_suite http://en.wikipedia.org/wiki/OSI_model

ZoF · on Dec 27, 2013

Dewees?

tptacek · on Dec 27, 2013

No, it would not be anything close.

FuckFrankie · on Dec 27, 2013

Well, TCP is a very simple protocol so I can't imagine what the OP means by "all of TCP" unless he's talking about IP and it's associated services.

slashdev · on Dec 28, 2013

TCP is actually extremely complex, see http://en.m.wikipedia.org/wiki/Transmission_Control_Protocol

I don't think I could implement it in under 4000 sloc, and I'd guess it'd be over two months worth of work for just a minimal TCP stack that could interoperate with the internet at large.

userbinator · on Dec 28, 2013

TCP is only complex because of all the optional bits for performance tuning, but if you are only interested in size, look on the embedded side. The actual core functionality is much smaller, and I would even go as far as to say that a lot of the widely used open-source TCP/IP stacks obfuscate this with their added complexities.

E.g. http://en.wikipedia.org/wiki/UIP_(micro_IP)

tptacek · on Dec 28, 2013

Uh. No. TCP is complex because it works in the scenario where you, me, and 20 coworkers share a fast LAN hooked up to a slow pipe to the Internet and we all have to share that pipe constantly but nobody actually knows exactly who's trying to do what with it. TCP congestion control is a minor miracle even before you realize that this plays out writ large across the whole Internet, large-pipe-huge-pipe-small-pipe-big-pipe, without the Internet collapsing, which is what it used to do. One of the big challenges in designing fast scalable transports that gain speed by allowing drops or out of order delivery is in making them compatible with the congestion control regime that TCP implements.

TCP is complex because it solves a mindbogglingly complex problem. That it is as small as it actually is makes it one of the more elegant things ever to come out of computer networking.

derefr · on Dec 28, 2013

> One of the big challenges in designing fast scalable transports that gain speed by allowing drops or out of order delivery is in making them compatible with the congestion control regime that TCP implements.

I would argue that this very thing makes TCP inelegant; congestion-control should really have been its own layer between IP and TCP, instead of being something that every protocol that's not UDP has to carefully reimplement.

tptacek · on Dec 28, 2013

I said TCP was elegant, not that it was perfect. It also has the "urgent data pointer". :)

dlitz · on Dec 28, 2013

You're not necessarily wrong, but uIP may not be the best example of a good minimal TCP implementation. I've seen it routinely put the wrong value into the TCP window-size field. This creates an illusion of packet loss, causing senders to retransmit and performance to drop exponentially to zero.

As a workaround, I wrote a custom tool that avoids sending more than one TCP segment at a time:

https://github.com/dlitz/dlink-firmware-uploader/blob/581b64...

mikeash · on Dec 27, 2013

Because TCP is very simple, you can't imagine that "all of TCP" means what it actually says? Huh?

FuckFrankie · on Dec 27, 2013

You guys are just to smart for me, I guess I'll have to withdraw from having any conversation at all since I've been downvoted into hellban for asking a question.

RiderOfGiraffes · on Dec 27, 2013

Here's some very personal advice. This is my opinion, and others won't agree. However, for what it's worth ...

I would suggest that you would benefit from learning how to ask questions. Here's a start:

http://www.catb.org/~esr/faqs/smart-questions.html

However, there's more advice than that, and that, although a good start, is not to be taken as gospel. You probably also should read the site-specific commentary, and suggested alternatives:

https://mikeash.com/getting_answers.html

https://news.ycombinator.com/item?id=2911381

The point is this - you said:

    Well, TCP is a very simple protocol so I can't imagine
    what the OP means by "all of TCP" unless he's talking
    about IP and it's associated services.

Here's an alternative:

    I'm not sure I understand what's been left out of this.
    Could someone provide a synopsis, and a pointer to more
    details?

Your comment comes across as "Well I know everything, so the only thing that's left out must be all the internet." The second admits that you don't know what people are talking about, and asks for more information.

And just as a final question - why do you think you've been hell-banned? Are you sure? In particular, I've turned on "Show dead", and you don't seem to have made any comments that have been auto-killed. The evidence suggests you are not hell-banned.

upwardbound · on Dec 27, 2013

It also probably doesn't help that the username "FuckFrankie" sounds aggressive / combative

cpayne · on Dec 27, 2013

If you honestly think you've been downvoted for "asking a question", take a moment to think about the type of question you asked and the way you asked it.

TallGuyShort · on Dec 27, 2013

There is no need to be sarcastic. tptacek is pointing out that this is only one portion of that protocol. "The entire Internet" and the rest of the protocol are not the same thing. If you've been downvoted or hellbanned, it's not simply because you asked a question.

stonemetal · on Dec 27, 2013

It shows receiving data, it doesn't show anything about sending data including congestion control. It doesn't even show the full TCP receive a packet process since acking a packet is ... in the code. I also don't see anything about the state machine that occurs during connection creation or tear down. etc. Calling what is in the link TCP is about like claiming to describe the entire postal system in three steps(pick up envelope that has been delivered to your mailbox, open envelope, read letter).

lussier · on Dec 27, 2013

wslh · on Dec 27, 2013

If you liked this post you might like the "Appendix E: Extended Example: A Tiny TCP/IP Done as a Parser" by Ian Piumarta. The paper is available at: http://www.vpri.org/pdf/tr2007008_steps.pdf

grannyg00se · on Dec 27, 2013

I love the opinionated writing style. I wish I knew more about this so I could relate to the comments about the "compiler braindamage", "mbuf chain stupidity", and "netipl software interrupt bs".

minimax · on Dec 27, 2013

The "compiler braindamage" is that it is generating instructions to load individual registers 4 bytes at a time where it could instead generate instructions that load eight (properly aligned) bytes at a time into two four-byte registers at once. The bytes in question are some fields in the tcp header and some fields in the pcb (which is like all the state related to the TCP connection).

These instructions here are loading the header into registers:

        ld [%i0+4],%l3     ! load packet tcp header fields
        ld [%i0+8],%l4
        ld [%i0+12],%l2
        ld [%i0+16],%l0

That's the assembler for this:

        u_long seq = ((u_long*)ti)[1];
        u_long ack = ((u_long*)ti)[2];
        u_long flg = ((u_long*)ti)[3];
        u_long sum = ((u_long*)ti)[4];

ld loads 4 bytes at a time into a register but there is a sparc instruction ldd that will load 8 bytes into two registers at a time. If the compiler used ldd these four instructions turn into two. That gets us from 33 to 31. I'm not 100% clear which two fields from the pcb can be loaded simultaneously but there is this line

        ld [%i1+72],%o0                 ! compute header checksum

and then further down

       ld [%i1+68],%o0

which I think are:

       u_long cksum = tp->ph_sum;

and

        if ((flg & FMASK) == tp->pred_flags && seq == tp->rcv_nxt) {

Obviously that line is multiple instructions but what I meant is the part where tp->rcv_nxt is loaded into a register.

So probably tp->ph_sum and tp->rcv_nxt are adjacent in his version of struct tcpcb and he thinks the compiler should do use the same parallel load instruction (loading two registers from 8 bytes at once) for those fields.

darbelo · on Dec 28, 2013

The "mbuf chain stupidity" refers to the way memory buffers are handled in the kernel. His implementation uses 'pbufs' to store a single packet contiguously in kernel memory. This is in contrast to BSD[1], where a packet could span multiple 'mbufs' which were 'chained' into a linked list.

You can "man 9 mbuf" on your nearest BSD derivative to glean a bit more information about what he is opposing. Today's mbufs are not quite the same as they were back then, but the parts that he hated are still there :)

[1] "my kernel looks nothing at all like any version of BSD"

laichzeit0 · on Dec 28, 2013

For a walk through of the BSD TCP/IP source code there's also Steven's TCP/IP illustrated volume 2. The first few chapters painstakingly go through mbufs and exactly how the data structure is implemented and used. Much of this won't apply to current kernels but it's probably close enough for what Van Jacobson is talking about.

http://www.amazon.com/TCP-IP-Illustrated-Implementation-Vol/...

chetanahuja · on Dec 28, 2013

mbuf chains [1] (or something like them) are the canonical way of receiving data from the network. You basically "tag" every incoming chunk of data as an mbuf, and keep the data as linked lists of said mbufs (hence the mubf "chain"). Some key properties of incoming network data are that it's:

1) Arriving asynchronously, at arbitrary time points (actually the OS code gets to handle the incoming data in interrupt processing routines)

2) Arriving in arbitrary quantities in each new chunk(as opposed to, say, reading nice, aligned, blocks from the disk)

3) Possibly arriving out of order.

Now if you consider the requirements that you have this uncertain sized chunks of data that you (the network stack code) need to parse into (and possibly reorder into) one of many possible protocols then deliver it in a nicely packaged form to a user process (which may or may not be ready to receive this data at this moment) --- and, additionally, there's a lot of pressure to avoid unnecessary memory copies -- you'll be inevitably led towards an architecture that looks somewhat like mbuf chains.

The cited writeup and accompanying code has some crucial details hiding in the following snippet

"the Packets go in 'pbufs' which are, in general, the property of a particular device."

The point I think is that it's special purpose code written for a narrow use/demo case. Not meant to be taken seriously for an actual, general purpose OS.

tl;dr mbuf chains are inevitable for general purpose networking. Resistance is futile.

[1] https://developer.apple.com/library/mac/documentation/darwin...

delinka · on Dec 27, 2013

I read those things as sarcasm (e.g. "none of that buffer-overrun-checking garbage") but perhaps there's more to it.

anonu · on Dec 27, 2013

I am curious to know if today's modern TCP stacks (windows, popular linux distros,etc..) are coded with the same approach. Does anyone know?

bodyfour · on Dec 27, 2013

More or less, yes.

In the original TCP implementations from the early 80s, performance was (understandably) not the main priority.. they just wanted to get it working first. Also, nobody was sure what networking protocols would become popular (IP? OSI? XNS?) so a lot of work went into making everything as flexible as possible. This reached an apogee with AT&T's "STREAMS" subsystem (a competitor to the sockets API for writing networking code on UNIX) which was very flexible but also extremely complicated.

What Van Jacobson's work was saying is "look guys: by paying close attention to the fastpath you can saturate your 10Mbps network with TCP traffic" I'm sure everybody who has written a TCP stack has seen this email and has taken the tricks to heart.

These days not all of the tweaks might be still relevant. For instance, the hardware is probably doing the hardware checksum for you. However, the spirit of "do as little as possible in the fastpath" is certainly still followed in modern stacks.

oofabz · on Dec 27, 2013

Just as a historical note, classic Mac OS (versions 7-9) used STREAMS for TCP, in the Open Transport networking API. Sockets were provided as a wrapper around STREAMS, which worked about as well as you might expect. It had its fans but most of us were pleased to get real sockets in Mac OS X.

bodyfour · on Dec 28, 2013

I wasn't aware classic MacOS worked that way. The "sockets-emulated-over-STREAMS" was the standard way of doing things in SVR4-based UNIXes. In the early-to-mid 90s this gave them a poor reputation among early web admins since they just didn't handle high rates of connections as well as BSD or even the early linux stacks.

In the case of Solaris, this was largely fixed in the 2.6 release where they went back to a sockets-based stack and ran the STREAMS stack in parallel. (Actually this was originally supplied as a semi-supported patch to 2.5.1 since the scalability of the network stack was becoming a critical issue at large customers) Many of the OS-supplied services (like rpcbind I think) still used the STREAMS API but the 99.9% of external software that used the sockets API now had a fast native path to the network.

As far as I'm aware this is still the state there today, with STREAMS probably still existing for compatibility but basically ignored by everyone.

There were some interesting things Solaris did with STREAMS.. for instance telnetd and rlogind would just push a kernel-land module that would copy data between the pty and socket. This way you didn't have to make the kernel->user->kernel->user transition on every keystroke. In the heyday of shell accounts this was terrific. Of course these days CPUs are so much faster and everyone uses ssh anyway so it wouldn't be a useful optimization.

STREAMS was an interesting experiment, but I don't mourn its passing at all.

mietek · on Dec 29, 2013

In the age of System 6 and early System 7, before Open Transport, there was MacTCP, an add-on TCP stack, originally sold by Apple for $2500.

http://en.wikipedia.org/wiki/MacTCP

http://tidbits.com/iskm/iskm3html/pt4/ch17/ch17.html

nashashmi · on Dec 27, 2013

Might be off topic, but I have always wondered where do these public domain university emails come from? Was E-mail a public forum back then?

michaelhoffman · on Dec 27, 2013

I'm not sure what you mean by "these public domain university emails." This is an email that has been saved and re-forwarded many times, including to mailing lists such as the one archived in the main link. Originally it would have been forwarded by either the sender or one of the recipients.

Buge · on Dec 28, 2013

In the header at the top it says Cc: ips@ece.cmu.edu, so it was carbon copied to the mailing list.

radikalus · on Dec 28, 2013