BSD socket API revamp

cafxx · on Feb 18, 2017

Am I missing something? It seems to propose to ditch all non-blocking APIs under the assumption that "runtimes" will provide lightweight concurrency. But AFAIK most (all?) runtimes that provide lightweight concurrency make extensive use of the non-blocking network APIs to work - including Golang, that is even cited as an example:

   During the decades since BSD sockets were first introduced the way
   they are used have changed significantly.  While in the beginning the
   user was supposed to fork a new process for each connection and do
   all the work using simple blocking calls nowadays they are expected
   to keep a pool of connections, check them via functions like poll()
   or kqueue() and dispatch any work to be done to one of the worker
   threads in a thread pool.  In other words, user is supposed to do
   both network and CPU scheduling. 
   [...]
   To address this problem, this memo assumes that there already exists
   an efficient concurrency implementation where forking a new
   lightweight process takes at most hundreds of nanoseconds and context
   switch takes tens of nanoseconds.  Note that there are already such
   concurrency systems deployed in the wild.  One well-known example are
   Golang's goroutines but there are others available as well.

JoshTriplett · on Feb 18, 2017

Agreed; if anything, the blocking API seems like the more artificial one, as the lowest-level implementation needs to handle the asynchronous receipt of data. A network card receives data when it receives data, sends asynchronous notifications to the OS in the form of interrupts, and keeps receiving data while you pull data out of its buffers. You can build a synchronous interface on top of that for convenience, but it fundamentally works asynchronously.

zzzcpan · on Feb 18, 2017

Yeah, one would expect for a new networking API proposal to embrace our asynchronous world, look into what people are already doing and why, not blocking APIs and multithreading. Personally, I was hoping for something in a spirit of netmap with the whole stack living in userspace for performance and security reasons and APIs being completely non-blocking with zero-copying and batching in mind.

tptacek · on Feb 18, 2017

Apart from the layering, which really has nothing to do with the sockets interface and can be (and has been, repeatedly, as this author knows) built today on top of the sockets interface, I don't really see what's interesting at all about this interface; it seems like a simple restatement of the existing blocking socket API.

rumcajz · on Feb 19, 2017

It tries to describe in detail what are the minimal semantic preconditions for protocol implementations to be composable (both vertically and horizontally) while at the same time being efficient (so that a stack with 20 microprotocols won't grind to a halt). The most interesting parts are handshakes, how they are conflated among multiple protocols as well as semantics of error handling.

But the real proposition here is to have an ecosystem of microprotocols (e.g. a CRLF protocol that splits bytestream into messages using CR+LF as a delimiter) which, by the virtue of having common, well understood semantics, could by composed into ad-hoc networking stacks according to users' needs.

tmyklebu · on Feb 19, 2017

I'm not sure we have a sufficiently efficient concurrency mechanism that doesn't simply give the user control over scheduling decisions.

Depending on how many tens we're talking about, those tens of nanoseconds for a context switch might themselves blow your time budget even on ordinary hardware. 10 gigabits per second means one byte is about 800ps long, so a 64-byte ethernet frame (plus the gunk before and after) is about 7 tens of nanoseconds. To keep up with this traffic pattern, we need to be able to process packets at least this quickly.

A goroutine context switch plus one unbuffered channel read and one unbuffered channel write on my machine seems to take about 18 tens of nanoseconds on my machine when Go's runtime is made to use the same thread for everything. (I'm not sure how to better isolate context switch overhead in Go, but a context switch, channel read, and channel write per packet does seem to be the anticipated use case.)

rumcajz · on Feb 20, 2017

The question to ask is: Can you do better?

Say you use a state machine instead of using coroutines. Will one state transition of the machine be faster than one context switch?

180ns can definitely be improved on (with the C implementation in the project it's more likely to be 60ns or so) there's no fundamental reason why one approach should be faster than the other, given that both are doing basically the same thing, albeit on the different layers of the stack.

tmyklebu · on Feb 21, 2017

I'm not clear on why coroutines or an explicit state machine are necessary to implement network protocols. We have been doing it for decades by writing straightforward imperative C code and for years by writing C++ template classes that are templated on the downstream type (CRTP). Both of these approaches result in near-zero overhead (computers are good at fetching and decoding consecutive instructions) or even negative apparent overhead (if the compiler sees caller and callee, inlining and dead code elimination can potentially eliminate lower-level protocol handling irrelevant to the higher-level protocol).

I would guess that transforming coroutine-based code into state machines hampers performance over the long term by scaring the programmer away from wanting to touch it. However, I would also guess that the direct cost of a coroutine context switch (save and restore all callee-save registers and stack pointer then indirect-jump somewhere unpredictable) outweighs the direct cost of a state machine transition (update some variable).

I do not want to pass judgment on the approach beyond saying that its attendant overhead is too high for some workloads.

rumcajz · on Feb 22, 2017

Ok, but doesn't "straight imperative code" rely on blocking API underneath it? If it does not, the implementation has to store the current state in case it cannot proceed immediately, then restore it later -- which is a context switch.

As for state machine vs. coroutine performance, agreed. But the cost is somewhere around 10-15ns per context switch. Which may be worth paying to get a maintainable code.

tmyklebu · on Feb 22, 2017

You can use the usual BSD socket API (in which case the context switches to/from the kernel for calling read() puts the contemplated workload out of reach) or any number of vendor-specific userland network stacks. There is no need to context switch. (Function calls and function returns are not context switches.)

jstimpfle · on Feb 19, 2017

Is 180ns that bad? Assuming that at each turn you process a packet of 180 bytes...

tmyklebu · on Feb 21, 2017

180ns overhead per packet means the time budget per packet for your business logic goes down by 180ns. That said, I suspect you can live a long life without ever encountering a situation where this matters.

jstimpfle · on Feb 19, 2017

I think you've read that wrong. It's the other way around.

Runtimes use non-blocking kernel interfaces to provide the convenient blocking interfaces to lightweight threads. So, no blocking kernel interfaces are needed anymore.

dboreham · on Feb 19, 2017

Ditching nonblocking mode would be ok if threads worked (by some definition of worked : for example 100K threads waiting on reads on open sockets performs well).

But I think you are right that in present-day reality all the runtimes that provide "workingness" for concurrency (Go, perhaps the JVM, Erlang, homebrew libraries) do so by working around system threading in one way or another, and to do so they use non-blocking interfaces.

rumcajz · on Feb 19, 2017

The implementation in the project uses its own lightweight threads (ctx switch ~20ns, possibility of arbitrarily small stacks, even stacks located on parent thread's stack). Also, it's written in C, so other languages can wrap it.

shabbyrobe · on Feb 18, 2017

Didn't notice this on my first read, but this is by Martin Sustrik. Sustrik is a former ZeroMQ contributor and, more recently, author of the abandoned zmq alternative nanomsg [1], as well as the libmill [2] and libdill [3] lightweight concurrency libraries.

  [1] http://nanomsg.org/
  [2] http://libmill.org/
  [3] http://libdill.org/

jey · on Feb 19, 2017

Is nanomsg truly abandoned? I was hoping it was simply converging toward stability.

shabbyrobe · on Feb 19, 2017

Hmm true, my language was a bit ambiguous. I meant it was abandoned by Sustrik [1], even if some of the community are doing the best they can manage to keep it alive.

  [1] http://www.freelists.org/post/nanomsg/New-fork-was-Re-Moving-forward

jey · on Feb 19, 2017

That seems a bit outdated; looks like main nanomsg repo has regular commits from gdamore now[1], so maybe the fork was resolved?

1. https://github.com/nanomsg/nanomsg/commits/master

shabbyrobe · on Feb 19, 2017

Yep it was eventually resolved midway through 2016, D'Amore stepped up, then stepped down, then stepped up again. The link I cited is not a commentary on the current status of nanomsg, it is more to illustrate Sustrik's role in the situation. If, ultimately, this discussion is about Sustrik's IETF proposal, it is important to consider his relationship with previous similar (zmq/nanomsg) or related (libmill/libdill) projects which have borne his name.

RickHull · on Feb 19, 2017

Also https://github.com/go-mangos/mangos

Animats · on Feb 18, 2017

Why is this being proposed to the IETF? It's not an issue visible at the wire protocol level. It doesn't affect interoperability. Sockets belong to the POSIX spec, and to some extent to language specs.

luchs · on Feb 18, 2017

There are already other RFCs regarding the socket API, such as RFC3493 (IPv6 socket API).

dfox · on Feb 19, 2017

There even are RFCs (RFC 2783 comes to mind) that specify OS-level APIs that are only tangentialy related to networking.

comex · on Feb 18, 2017

Very interesting idea overall. I've long thought the current segregation of network protocols between userland and the kernel is rather arbitrary and inflexible - why should reliability (TCP) be in the kernel but security (TLS) be in userland?

I don't think the idea of requiring Go-style lightweight threading is viable. Lightweight threading has many nice properties, but also inherently requires more expensive operations to deal with split stacks, and tends to use more memory than a manual approach. It's also rather poorly supported in general by existing languages and OSes (other than Go), while any new OS-level socket API should be as universally accessible as the existing one. In particular, many scripting-ish languages either have no concept of threading at all or only support isolated threads whose communication is limited to relatively slow message passing. In theory the language implementations could write a C shim to expose a non-blocking interface around blocking underlying operations - this is what libuv does today for file I/O, for instance - but the result would be a lot of unnecessary overhead.

jstimpfle · on Feb 19, 2017

I've come to think that the whole distinction "kernel vs userland" is inflexible.

There was a thread here on HN a few days ago, and it showed some clear deficits of conventional OS, for example in processes. Speaking of lack of composability of clients and linked-in libraries, speaking of IPC and uncontrolled implicit environment inheritance (Docker... :<), speaking of the ad-hoc security model (Unix permissions, setuid hacks, and more recently extended attributes).

There's just a single trust boundary in Unix. What if the system was instead just a graph of interacting entities, each managing control over some resource? The "userland process" would just be another entity that is in control of another (more ephemeral) resource. We wouldn't even separate address spaces! Security by careful propagation through fine-grained trust boundaries (instead of globally accessible resource managers that have a hard time deciding whether a requesting entity should be authorized). Think svn vs git.

And my understanding is that that's much of what Lisp Machines were about...

nwmcsween · on Feb 19, 2017

Define 'separate address spaces', you have linear address space (virtual) and kernel/user 'address spaces' which are address spaces but not in the way linear address spaces are.

It would be easier to simply move almost everything not needed by the kernel into 'userspace' and have some sort of capability system to enforce access rights.

jstimpfle · on Feb 19, 2017

There is a missing word; I meant to say "we wouldn't even need separate address spaces" ("separate" was meant as an adjective, not as a verb).

And by that I meant that we would need no hardware support for separate address spaces. The kernel and all processes would share a single address space. We could look at memory as just another resource, and readable or writeable chunks of memory ("buffers") would be only accessible through authorized "memory arbitrator" entities. Of course some mechanism is required to prevent programs from making up buffers on their own; this can be done in hardware of course, but alternatively also by software mechanism, e.g. enforcing a higher-level programming language like Lisp.

So one could even say that each such buffer was their own address space; however I'm convinced the approach is much simpler and less arbitrary (no coupling process<->address space).

"Virtual address spaces" are a required abstraction, just as filesystems need to provide the illusion of contiguous available storage for files. This could of course be done in software, just as filesystems do it in software. However it would be a lot slower since we don't batch memory accesses like we batch file accesses.

tychob · on Feb 19, 2017

The key thing here is the separation of memory protection from memory addressing. Virtual memory addressing is a required abstraction to deal with stuff like paging. Memory protection does not need to be tied to addressing, and once you do that you can basically get this buffer abstraction that you are describing.

The buffers are just chunks of the global address space that this process has been allowed to access. Processes already deal with a similar abstraction, what does mmap return but a buffer of memory? The only difference from the point of view of writing software is that pointers in shared memory start to work better.

justincormack · on Feb 19, 2017

There have been a bunch of single address space (64 bit!) OS projects http://wiki.c2.com/?SingleAddressSpaceOperatingSystem

krylon · on Feb 19, 2017

IIRC, Microsoft tried to build something along those lines with the Singularity research operating system - it did not use virtual memory to isolate processes from each other, but it strongly restricted how processes could exchange data, thus getting pretty much the same isolation level conventional operating systems get by using virtual memory.

I even think the source code to Singularity has been released under an open source-ish license, if you want to look into it.

jstimpfle · on Feb 19, 2017

Thanks, I will look it up!

morb · on Feb 19, 2017

Could you please link to the mentioned thread?

jstimpfle · on Feb 19, 2017

https://news.ycombinator.com/item?id=13621623

If you search for my name there, you will find at least discussions about file modes and also why I claim that processes are a failed abstraction (it was really another user who pointed that out).

aftbit · on Feb 18, 2017

The big reason security is in userspace is because it requires policy decisions. How do we determine which remote users to trust? WoT, cert chains, trust on first use? There is kernel-mode security (IPsec) but it turns out that userspace is better equipped to make those decisions.

comex · on Feb 18, 2017

Actually, policy decisions are a big part of why I think TLS should be "in the kernel". It doesn't have to literally be in the kernel - it could be a userspace daemon instead - but that's an implementation detail. What matters is that just like with TCP, there should be one high-level API, provided by the system, which all applications are expected to use.

Scenario: You want to temporarily ignore an invalid certificate. (Yes, I know this is usually a bad idea, but sometimes it’s necessary.)

Today: Every application needs its own flag, environment variable, configuration option, and/or UI element to disable validation. If you’re unlucky, there won’t be one. If you’re really unlucky, there won’t be any way to enable validation; even in 2017 there are still lots of poorly written applications like that.

Centralized: There should be a standard environment variable that works with everything, and/or maybe a “security center” applet where you can add a temporary system-wide exemption per-host, or something like that.

Scenario: You want to sniff a secure connection for debugging purposes.

Today: You have to use a MITM proxy, which is gross and needs setup and breaks things, and requires trusting invalid certificates, which doesn’t always work; see above and below.

Centralized: The kernel could just hand decrypted packets to packet sniffing programs, allowing for purely passive observation.

Scenario: You want to ensure that you’re using high-quality ciphersuites.

Today: Up to the application, usually not customizable, so browsers will do the right thing and other applications will have some list the developer copy+pasted that seemed to work. Welp.

Centralized: The system should be able to specify policy for all applications, with the option for the user to override it. After all, it’s almost never actually desirable for a cipher suite to be supported by one application and not another - like, it’s okay if my mail client connects to Google insecurely, but not if my instant messaging client does.

Scenario: You want to add a custom trusted CA, or generally you want different applications to be consistent in their policy decisions.

Today: Different TLS libraries have their own trust store formats and trust rules. If you’re lucky, your TLS library of choice will have some way to use the system trust store, on Windows and macOS where such a thing exists, although who knows if it’ll properly replicate the various rules browsers like to impose, e.g. requiring certificate transparency. If you’re unlucky, you end up with things like Java having its own trust store, or wget on macOS refusing to download any https URLs until you go install some .crt file and point wget to the right path. In both cases, you end up with a separate list of trusted certificates, that requires a separate process if you want to add a custom one, or upgrade to remove newly distrusted roots, etc.

Centralized: Well, there’s already some centralization with system trust stores, but ideally not would those be supported universally, the entire process of “should this TLS connection be established” could be defined and maintained in one place.

Scenario: On a server, you want to run a program that accepts TLS connections.

Today: Each program needs its own copy of the private key and its own TLS configuration, which often doesn’t expose the full range of settings it’s desirable to customize.

Centralized: The system holds the keys (reducing both configuration burden and the danger of leaks); applications can just say “listen on port X using TLS”, and send/receive plaintext on the resulting sockets, and encryption will be handled automatically.

...

I admit there are also some downsides of a centralized design. One of them is updates: these days, on most systems, users update their browsers a lot more than they update the OS, so the browser is likely to be up to date while the OS is not. But "at least the browser is secure, even if other apps are not" is not exactly an ideal outcome, and this too seems like an artifact of history... there's no reason that OS components can't auto-update like browsers do.

bogomipz · on Feb 19, 2017

>"Actually, policy decisions are a big part of why I think TLS should be "in the kernel". It doesn't have to literally be in the kernel - it could be a userspace daemon instead - but that's an implementation detail."

Can you explain this? How is something "in the kernel" but implemented as a userspace daemon? Are you suggesting something like a vDSO like how gettimeofday works? That's the only thing I can think of that would come close to that.

comex · on Feb 19, 2017

Yeah, my wording wasn't the best.

My dream would be for "TLS sockets" to be integrated into the system socket API as a first-class citizen next to (plain) TCP and UDP sockets, ideally as part of a POSIX standard. If that were done under the current BSD socket API, it would require at least some kernel support, because that API revolves around file descriptors and system calls like setsockopt, read/write, poll, etc. On the other hand, if it were done as part of a broader revamp of the API, like the one proposed above, well, ideally that revamp would support custom user-level protocols, so TLS could be implemented as one of those rather than involving the kernel.

rumcajz · on Feb 19, 2017

That is actually the case. If you look at the github project, there's "running code" implementation of the proposal and, yes, it supports TLS.

bogomipz · on Feb 19, 2017

I see, thanks for the clarification.

pvg · on Feb 19, 2017

"at least the browser is secure, even if other apps are not" is not exactly an ideal outcome,

That sounds like an ideal (if strictly unattainable) outcome. Your design seems like a sort of software-defined violation of the end-to-end principle. If it existed, most apps that have strong feelings about security simply wouldn't use it.

comex · on Feb 19, 2017

It is ideal from the browser's perspective, but not for the user who is also using other applications in addition to the browser :)

It only takes one confused app autoupdater (typically secured at least in part with TLS) to get pwned.

If browsers want to be resilient to the possibility that the system stack is not secure, I don't mind terribly. After all, browser vendors have dedicated TLS security teams and certificate trust programs, so they can do their own thing and ensure it's secure. Custom TLS stacks get in the way of some of the goals I mentioned, such as user-customized trust settings and packet sniffing, but that's not the end of the world. What I do mind is if the system stack is actually not secure. If nothing else, that means that applications bundled with the system are likely to be at risk.

Non-browser third party applications have to outsource their security policy to someone. They can choose to trust the OS, or to get a TLS stack from somewhere and ship that. In today's world, both approaches have ups and downs from a security perspective. The basic problem is that for maximum security, TLS stacks and their configurations should be updated regularly, mainly to update the trusted certificate list, but also to address vulnerabilities in protocols, ciphersuites, implementation, etc. But many users rarely update their OS. On the other hand, even if you ensure that users update your app (via an autoupdater), you're still responsible for ensuring you ship updates promptly. Even if you cease development because you lost interest or your startup got bought for a trillion dollars or went bankrupt or whatever. And this applies to every single program that wants to establish a secure connection, which these days is basically everything that connects to the Internet.

Of course, on Linux distros, application updates tend to come with system updates, so you probably can't do any better than trusting the system, at least when it comes to updates...

Today, I think many applications which would be happy to trust the system ship their own TLS stacks anyway, just because it's easier - because of fragmentation and the lack of good wrapper APIs in many environments. For example, the Rust ecosystem has thus far mostly depended on OpenSSL bindings for TLS, but there's a desire to move to the new "rust-native-tls" crate, which wraps SChannel on Windows, SecureTransport on Darwin, and OpenSSL elsewhere. Works for Rust; but if there's a C equivalent, I haven't heard of it...

pvg · on Feb 19, 2017

Most of the scenarios you described didn't seem all that user-centric to me. What users want is apps as secure as possible. 'Features' like something-instead-of-the-app renegotiating ciphersuites for it or unencrypted packets flowing through the kernel or forcing a cert-pinned app to cough up its private credentials to some TLS 'service' don't make apps more secure. I'd go as far as to invoke the immortal words of Chinese-subtitled Darth Vader: 'DO NOT WANT'. I think the notion that encryption should be done at the endpoints is neither new nor controversial. I'm not sure what you're proposing offers enough to abandon it.

comex · on Feb 19, 2017

When applications use a custom or nondefault configuration, most of the time the result is indeed less secure. The classic example is lazy developers doing things like turning off CURLOPT_SSL_VERIFYPEER, but it goes beyond that. Most application developers have no opinion on what ciphersuites are good, but OpenSSL forces you to choose, and chances are that choice will be hardcoded somewhere and never updated. (Other TLS libraries do this better.)

Browsers lately have been imposing conditions on individual root CAs that are more complex than a binary trust/distrust. For example, they have added artificial TLD limits, certificate transparency requirements, or more complex stuff designed to address specific instances of bad behavior by CAs (take a look at mozilla.dev.security.policy). There's no good way to implement that systemwide if there are N different SSL libraries in use, often statically linked, with their own configurations. So you end up with the restrictions only being applied to the browser. Which is less secure.

That's the situation today. Of course, any given issue can be solved many different ways, and there are less disruptive approaches to improving them than what I'm suggesting. My suggested approach is far from necessary, and indeed is unlikely to happen in reality. But you seem to have a gut distaste for it which doesn't resonate with me.

As for server programs "coughing up" private credentials, it's quite the reverse: the administrator doesn't have to stick the credentials in some configuration file for a server that they may not trust particularly well. They don't have to ensure the permissions are correct in N places, but 1. Privilege separation is generally recognized as a good approach to security. (In fact, it could even be possible for unprivileged Unix users to start TLS-enabled servers on high ports without giving them the certs.)

pvg · on Feb 19, 2017

No, not servers. Like, say, the dropbox client that comes with its own cert. It would have to give it up to your 'tls service' thing. The server case already has pretty standard solutions in any but trivial deployments - something specialized terminates SSL (which is a bit like your idea but it already exists).

But again, your solution to 'some app makers don't do TLS right' is 'hand over plaintext to the kernel'. The cure seems far, far worse than the disease. I don't think this is a 'gut distaste', it's just that most similar ideas of a generalized way to move encryption away from from the endpoints have failed and standard practice leans very much the other way for reasons that I think are pretty good. My argument isn't so much about anyone going off to implement or not implement this, it's that if you're going to think about this stuff (a perfectly fine thing), you have to meet the bar of the current accepted practice and then offer some, however hypothetical and thought-experimenty improvement. And I'm just not convinced that this idea does.

Edit: You also mention (and I managed to skip over, sorry) the fact that browsers have complex certificate trust handling policies which other apps don't get to leverage. This is true but your example of an app is 'an app that pretends to be a browser'. I think this is a special, narrow case. Consider the dropbox client - it doesn't need or care about CAs, chains of trust, PKI or your system trusted cert store. That's browser stuff. Install mitmproxy and watch dropbox simply fail to work. So your solution is a sort of 'key escrow' demand for any app that just wants to talk TLS, not emulate the complex trust dance of a browser.

comex · on Feb 19, 2017

I'm mainly concerned with applications that follow standard protocols and allow the user to ask them to connect to arbitrary hosts. There are still a lot of those, even if these days standard protocols are dying out in favor of centralized proprietary services ;)

That includes both non-HTTP TLS-wrapped protocols like email and IRC, and applications that access user-supplied HTTP URLs but aren't browsers, like media players, RSS clients, git, Matrix (chat), wget, apt-get, etc.

Proprietary services are trickier because there's a question of "security for whom". As someone who mostly uses computers he owns, I would definitely like to be able to sniff Dropbox and see what it's sending. (Actually, Dropbox is a bad example because most Dropbox users already trust them with a lot of personal data, but there are many stories of, e.g., random mobile apps sending user data to servers with no justification.) That's not a security compromise from my perspective because it's my data. But Dropbox or whoever may not want people to get at their app's internal communications, and may see it as a compromise from their perspective. And if I'm using a system where someone else has root, I might agree with them. Of course, root can always attach with a debugger and try to extract the data anyway, but then there's the cat and mouse game of obfuscation...

pvg · on Feb 19, 2017

ask them to connect to arbitrary hosts. There are still a lot of those

Yep, I get that and I understand the problem you're describing is an actual problem. I just don't think 'endpoint hands over its plaintext' is the right solution.

I would definitely like to be able to sniff Dropbox and see what it's sending

Well, I'm sure you can, it's just not as easy as firing up tcpdump. But if you wanted to you could, it probably won't even take you long. 'It's not as convenient as firing up tcpdump' is still not a strong argument for 'just hand over the plaintext', in my mind.

I'll wrap this up with a story you might have also seen play out over twitter the last couple of months. Tavis Ormandy, a security engineer currently working on Project Zero at Google has been dissecting various AV products and finding them riddled with pervasive and gross security holes up to and including intercepting TLS traffic and then screwing it up so badly as to make it highly vulnerable.

The response from one vocal AV expert has been 'well, we wouldn't have to do that [and screw it up] if you gave us a tap/API into the plaintext so we can keep you all safe'. The predictable retort of security engineers has been 'LOL, NO'. I think ultimately, that 'LOL, NO' is what you're up against rather than some rando ranting at you on a message board. Enjoyable as it has been!

asdfaoeu · on Feb 19, 2017

Most OS provide a mechanism to manage SSL certificates and a OS provided SSL library, sure not every application uses that but then how is a new standard going to fix that?

The only thing providing it in the kernel would do would do is decrease performance and require kernel updates to fix SSL vulnerabilities.

theatrus2 · on Feb 19, 2017

So, IPSec.

tptacek · on Feb 18, 2017

How would a design with TCP done entirely in userland work? A given socket address might in one instant belong to uid 10, and in the next, after uid=10's connection closes, to uid 20.

(I'm not challenging you, just asking what you think the semantics should look like).

Worth adding for everyone else, though I'm sure you know: high speed network implementations already pull a lot of this stuff into userland.

justincormack · on Feb 19, 2017

The thing is, if addresses are scarce, you do need something (typically a kernel) to allocate them. Port numbers on a single IP are scarce. IPv6 addresses, if each machine has a /64 though are not, and it becomes much easier, each (64 bit!) pid can have its own IP with neither being reused, so routing can be dumb.

(And/or encrypt everything, keys are not a scarce resource either).

comex · on Feb 18, 2017

There are a few different ways to answer that question.

First, I believe that for most applications, it's not that TCP should be in userland; rather, TLS should be in the kernel, or otherwise system managed. See the wall of text I just replied to someone else with.

Second, if you do want to customize the transport layer, which is of course fairly common today - multiplayer games often go for some custom "reliable UDP" protocol that drops ordering guarantees to improve latency, and then there's Chromium with QUIC, and uTorrent ‎µTP, and WebRTC... well, the reality of networking today is that that has to be done on top of UDP rather than IP directly. I doubt there's much point writing a new API for TCP itself that exposes all the bells and whistles, because the design of TCP is dated. Rather, there should be more standardized APIs for applications to more easily use replacements for TCP that run over UDP. Ideally I should be able to switch from TCP to QUIC just by changing a few lines of code.

(Edit: By the way, if we could redo networking protocols from scratch, I think ports ought to exist at the IP layer rather than being replicated in TCP and UDP, and then UDP could be abolished entirely. But we can't.)

Actually, that ties into high performance networking too. As you know, those alternate stacks require userland to talk directly or near-directly to hardware, meaning you just have to give up on different ports belonging to different applications. But beyond that, one thing holding those stacks back is that they require a lot of custom code and don't work with existing applications. With a standardized networking API that supported userspace plugins, it could be possible to add "DPDK+rumptcpip" as an option next to "kernel TCP" and "QUIC", and configure any random process to use it.

zrm · on Feb 19, 2017

> I doubt there's much point writing a new API for TCP itself that exposes all the bells and whistles, because the design of TCP is dated. Rather, there should be more standardized APIs for applications to more easily use replacements for TCP that run over UDP.

The advantage TCP has over UDP is that middleboxes know what TCP FIN is and so are willing to use much longer timeouts for TCP sessions. For example the default Linux connection tracking timeout for established TCP connections is five days but for UDP streams it's three minutes.

So if you need a long-lived session to receive event-based messages your choices are to use UDP with a mapped port using NAT-PMP or PCP (ideal but not always available), use UDP with frequent keepalives (expensive), or use TCP.

Being able to do TCP in userspace would be very useful for any VPN-like thing because you could get the long timeouts but still deliver packets immediately even if an earlier one was lost, and avoid the TCP-over-TCP performance degradation by deferring congestion control to the inner packets.

rumcajz · on Feb 22, 2017

Can be done for example by having a address allocation daemon running in user space. The socket library would create an IPC connection to the daemon and ask for an address/port. The daemon would reserve the address/port and pass it back. The daemon would have to health check the applications so that it can return the addresses back to the pool in case of failure.

zrm · on Feb 18, 2017

Basically the same as SOCK_DGRAM works except that send() and recv() expect/provide the TCP header before the transport payload.

If a user process calls bind() on some address/port then it gets all TCP packets for that local address/port. If it calls connect() then it gets only the ones that match both the local and remote address/port.

0x0 · on Feb 18, 2017

With packet retransmissions and out-of-order packet delivery, wouldn't there be a risk that the client software at uid=10 sees a clean shutdown and disconnects, but uid=20 instantly binds with the same ports and steals a bunch of delayed packets?

zrm · on Feb 19, 2017

That happens with UDP already and is another reason everything ought to be encrypted with session keys.

But the normal TCP solution of the OS holding released ports in TIME_WAIT for a couple of minutes still works as ever.

HeadlessChild · on Feb 18, 2017

OT: How is text files like this (or like any RFC) written? Is there any standard format and is it done in a specific way?

EDIT: Found it! There was actually a RFC for that. https://tools.ietf.org/html/rfc7322

hueving · on Feb 19, 2017

So RFCs are self-hosting? ;)

kevin_thibedeau · on Feb 19, 2017

Nroff traditionally but now there are any number of foo2rfc tools.

eschaton · on Feb 19, 2017

I'd like to see some comparison with the STREAMS layered/stackable API from SVR4 which let a developer do just the sort of layering this API proposes. What makes it better than STREAMS, does it share the same pitfalls, etc.?

yuhong · on Feb 19, 2017

Also Plan 9.

astrange · on Feb 19, 2017

There doesn't seem to be much of anything interesting here. I do think there's a lot to consider in a new system design, but the fact (brought up at the start) that you can't send any new protocol over the internet, and can't send anything but HTTP through half of it, really limits you.

If I were designing an API, the RX side would be asynchronous and batch multiple connections, and the TX side would let you assemble your own TCP packets. This rules out ideas like sendfile() which IMO became nonstarters when we moved to HTTPS.

And obviously the accidental complexity like fcntl vs. setsockopt, bind/connect, shutdown/close, multithreading+signals, have to go somehow. It doesn't really matter though. I can't imagine any major world crises are being caused by BSD sockets.

johnsmith21006 · on Feb 18, 2017

A very long time ago and coming from VMS a big reason preferred BSD over Sys V was non blocking sockets. Now that was over 20 years ago but now going away?

luchs · on Feb 18, 2017

Something this API gets right is having a unified interface for both IPv4 and IPv6. With the sockets API, you have to decide for one of them. Changing isn't easy as the constants and structures are all named differently.

While it's possible to use IPv6 sockets for IPv4 connections, this doesn't cover all use-cases. For example, you can't do IPv4 broadcast with an IPv6 socket. Additionally, as most examples are written for the classic IPv4 API, that's what everyone uses per default. Later on, when people complain about missing IPv6 support, they are turned down because it's a ton of work to change.

dfox · on Feb 18, 2017

For majority of applications, supporting IPv6 boils down to using getaddrinfo()/getnameinfo() instead of gethostbyname()/gethostbyaddr(), which results in code that supports both IPv4 and IPv6 and is simpler than the IPv4-only original.

rwmj · on Feb 19, 2017

The catch is you also have to be prepared to listen on multiple sockets. There are lots of servers out there which get this wrong, including qemu-nbd.

For reference here's my implementation of this which (I think) gets it right: https://github.com/libguestfs/nbdkit/blob/master/src/sockets...

mvkg · on Feb 19, 2017

A good dual-stack implementation should also support Happy Eyeballs which does not result in simpler code.

slrz · on Feb 19, 2017

Changing isn't easy as the constants and structures are all named differently.

They are all named "struct sockaddr_storage", no?

You won't even need that most of the time, as it's all dealt with by getaddrinfo and friends (see sibling comment). Sure, there are some cases where it breaks down but the vast majority of software doesn't go there.

bsder · on Feb 18, 2017

The problem with proposals like this is that they paper over the fundamental problem:

Time is an input variable and really should be part of the API.

The problem is that sometimes you want the end user to have control over time and sometimes you might not want the end user to have control over time.

If I'm on Linux, I certainly should not be able to manipulate the "time" of the communication stack in order to send faster/more aggressively than I should be allowed.

If I'm on embedded, I really want to be able to manipulate the "time" of the communication stack depending upon what I am doing.

jstimpfle · on Feb 19, 2017

Could you clarify? I don't get where you are heading with that.

At least the concept of a "stream" isn't really linked with time. It's just a sequence of bytes.

bsder · on Feb 19, 2017

The issue is that abstractions "leak".

Let's take your "stream", for instance.

It's a sequence of bytes.

But is it a "maximum bandwidth" stream of bytes (gigantic file transfer) or is it a "minimum latency" (audio packet) stream of bytes?

If I know or can control what the "time" the stack is working toward, the stack doesn't have to know the difference between those two.

In addition, when on embedded, I quite often want a stack API which is "foo_init(), foo_deinit(), foo_queue_send(), foo_queue_action(), etc." which all update internal state

And then "foo_make_incremental_progress(foo_state, foo_now, ...) where ONLY that incremental progress function actually carries out actions like reading hardware, writing hardware, timing out, etc. Now, I can use whatever concurrency construct I like without the stack getting in the way.

Now, that isn't necessarily the fastest performance as the stack needs to be structured such that it only carries out one "action" at a time on each incremental call. However, it's very flexible. It also has the wonderful property of being repeatable and testable. Something which TCP stacks are notoriously resistant to.

jheriko · on Feb 19, 2017

> To make the API less tedious to use, short protocol name, e.g. "ws", SHOULD be preferred to the long name, e.g. "websockets".

Well that's a classic dumb mistake.

Programming isn't about typing.

Why not save everyone the effort of having to learn pointless things or look things up? There is no ambiguity with "websockets" where as "ws" describes nothing without context.

rumcajz · on Feb 22, 2017

Another example: use "tcp_connect" instead of "transmission_control_protocol_connect".

jheriko · on Feb 22, 2017

i'd make an exception for that, but it has nothing to do with the length of the function name.

TCP is much more well known as an abbreviation than long hand (most people who use it do not actually know what it stands for)

communication is about getting the intent across and not obscuring it after all...

rumcajz · on Feb 23, 2017

Yet one example: "dccp_send" or "datagram_congestion_control_protocol_send" ?

dbmikus · on Feb 18, 2017

Specifying a deadline timestamp instead of a countdown could run into problems if a given machine has an improperly configured clock.

I think the risk of any machine having an incorrect timestamp is greater than a library implementer not recalculating the countdown correctly. Whenever possible, I prefer to stay away from timestamps. They're a fickle thing.

slrz · on Feb 19, 2017

Why would it matter? The deadline would be specified in terms of a monotonic clock that's unaffected by wall time adjustments. So even if some crude NTP implementation steps your clock by seconds (or a leap second is inserted), it shouldn't matter for the deadline calculation.

Bino · on Feb 19, 2017

I think most of my timeout/deadlines relative to the monotonic time. Hence specifying the absolute monotonic time would just involve yet another syscall (to query the time). Is there a really good argument to specify a absolute monotonic time?

dbmikus · on Feb 20, 2017

Then you need to send information to agree on a monotonic clock time.

notaplumber · on Feb 19, 2017

To clarify, none of the BSD projects have anything to do with this "revamp" not-a-real-RFC.

rumcajz · on Feb 19, 2017

"BSD sockets" is the name of the existing socket API.

dchest · on Feb 19, 2017

It's not some "not-a-real-RFC", it's https://en.wikipedia.org/wiki/Internet_Draft

Ono-Sendai · on Feb 19, 2017

Sounds good. Please add a flushHint() call as well: http://www.forwardscattering.org/post/3

rumcajz · on Feb 22, 2017

The interesting question here is: Once we have flush() function does that means that protocols are allowed to delay data indefinitely if flush() is not called?

Ono-Sendai · on Feb 23, 2017

Yeah that is an interesting question. I think if the provider of the socket interface could be sure that the client code using the interface would know about flush(), then it wouldn't need to set timers or anything like that to flush after X ms etc. However there wouldn't be any point, I think, to buffering up more than e.g. a MTU worth of bytes. So it wouldn't be delaying indefinitely.

astrange · on Feb 19, 2017

Already exists in popular implementations: https://t37.net/nginx-optimization-understanding-sendfile-tc...

dcow · on Feb 19, 2017

There's a typo in section 3.4 paragraph 4 s/bystream/bytestream--unless it's an attempt to be clever.

rumcajz · on Feb 22, 2017

Fixed.