Hacker News new | past | comments | ask | show | jobs | submit login
Unit testing a TCP stack (2015) (snellman.net)
135 points by PuercoPop on Aug 28, 2021 | hide | past | favorite | 35 comments



There's a lot of push back from engineers - especially people at lower levels of the stack - against testing infrastructure. One particularly famous example is Linux. Rather than testing before merging in code, they merge in code and then test the release candidate as a whole. It also seems game developers are extremely against automated testing frameworks as a whole. I've heard many times that it would be impossible to develop an enemy AI in a test-driven way (I did this for a senior project in college - finished the AI before the game was able to even start testing it [0]).

I wonder what would need to happen to convince people that:

1. Even if you do something extremely low level, you can draw a distinction between your hardware and the interface that 99% of your software runs at.

2. You can develop complex behaviors iteratively with automated testing just like you can develop complex programs iteratively (tests are just programs).

[0] - https://github.com/gravypod/it491-disabler-ai


Automated testing is hard to justify for games because you are not simply finishing features and moving on. You are constantly experimenting. Throwing out features and ideas and reworking them is part of the creative process. Automated testing adds overhead to iteration and isn't free, so you have to be very selective. At the end of the day it is more important for the game to be fun than it is stable.


Though there are games where the fun has been proven, and it's important to be able to iterate fast without breaking what's already there. There's a great 2016 blog post from Riot on how they test LoL:

https://technology.riotgames.com/news/automated-testing-leag...


> I wonder what would need to happen to convince people that ..

3. It's worth it.

I work at (relatively) low levels, and I would absolutely love to have extensive tests (plus more, e.g. TLA+ models to prove critical properties of the systems I work on).

The pushback comes from stakeholders. They don't want to invest time and money into automated testing.

And when no automated testing has been done yet, you can guess that the system hasn't been architected to be easily testable. Figuring out how to add useful tests without massive (time-consuming and expensive, potentially error-prone) re-architecting is also something that requires quite a bit of investment.

Of course a part of is just lack of experience. If someone who knows how it's done could lead by example and show the ropes, that'd probably help. Getting the framework off the ground could be the key to sneaking in some tests in the future, even when nobody asks for them.


I did tests for something like this once. Not low level, more of a set of 30+ microservices but the concept is there. The black box testing. This was for a smart home solution based on RabbitMQ. The client wanted to replace RabbitMQ with Kafka but they were anxious because there was no way to verify that the replacement would behave the same way.

So we have spent 2 months writing black box tests against the RabbitMQ version, swapped it out with Kafka and fixed all issues within a couple of weeks.

Since then, I believe that the integration tests are so much more valuable than unit tests.


Or maybe those people know what they are talking about and the truth is that many people who are big fans of automated testing tend to overrate its value when it comes to many areas of software development. Testing takes effort and makes it harder to change things. In an ideal world where testing came for free then sure, more testing would be better. In the real world there are tradeoffs. If I am writing code that controls a spaceship then it makes sense to spend a huge amount of effort on testing. On the other hand, if I am adding a feature to a web application then in my personal experience, most of the time adding automated testing is a waste of effort.


> Or maybe those people know what they are talking about and the truth is that many people who are big fans of automated testing tend to overrate its value when it comes to many areas of software development.

I would normally agree with you, but a TCP Stack is one of those things where I vehemently disagree.

Communication stacks, in general, are giant piles of implicit state unless you go out of your way to manage the state explicitly. As such, they have obscure bugs which are difficult to find when some of the cases are hit rarely.

A communication stack really needs to be written such that the inputs (including TIME), the outputs, the current state, and the next state are all quite explicit. This enables you to test the stack because it is now deterministic with respect to inputs and outputs.

Yes, it's not easy. And it requires that you really mean it and architect it that way. You may not be able to evolve your current stack and may have to throw it away--that's never going to be popular.

However, every single time I have done this for a communication stack (USB, CANopen, BLE, etc.), the result was that the new stack quickly overtook the old stack on basically every metric worth monitoring (throughput, latency, reliability, bug rate, etc.).

Now, to be fair, I was obviously replacing a communication stack that was some level of "pile of crap" or I wouldn't have done it. However, I'm just one person, and those stacks generally came from a company who had a vested interest in it not sucking. I'm not some amazing programmer, and I certainly didn't spend more time on it than the original stack, so it really comes down to the fact that the "underlying architecture" was simply a better idea.


> A communication stack really needs to be written such that the inputs (including TIME), the outputs, the current state, and the next state are all quite explicit. This enables you to test the stack because it is now deterministic with respect to inputs and outputs.

Do you have something open source to look at that uses this approach? The idea is great but I wonder how complicated becomes code when using it.


Sadly, no. I don't rewrite communication stacks for fun. And someone who pays me to do so generally sees it as a competitive advantage.

The complication is that the "event loop" is atomized. This means that initialization, teardown, and iteration are now in your hands. You're responsible for making sure that the "iteration" function is called often enough (ie. both because of timers and incoming packets), not the stack. Of course, because time is an input, you can clearly identify "Oops. I didn't call this often enough."

The advantage is that your "event loops" can now compose. Quite often, it's very hard to get different communication stacks to cooperate because they all want to be the "primary event loop". (A good example of this failure is "async" frameworks--"How do I wait on a screen refresh, a tcp socket, and a character device descriptor simultaneously?"--if your async executor doesn't take all of those into account you're gonna have a bad time) Since the event loops are atomized, you can interleave the "iteration" calls and wait/sleep however you like.

As for open source, to be honest, I probably wouldn't release anything open source these days unless I got a very significant personal benefit somehow. It's not enough to simply create something and put it out there--everybody expects you to be a maintainer and have some infinite well of tactfulness in the face of idiots who deserve to be slapped silly.

For me, the grief from the whiny, entitled contingent is far too high to make producing open source worthwhile nowadays. I respect the people who do it but have no desire to join them.


> Or maybe those people know what they are talking about and the truth is that many people who are big fans of automated testing tend to overrate its value when it comes to many areas of software development.

There's been a lot of research, and internal studies, done at many companies that show pretty impressive benefits.

When really questioned most engineers just say "I know my code works" or "I test my code, I don't need automated tests". That's the mentality I just don't understand.

> Testing takes effort and makes it harder to change things.

If it "makes things hard to change" just delete the test? You'll still get the benefit of knowing XYZ are broken/altered. You can also automate end-to-end and black box tests which should absolutely not require any modification if you're just refactoring.

> If I am writing code that controls a spaceship then it makes sense to spend a huge amount of effort on testing. On the other hand, if I am adding a feature to a web application then in my personal experience, most of the time adding automated testing is a waste of effort.

If you are working something that is allowed to fail, then sure, you don't really need to care about what practices you do. It's a very end-all-be-all argument to say "it's ok for my things to break". That argument goes just the same for all of these things:

"Why do I need a version control system? It's fine if I manually merge my code incorrectly"

"Why do I need a build system? It's fine if I forget to recompile one of my files"

etc.

In addition: the "argument" for automated testing isn't that it will just prevent you from breaking something. It's that it lets you know when things change and makes it easy to update your code without manually checking if things are broken. Recently, when adding features to our frontend, I just run our tests and update a png file in our repo. I then play around until my styling is how I like it. It's completely automated and saves me a lot of time. It also lets others know immediately when their CSS change will effect, or will not effect, my components.


>There's been a lot of research, and internal studies, done at many companies that show pretty impressive benefits.

I would need to look at the research and studies to see whether I actually believe them. I have seen how politicized and tribal technical decisions can become at a company and can easily imagine that there might be confounding variables.

>When really questioned most engineers just say "I know my code works" or "I test my code, I don't need automated tests". That's the mentality I just don't understand.

If I do not write automated tests and then my code works fine in production 95% of the time, and out of the 5% of the time that it breaks, 99% of the time it is a problem that is easily fixed and causes no major problems, and even the few major problems are of the "lose a manageable amount of money" kind and not the "people get injured or killed" kind - meanwhile if, on the other hand, writing automated tests would add 50% more effort to my work - then the cost/benefit analysis might suggest that I should not write automated tests. Keep in mind that I could spend that 50% more effort instead doing things that will make my code less likely to break but that are not automated testing. "not adding automated testing" is not the same thing as "not doing anything that will make the code less likely to break".

>If it "makes things hard to change" just delete the test? You'll still get the benefit of knowing XYZ are broken/altered.

If I delete the test then I will have wasted part of the effort that went into writing it. Usually when I change code I already know that things will work differently afterward and that things might break, so this would not tell me anything I did not already know.

>You can also automate end-to-end and black box tests which should absolutely not require any modification if you're just refactoring.

Agreed - I am much more friendly towards end-to-end tests than towards unit tests. I still would not advocate them dogmatically, but I find them to be more useful than unit tests.

>If you are working something that is allowed to fail, then sure, you don't really need to care about what practices you do. It's a very end-all-be-all argument to say "it's ok for my things to break". That argument goes just the same for all of these things:

I think that you might be seeing things in too binary a way. The vast majority of software products are allowed to fail, but not all of them are allowed to fail to the same degree. The practices do matter and there is no one-size-fits-all approach to testing. The context matters. What rate of failure is acceptable such that to try to prevent failure beyond that rate would actually be counterproductive? What specific impacts would more vs. less testing have on the development process? How do the relevant pros and cons fit into the overall goal of the organization? Etc.

I am not saying "it's ok for my things to break", nor are probably most people who question automated testing dogma saying "it's ok for my things to break". We are saying that there are tradeoffs. Sometimes adding more automated testing does not actually add value. Again, it depends on the exact context.

Regarding the rest of what you wrote: I am not disputing that automated testing can bring lots of benefits. I am just saying that I think some people are too dogmatic about it, see it too much as a magic pill, push for its use too broadly, and do not take the relevant tradeoffs sufficiently into account.


I agree - it seems it's only excuses that keep certain devs away from testing. Even Factorio manages testing, though its more of an integration test. I'm sure it could be done with unit tests too.


I think even integration tests would be a huge improvement for most teams and software. As long as some automated checks are run before code is merged you're going to save yourself a lot of heart ache.


I couldn't make it past serialization/deserialization logic in my own hobbiest TCP/IP stack. Even that part was super buggy. Next time around I'm definitely going to be unit testing more parts otherwise it's too hard for a beginner to get the easy parts right let alone the harder parts.

Also, take a look at gvisor's network stack. It's definitely unit tested.

https://github.com/google/gvisor/tree/master/pkg/tcpip/link/... (an example)


This is perhaps a better example: https://github.com/google/gvisor/blob/master/pkg/tcpip/trans...

Also, some networking tests use separate frameworks (which look more like the setup the original post is describing, since those are needed also), e.g.: https://github.com/google/gvisor/tree/master/test/packetimpa...


Yes, a TCP stack certainly is complex enough to warrant serious automated testing and/or TDD.

The idea of putting the TCP stack in user space is interesting. If one actually could map the memory of the whole device into user space one could maybe have fewer system calls and therefore have better performance.

Also, what I find somewhat irritating about using a linux system is how often one needs to run commands as root (sudo) for common administrative tasks like mounting a disk or stuff like that. Having a user space TCP stack could also decrease the need for that as far as setting up the network is concerned. If the linux machine is single user, as most of them are nowadays, it makes more sense that way, I think.


> one needs to run commands as root (sudo) for common administrative tasks like mounting a disk

I would think if you don't do this, an attacker who is able to execute code but is non-root yet could easily elevate permissions by shadowing legitimate pathes and trick root into executing untrusted code.

I'm not a security engineer and just find it interesting, so if my thinking is off, please correct me.


The whole "map the PCIE device into userspace process memory" thing is called DPDK (https://www.dpdk.org/)


And you can combine the two:

https://fd.io/docs/vpp/master/whatisvpp/hoststack.html

And there is a sister project using this tech to get noticeable speed-ups:

https://wiki.fd.io/view/VSAP

Disclaimer: I am involved with the VPP project.


> Having a user space TCP stack could also decrease the need for [root privileges] as far as setting up the network is concerned.

I think it’s important to distinguish between the protocol (TCP) and the hardware device. You would still absolutely need to talk to the device, it’s just that moving a lot of the logic to user space means much less context switching for system calls for the application.

I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.


> I can imagine on Linux you can talk directly to /dev/eth0 if you would want to (in the same way that you can talk to /dev/sda), and then you would be back at square one regarding root privileges.

It's a AF_PACKET, SOCK_RAW socket rather than a device file, but yes.


> The idea of putting the TCP stack in user space is interesting.

Indeed! Julia Evans wrote a really nice post explaining the usecases and benefits - https://jvns.ca/blog/2016/06/30/why-do-we-use-the-linux-kern...


You don't need to be root to mount disks, when you have udisks installed (which would be almost all distros by default). See udisksctl(1): <https://manpages.debian.org/buster/udisks2/udisksctl.1.en.ht...>


There's nothing inherent about Linux which prevents you from running everything as uid 0. If you're fine with every process you run having the same full privileges and shared ownership of everything, you should.

Most machines, at least outside embedded devices, are not like this. They are multi-user systems even when there's only ever one breathing thing at the desk because it offers a degree of separation between the privileges of your daemons, your pid 1, your web browser etc.


I think the point was that root shouldn’t be required for “common administrative tasks”. The nuclear option of running everything as root doesn’t address this.


What the single user is called is a technicality. The logical conclusion is the same: your login account has administrative privileges and processes run by that account have administrative privileges as a consequence.

The point I'm getting at isn't to promote the nuclear option, but suggest that maybe there's a good reason for e.g. a web browser or your word processor to not have the same privileges as a user who can execute "simple administrative tasks" like changing the TCP/IP stack through which all your network traffic passes.


What are the common administrative tasks related to networking that require root for networking? All I can think about is stuff like route tables and dhcp, both of which live at the IP/Ethernet level rather than TCP.


Starting any network server process that uses ports under 1000, like most standard protocols (https, http, ssh, smtp, dns, ntp, dhcp etc.), requires root rights on any UNIX-like operating system.

Most personal computers do not need server processes (unless you want to connect remotely to them), but your question was not restricted to them.


From a practical point of view, regardless of the scope of the original question, this is the kind of scenario where you'd really want the restriction. More than a simple administrative task it's a dangerous attack vector to allow any user to launch your httpd or DNS.

That being said, check out capabilities(7) in Linux. You can grant an executable the privilege of binding to a low port when run by non-0 uid through setcap. This is a good compromise.


this whole 'privledged ports' nonsense is left over from a time where some process on another machine running on a low port was somehow to be trusted - because the person running that process was another administrator, and you can generally trust those guys (as opposed to unwashed users).

that world didn't last very long, and I wish we could vent some of these designs that didn't pass the test of time.


You can adjust the highest priviledged port (at least on FreeBSD). It's convenient to set that to 79 and let regular users listen to http without needing root to listen.

ssh and smtp generally need root to do their job, although maybe you could find a way to deliver mail to users without it. If you want to run user based dns or others, you could set the priviledged port even lower.


One benefit we discovered with this test framework after the blog post was written was that it made it much more convenient to do fuzzing and differential testing of the TCP stack. The core problem with fuzzing TCP is that there's a lot of incrementally built up state, and everything is extremely timing-dependent.

You basically need the fuzzer to have a model of TCP state so that it can effectively explore the state space, which is quite complicated and not something you can do with off-the shelf tools.

But once you have a bunch of unit tests designed to put the TCP stack into a specific state + a way of saving and restoring that state, it's really easy to just have snapshot of interesting situations where you can run a fuzzer on the next packet to be transmitted and see what happens.


It would be nice to have a bring-your-own-I/O TCP stack library that *doesn’t* rely on custom callbacks - something like BearSSL but for TCP, where the stack is just a pure state machine object and the user is responsible for explicitly shunting packets to and from the state machine, retaining control over when and how the I/O is done. Instead of having to define callbacks for retrieving time and consuming packets, why not explicitly pass the timestamp and packet data to a state machine object via a direct function call?


Is Cloudflare's Quiche QUIC library https://github.com/cloudflare/quiche similar to what you're looking for? All I/O must be done by the caller.


Yeah, that’s the general idea. Essentially a state machine with a send queue and receive queue, and four operations:

- Input received raw data

- Output received application data

- Input application data to send

- Output raw data to send

Obviously, since TCP connection state is time sensitive, the “raw data” wouldn’t just be the IP packet and headers, but also a time stamp telling the state machine when that packet was received/sent. If you want the state machine to keep track of time even when no packets are being received or sent, there could be an additional operation just to input a timestamp without additional packets. In effect, time is just another input that the user is responsible for feeding to the state machine at sufficiently fine intervals.

In practice, you could emulate this pattern with a callback-oriented protocol stack by populating an in-memory send/receive queue in your callback function, but that design can be somewhat inflexible because it forces potentially undesirable constraints, e.g. an extra memory copy that could otherwise be elided.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: