Nice article and analysis! I'm actually considering scrapping the trie used in my project to something based off of this one, with some modifications:
For instance, find_node(c, Node48) could avoid the branch if a non-existing index points to an additional entry in child_ptrs that's always NULL. Lookup would be comparable to the Node256 version.
Another thing that could be done, is to scrap the Node48 entirely, and implement two new structs to replace it: Node32 and Node64, and use respectively AVX2 and AVX512. These can be based off of the Node16 version. It remains to be seen if these will yield better performance than the branchless Node48 above, especially if power management kicks in when mixing AVX512 with older SIMD generations.
The trie in Lwan (https://lwan.ws) does an interesting trick to reduce the amount of memory used in the equivalent of a Node256: instead of 256 pointers to a node, it has only 8 pointers. Characters are hashed (MOD 8). The leaf node contains a linked list of key/value pair, and an actual string comparison is performed at the end. (Lwan cheats here by avoiding a string comparison if the linked list contains only 1 element.) Works pretty well, as it's part of the URL routing mechanism.
One other experiment I've been making with tries, is to use the idea of key compression and use it in a different way: slice it every 4 or 8 bytes, consider those bytes as an arbitrary integer, and add every chunk of it to a hashmap<int, some_struct>, building a chain for the next lookup in some_struct. The prototype I wrote works pretty well.
Another standard trick to reduce memory is to store a bitfield telling you which pointers are not null, and then store an array of only the pointers that are non null. For example for a Node64 you store a 64 bits bitfield plus only the pointers that are not null, so if that node has only 3 children you store 64 bits plus 3 pointers instead of 64 pointers. You can index that structure by doing a shift + popcount: if you want to find the pointer at index n you first check the n-th bit in the bitfield to see if the pointer is null, and if it isn't you count the number of set bits in bitfield[0..n] to find the index into the pointer array.
Something we’ve used before is preprocessor string hashing similar to [1]. Basically we hash strings at compile time to ints and use those for our case statements. Word of caution though, there’s the possibility of collisions. It works great on a limited set of strings like XML tags and such.
You know, I don't quite understand the point the author is making there. They concede that the compiler is smart enough to elide memcpys, but not smart enough to vectorize their strlen function? What?
There is SSE, which, although unidirectional, can be used in lieu of websockets for a chat application (receive messages/events via SSE, send via another endpoint).
However, as I said before, this is an open source project; if you have an itch, you can scratch it. PRs are open. :)
The authors of Seastar used to use Lwan to benchmark their HTTP server implementation. When Seastar isn't using DPDK/usermode TCP stack, and drops to the POSIX API, Lwan is still faster.
Curious that you compare Seastar with a Ferrari: those cars are known for an internal finish that's not compatible with their price tag... and Seastar is nothing like that. It's a seriously well engineered piece of software, and it shows: its readability and code quality is superb.
It's a known issue with TWFB. I don't know if they're planning on fixing it, but most frameworks that are not in newer rounds are not appearing in older rounds any longer.
Because I failed to move the benchmark harness from the Lwan repository to the TWFB repository. Also, I'm not really spending a lot of time on Lwan any longer; there are so many things I'd like to do, and so little time. :)