It sounds like, with the no-length-field part, they're saying a protocol should define all metadata and data formats? I get this from the packet portion; if skipping over the data part of a packet is bad, but churning through it to get to the close-delimiter is good, then you must be validating the data somehow.
Well sure, validate your data. But why is that part of the packet protocol? I would think it should only verify that the data is properly escaped (if it is indeed escaped - with a length field it doesn't need to be). Shouldn't the contents be validated by what's using it, and not the TCP/IP packet specification?
Adding a length field certainly doesn't make the protocol Turing-complete, a pushdown state machine should handle it just fine - push the length onto the stack, and iterate and decrement until it's exhausted. That even keeps it context-free (if the protocol is regular, and you just added a length field), which they claim is a safe zone.
--
They also claim that escaping is a flawed art, citing SQL injection. Is it? It seems trivial to do a trivial escape that's not flawed (not claiming CSV here, just double-quotes or something), and such an escaping act must be performed if you read open and close delimiters... unless your protocol is also specifying and validating the contents it's transferring, down to the very last bit, and it doesn't allow those delimiters in the data.
Doesn't this mean they've just claimed that arbitrary data transport protocols are all inherently flawed? And that any nested specs (ASCII in a TCP/IP packet, for instance) are inherently flawed, but single-depth ones (which could very well specify ASCII as the only data format for TCP/IP packets) aren't?
--
All in all an interesting talk, and the core message is a good one - simple, non-Turing-complete protocols are best, and perhaps should be the only protocols, because they can be validated. But I don't think I buy some of the reasoning, though I'd love to be convinced otherwise if it is indeed valid.
The point about encapsulating arbituary data formats is a good one. Especially if you invent some protocol that may have to receive programming language code or worse images/mp3s etc etc especially compressed/encrypted ones.
Unless your protocol will have very limited use cases it's almost impossible to know what data will be sent over it , so you can't really guarentee any single delimited character. Even if you do make the use case limited someone will want to use it in a way you didn't intend (look at HTTP for example).
I don't think they were talking about validating everything at TCP/IP level, especially since it's unlikely you or I could affect meaningful change to the TCP/IP spec anyway. I think it's meaning application layer protocols that get invented for specific programs where what you are doing infact is creating another type of packet at a higher abstraction level.
I think the length field example was probably a bad one but I think the tl;dr is to KISS when defining a protocol. If somebody else re-implements your protocol and you can't easily run their implementation through a series of tests to know if it is correct or not then you run the risk of creating a so called "weird machine".
KISS seems to demand a length field on the receiving side of a protocol, though - you don't need to restrict, validate, (un)escape, or even look at the data you're transferring, so you can easily transfer anything. Testing length is far easier than escaping - do the bits after skipping match the spec for the next chunk of metadata? You can therefore create tests which only test the metadata (your actual spec), and completely ignore the data.
The only thing left is ensuring your tests fail when you specify the wrong length, which should be assured by your spec (either a master-length value or a protocol-terminating character like \0 (where you touch on the escaping problem again)). This is the exact same thing as ensuring start/end delimited tests pass/fail when you are adding/missing bits of your packet, or if the bits/bytes are misaligned, which you should be doing regardless.
Given this, using lengths means you can avoid creating or testing any (un)escaping code, and the bit-alignment tests are identical between the two - lengths are simpler and more easily testable. KISS favors lengths.
On the sending side, length is still easier than escaping, unless you're streaming data of unknown length. In that case, you're probably still breaking it up into packets that can be individually sent and validated with a known length, or you have to resort to escaping and delimiters (not that it's a bad option, just that it's the only remaining one AFAIK).
Testing length is far easier than escaping - do the bits after skipping match the spec for the next chunk of metadata?
What happens if the wrong length is supplied but something inside the actual data portion happens to match the spec for the next part of the metadata (either by accident or design)? This could result in too little of the data being read as input.
You will then probably end up with some data/metadata further along the line being read by a piece of code it wasn't supposed to be since everything is out of whack.
This shouldn't have security implications as far as I can think , but could lead to data that should be valid causing errors in some cases.
Could be fixed maybe by having a hash right after the data portion and checking this after reading it. That would fix "by accident" but not by design. Although if your purposefully crafting packets to confuse the parser at the other end you deserve to have them dropped on the floor.
This is really pascal strings vs C strings all over again.
Assuming only lengths, the packet length won't match - you'll reach the end of the structure before the end of the data. The point of all this is to validate before interpreting, so it fails validation, and no harm is done.
Assuming a \0 termination, which requires that it does not exist in the data (an escaping problem): there won't be a \0 after the should-be-last piece of metadata, so it fails validation, and no harm is done.
And all of this assumes you have a correctly-transmitted packet to begin with. What if you don't? If your spec+validator can't detect it, then you are introducing possible attack vectors - your tests must test wholly-invalid structures, which your validator must fail, to be 'safe'. I don't see how delimiters protect this any more than lengths, and they add complexity due to escaping (which I do think can be done safely + correctly, but we're also assuming they cannot be because that was essentially a claim in the video), and you must escape if you use delimiters and allow arbitrary data.
One of my favorite bits from Vernor Vinge's _A Fire Upon the Deep_ -
>>> WARNING! The site identifying itself as Arbitration Arts is now controlled by the Straumli Perversion. The Arts' recent advertisement of communications services is a deadly trick. In fact we have good evidence that the Perversion used sapient Net packets to invade and disable the Arts' defenses. Large portions of the Arts now appear to be under direct control of the Straumli Power...
"Sapient net packets" seem like a really, really bad idea . . .
What I'm interested in is the performance implications of context-free parsing of some arbituary data without a length field.
Someone asked this in the talk and a paper was referenced and she thought context free could be faster but I don't really have time to read it.
Let's say you get delimited data over a network socket and no length field. How do you know how big to make your receive buffer without losing efficiency (i.e scanning the data twice or allocating more than you need) or re-allocating and copying the buffer at some interval?
Well sure, validate your data. But why is that part of the packet protocol? I would think it should only verify that the data is properly escaped (if it is indeed escaped - with a length field it doesn't need to be). Shouldn't the contents be validated by what's using it, and not the TCP/IP packet specification?
Adding a length field certainly doesn't make the protocol Turing-complete, a pushdown state machine should handle it just fine - push the length onto the stack, and iterate and decrement until it's exhausted. That even keeps it context-free (if the protocol is regular, and you just added a length field), which they claim is a safe zone.
--
They also claim that escaping is a flawed art, citing SQL injection. Is it? It seems trivial to do a trivial escape that's not flawed (not claiming CSV here, just double-quotes or something), and such an escaping act must be performed if you read open and close delimiters... unless your protocol is also specifying and validating the contents it's transferring, down to the very last bit, and it doesn't allow those delimiters in the data.
Doesn't this mean they've just claimed that arbitrary data transport protocols are all inherently flawed? And that any nested specs (ASCII in a TCP/IP packet, for instance) are inherently flawed, but single-depth ones (which could very well specify ASCII as the only data format for TCP/IP packets) aren't?
--
All in all an interesting talk, and the core message is a good one - simple, non-Turing-complete protocols are best, and perhaps should be the only protocols, because they can be validated. But I don't think I buy some of the reasoning, though I'd love to be convinced otherwise if it is indeed valid.