Systematic Parsing of X.509: Eradicating Security Issues with a Parse Tree

jcranmer · on Dec 17, 2018

There's one thing that gives me pause here:

The single most common error is listed as DNS/URI/email format violations. There is absolutely no discussion as to what kinds of violations these break into, nor is there even a discussion as to what the paper thinks the correct formats ought to be. This is unfortunate because the format of these parameters is one thing where specifications often have a view of the world which is completely incongruent with reality. As a simple case, you will sometimes come across documentation that thinks that DNS names cannot start with a digit, which does not match reality at all.

tialaramex · on Dec 17, 2018

If they don't discuss it, it seems reasonable to assume they mean as specified. PKIX and the Baseline Requirements are pretty clear on how this works, despite ignorance from the CAs and end users.

Note in particular that SAN dnsNames in PKIX are host names, and not all DNS names are acceptable names for hosts per the specifications.

In my experience "I wasn't sure of the format definition" is code for "I knew this was strictly forbidden but it's more convenient for me this way, now I can claim to be outraged when it's pointed out and demand extra time to correct it".

This works up to a point, but these things pile up. One small mistake is very forgiveable, except in the context of having made lots of other (perhaps convenient for you) "mistakes" and then it looks like incompetence regardless of your actual motivation.

C1sc0cat · on Dec 17, 2018

grin I have seen this working with x.400 mail - Sprint flat out ignored stuff.

ICL decided to start an index at 0 when the spec said MUST start at 1 - and oops look a divide by zero error

bluejekyll · on Dec 17, 2018

> As a simple case, you will sometimes come across documentation that thinks that DNS names cannot start with a digit, which does not match reality at all.

I wish that were true. It’s seriously annoying that DNS names are nearly indistinguishable from ipv4 addresses.

technion · on Dec 17, 2018

To complicate this further, there are certificates issued with IP addresses as names. One of the early bugs in CT Advisor[0] involved not knowing what to do with such a thing. I'd be interested in whether those writing this report considered these a valid URI.

An obvious example: https://1.1.1.1/

[0] https://ctadvisor.lolware.net/

cryptonick · on Dec 18, 2018

I am the main author of the paper, I am glad to see such an interesting discussion aroused on our work.

In general, to address the concerns about the validation of the information found in the certificate, I remark that we targeted only syntactic checking: that is, we just verify that the format of the IP address or the DNS names is compliant to the X.509 specifications. Instead, the validation of the information retrieved by the parsing can be done by an application built on top of our parser. Therefore, a URI/DNS/IP address is valid if it is compliant to the syntactic format described in the standard.

Regarding the specific example, I run our parser on that certificate: since the IP addresses in SAN have the correct format, they are correctly recognized. This certificate provides an example of a bad DNS name: indeed, it contains a DNS name starting with *. Although I could expect that this might sounds reasonable, such a Domain Name is not allowed by X.509 specification.

tialaramex · on Dec 17, 2018

Older versions of Microsoft Windows don't understand the ipAddress SAN at all. Instead they parse the dnsName SANs as a generic text field into which you can write host names, IP addresses, or whatever you like. This was fixed in modern Windows (maybe Windows 7 onwards?)

As a result, and with enforcement never total, CAs would often issue non-compliant garbage for IP addresses because "it works on Windows". In the last few years this has improved as older Windows systems rust out and enforcement becomes more proactive.

1.1.1.1 is an example of how this should look.

XMPPwocky · on Dec 17, 2018

I'm seeing the SAN for the cert served by 1.1.1.1 to contain the IP 1.1.1.1, not the domain 1.1.1.1. Or am I misunderstanding?

technion · on Dec 17, 2018

Yes, the name is listed in the SAN on that cert. That said, that's one of the fields that's parsed, and potentially a source of issues in this paper.

praseodym · on Dec 17, 2018

Subject Alternative Name has explicit differentiation between IP address and DNS names [1].

And fortunately the fallback to the the common name (CN) attribute that does not have this explicit differentiation was deprecated since 2000 in RFC 2818, and more recently also by both Chrome and Firefox [2,3].

[1] https://tools.ietf.org/html/rfc5280#section-4.2.1.6 [2] https://www.chromestatus.com/features/4981025180483584 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=1245280

lsh · on Dec 17, 2018

see also langsec.org

non-Turing complete languages, formal grammars and context-free parsing are fascinating and the current state of tooling is really sophisticated but sparse. So much boilerplate code and adhoc parsing exists in my code and I never really appreciated how much until I asked myself if I really needed a Turing complete language to tell me if an input really is an integer or a string of 4 chars, etc.

I'm terrified what would happen if a fuzzer ever went to town on my python code.

zvrba · on Dec 17, 2018

The last couple of times I had to parse/generate string according to something that can be described with a grammar, I resisted the temptation to implement an ad-hoc parser just because it was "simple". So I took the time and started to use Boost Qi/Karma (C++) instead. Just BECAUSE the formats are simple, they are the perfect opportunity to start learning more powerful tools.

sneak · on Dec 17, 2018

It constantly underscores how early we are in the development of reliable tooling that such basic errors as these are still regularly being made. I am glad this sort of research is being done to uncover and identify our societal technical debt.

nailer · on Dec 17, 2018

There's a lot of debt from just ASN1 and X509 parsing itself. The formats are only popular because of their popularity: their payloads are what matters.

wahern · on Dec 17, 2018

The formats are popular because they were popular in closed-source software. And they were popular in closed-source software because there were (and still are) good commercial parser generators for ASN.1.

ASN.1 has been a failure in open source because there weren't any good parser generators. The only open source ASN.1 generator for C code I'm familiar with is asn1c[1], which was published long after OpenSSL and other projects added their ad hoc certificate parsing code. I think there may be one or two for Java, but that's about it.

Moreover, open source projects have historically disfavored using parser generators. They don't like the dependency, and there's still the sense that good protocols shouldn't need parser generators--contrast commercial protocols like X.whatever with SMTP, HTTP, etc.

ASN.1-based formats were never intended to be parsed using hand written code. Abstraction Syntax Notation refers to the grammar for specifying the wire-line formats.

ASN.1 is solid technology. The technical debt exists because open source tooling never developed around it. At first the community thought it was too complicated and unnecessary. Then when the need arose the community simply reinvented the wheel (Protocol Buffers, etc).

[1] asn1c is amazing, BTW. Not only will it generate encoders and decoders given the ASN.1 specification, but it can generate streaming encoders and decoders, something that most open source alternatives (e.g. Protocol Buffers) can't do. (And by streaming I mean streaming a single message, which is important for low-memory environments, either because of minimal hardware resources, as a performance optimization, or as a security constraint.)

cryptonick · on Dec 18, 2018

I agree with Your claim about ASN.1 not intended to be parsed using hand written code. The ASN.1 is indeed quite close to a grammar specification, as also shown in the paper. However, I believe that a major issue related to its parsing complexity is the binary encoding generally used, which is either BER or DER: both of them employing length fields. While the usage of length fields is usually ubiquitous in formats related to communications protocols, these fields are quite annoying to be handled from a grammar design perspective. Indeed, a length field requires to count bytes of the payload: this operation is tedious to be done with grammars, while it is extremely easy with hand written code, in turn generally making the usage of grammar based automatic parser generators for these formats a less common choice.

From a grammar design standpoint, a delimiter based structure would be preferable. For instance, in the context of X.509, we proposed a new format[1] which replaces DER encoding with a new format where there are no longer length fields but the payload is terminated by a fixed delimiter. The grammar for this format was way easier than the length field base encoding, without requiring any hand written code.

[1] A novel regular fomat for X.509 digital certificates, https://link.springer.com/chapter/10.1007/978-3-319-54978-1_...

wahern · on Dec 18, 2018

I believe I read that paper recently :) Ultimately I ended up using PEGs. LPeg in particular, using LPeg's match time captures to recursively invoke PEGs for length-encoded objects. (In addition to the match time capture extension, what's especially nice about LPeg--missing from every other PEG library I've seen--is that you can build and transform the AST in one shot.)

I've also tentatively rejected translation to a format like that proposed in the paper. In a secure enclave-like environment I'd rather be dealing with statically defined C-like structs with stronger invariants--i.e. no optional or sum types, no variable-length fields; basically, no need for any kind of parsing whatsoever. If I have to transform, I'd like to transform the message both syntactically and semantically into the simplest possible form. Parsing complexity is only part of the equation. The other part is semantic complexity, which is a different kind of problem that better formats and parsers can't fix.

In another timeline things could have been different, but we don't live in that timeline :( We can't let perfect be the enemy of good. Even if we could move away from DER or even ASN.1 in the open source world, the entire telecommunications industry (and specifically the cell industry) is built around ASN.1 and DER/PER/XER. AFAICT the biggest users of asn1c are people working with 3GPP and similar standards. No matter how sane and secure we can make our open source ecosystems, ASN.1 and similar older tech will still lurk in the background, remaining the weakest link in the chain. If we want real security we have no choice but to develop better tooling in that regard. I appreciate your proposal is very much of that mindset, I'm just not sold on the practical utility.

nly · on Dec 17, 2018

My default position on parsing anything more complex than a couple of comma-separated non-string values these days is to write a grammar and pick up a tool. I wish more programmers felt this way.

AllegedAlec · on Dec 17, 2018

Event that is always highly annoying. How do you deal with (for example) Dutch decimal numbers, which use the comma for decimal separation, or numbers with a comma as a thousands separator?

LeonM · on Dec 17, 2018

'Dutch' notation can usually be distinguished with enough context (i.e. if the number has decimals), or by comparing it to other numbers found in a file.

But don't get me started on people mixing American date notation and ISO notation. Especially mixing the separators.

If you ever have to work with date notations, use a hyphen as a separator for ISO, and use a forward slash (/) when using American notation. It's the only way to distinguish dates before the 13th day of the month.

riffraff · on Dec 17, 2018

I am sorry to tell you the rest of the world also uses slash with dates, it's not just a US thing. But with same field order.

mattashii · on Dec 18, 2018

Please note that in The Netherlands, the most commonly used date format using '/' is day/month/year, not month/day/year as seen in the US.

See also https://en.wikipedia.org/wiki/Date_format_by_country

userbinator · on Dec 17, 2018

Interesting how Apple's SecureTransport seems to be the most permissive of them all, rejecting 0 certificates that all the others complained about syntactic errors with in the dataset.

k-ian · on Dec 17, 2018

[comment about x509 being bad]

Dylan16807 · on Dec 17, 2018

Sucks that you're being downvoted since at the time you posted "X.509" wasn't in the title and it was reasonable context on why "20% of HTTPS server cert are incorrect, half considered valid by libs"