Puffs: Parsing Untrusted File Formats Safely

evincarofautumn · on Nov 16, 2017

Some curious things about the language:

• All operators have equal precedence, so parentheses are required when mixing operators. I guess they’ve seen one too many bugs from precedence confusion.

• It doesn’t have dependent types; rather, it uses a proof checker with some built-in knowledge about the language, and a way to specify preconditions, postconditions, and invariants that the checker can reason about.

• There is limited effect typing: functions can be marked pure, impure (!), or impure coroutine (?).

• There doesn’t appear to be any kind of polymorphism—over types, effects, refinements, or proofs.

nigeltao · on Nov 16, 2017

> I guess they’ve seen one too many bugs from precedence confusion.

Yep. I just filed a precedence bug at https://github.com/jbangert/nail/issues/7 for Nail, which is a security-concious project. (It was presented at LangSec 2014).

> It doesn’t have dependent types.

Yep. Programming language theorists love type systems, and dependent types are one way to prove bounds safety, but they're not the only way. Puffs is an imperative language, not functional (for the reasons described in https://github.com/google/puffs/blob/master/doc/related-work...). With mutable state, a Puffs variable's within-bounds-ness can change over that variable's lifetime, but its type does not.

> There is limited effect typing: functions can be marked pure, impure (!), or impure coroutine (?).

Yep. That's the current design, although I'll probably revise that based on recent experience. The question mark denotes a coroutine, as you noted, but also whether that function can (very roughly speaking) throw an exception. I think we will need separate syntaxes for those two concepts, since I've wanted the latter without the former.

> There doesn’t appear to be any kind of polymorphism—over types, effects, refinements, or proofs.

Yep. Haven't needed it yet, and I've erred on the side of simplicity and leaving things out.

masklinn · on Nov 16, 2017

> All operators have equal precedence, so parentheses are required when mixing operators.

Doesn't that mean it rejects precedence? In a language equal precedence (e.g. Smalltalk), operators would just be executed from LTR or RTL.

evincarofautumn · on Nov 16, 2017

I agree, although I suppose different operators could still have different associativity in a language without precedence. I took that phrasing from the “Puffs the Language” documentation.

masklinn · on Nov 16, 2017

> I suppose different operators could still have different associativity in a language without precedence

That… is true, and I have a hard time computing how it would even work.

evincarofautumn · on Nov 16, 2017

One way to do it would be to have a default associativity between different operators, but chains of the same operator can override it, e.g. suppose “⧺” means concatenation (rightward), “×” means repetition (leftward), and the default is leftward:

      [1, 2] ⧺  [3, 4] ⧺ [5, 6]   × 2  × 3
    (([1, 2] ⧺ ([3, 4] ⧺ [5, 6])) × 2) × 3

    [ [ [1…6], [1…6] ],
      [ [1…6], [1…6] ],
      [ [1…6], [1…6] ] ]

I dunno how good an idea that would be in practice, though.

klodolph · on Nov 16, 2017

Think exponentiation, where 2^3^4 might be 2^(3^4), but where subtraction 2-3-4 is (2-3)-4.

mnarayan01 · on Nov 16, 2017

Consider:

  2-3^4

irishsultan · on Nov 16, 2017

Since there is no precedence in this (hypothetical) language that would be an invalid expression.

nigeltao · on Nov 16, 2017

What I mean is that you can't write "2 + 3 * 4", not even with implicit LTR, you have to write either "(2 + 3) * 4" or "2 + (3 * 4)".

See e.g the "precedence bug" comment I wrote elsewhere on this page.

masklinn · on Nov 17, 2017

My point is that's not "equal precedence". Smalltalk has equal precedence, "2 + 3 * 4" parses as "(2 + 3) * 4". If you reject the expression entirely you don't have equal precedences, you have unmixable operators or precedences (the latter being if "2 + 3 - 4" is still allowed).

nigeltao · on Nov 18, 2017

OK, I'll change the wording. Thanks.

amenghra · on Nov 16, 2017

The best way to solve precedence confusion (IMHO) is to have your IDE show you how the expression is going to get parsed. Possible ways to achieve this include highlighting subexpressions, underlines, light-colored parenthesis, showing a tree, etc. This could be done only on mouse-over to not burden the overall UI.

skybrian · on Nov 16, 2017

An aside: I'm writing a toy language without operator-specific precedence and it's surprisingly readable. One trick is to treat expressions without spaces (a+b) as higher precedence than expressions with spaces (a + b). This is enough to write many formulas in an intuitive way, with few parentheses.

johnlbevan2 · on Nov 16, 2017

> "All operators have equal precedence, so parentheses are required..."

Just to play devil's advocate (aka be annoying); parentheses are operators.

Ref: https://softwareengineering.stackexchange.com/a/208354/69247

_dcwr · on Nov 16, 2017

A devil's advocate is not an annoying person but someone who argues against something he truly believes in or hopes to be true in order to provide a debate and help prove the point in stronger ways and show onlookers that contrary opinions were considered.

It comes from canonization process in Vatican where someone (who is not an actual satan worshiper or against the church) argued in favor of the devil (against the candidate) to force the other pro-sainthood side to provide good solid proof of the sainthood, miracles done, etc. of the candidate.

What you're doing is just being annoying by nitpicking the casual wording. There is no debate to be had from "parentheses are kinda operators" and the meaning of the original line was understandable, even if you consider parentheses operators.

Chris2048 · on Nov 16, 2017

I don't find that link convincing. For one thing, the premise is assumed in the question. Secondly, when it says "they also include ()[] as operators" is implies it is talking about language-specific terminology rather than a general term.

WP describes: https://en.wikipedia.org/wiki/Operator_(computer_programming...

> Occasionally parts of a language may be described as "matchfix" or "circumfix" operators, either to simplify the language's description or implementation.

Bu then, the term "operator" seems to also have mathematical roots: https://en.wikipedia.org/wiki/Operator_(mathematics)

The relevant part seems to be the definition refering to syntax, or denotation:

> Operator is also used for denoting the symbol of a mathematical operation. This is related with the meaning of "operator" in computer programming

Now, whether "the symbol" implies a singular symbol, and excludes delimiter symbols might be a matter of semantics ;-)

nigeltao · on Nov 16, 2017

Hi, Puffs author here.

ingve linked to the github page instead of the announcement e-mail: https://groups.google.com/forum/#!topic/puffslang/2z61mNTAMn...

That announcement has more to say about the comparison to Rust, which is probably the most frequently asked question.

There's also some more words, on Rust and on other related work like Dafny, at https://github.com/google/puffs/blob/master/doc/related-work...

Edit: It's also not Rust per se, but the numbers at https://github.com/google/puffs/blob/master/doc/benchmarks.m... shows that, on Puffs' benchmarks, gcc 7.2 noticably outperforms clang/llvm 5.0. I'm sure this is a solvable problem, and not a fundamental flaw with llvm, but fixing that's beyond my llvm knowledge.

vvanders · on Nov 16, 2017

> Some of the safe languages (Go, Java, JavaScript, Python, Rust, etc.) are faster than others, but generally speaking, they're not as fast as C/C++. Consider an array indexing expression like "a[i]".

I don't think you'll win any friends in the Rust camp by making broad statements like that. Performance of "unsafe" vs "safe" languages largely boils down to memory access patterns, cache usage and how many levels of indirection you tend to hit. Rust easily matches C/C++ while keeping high level abstractions(bounds checks are done at slice level and not per-element access).

I don't think Mozilla would have used Rust for Servo/Quantum if it was slower than C++. It's certainly held true in all the cases I've used Rust in place of C++.

nigeltao · on Nov 17, 2017

First, as I said in another comment, Rust is great tech, written by great engineers. Puffs is still different (with different trade-offs).

And, yes, memory access patterns, cache usage, etc. affect performance.

And, yes, in general, Rust performs comparably to C/C++. As I noted elsewhere, Rust with runtime arithmetic overflow checks currently performs worse than Rust without such checks. So, yes, Rust without those checks is as fast as C, and in general, arithmetic overflow isn't the biggest concern.

steveklabnik, a Rust expert, commented on this page that, in the future, "if the runtime checks get cheap enough, we can do them in release mode as well". If so, that's great, I'm happy to be proven wrong. Cheap still isn't zero, though, and see "nanoseconds become milliseconds" at https://groups.google.com/forum/#!topic/puffslang/2z61mNTAMn...

In contrast, Puffs today performs as fast as C, with arithmetic overflow checks. They just happen to be compile time checks. And sometimes overflow is indeed a concern (search for "underflow" in https://blog.chromium.org/2012/05/tale-of-two-pwnies-part-1....).

I'm sorry, but I don't understand what you mean by bounds checks being done at the slice level and not per-element access. A statement like "pixels[y * stride + x] = etc" is per-element, right?

vvanders · on Nov 17, 2017

> I'm sorry, but I don't understand what you mean by bounds checks being done at the slice level and not per-element access. A statement like "pixels[y * stride + x] = etc" is per-element, right?

Yes, if you index a slice you have to check each access. However the idiomatic way to work with strides of data in Rust is to use Iterators.

Bounds is checked at the entry of an iteration and the the inner loop is nice and fast. So your example would be:

  for pixel in &mut pixels[0..y*stride+x] {
    *pixel = etc
  }

I tried to do something similar on the playground[1] but it turns out Rust/LLVM is too smart and folded the whole loop down to a constant.

[1] https://play.rust-lang.org/?gist=f3699d6456a561c3874395bff36...

bjz_ · on Nov 17, 2017

https://rust.godbolt.org/ is good for playing around with this stuff: https://godbolt.org/g/9TgVXg

nigeltao · on Nov 18, 2017

Sure, the inner loop is nice and fast if you're consuming exactly one byte per iteration, but I'm not doing that. In any case, I asked https://users.rust-lang.org/t/iterators-and-eliminating-all-...

comex · on Nov 17, 2017

> In contrast, Puffs today performs as fast as C, with arithmetic overflow checks. They just happen to be compile time checks.

I'll grant that's largely true, but as far as I can see it's not entirely true. For example, to quote a comment from one of the example .puffs files:

    // Set history_index (modulo 0x8000) to the length of this
    // remainder. The &0x7FFF is redundant, but proves to the
    // compiler that the conversion to u32 will not overflow.

In this case, the redundant operation is equivalent to a disabled arithmetic overflow check, although it's (commendably) made more explicit.

Similarly, in decode_lzw.puffs, I think (could be wrong) that some of the conditions in 'if' statements, e.g. "(width < 12)", should always be true unless the image is malformed. In a way, this is substituting for a runtime array bounds check and/or overflow check that might be hit if the condition weren't present. Here too, Puffs arguably wins by making the check explicit. If I'm right that the author stuffed in extra conditions where demanded by the compiler, rather than properly considering the consequences of their being false (e.g. should probably result in returning an error), then there's some argument that it would be better to crash than silently misbehave. But even then, at least Puffs makes the problem relatively obvious, whereas in e.g. Rust you just have to trust that your code won't perform any overflows.

> And sometimes overflow is indeed a concern (search for "underflow" in https://blog.chromium.org/2012/05/tale-of-two-pwnies-part-1....).

Integer overflow is a huge concern in C, but much less of one in memory-safe languages, where a buffer size being miscalculated should result in a controlled panic rather than memory corruption (…at least if you're not interfacing with unsafe code). Still, an overflow is effectively a miscalculation, incorrect behavior, and with any kind of incorrect behavior there's always a chance it will compromise security in some way.

lilyball · on Nov 17, 2017

> Similarly on performance, the idiomatic and fast C/C++ way to read a byte of input from a buffered source is "x = src++". But pointer arithmetic is explicitly unsafe in e.g. Go and Rust. Sure, you can hide the unsafe parts as private implementation and provide an unsafe-free public API. Still, as future maintainers edit the code, proof of safety (and the assumptions that they rely upon) are merely comments, not compiler enforced.*

Sure, but that's not the idiomatic way to do that in Rust. The idiomatic way to do that in Rust is to have an iterator over the bytes and call `next()` on it to get the next byte. This means no unsafe code in your codebase, meaning no need to worry about proving that it's safe or about safety not being enforced by the compiler.

nigeltao · on Nov 18, 2017

Sure, that's not unsafe, but it does require a bounds check, i.e. seeing if you've hit the end of your slice of byte data.

One of the goals of Puffs is to eliminate that runtime bounds check, so it'd be exactly as fast as C/C++'s "x = *src++", even if you're not consuming exactly one byte per iteration, so you can't use Rust's "for val in bytes_iter".

In another comment on this page, I posted to where I asked https://users.rust-lang.org/t/iterators-and-eliminating-all-... for more feedback from the Rust community about this.

kleiba · on Nov 16, 2017

Coincidentally, the word "Puffs" in German is a colloquial expression for "brothels". Not that it matters much in your context, just a piece of trivia...

margorczynski · on Nov 16, 2017

> The aim is to produce software libraries that are as safe as Go or Rust, roughly speaking, but as fast as C, and that can be used anywhere C libraries are used

One question about this to people more versed in language compilation - wouldn't it be possible for SAFE Rust code to be faster than C code considering that the Rust compiler has much more syntax to play around and optimize? Cause many of those safe constructs are part of the language and can be taken into account

steveklabnik · on Nov 16, 2017

There's two more interesting ways to think about this: what's the average case, and what's maintainable?

What's the comparison in speed between an average C programmer vs an average Rust programmer? Rust usually defaults to the fast thing, which can lead to performance surprises when it doesn't. But I think this question is more interesting than "what can the best C coder do vs the best Rust coder", the average is what is going to be the case for a much bigger segment of programmers.

Secondly, what's maintainable? For example, take the Stylo and Firefox devs: you can absolutely do what Stylo does, but in C. But can you maintain it? Rust's safety guarantees let you do very aggressive things that are too tricky in C. Yes, you could write the code. When you come back three months later, and modify the code, will you make a mistake? The Rust compiler will catch you here, but the C compiler probably won't...

A related version of this story: one of the core devs talks about working on some code that needed reference counting deep in the core. He used Rc, which is non-atomic, because the original case was not threaded. He came back a few months later, trying to paralellize it, and got a compile-time error, pointing to the guts that used the un-threadsafe code. He was able to immediately fix it, whereas if the compiler couldn't have checked this kind of thing, it would have been a subtle, hard to reproduce bug.

TL;DR: these kinds of comparisons are very difficult and often anecdotal. We'll see empirically as time marches on :)

margorczynski · on Nov 16, 2017

That's also how I see it and that e.g. Rust can really increase the average quality of software and reduce the cost of maintaining/extending compared to C and C++.

I wrote quite a bit of C and I like the language as it really fits its purpose (the thinnest and usable layer above Assembly) but I would never want to create/maintain a big application (especially threaded) in it

nitrogen · on Nov 16, 2017

C is manageable if you use memory and concurrency checking tools like Valgrind, but Valgrind doesn't guarantee that your tests cover all possible behaviors of your code.

pjmlp · on Nov 17, 2017

Which most developers don't use unless obliged by upper management.

Manishearth · on Nov 16, 2017

> wouldn't it be possible for SAFE Rust code to be faster than C code considering that the Rust compiler has much more syntax to play around and optimize?

Yes. Rust has TONS of aliasing info that optimizers essentially spend a lot of time recovering in other programs. However, LLVM doesn't support being told about all of it, and doesn't make use of it that much directly (we can at best tell LLVM that a whole bunch of things are `restrict`). In fact, we had to stop doing this for `&mut` because LLVM had too many bugs around that. C/C++ codebases don't benefit from this as much so LLVM doesn't prioritize this.

Rust can do optimizations at the MIR level and I'm hoping we'll do more in that space in the future.

vvanders · on Nov 16, 2017

That's pretty depressing, restrict is killer when you really want to eek out performance and I've had an intuition that Rust had all the relevant info to automatically use it.

kibwen · on Nov 17, 2017

Rust does have the information to, but it has so much information that LLVM doesn't yet know what to do with it all. :P

bjz_ · on Nov 17, 2017

I dunno, it's actually exciting imo. Means we've got more speed to see yet from Rust! :)

inimino · on Nov 17, 2017

eke out

chrismorgan · on Nov 16, 2017

In theory, yes, some Rust code can be faster than the C or C++ code people would write, especially as pertains to the guarantees offered by its aliasing control.

In practice, LLVM hasn’t had any incentive to implement those sorts of optimisations at this stage (because before Rust came along nothing would benefit from them), so the benefits are generally theoretical only.

It remains to be seen what this will effect.

muizelaar · on Nov 16, 2017

LLVM does implement these sorts of optimizations:

C:

  int foo(int *x, int *y) {
    *x = 0;
    *y = 1;
    return *x;
  }

gives:

  foo(int*, int*): # @foo(int*, int*)
    mov dword ptr [rdi], 0
    mov dword ptr [rsi], 1
    mov eax, dword ptr [rdi]
    ret

where as in Rust:

  fn foo(x: &mut i32, y: &mut i32) -> i32 {
    *x = 0;
    *y = 1;
    *x
  }

which compiles to:

  example::foo:
    mov dword ptr [rdi], 0
    mov dword ptr [rsi], 1
    xor eax, eax
    ret

wolf550e · on Nov 16, 2017

That optimization is useful to C and C++ code, they use "restrict": https://en.wikipedia.org/wiki/Restrict

Does LLVM implement optimization that only Rust can use, or at least those that cannot be used from C/C++?

steveklabnik · on Nov 16, 2017

&mut T pointers are basically restrict by default.

LLVM is adding some semantics specific to non-C or C++ languages. I’m on my phone so I can’t link you, but they’re adding an intrinsic related to infinite loops because languages like Rust have different semantics here.

Manishearth · on Nov 16, 2017

Not anymore, LLVM was buggy about this and we had to remove this.

steveklabnik · on Nov 16, 2017

That’s why I said “basically”, it’s been fixed upstream and will be turned on again soon.

andoma · on Nov 16, 2017

For the record, gcc and clang does this as well when using the restrict keyword. (I'm still a big fan of Rust though)

  int foo(int *restrict x, int *restrict y) {
    *x = 0;
    *y = 1;
    return *x;
  }

Compiles to:

  foo:
    movl	$0, (%rdi)
    movl	$1, (%rsi)
    xorl	%eax, %eax
    ret

rurban · on Nov 17, 2017

Pony does use most LLVM optimizations and already IS faster than C++ with OpenMP so there's still a lot of catchup to do.

fixermark · on Nov 16, 2017

I think it goes a bit further---it aims to produce software libraries that are safer than Go or Rust. For example, Go will cheerfully allow construction of an arithmetic operation that overflows integer size; the behavior is well-defined and not an exception, but if the developer is not anticipating it, it can result in some suprising code (as Go specifies, the compiler is not allowed to assume x < x + 1, but that's a very easy trap for a programmer to fall into).

Puffs lets you specify in the code that you do not intend to allow arithmetic overflow at all. It also has types to distinguish pointers that may be null and pointers that must not be null.

nigeltao · on Nov 17, 2017

Exactly. In the announcement (https://groups.google.com/forum/#!topic/puffslang/2z61mNTAMn...), I said "Rust can check for arithmetic overflow (at run time), but after some debate, the Rust team decided to disable such checks when building in release mode, for performance reasons. Obviously, though, we don't ship our users debug builds."

Puffs enforces arithmetic overflow checks (it's not optional, and you don't have to specify that you intend to check overflow). It differs from Rust in that it's done entirely at compile time instead of at run time.

pkhuong · on Nov 16, 2017

The expressiveness of LLVM IR is very close to C. Given a high enough tolerance for pain, one could implement the same rewrite that the rust compiler would implement in C (without even going into inline assembly).

In practice… maybe. However, I find we're more forgiving of unpredictable performance (either wrt analysis runtime or quality of analysis) in safety checking tools than in code generators. The ergonomics of bitblasting code generation questions to SAT aren't great.

CapacitorSet · on Nov 16, 2017

I wouldn't be too sure about that. Rust gives more compile-time guarantees, allowing the compiler not to emit code for otherwise plausible conditions (see eg. pointer aliasing); on the other hand, C compilers typically rely on undefined behaviour to apply optimisations, whereas Rust may not have as much undefined behaviour due to its safety.

steveklabnik · on Nov 16, 2017

Safe Rust has no UB, outside of some bugs, to be clear.

That said, optimization is all about guarantees: this broad of a claim (UB vs info) because both are about what's guaranteed, and what's not. That is, they play similar roles, so you'd have to be comparing how much of each is in what amount in what code.

We generally expect Rust to be roughly as fast as C, sometimes faster, sometimes slower. Modulo optimizer bugs in each case, of course.

tc · on Nov 16, 2017

This DSL seems largely motivated by the desire for provable compile-time bounds checking. One general-purpose modern language that can check bounds at compile time and generate code that is performance-competitive with C is ATS:

http://www.ats-lang.org/

Dependent types are meant to solve exactly this class of problem. Rust has an RFC for adding these:

https://github.com/rust-lang/rfcs/issues/1930

nigeltao · on Nov 17, 2017

Re ATS, http://ats-lang.sourceforge.net/DOCUMENT/INT2PROGINATS/HTML/... says that "By default, the constraint-solver implemented for ATS/Postiats makes use of the standard arithmetic of infinite precision. For the sake of efficiency, one may also choose to use machine-level arithmetic for solving integer constraints. Due to potential arithmetic overflow, results returned by the constraint-solver that uses machine-level arithmetic can be incorrect".

Puffs does not use infinite precision integers (aka "big ints"), for performance. But Puffs still checks for arithmetic overflow.

Re dependent types, I made a comment elsewhere on this page that "dependent types are one way to prove bounds safety, but they're not the only way".

doublec · on Nov 17, 2017

ATS uses infinite precision ints only for the proof/dependent type checking. The ints used at runtime are machine ints. It's to avoid overflow during type system checking.

throwaway613834 · on Nov 16, 2017

Aren't dependent types undecidable?

bjz_ · on Nov 17, 2017

No, not if your language has totality checking like Idris, Agda, Lean, etc. Granted, it can sometimes be finicky to prove totality - the type checker naturally has to be a little conservative due to the impossibility of solving the halting problem, but for most things with array bounds checking it shouldn't be an issue.

tomp · on Nov 16, 2017

integers with additions are decidable - that's pretty much what you need to verify array bounds (multiplication with a constant is included as well, so matrices should work too).

throwaway613834 · on Nov 16, 2017

Do you mean "in practice" that's what you need? I could buy that, but I'm thinking about it in terms of the language spec. What would it say? Are we moving toward a language spec that says the meaning is "whatever the reference compiler can handle"? Or do you mean one that says "array indices may only be (constant) linear combinations of variables in order for the code to be compilable" or something like that?

tomp · on Nov 16, 2017

Good point. The way I imagined is, the language is defined as "safe", and has both "static" and "dynamic" guarantees. Whatever the compiler is able to prove statically, amazing, or else it adds dynamic checks. The programmer can optionally tag the function/expression so that the compiler will warn if it can't prove it. Long-term, I think it's very reasonable to assume the solvers will improve, so best to design a language with this in mind.

yeahbutbut · on Nov 16, 2017

Maybe.

throwaway613834 · on Nov 16, 2017

What do you mean by "maybe"?

[1] https://cs.stackexchange.com/questions/12691

yeahbutbut · on Nov 16, 2017

I was trying to be punny by implying indecision...

edit: also thanks for the link :)

throwaway613834 · on Nov 16, 2017

Ohh wow that went over my head haha.

CapacitorSet · on Nov 16, 2017

>The aim is to produce software libraries that are as safe as Go or Rust, roughly speaking, but as fast as C, and that can be used anywhere C libraries are used.

I'm all for having different implementations of software, but does Rust not fulfill these requirements?

Ded7xSEoPKYNsDd · on Nov 16, 2017

There are C compilers for many architectures that the Rust compiler (today) does not support. They may also be referring to the fact that you can't just drop it into existing projects with their complicated build systems in the same way that you can drop a few generated ".c" and ".h"files.

yaantc · on Nov 16, 2017

Great points, plus from the puff readme: "In Rust, integer overflow is checked at run time in debug mode and silently ignored in release mode by default, as the run time performance penalty was deemed too great.". With puff, it's statically checked. I haven't delved into puff's details, but to support as many environment as possible in an easy way today, a DSL generating C seems a pragmatic decision.

steveklabnik · on Nov 16, 2017

Small note here, we left the spec broad enough that if the runtime checks get cheap enough, we can do them in release mode as well.

tveita · on Nov 16, 2017

By default, but it is possible to enable integer overflow for release builds with "-C overflow-checks".

nigeltao · on Nov 17, 2017

And that comes with a performance penalty. Maybe it won't, in a future version of Rust, but it does for now.

lossolo · on Nov 16, 2017

But it's about checking it compile time because of overhead not at runtime.

pornel · on Nov 16, 2017

I think it does. It can be freely mixed with C code (and it works seamlessly even with debuggers/profilers/code coverage).

And there are several nice parsers to choose from, e.g.:

https://github.com/Geal/gif.rs/blob/master/src/parser.rs

sitkack · on Nov 16, 2017

If we had a modern Cfront [0] we could compile C++ to C, and use Rust's existing interop to blend the two.

[0] https://en.wikipedia.org/wiki/Cfront

comex · on Nov 17, 2017

That's not really necessary, as rust-bindgen already has some support for C++ bindings:

https://github.com/rust-lang-nursery/rust-bindgen/blob/maste...

Its support is somewhat limited, but… using a Cfront-like approach wouldn't help the situation much, aside from increasing portability.

After all, a given platform's C++ ABI usually looks like the C ABI, with some things added on top that could be represented explicitly at the C source level (e.g. inheritance becomes composition; name mangling; hidden fields for vtables; hidden parameters for 'this'; etc.). In other words, C++ compilers already act like Cfront followed by a C compiler. By the same token, C++ functions and structs/classes can be accessed explicitly through a C FFI by doing the extra stuff manually (or having a tool like rust-bindgen do it for you). There are a whole bunch of exceptions, e.g. Windows using a special register to pass 'this', but in theory those can mostly be handled through some ad-hoc additions to the FFI; many of those additions haven't yet been implemented in Rust [1], but they will be.

There is a more fundamental limitation, though. If you have, say, a C++ template function in a header file, it needs a C++ compiler to translate it to machine code for every set of template parameters it gets used with. If you want to call it with arbitrary parameters from Rust, you'd have to have a C++ compiler built into your Rust compiler. But a Cfront-like tool wouldn't help with that. It couldn't, say, translate a template function to a single C function that doesn't have to be monomorphized; C++ gives you too much power at compile time for that to work.

[1] https://github.com/rust-lang-nursery/rust-bindgen/issues/849

andrewflnr · on Nov 17, 2017

Statically eliminating array bounds violations is definitely something Rust doesn't do.

kibwen · on Nov 17, 2017

Not Rust, though it is something that LLVM does will do to Rust code, quite often.

andrewflnr · on Nov 17, 2017

Sure, but that's not something you can rely on, is it?

kibwen · on Nov 17, 2017

It's just an optimization, so obviously it's not guaranteed, but that doesn't mean it's random. And if it's in your hot loop and not being optimized away, benchmarking will find it easily. At the same time, array indexing in Rust is vanishingly rare IME... turns out I only ever used it for `for` loops, which are better served by iterators. :P

nigeltao · on Nov 17, 2017

See my other comment (written after your comment). Search this page for "That announcement has more to say about the comparison to Rust, which is probably the most frequently asked question."

I should also add that Rust is great tech, written by great engineers. Puffs is still different (with different trade-offs), as per the links in my other comment.

microcolonel · on Nov 16, 2017

This is very interesting, after kicking the tires on a few parsers, I've come to the conclusion that most parsers are absolute garbage, and it's nobody's fault in particular. This should make it close to feasible to write fast native parsers (especially validators) which aren't a liability.

tom_mellior · on Nov 16, 2017

Does anyone know what analysis or prover they use to discharge proof obligations? I poked around the .md files in the repository, but I haven't seen this addressed anywhere. I'm not (yet) motivated to dig into the actual source...

haskellandchill · on Nov 16, 2017

Looks like it is not a general system, they prove the hardcoded binary operations: https://github.com/google/puffs/blob/master/lang/check/asser...

tom_mellior · on Nov 16, 2017

Thanks. From what I've seen in that file and another one in the same directory, it seems to be a standard abstract interpreter on a domain of intervals. I would have expected a more refined relational domain that can express certain inequalities, but this seems to be enough for a first prototype.

nigeltao · on Nov 17, 2017

I'm not entirely sure what you had in mind, but you can express certain inequalities. For example, https://github.com/google/puffs/blob/master/std/gif/decode_l... has this line:

assert n_bits < (width + 8) via etc

The 'via' syntax is discussed at https://github.com/google/puffs/blob/master/doc/puffs-the-la...

tom_mellior · on Nov 17, 2017

Thanks for your answer. Yes, this kind of symbolic inequalities was what I had in mind. The solution via "via" is quite clever.

In the beginning I had thought that maybe you were using something like Pentagons: https://www.microsoft.com/en-us/research/publication/pentago... which were developed explicitly in the context of proving array bounds checks.

You might want to give them a look. They would cost you in implementation effort, of course, but they might make Puffs (almost as) fast and pretty smart!

ekr · on Nov 16, 2017

An interesting observation: this was written by the brother of Terence Tao.

dullgiulio · on Nov 16, 2017

If I am not wrong, he is a member of the Go core team?

nigeltao · on Nov 17, 2017

Yep, a member of the Go core team, although not as active in the Go community these days as I was in past years.

And yep, I'm also Terry's brother.

lilyball · on Nov 16, 2017

> The aim is to produce software libraries that are as safe as Go or Rust, roughly speaking, but as fast as C, and that can be used anywhere C libraries are used.

This is kind of a weird thing to say, because Rust is as fast as C, and can be used to create C-compatible libraries.

nigeltao · on Nov 17, 2017

See my other comment (written after your comment). Search this page for "That announcement has more to say about the comparison to Rust, which is probably the most frequently asked question."

andrewchambers · on Nov 16, 2017

What makes this language specific to parsing files? It mentions that it is a DSL for parsing files a few times... why not network packets, or any other task.

fixermark · on Nov 16, 2017

I interpret that to be more a statement of creator intent than a restriction on the end-user. The general vibe I get is "This language is verbose as hell and puts a lot of constraints on what your functions are allowed to do; writing a full-sized application in Puffs alone is going to be a super-tedious exercise."

gowld · on Nov 16, 2017

Network packets are files.

Santosh83 · on Nov 16, 2017

Can this DSL be extended to cover other classes of bugs and undefined behaviour in C without spilling over into runtime checks?

nigeltao · on Nov 17, 2017

Possibly. I'd like Puffs (and its generated C code) to be free of UB (and I've skimmed the blog.regehr.org posts), but I'm not 100% certain if it is.

rurban · on Nov 17, 2017

Looks like a mix of ATS and Pony, but without the rich proof system of ATS, without the safeties provided by both and without the performance advantages of both. Wonder why they still try to reinvent the wheel, when there are already established and better langs out there. Well, it's Google, so they are still academics.

actionowl · on Nov 16, 2017

We're google, we can pick any name we want, even if a very similar project is already using it. https://www.netbsd.org/docs/puffs/

acdha · on Nov 16, 2017

Creating a filesystem in user-space is “very similar”? I mean, they both involve files but…

http://netbsd.gw.com/cgi-bin/man-cgi?puffs+4.i386+NetBSD-7.1

deepsun · on Nov 16, 2017

What will I get if I type "sudo apt install puffs"?

What do I have if I see libpuffs.so somewhere?

acdha · on Nov 16, 2017

That's a valid question but not the same as saying the projects are very similar. In that case in particular, the answer would come down to “Are you running on NetBSD?” and so the most likely answer is that the NetBSD package maintainer would package it as "google-puffs" or something like that.

nigeltao · on Nov 17, 2017

Yeah, it'd be unfortunate, but not unprecedented. For example, Debian has both epiphany and epiphany-browser, which are two separate things.

nigeltao · on Nov 18, 2017

I'm considering a name change: https://groups.google.com/d/topic/puffslang/cSrH-s7UqwA/disc...

As for the "We're google, we can pick any name we want" assumption that somebody else wrote earlier, https://github.com/google/puffs says "Disclaimer: This is not an official Google product, it is just code that happens to be owned by Google". Also, before the launch, I didn't find many hits for "puffs" when searching github.com projects, or searching "apt search puffs", or searching the Web in general for queries like [puffs programming], or uses of the ".puffs" file extension. It's not like I cackled maniacally as I deliberately screwed over the NetBSD puffs project, whether for myself or on behalf of Google, I just didn't find it. I am bad at searching. Sorry.

jakobegger · on Nov 17, 2017

There are only so many combinations of letters that are short and pronouncible, so people will have to to reuse names eventually.

tambourine_man · on Nov 16, 2017

That's a great acronym