Hacker News new | past | comments | ask | show | jobs | submit login
The life and times of an Abstract Syntax Tree (trailofbits.com)
98 points by jnord 44 days ago | hide | past | favorite | 29 comments



Starting a paragraph about how to "safely" remove safeguards, and then tongue-in-cheek joking about Rust, kind of looks worse when your very first example isn't safe and liable to segfault even before the "proverbial gun" variant :)

It needs to be

> auto &lhs = *node_storage.emplace_back(make_unique(...));

instead of dereferencing lhs on the third line: emplace_back returns a reference to the inserted item, so it's a &unique_ptr...and that reference is invalidated by the rhs emplace_back reallocating the vector.

Rust also has a number of easy to use ways to do the same thing - slotmap or typed_arena are the standard, or just doing the vector indices scheme manually like at the bottom: this post is talking in circles around the fact the vector is really being used as an arena allocator without saying that, and so has obvious problems with it being extremely fragile to anyone who reads the type and expects it to be vector-like. It would probably do better explicitly saying that and linking to the wikipedia page or whatever, so the reader can look into the proper way of doing the code pattern.


> instead of dereferencing lhs on the third line: emplace_back returns a reference to the inserted item, so it's a &unique_ptr...and that reference is invalidated by the rhs emplace_back reallocating the vector.

Correct but only under the assumption that growing the backing storage of std::vector is implemented through a completely new instance of malloc, which indeed seems to be the case with libstdc++ implementation AFAICS.

Otherwise, if realloc had been used, such condition wouldn't be guaranteed since realloc can grow from existing memory and thus not invalidating all other references to it.

That said, I wonder if this behavior is mandated by the standard. https://en.cppreference.com/w/cpp/container/vector/emplace_b... implies that it is but I didn't check

    If after the operation the new size() is greater than old capacity() a reallocation takes place, in which case all iterators (including the end() iterator) and all references to the elements are invalidated. Otherwise only the end() iterator is invalidated.


This comment is pedantic but not relevant, the grandparent comment was right. If the vector is reallocated to a new location in memory (either because it uses malloc or because realloc can't grow the existing allocation and must allocate somewhere new), the references will be invalidated and the behavior of this program is undefined. Using realloc would not make this code correct.

The references also won't be invalidated if there's no need to reallocate, because the vector has enough space. Enumerating the cases in which this wouldn't cause UB does not change the fact that there are cases in which it would cause UB.

I'm sure it feels good to make comments like this, but it's a disservice to newer developers who may be misled by your comment about whether or not this code contains UB.


My comment acknowledges the issue mentioned by the parent comment and extends it with extra information out of my own curiosity, and which others may find useful or valuable too. Or not. I don't care.

There surely could exist std::vector implementation with std::realloc implementation under the hood where the issue described wouldn't always be reproducible? I think that MSVC actually uses this technique AFAIR.


Unfortunately, C++ containers generally cannot use realloc, because they are required to invoke their elements' move/copy constructors when their addresses change.


Yes, you're correct that generally you would not be able to use realloc for reasons you mentioned (and also because of destructors) but for simple POD (trivially-constructible) types I think it should be possible, no?

In that case, realloc will either

(1) grow the existing memory, perhaps default initialize it, and call it a day

(2) Or if growing the memory is not possible, (a) allocate a new region of memory, (b) memcpy the contents from the old block and (c) free the old block.

Perhaps I am missing something but I think those are all the guarantees that realloc is giving to you.


In Rust, one of the rather annoying limitations of the borrow-checker is constructing self-referential data structures, e.g. ASTs.

You either have to explicitly mark every node in the tree with a Box<T>, or nodes in the tree has to share lifetimes.

But since you're building the tree gradually by recursively scanning over some concrete syntax, that won't work.

Box<T> generally ruins your ability to do pattern matching, which is a deal-breaker when working with ASTs.

You can overcome some of the inconveniency by using a compiler extension called box patterns:

https://doc.rust-lang.org/beta/unstable-book/language-featur...

But the nicest solution I've found is to use a bump allocator (specifically, bumpalo).

You allocate a big chunk of memory for your AST, for which the lifetime is that of the bump allocator's handle.

And you tie the lifetime of all nodes of your AST with that lifetime.

The "drawback" is that you must free your entire AST at once. But you probably were going to anyways.

https://manishearth.github.io/blog/2021/03/15/arenas-in-rust...


Halfway through your comment I was going to say you were wrong, that ASTs aren't self-referential and that you can easily allocate them in a single block using a bump allocator, but then obviously you say so yourself in the second part of the post. This is exactly what this post proposes to do when it proposes to remove the safeguards and says it can't show a Rust version (the "AstContext" is just a sort of bad bump allocator), except of course that in Rust it's completely memory safe.

> Box<T> generally ruins your ability to do pattern matching, which is a deal-breaker when working with ASTs.

For what it's worth I find that I never want to pass ownership when traversing ASTs with pattern matching, so I just bind everything by reference and don't experience this problem. YMMV.


OK, I didn't actually read the "I give up on doing this in Rust" section very carefully.

The article capitulates:

> last time I asked how to do that in my company’s Slack channel, the responses I received were something like “don’t” and “why would you do that?” and “someone please call security.” It should not have been a surprise, as an AST is basically a linked list with extra steps, and Rust hates linked lists.

If I didn't tell myself "go find a good solution", I might have not found someone suggesting "Just use a bump allocator and make your nodes live until one exact, shared lifetime."

The article does agree in principle with this solution:

> In my experience, it’s quite easy to do away with the ceremony of making each node hold a reference count altogether, and instead decide on a more disciplined approach to ownership.

I think this is one of those examples where you get a principled, elegant and safe solution.

You just gotta find it first.

But here it is.

(Also, I read about it on a forum when searching for a better solution than box_patterns.)


There's another, much nicer, solution to the contiguous allocation problem: use a proper arena allocator. This is essentially the article's `std::deque` option, but without the downsides of "every node is the same size" or Microsoft's deque implementation.

You can get 90% of the way there by expanding the `vector`+`reserve` approach to keep a list of vectors, and allocate a new one whenever the previous one fills up. Replace the vectors with untyped byte buffers, and you can fill them with objects of different sizes.

This is quite reasonable to do even in Rust, e.g. with bumpalo: https://docs.rs/bumpalo/latest/bumpalo/.


I admit I was also thinking arena allocator the whole time I was reading that. I think it would work rather well since you are building up a tree piece by piece.


The raku devs are working away on rakuAST, which is now in v6.PREVIEW and early adoption is eased by the Slangify module https://raku.land/zef:lizmat/Slangify

raku’s killer app is its built-in Grammars … now you can write Actions that build AST right in your code. What does this mean? You can define your own Domain Specific Language (DSL) in your code and then just tag the blocks that use it. (Well raku is anyway a collaborative set of DSLs - eg for regexes, quoting constructs - but AST now makes this extensible by user code)

The first examples of this are Brainf*k and BASIC interpreters.

Checkout this introduction… https://dev.to/lizmat/rakuast-for-early-adopters-576n


This is why I really like compacting garbage collection:

- Don't have to think about this stuff, no matter how the AST is manipulated.

- Get locality for free from live objects being compacted together

A pretty good deal!


I recently implemented a compacting garbage collected arena in C, was surprised at how simple it was to make (though I need to still improve some things). I wonder why it's not used more? Just keep the really huge non-temp data buffers in a different datastructure so they don't need to be compacted, otherwise you can go wild with allocations and never have a use-after-free.


Some related references (on a somewhat messy wiki page) - https://github.com/oilshell/oil/wiki/Compact-AST-Representat...

Feel free to add others


There is one common pattern which is required in many use cases when working and representing ASTs - having a `parent` node, or having access to parents and children. This complicates ownership implementation.

Additonally, is essential to provide an API to transform trees or even construct new ones using immutable ASTs like implemented in many compilers eg .NET Roslin or typescript TSC.


Roslyn specifically does not include parent pointers in its immutable AST nodes. It must do this in order to share those nodes between different versions of the tree.

It only introduces parent pointers in a convenience wrapper layer, built on-demand as you traverse the tree - you could equivalently just pass the parent down as an argument to your tree walking functions.

(The ownership problem of parent pointers also goes away when you use the arena allocation approach that the post arrives at.)


It seems to me that compilers are a perfect place for data oriented architectures/design, but I only know of Zig using it: https://vimeo.com/649009599


i'm curious whether that article was inspired by the recent SQLite article about bytecode vs ASTs (<https://sqlite.org/whybytecode.html>).


Author here - that's not the case, this post was in the works for months haha

Maybe it's a bit of Baader–Meinhof phenomenon at play?


Discussed this in a comment on lobste.rs https://lobste.rs/s/ljlmid/life_times_abstract_syntax_tree#c...

tl:dr; use indices in Rust, they work pretty well, and proper use of the type-system helps with reducing overhead without sacrificing safety.


Is learning how compilers work or how to implement them still worth it with AI replacing programming?


AI isn't, yet at least, replacing developers any more than it is replacing lawyers or any other job. It can help a developer code a function, or at least act as smart autocomplete (GitHub CoPilot), which is about the equivalent of helping a laywer write an e-mail - productivity tool, but not about to replace anyone.

There's a ton of work to be done before we get to the stage of AI replacing actual jobs - in the shorter term things like planning/reasoning, working memory and factuality need to be addressed, and to learn to do an actual job then you need online (incremental real-time) learning too. These all need architectural innovation - they are not a matter of scaling up.

Bear in mind that in the 7 years(!) since the Transformer was invented, all we've done is basically a bunch of engineering work to scale them up... The pace of AI improvement may seem fast, but here we are after 7 years with SOTA $100M AIs still struggling to do very basic logical tasks like the recent Twitter A::B 4-rule letter replacement challenge! It seems pretty clear that AI isn't going to be replacing developers any time soon!


Yes. Knowing how the language is transformed into running code helps your mental model for simulating the process as thoughts.

You still have to read and validate the code that an LLM emits, and some code will be faster for you to type than it is to prompt,wait,correct,test,adapt etc. It may even help in describing the problem to the LLM, as knowledge of the appropriate language terms will guide the context along the right lines.

And it's also fulfilling to design programming languages, whether general purpose or DSLs, at least in my opinion.


Show me several convincing examples of AI replacing programming.

Don't fall for hype, this is not happening. We only got some okay snippet generators and some semi-adequate translators of simple pieces of code, from a limited set of languages to another.

There's nothing else. AI is not replacing programming yet. Or even soon.


The "Devin AI developer" demo was totally misleading. It's a bit like having an AI look at a lawyer's e-mail inbox and reply to a couple of them with legit-appearing legalese, then claim you've developed an AI lawyer.

You'd have thought they'd at least have cherry-picked an example where "Devin" did something useful, but in fact the Upwork job they set it loose on was just asking for a repository to be updated to build correctly (with latest tool versions), and Devin ignored this and went and introduced a bunch of bugs by making coding changes .. and then came back and fixed it's own bugs! (See Yannic Kilcher YouTube channel).


Which AI specifically? Because every LLM I've tried has usually failed miserably at producing code with non-trivial requirements after they were patiently repeated multiple times and came with clear examples.

This tells me that anyone asserting what you said is either 1) not a developer, 2) taking buggy code at face value, or 3) taking 10x the time it would take an average developer to patiently guide the LLM to a working solution.


> Which AI specifically? Because every LLM I've tried has usually [...]

I think pipeline_peak's remark is unwarranted, but we probably should be trying to look ahead a number of decades (depending on how long you have left in your career) rather than only considering what's possible with today's LLMs. Two years ago, before ChatGPT and DALL-E 2, many would've considered what's being done now with generative AI as infeasible.

But even with current LLMs, I've been surprised at GPT-4 Turbo's ability to produce working scripts for problems that, though maybe not the most challenging, weren't entirely trivial. I'd speculate there are a number of useful tasks within the LLM's capacity and just needing a good integration. Maybe Microsoft releases a designer tool that allows creating static company websites, iterating based on client prompts and the model's vision capability, for example.


There are two pinnacles in a traditional CS program - building a compiler and a DBMS. They are umbrella topics that prepare you for building pretty much any kind of software.

If you came to programming because you liked computers you'll find building an interpreter a fascinating topic in its own right. Building a compiler frontend is comprehensible and reading a modern/practical textbook (e.g. based on ANTLR or whatever is the latest parser generator in the wild) will illuminate so much for you.

You're highly unlikely to find a job with this skillset though. There are very few companies that do it, very small core teams, and there's competition with PhDs from top tier schools. The current leetcode fashion will ignore your niche knowledge right off the bat. So from this perspective the utility of knowing this stuff has always been a grey area regardless of the AI.

If you like money more than you like computers surely aim for an MLE position instead of SE/DE. But that's kind of a selective club too I imagine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: