Do you have a link to a copy of the video with captions? YouTube autogen doesn't cut it unfortunately. Or perhaps a written-form version (slide deck + transcript)?
The remaining 8% mostly falls into the following categories:
- Code that use OS functionalities that are cumbersome to mock in tests
- Code paths that are triggered relatively rarely and I was simply too lazy to add tests for them
Nothing is impossible to cover, but for whatever reason it was too much work for me when I wrote the code.
However, it's worth mentioning that I only settled on the transcript test pattern fairly recently, and if I were to rewrite or refactor some of the untested code today I would add tests for them, because the cost of adding tests has been lowered considerably. So Elvish's test coverage is still increasing slowly as the cost of testing decreases.
There were a lot of aspects of this talk that I thought were really great. The willingness to try something unscripted, diving into the code repo live (e.g. to show where fuzzing is used), and the discussions of the reasoning behind the design choices. Great job @xiaq. This really makes me want to try elvish out, and I usually am quite skeptical of new shells.
haha I can't present nearly as well as yourself but maybe one day.
It's not easy to present though. I know on HN we see a lot of very clever people give some well executed presentations and it's sometimes easy to forget how much preparation and courage it takes to perform like that. And it's great to see how engaged people were with the content too.
Sorry, this is less of a question and more just comment of appreciation.
Did you set your login shell to Elvish? Vim unfortunately relies on your shell being a POSIX shell, but you can fix that with "set shell=/bin/sh" in your rc file.
This is meant as additional information not criticism. I skimmed the transcript really fast so if this is in there and I missed it, please correct me, but two things I think are helpful for people creating projects like this to be aware of:
- This video seems to combine the concepts of lexing and parsing. It is usually beneficial to separate these two steps and lex the input into tokens before passing to the parser.
- Go actually has a pure Go implementation of Yacc in the toolset and I've used it in several projects to make parses. Dealing with the Yacc file is often much easier than dealing with code directly since it takes care of writing the actual parser. There is a lot of boiler plate that goes into parsers that when you use Yacc it "just works".
Edit: there are also some tools for writing parsers in Lex/Flex like syntax (re2c comes to mind) but I've found hand writing lexers to be effective in Go if your language doesn't have many different types of tokens.
Right, I may have forgot to mention that lexerless parsers are somewhat unusual.
I didn't have much time in the talk to go into the reason, so here it is:
- You'll need a more complex lexer to parse a shell-like syntax. For example, one common thing you do with lexers is get rid of whitespaces, but shell syntax is whitespace sensitive: "a$x" and "a $x" (double quotes not part of the code) are different things: the first is a single word containing a string concatenation, the second is two separate words.
- If your parser backtracks a lot, lexing can improve performance: you're not going back characters, only tokens (and there are fewer tokens than characters). Elvish's parser doesn't backtrack. (It does use lookahead fairly liberally.)
Having a lexerless parser does mean that you have to constantly deal with whitespaces in every place though, and it can get a bit annoying. But personally I like the conceptual simplicity and not having to deal with silly tokens like LBRACE, LPAREN, PIPE.
I have not used parser generators enough to comment about the benefits of using them compared to writing a parser by hand. The handwritten one works well so far :)
That example you gave could certainly be done in Lex/Flex and I assume other lexers/tokenizers as well, for instance, you would probably use states and have "$x" in the initial state evaluate to a different token type than "$x" in the string state.
But I do get your meaning, I've written a lot of tokenizers by hand as well, sometimes they can be easier to follow the hand written code. Config files for grammars can get convoluted fast.
But again, I was not meaning it as criticism. But your talk title does start with "How to write a programming language and shell in Go" so given the title I think Lexers / Tokenizers are worth noting.
Yeah, ultimately there's an element of personal taste at play.
The authoritative tone of "how to write ..." is meant in jest, but obviously by doing that I risk being misunderstood. A more accurate title would be "how I wrote ...", but it's slightly boring and I was trying hard to get my talk proposal accepted you see :)
That's no problem in many modern lexers as they usually have a "state" so when you encounter "echo" you can switch to a new state and that state may have different token parsing rules. So "if" in the "echo" state could be a string literal whereas it may be a keyword in the initial state.
Lex/Flex takes care of that mostly for you which is one of the benefits of using a well worn lexer generator and not rolling your own.
unless i miss something this should not be an issue. the lexer could parse if as an IF token, and the parser could treat tags as STRING || IF ( || other keywords… )
That seems like it'd get really awkward pretty quickly. "if" isn't unique in this regard; there are about a hundred shell builtins, and all of them can be used as an argument to a command. (For example, "echo then complete command while true history" is a valid shell command consisting entirely of names of builtins, and the only keyword in it is the leading "echo".)
The problem lies with shells extensive usage of barewords. If you could eliminate the requirement for any bareword to be treated as a string then parsing shell code would then become much simpler...but also few people would want to use it because nobody wants to write the following in their interactive shell:
> Dealing with the Yacc file is often much easier than dealing with code directly since it takes care of writing the actual parser. There is a lot of boiler plate that goes into parsers that when you use Yacc it "just works".
Honestly, I think this is overstating the amount of boilerplate in a parser and overstating how well a parser generator "just works". I haven't used Yacc, so maybe it's better than ANTLR, but having tried ANTLR and written a few recursive descent parsers I've been pretty well cured of wanting to ever use a parser generator. ANTLR's generated code is verbose, the data structures are hard to work with, and error handling leaves a lot to be desired.
Parser boilerplate can be reduced to a large extent with a good set of helper methods (I often find myself referring back to the set used in Crafting Interpreters [0]), and what you get in exchange is full control over the data structure generated by the parser and over the error handling. For a language that you're serious about, that tradeoff is totally worth it.
Maybe it's just my skill level, but I've used both hand-rolled recursive-descent and ANTLR for the same project (Thrift parser), and hoo boy I would never go back to recursive-descent for that. ANTLR shrank my code by an order of magnitude, and cleaned up some bugs too.
I'd be willing to believe that beyond a certain level of input complexity, ANTLR no longer pays for itself. In my experience, there exists a class of languages for which there's no better tool.
I would love to see the diff between the hand-rolled recursive-descent parser and the ANTLR syntax!
I certainly feel the amount of boilerplate in my hand-rolled recursive-descent parser is manageable. Of course it's not as succinct as an EBNF grammar:
- For example, you have to write an actual loop (with "for" and looping conditions) instead of just * for repetition
- The Go formatter demands a newline in most control flows
- Go is also not the most succinct language in general
So you do end up with many more lines of code. But at the end of the day, the structure of each parsing function is remarkably similar to a production rule, and for simpler ones I can mentally map between them pretty easily, with the added benefit of being able to insert code anywhere if I need something beyond old-school context-free parsing.
> This video seems to combine the concepts of lexing and parsing. It is usually beneficial to separate these two steps and lex the input into tokens before passing to the parser.
Historically, yes. In recent years combined lever-parsers have outperformed dedicated lexer + dedicated parser combinations, and with modern tooling this isn’t the janky mess it used to be. Some of the best tools out there are combined lexer-parsers.
> - This video seems to combine the concepts of lexing and parsing. It is usually beneficial to separate these two steps and lex the input into tokens before passing to the parser.
With traditional techniques, yes. But if you eg use parser combinators (which would admittedly a bit unusual in Go), combining both steps is pretty common.
> - Go actually has a pure Go implementation of Yacc in the toolset and I've used it in several projects to make parses. Dealing with the Yacc file is often much easier than dealing with code directly since it takes care of writing the actual parser. There is a lot of boiler plate that goes into parsers that when you use Yacc it "just works".
You are right that it's best to avoid Go when you can. Just like Java folks (stereotypically) seemed to avoid writing Java at all costs and rather wrote XML config files to drive their logic.
Yacc (and lex) are otherwise not good choice for specifying languages these days.
As the sibling comment mentioned, you can find documentation on Elvish itself on the website https://elv.sh. There are tutorials and (not 100% but fairly complete) reference documents.
If you're interested in Elvish, you may also be interested in the talk on its design - https://www.youtube.com/watch?v=wrl9foNXdgM