> makes it difficult to share parser logic between the two It’s difficult to reu...

rhdunn · on April 11, 2022

My perspective comes from writing an XPath and XQuery lexer and parser for IntelliJ, which has its own lexer, parser, and AST APIs.

The XPath lexer and parser are designed to be overridden where needed to implement the XQuery lexer and parser.

The lexer itself has state as a stack-based lexer in order to tokenize the different structures (string literals, comments, embedded XML) correctly. A compiler could use the parse state as the context to drive the tokenizer without needing a state/stack-based lexer.

The lexer also treats keyword tokens as an identifier type as keywords can be used as identifiers. This is not necessary in a compiler as it knows when it is reading/expects a keyword.

My parser handles the different versions of XPath/XQuery, the different extensions, and vendor-specific extensions all in a unified lexer/parser. A compiler could ignore the bits it does not support and simplify some of the logic.

My QName parser is very complex due to providing error recovery and reporting for things like spaces, etc. -- Other parsers (e.g. Saxon) treat the QName as a single token.

I'm also generating a full AST with single nodes removed, e.g.:

    XPath
       InstanceofExpr
          IntegerLiteral       "5"
          XmlNCName            "instance"
          XmlNCName            "of"
          SequenceType
             AtomicOrUnionType
                QName
                   XmlNCName   "xs"
                   Token       ":"
                   XmlNCName   "string"
             Token             "?"

I'm traversing this AST to do things like variable and namespace resolution. For the modules, I'm using the IDE's mechanisms to search the project files. -- In a compiler, these would be collated and built as the file is parsed, which does not work with incremental/partial parsing.

I'm getting to the stage where I can evaluate several static programs due to the need of implementing IDE features, and providing static analysis.