Major mode for editing SEML (S-Expression Markup Language) files

tgbugs · on June 3, 2019

I've been using SXML [0] for all my sgml needs in Racket and the quality of life improvement from having a sane and regular syntax for everything is hard to overstate. seml looks like it might have the same kind of quality of life improvements for some of my elisp-only code. I'm not sold on the way that missing attributes are handled using nil, that seems like a design decision that was made to simplify parsing at the expense of making the representation more cluttered.

https://docs.racket-lang.org/sxml/

neilv · on June 3, 2019

My old Scheme/Racket permissive HTML parser[0] initially used what might've been exactly this SEML format. Because it's perhaps the most natural choice for a Lisp person -- HTML element is a list, item 0 of that list is a symbol for the HTML element name, item 1 is an alist of HTML attributes, tail items are HTML element content.

However, I changed the format when I saw Oleg Kiselyov's SXML work, to make my Web-scraping and other tools able to use his XML tools. I later made a few other tools that used SXML, such as a simple HTML-writing template embedded in Racket that does some of the checking and work at compilation time.[1]

At a library level, SXML's arbitrary nested lists make some things computationally harder to do (e.g., find all the attributes, depending on the "normal form" of SXML), but some other things easier to do (e.g., some kinds of functional editing, due to arbitrary nested lists). Aside from those considerations, SXML is the closest to a de facto standard for XML and HTML tools in Scheme.

If I ever happen to have funding to do so, I'd like to revisit the exact representations, to try come up with an end-all-be-all for all purposes, and redesign/reimplement all the tools from scratch. Until then, there's SXML.

[0] https://www.neilvandyke.org/racket/html-parsing/

[1] https://www.neilvandyke.org/racket/html-template/

agumonkey · on June 3, 2019

It's funny how the lisp/fp often remove irregularity, at the cost of abstraction [0], which makes people pulling their hair off to the point of going back to simpler but irregular separate logical tools. Even if they complain about it very .. regularly.

[0] some people can't bear lisp uniformity for instance.

chriswarbo · on June 5, 2019

Keep in mind that Lisp (including SXML) can be written in many ways (parenthesised, indentation based, braces, infixed, prefixed, etc.), which can be mixed and matched within the same expression, and can be trivially converted between automatically.

I bring this up so often that I have a go-to blog post for it: http://chriswarbo.net/blog/2017-08-29-s_expressions.html

noir_lord · on June 3, 2019

Although different I got a similar quality of life from using pug for html, one you get used to it it's faster to write but crucially much easier to read, it makes the intent so much clearer.

zeveb · on June 3, 2019

I used to spend a lot of time manually typing up examples of why S-expressions are cleaner and more readable than XML, HTML, JSON, YAML &c. They are, they really are. And yet for some reason there's a population of people who prefer:

    <!DOCTYPE html>
    <html lang="en">
        <head>
            <meta charset="utf-8"/>
            <title>sample page</title>
            <link rel="stylesheet" href="sample1.css"/>
        </head>
        <body>
            <h1>sample</h1>
            <p>
                text sample
            </p>
        </body>
    </html>

to:

    (html ((lang . "en"))
          (head nil
                (meta ((charset . "utf-8")))
                (title nil "sample page")
                (link ((rel . "stylesheet") (href . "sample1.css"))))
          (body nil
                (h1 nil "sample")
                (p nil "text sample")))

I don't understand it, but it seems to be true. The egotistical part of me feels that they just haven't experience the enlightenment of understanding the benefits of have all data & code be manipulable, structured data rather than dead text which must be painstakingly parsed, combined with the benefits of a single, general, universal, cheaply-parseable representation.

But the professional, open-minded part of me wonders if maybe I am missing the point. Maybe all that painful-to-parse, irregular syntax is buying something. Maybe there's a reason every generation for the last 50 years has been approximating some but not all of the features of Lisp. Maybe those other languages and formats have worthwhile benefits. Maybe they're even superior.

Or maybe most folks really are stuck in a local maximum, like kids who like being read to and don't see the advantage of learning to read. I honestly don't know.

Regardless, SEML looks great.

nfoz · on June 3, 2019

Your example (and all the SEML documentation it seems) is missing marked-up text. For example, how would you represent:

    <p>This is a <b>really cool</b> sentence.</p>

Your solution will probably have to splice the text segments around the embedded markup, something like:

    (p nil (text "This is a " (b nil "really cool") " sentence."))

In particular, notice the careful whitespace at the edges of the strings.

IMO this Sexpr is now more obtuse than the XML, and the more markup you have within text-spans (e.g. nested markup), the worse it gets. This is also a major difference between XML (markup language) vs JSON (data-structure notation).

Maybe you don't need the (text ...) thing, but either way you're changing the grammar. How do SEML and SXML handle this?

undersuit · on June 3, 2019

The issue is the spaces, so let's get rid of them!

    (p nil (text (string-join '("This" "is "a" '(b nil "really") '(b nil "cool") "sentence.") " ")))

No more wondering if the space between 'really' and 'cool' should be bold and no need to have awkward preceding and trailing padding.

zeveb · on June 4, 2019

I think that's mostly an artifact of SEML. In SXML that would be:

    (p "This is a " (b "really cool") " sentence.")

Which seems a-okay to me.

TiredOfLife · on June 3, 2019

Quoting the page: "SEML is short and easy to understand for Lisp hacker."

For someone that has edited a couple of html pages but is not a programmer SEML looks like gibberish (who is nil?).

zeveb · on June 3, 2019

Honestly, SXML is probably cleaner and easier than SEML. The example in SXML would be:

    (html (@ (lang "en"))
          (head
           (meta (@ (charset "utf-8")))
           (title "sample page")
           (link (@ (rel "stylesheet") (href "sample1.css"))))
          (body
           (h1 "sample")
           (p  "text sample")))

The great thing about standards is that there are so many to choose from.

kazinator · on June 3, 2019

> who is nil

A noun in the English language, of Latin origin.

https://www.merriam-webster.com/dictionary/nil

tlavoie · on June 3, 2019

Makes me think of Edi Weitz's CL-WHO, which works very nicely if creating web pages from Common Lisp. https://edicl.github.io/cl-who/

notduncansmith · on June 3, 2019

This reminds me very much of Hiccup[1]. Both nested s-expressions and HTML describe trees.

[1] https://github.com/weavejester/hiccup

StreamBright · on June 3, 2019

I really like Hiccup, it is my favorite part of the Clojure web kit.

agumonkey · on June 3, 2019

lisp and trees, you know

Lowkeyloki · on June 3, 2019

This is interesting. It reminds me of the API behind JSX. But I'm not sure what problem this is seeking to solve exactly. Is it showing that HTML and s-expressions are technically interchangeable?

txru · on June 3, 2019

If you have time, this[0] is the canonical article usually shared around this concept. The thrust of it is that yes, XML (or x-expressions) and s-expressions are very similar, and that s-expressions are a less verbose and simpler way to represent data.

[0] https://www.defmacro.org/ramblings/lisp.html

js8 · on June 3, 2019

Actually, there is a fundamental difference between XML and sexps. In XML, text is unescaped, while the metadata are escaped. In sexps, the metadata are unescaped, while the text is escaped.

Most text formats fall into one of these two categories. Formats primarily for storing text (like XML or SGML or TeX) are in the former, formats primarily for storing (unstructured) data (like sexps or JSON or YAML) are in the latter.

jakear · on June 3, 2019

Skimmed the article, left me a bit confused. Am I missing something big, or is this not particularly novel? The similarities between a-expressions and XML seem fairly obvious to me.

neilv · on June 3, 2019

It's obvious to any Lisp person, and very convenient, in some wasys.

A few things usually going on with S-expression representation of XML or HTML encoding in a Lisp:

1. It uses some of the native basic types of Lisp -- the list, the symbol, and the string.

2. The HTML element values that you type in your source and that are displayed to you are generally in the same syntax, since that's how Lisps tend to work with the basic types. (For contrast, you don't type your HTML like `<html><body><p>Hi</p></body></html>` in your source, and then see it in the debugger like `HtmlElement#abcd1234("html", {HtmlElement#c948f447("body", {HtmlElement#e7e7e7e7("p", {HtmlCdata#c8c8c8c8c8("Hi")})})})`.

3. The S-expression printed representation you see in your source can be less verbose than HTML or XML, such as by not needing HTML element end tags. Though you will have to put your HTML CDATA text as quoted string literals.

4. A Lisp person's typical code indenting (supported by the editor), tends to expose the tree/forest structure of HTML conveniently:

    (html (head (title "My Page"))
          (body (p "First paragraph.")
                (blockquote "Don't quote me on that.")
                (div (p "Another paragraph.")
                     (p "Yet another paragraph."))
                (p "Hey, it's a paragraph.")))

Note that I probably wouldn't type a huge book this way. I might instead use Markdown or a DSL or alternate reader, such as Scribble or its at-reader, mainly to get TeX-like paragraphs: https://docs.racket-lang.org/scribble/ https://docs.racket-lang.org/scribble/reader-internals.html

tannhaeuser · on June 3, 2019

> The S-expression printed representation you see in your source can be less verbose than HTML, such as by not needing HTML element end tags

Tag inference/omission in SGML (and by extension HTML when seen as an application of SGML) is way more powerful. A minimal, valid HTML document is this:

    <title>Whatever title</title>
    <p>Text goes here

SGML's tag inference, when coupled with a DTD for HTML5 such as mine [1], will treat that as equivalent to this:

    <html>
      <head>
        <title>Whatever title</title>
      </head>
      <body>
        <p>Text goes here</p>
      </body>
    </html>

See details in slides or paper linked from [2].

[1]: http://sgmljs.net/docs/w3c-html51-dtd.html

[2]: http://sgmljs.net/blog/blog1701.html

neilv · on June 3, 2019

Neat that you've done this for HTML5. Early Web browsers tended to do some of that (with less-rigorous semantics, some of which I approximated in my early HTML parser). Consequently, Web pages in practice often did, too. Maybe that could make manual writing of documents in HTML5 more practical. For bits of HTML embedded in code lately, I've preferred a simpler model and syntax using S-expressions, but I could see the implied tags as very useful for handwriting documents SGML-style.

txru · on June 3, 2019

Well, it is and isn't novel. S-expressions have been around since McCarthy, XML since the mid-90's. The point of the article is that XML is a more verbose re-interpretation of s-expressions-- it's hard to find things that XML brings to the table that sexps don't have. What's more, inside editors, there are really clever things that manipulate sexprs, move them around, redefine their semantic meaning. XML usually doesn't work quite that way, not as a first intent.

tannhaeuser · on June 3, 2019

SGML has been around since the 1960s, and XML is specified as a proper subset of it. SGML/XML isn't so much about (trivial) nesting than it is about content models, eg. the language defined by a regular expression admitted/recognized as the content of a particular element. Markup is also first and foremost a plain text format, optionally tagged by start-/end-element tags, unlike sexprs which need quotes around individual spans of text. Try telling an author to use verbose quoting (and escaping for quotes) for what makes the majority of his text format, or try edititing a large text with verbose sexpr yourself, and you'll see why nobody uses sexprs for semistructured text.

txru · on June 3, 2019

I take your point, and I really don't want to be the spark of a syntactic flamewar. I suspected I was missing something, and that the $angle_bracketed_format was older, but I was searching the wrong things.

It seems to me, though, that escaping is just something that's going to be tricky everywhere, and a decent first line solution, wherever you are, is to have a really rare set of characters represent your begin/end string marks. In Python, """text""", Postgres has $$text$$, non-ASCII characters in other formats. XML and sexps are both susceptible to that issue-- both of their escapes are, themselves, escapable. To either one, if you have a subregion that's likely to be unintentionally escaped, then you create a boundary where you either explicitly escape every one, or you refuse to acknowledge previously accepted delimiters. As an example, lisps have (quote term) rather than 'term when you're writing macros and concerned with macro-expansion.

To your regex point, there are lisps that definitely did awful deeds with that, particularly emacs lisp, but the more recent ones have solutions just like other modern programming languages and markups do.

To me the unending escape just kind of seems like a universal bug. While lisp is just as susceptible, lisps have perfectly reasonable ways of treating these problems-- separate, make distinct, and as last resort, escape.

goto11 · on June 3, 2019

SGML came out of document authoring and publishing. SGML is more suited for this domain because (among other things) you don't have to quote every single string. It is not like the SGML community didn't know about s-expressions - DSSSL the style language for SGML was based on scheme, so they recognized s-expressions were appropriate for some domains.