Hacker News new | past | comments | ask | show | jobs | submit login
The Hardest Program I've Ever Written – How a code formatter works (2015) (stuffwithstuff.com)
500 points by goranmoomin on March 27, 2020 | hide | past | favorite | 125 comments



Philip Wadler has a seminal paper [0] on implementing pretty printers by modeling them using an algebra. There's implementations of the algorithm in most programming languages nowadays, and some state of the art advances that help with performance [1]

It's a very elegant algorithm, and is very pleasant to work. Elixir's formatter uses a variant of it [2], and it appears in the standard library [3]. I've personally made use of it to write a code formatter for GraphQL queries, and it was a very small amount of code for some powerful results.

Definitely take a look if you ever need to do something like this.

[0] https://homepages.inf.ed.ac.uk/wadler/papers/prettier/pretti...

[1] https://jyp.github.io/pdf/Prettiest.pdf

[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2...

[3] https://hexdocs.pm/elixir/Inspect.Algebra.html


I'm gonna take a look at Wadler's paper—but does anyone want to take a shot at a high-level description of how an algebra would be applicable to the problem of pretty printing?

Any more general comments on applying algebras to solving programming problems would also be appreciated.

My rough understanding of the benefits/rationale atm:

1) the algebra is a nice/minimal factoring of the problem

2) being able to formally classify it means we know a bunch of its properties, which tells us certain approaches are definitely (im)possible (and these properties are drawn from a fairly small set: e.g. associativity, distributivity, existence of inverses)

3) If the algebra is explicitly represented in code (maybe in a library), we can use that general structure to verify (and generate?) certain aspects of our particular problem solution.

Edit: The paper mentions "Over the years, Richard Bird and others have developed the algebra of programming to a fine art" —and I found Bird's book "Algebra of Programming". Seems like what I'm looking for maybe, but would be grateful to hear of any alternative resources as well.


You'd likely be more interested in the original paper [0] by John Hughes that Wadler built off of. It gets into a ton more detail about the algebraic structure of the documents they're defining:

> The question addressed in this paper is: how should libraries of combinators be designed? Our goal is to show how we can use formal specification of the combinators, and a study of their algebraic properties, to guide both the design and implementation of a combinator library. Our case study is a library for pretty-printing. But the methods we present are of wider applicability, and we also show how they can be used to develop a number of different monads.

Wadler's paper touches upon some of the same ideas, but the primary goal of his paper is to build a prettier printer, whereas Hughes aimed to demonstrate the power of algebraic properties.

[0] http://www.cse.chalmers.se/~rjmh/Papers/pretty.html


For anyone curious about the difference between the two: Hughes' pretty printer works really well for pretty-printing Haskell code, but Wadler's is more flexible for pretty-printing C-style languages where there is a dedented closing brace at the end of scopes.

It is 100% true that Hughes' paper is a better introduction to the idea of having algebraic document combinators. One can skim Hughes' and then feel at ease using a library that implements Wadler's printer - the bigger differences between the two libraries are tucked away in the layout algorithms and the APIs are similar.


> Hughes' pretty printer works really well for pretty-printing Haskell code, but Wadler's is more flexible for pretty-printing C-style languages where there is a dedented closing brace at the end of scopes.

That's very interesting! So far I've only used Wadler-Leijen-style prettyprinters.

Are Hughes-style pretty printers "better" in some way for Haskell-like documents? If so, why?


This looks great—thanks!


It's a little bit different from the other resources I've linked, but "Elements of Programming" [0] is another excellent (though incredibly challenging) read. It's written by Alexander Stepanov, and basically walks through the way he built STL by decomposing algorithms and data structures into abstract mathematical structures.

> Elements of Programming provides a different understanding of programming than is presented elsewhere. Its major premise is that practical programming, like other areas of science and engineering,must be based on a solid mathematical foundation. The book shows that algorithms implemented in a real programming language, such as C++, can operate in the most general mathematical setting. For example, the fast exponentiation algorithm is defined to work with any associative operation. Using abstract algorithms leads to efficient, reliable, secure, and economical software.

I revisit it every few years and get a little bit further every time before I start to struggle with the exercises.

[0] http://elementsofprogramming.com/


Here is an interesting use https://www.lexifi.com/instrumentbox/. Algebraic description of financial contracts that then gives all sorts of common features you want to analyze them ‘for free’.


The actual first true generic pretty printing algorithm comes from Derek Oppen and is referenced in that paper, btw.


> Note that “best” is a property of the entire statement being formatted. A line break changes the indentation of the remainder of the statement, which in turn affects which other line breaks are needed. Sorry, Knuth. No dynamic programming this time.

For what it's worth, the same is true of TeX's line-breaking algorithm too: “best” is a property of the entire paragraph being formatted; a line break can change the indentation of the remainder of the paragraph (in many ways, but most obviously: discretionary hyphens with nontrivial pre- and post- texts), which in turn affects which other line breaks are needed. And yet TeX uses dynamic programming.

So it would be interesting to see how this specific problem differs. (Why dynamic programming works for TeX but not for dartfmt.) Perhaps Knuth chose a definition of "best" that is amenable to dynamic programming (after all he had quite a bit of freedom; it doesn't take much to beat the naive first-fit/best-fit line-breaking algorithms as used in Word etc), or perhaps TeX's algorithm too can be expensive but in practice paragraphs just tend to have a different distribution of feasible breakpoints than code statements.

Knuth and Plass's (very readable!) paper on Breaking Paragraphs into Lines is here: http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf

Edit: Ah, I think one relevant difference is probably that when formatting code, one has to "match up" indentation with the previous corresponding sibling of the same nesting level, while with paragraphs the amount of required state is smaller: given a particular line break and indentation, the remainder of a paragraph can always be formatted in the same way, while a code formatter also has to "remember" all previous parent expressions' indentation. (This reason is different from what is directly implied by the quoted statement though…)


The answer is in between the extremes. There are multiple choices of linebreaks that lead to the same indent level at char k, so dynamic programmins is still potentially useful.


Exactly right. The formatter does memoization so that if multiple different solutions end up putting the same subtree at the same level of indentation, that subtree itself is only calculated once.


Thanks, calculating subtrees only once makes sense. But can't you go further (as in TeX and in the comment you're replying to) — that is, if there are two "paths" of breaking that eventually end with the same break at the same indentations (plural because of the matching-siblings issue), then you should have to recalculate the rest only once?

Another question: I must confess not understanding everything clearly, but it sounds like you pop off the lowest-cost breaks until finding a solution that fits the column limit. Couldn't there be cases where avoiding the "best" split in the first step could ultimately result in a lower overall cost? Optimizing over such solutions is the main thing that makes TeX's paragraph-breaking better than the usual greedy algorithms…


> that is, if there are two "paths" of breaking that eventually end with the same break at the same indentations (plural because of the matching-siblings issue), then you should have to recalculate the rest only once?

In theory, yes. But there are other constraints across a statement that the formatter tries to preserve, so the memoization needs to take those into account too. Basically, the key used in the memo table is more complex than just "where in the code" and "indentation level". It has to roll in some extra contextual information about other line breaking choices that were made if those choices force this subtree to be formatted in certain ways.

> Couldn't there be cases where avoiding the "best" split in the first step could ultimately result in a lower overall cost?

Yes, there exactly are, which is why it does indeed find the overall lowest cost solution.


One thing that Knuth forgot is that it is (imho) ugly if two successive lines in a paragraph start or end with the same word. Or more generally if two instances of the same word appear close to eachother.

This shows that writing a formatter is a task that is probably never finished :)


Definitely, the definition of "good" is subjective and one can never get it fully right automatically — but as far as TeX's line-breaking algorithm goes, avoiding successive lines starting or ending with the same word wouldn't require any major change, it could just be added similar to the existing method for consecutive lines being hyphenated (\doublehyphendemerits), without giving up the dynamic-programming algorithm.


I wrote a super short description of the overall architecture of prettier if people are interested: https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascri...


Prettier is amazing. Thanks for building this tool. I’m not sure where you love but if you’re in Seattle or in the valley I would love to buy you a beverage of your choice sometime. I love the tool.


AFAIK no formatting can understand intent. Example. I want my canvas.drawImage calls that take 9 arguments to be formatted with arguments 2-5 on one line, and 6-9 on another line because 2-5 are sourceX, sourceY, sourceWidth, sourceHeight, and 6-9 are destX, destY, destWidth, destHeight.

That's just one example of 100s where an auto formatter can not figure out intent.

Another example, I often align numbers and width and height

    // convert from screen coords to clip space
    clipX = mouseX / screenWidth  *  2 - 1
    clipY = mouseY / screenHeight * -2 + 1  
Another example would be how to format an if block. Depending on your style guide

    if (cond) single-statement
might be ok. Or maybe you want them on different lines. Or maybe you require brackets. Or maybe you don't. But there will almost ways be some place in the code where an exception to your style guide makes the code more readable. Auto formatters don't understand those exceptions.

This is why AFAIK many major projects do NOT use an auto-formatter. Firefox, Safari, Chrome for example. Because there are places where an auto-formatter will make the code far less readable. They'll give the advice that you are free to run an auto-formatter on your own changes but adding an automated one is a no-go. Even the Flutter team does not use an auto-formatter

https://github.com/flutter/flutter/wiki/Style-guide-for-Flut...


Language-wise, the existence of "sourceX, sourceY, sourceWidth, sourceHeight" speaks to a lack of tuples or simple records more than anything.

Obviously a formatter can't change the language, but just want to point out that the issue there is really that the arguments are linked logically but not in the code.

I think a language designed around using an auto-formatter will work better in that scenario. See, for example, Elm—where the use of auto-formatting is essentially universal.


It's a trade off. We've started using a code formatter recently: the overwhelming majority of the code in our team is more readable because it's so standardised. There are a few sub-optimal lines here and there but I wouldn't trade that for going back to the wild old days.


The projects I mentioned above all have style guides. There's no "wild old days". They just don't use a auto formatter because they know there are cases where the auto formatter makes the code less readable and that there are exceptions to every style guide. It's called a style "guide" not a style "law" because it offers guidance which can be ignored when appropriate.


From the Flutter link:

> We do not yet use dartfmt. Flutter code tends to use patterns that the standard Dart formatter does not handle well. We are working with Dart team to make dartfmt aware of these patterns.

So it looks like they want to use an auto-formatter


They definitely do, and most people on the team use dartfmt on at least their own code, usually through the VSCode extension. But it can get ugly in Flutter with deeply better widget trees.

Source: former team member


I don't know about Firefox & Safari, but I think you're missing an important caveat here for Chrome:

> In directories where most contributors have already adopted clang-format, and code is already consistent with what clang-format would produce, some teams intend to experiment with standardizing on clang-format. When these local standards apply, it will be enforced by a PRESUBMIT.py check.

From the looks of it it's a lot of the tree [1].

Where manual formatting gives you some freedom in meeting the style guide, it also brings along with it errors (style guides are massive and it's not possible to remember everything) & sometimes you just make a typo. This also means that your reviewers are spending time validating against the style guide which is not only the most wasteful use of their time, it's still an imperfect process because reviewers are just like the author - they don't have the entire style guide in their head at all times and they may miss violations.

[1] https://github.com/search?q=CheckPatchFormatted+repo%3Achrom...


I think the solution to this would be using an auto-formatter that supports “leave this formatting alone” directives and adding them to your code where needed:

    // convert from screen coords to clip space
    // formatting-preserve-start
    clipX = mouseX / screenWidth  *  2 - 1
    clipY = mouseY / screenHeight * -2 + 1  
    // formatting-preserve-end
That way you get the benefit of easy, convenient consistent formatting in the rest of your file at the expense of a few extra comments here and there.

Sadly, ignore comments of the Prettier formatter (https://prettier.io/docs/en/ignore.html) are not this powerful – they only support ignoring the next line of code.


Well you've just documented examples of your personal style, which can be made into custom rules for the formatter.

Maybe using some natural language processing, machine learning, and crowdsourcing, a formatter could understand intent sooner or later?


> ... which can be made into custom rules for the formatter.

Yep, well, dartfmt is almost non-configurable, without custom rules.

Because consistency over everything.

One of the greatest frustration trying to setup flutter in production for me. And no other formatter for dart I know of.

The configuration will change, all formatters I know added configuration eventually, but man!.


As far as I know, Firefox/Gecko uses clang-format for C++, rustfmt for Rust, and most recently Prettier for JavaScript.


> Or maybe you require brackets.

Your style should require brackets, and probably not be a single line


I just rewrote our Caddy v2 Caddyfile formatter this week [1]. It's not as "smart" as a true code formatter, but what I thought would be just a 1 hour task ended up taking all day. And I'm lucky it wasn't longer!

Lessons learned:

- State is hard. There's something to be said for functional languages, or at least functional patterns.

- The Caddyfile is a very theoretically-impure syntax.

- User input is always messier than you think it will be. (Spend some time helping people on the forums, you'll see what I mean. I'm a neat freak though, so I notice it...)

[1]: https://github.com/caddyserver/caddy/commit/7ee3ab7baa216599...


> There's something to be said for functional languages, or at least functional patterns.

Here’s my obligatory (and excellent) blog post by John Carmack where he adopts the functional style for a month: https://www.gamasutra.com/view/news/169296/Indepth_Functiona...


The author of that post has it _so easy_! Formatting well-formed programs, in a young language with a short spec. He just starts with an AST!

No no my friend. The challenge is when the program is _not_ well-formed, i.e. has errors, or worse yet - refers to to code elsewhere, so you can't even decide if it's well-formed or not. So no AST for you. You have to go into the dark and dangerous word of speculative and partial parsing of language grammars, representing ambiguity, replacing undefined, ambiguous or erroneously-defined semantics with deductions from the author's existing formatting and so on. That's where the real challenges lie.

And - it's not gonna be 3,853 lines, I can tell you this much.


> with a short spec.

Oh, how I wish. Dart is no C/C++ with preprocessor insanity, but it has quite a rich grammar.

> The challenge is when the program is _not_ well-formed, i.e. has errors

Being able to format erroneous code has come up as a feature request a number of times. It would be useful because then you can format even in the middle entering code into your IDE. We do have a very robust parser that can handle incorrect, incomplete code and give you a partial AST.

One of the reasons I've hesitated so far is that testing that in any sort of rigorous seems really difficult to me. Not to mention even defining what "correct" formatting is with code like that.

> refers to to code elsewhere, so you can't even decide if it's well-formed or not.

Thankfully most languages are context free, so that's rarely a problem outside of C/C++. :)

Dart does have a couple of semi-contextual syntax corners that could potentially affect formatting. An expression like:

    foo.bar();
Could potentially be:

* Call the method "bar" on the local variable "foo".

* Call the top-level function "bar" imported with prefix "foo".

* Call the named constructor "bar" on the class "foo".

Because dartfmt does not do symbol or import resolution (for many reasons), it can't distinguish those cases. In practice, all that means is that we don't have th eability to have different formatting rules for those different cases and they are all formatted uniformly. So far, that seems to be an acceptable limitation (though it would be nice if we could treat import prefixes as stickier).


I actually really like a "format on save; don't format without AST" combination.

It's really nice knowing as soon as you hit ctrl+s whether you've just written in a syntax error or not.


To see this concept taken to its logical conclusion, check out Casey Muratori write C code with the editor configured to reformat on every keystroke: https://www.youtube.com/watch?v=S3JutszP9fg&t=11m37s

I'm not sure I'd ever go this far, but that might just be years of conditioning by editors that format code much more slowly than the fork of 4coder that he's using in the video.


The idea of autoformatted code editing is intriguing.

I've been exploring that (plus instant preview/eval, but that's another story) in a home-made language with a basic editor. For smaller files/modules it's totally feasible to do on every key press, fast enough to be fairly seamless.

It took a bit of getting used to, but there's something catchy about the experience, like molding clay - if the clay was made of a "smart material" that re-formed itself to an optimal shape.

The immediacy of working with self-reshaping code (and live reload too) reduces mental friction, so I can forget about formatting or compilation altogether. I hope more languages make it a part of the developer experience, IDE, etc.


I feel like it's a subset of a much larger, more general topic of how latency impacts computer interactions. People have long tried to emulate real-world activities in computer programs, but not much attention has been paid to making those programs instantaneous.

When you mold clay, there is no lag - the tactile and visual feedback is immediate. Same with handwriting, painting, or petting a dog. On the other hand, even in this current world of multi-core 4Ghz+ computers most software fails to provide immediate feedback.

100ms, I believe, is the magic number. The maximum delay between cause and effect that still registers as "instantaneous" by the human brain. Yet, surprisingly, it's still a rarely reached target. FPS video game studios might be the only major industry that consistently cares about and delivers on this metric. (Edit: "major" was an intentionally vague word to use here. Obviously, all kinds of embedded software projects have to deal with real-time or soft-real-time requirements.)

If reformatting the whole current code file took 1ms, why wouldn't you enable it? If compilation took 50ms, why not recompile on each new line? There's a magic latency barrier below which actions feel "free", so why not try and move as many of them as possible below that threshold?


> When you mold clay, there is no lag

Wonderful phrasing.

On the importance of immediate feedback in the creative process, it reminds me of audio/music production - in particular, MIDI instruments. The latency between keypress and sound reaching the ear should be as close to zero as possible.

Recently I read a discussion about remote collaborative music performance. From what I understood, network latency is not nearly low enough to achieve it. There's also the physical limit of the speed of light.

10ms was mentioned as acceptable for musicians - but then someone said, even when musicians are in the same room, there can already be too much latency if it's a big room and they're distant from each other, making it difficult to play together.

> why not recompile on each new line?

This is becoming the standard in web development, with incremental compilation (and "hot reload" on the client) of only the changed code/module. I saw that it's getting applied in building mobile apps as well, to get closer to the ideal of instant feedback.

In thinking of the ideal developer experience, I often come back to Bret Victor's work, Learnable Programming. http://worrydream.com/#!/LearnableProgramming


> 10ms was mentioned as acceptable for musicians - but then someone said, even when musicians are in the same room, there can already be too much latency if it's a big room and they're distant from each other, making it difficult to play together.

Sound only travels ~3.4 meters in 10 ms, so it matches up that a large venue easily exceeds that quite a bit.


This is how I have Emacs set up, but when I use VS Code I really appreciate that syntax errors get the red squiggly underlines.


I know that I’ve missed a delimiter when clang-format arbitrarily adds a level of indentation to the latter half of the file.


Yes, it's like people try to make more science of it than it really is...


When does one need to format syntactically invalid code? What would that even mean?


All the time:

1. While you are in the middle of writing your file the first time.

2. When your IDE/editor doesn't see all the includes/imports

3. In languages where syntactic validity depends on semantics, beyond merely known and unknown identifiers.

4. When you make a syntax errors. Which is all the time, basically, until you've gotten your program or library to compile.

5. When you choose to format just a small segment of code, not the whole file.


The idea is that your code could be formatted for you as you type instead of having to wait until you reach a point where your code has no errors and hitting save.


IDEs like IntelliJ do this all the time. For instance, formatting a block of code when it's pasted into a file with syntax errors elsewhere.


If your language can't even be formatted properly if it refers to code elsewhere, I have to wonder what the fuck compilers do: Can't parse the file until we've loaded some unknown number of other files, the names of which we can't determine until we parse the file, which we can't do until we've loaded an unknown number of... C++ by way of MC Escher, sounds like.

Common Lisp has big, meaty, possibly-undecidable macros which compilers need to expand, but Common Lisp formatters are piss-easy because the basic grammar doesn't change unless you go so wild you invent Dylan or something.

So, yes: In the general case of Really Advanced Programming Languages, the amount of information you need to format is the same as the amount of information you need to compile, but that's fine, because it means you can flip it around and see that if you can't format cleanly you're never going to compile anyway, so what's the point of formatting code you can't even use?


> If your language can't even be formatted properly if it refers to code elsewhere, I have to wonder what the fuck compilers do

They are split into passes. With C macros it's totally fine to have a file that says

    #define LBRACE {
And then in a different file the compiler necessarily have to access the first file in order to find the opening brace. It's not even invalid for an include file to contain unbalanced braces with the expectation that whatever code that includes it fixes that.


> If your language can't even be formatted properly if it refers to code elsewhere,

Let me introduce you to a certain obscure language named C++ ...



The dart formatter is awesome - kudos to Bob.

All languages should come with a highly opinionated formatter. It puts an end to useless style arguments.


It doesn't really put an end to them, it just turns them from N×N arguments between each pair of programmers to N×1 where everyone argues with me, the formatter maintainer. But, yes, that's a significant performance win (for everyone except me...). :)


I hear the author of Black (an opinionated Python formatter) was at a conference and gave a presentation that when something like this:

> I'm the author of Black. Questions?

Questions, of course, filled up the entire rest of the time slot.


From my perspective at least, you've always handled it with grace. Thanks for your work Bob.


You're welcome! :)


Every time I use one of these it ends up doing ridiculous things to the formatting, like splitting what should really be 1 line over 4 or 5. Which completely destroys whole-file readability/scanability. Does nobody else find it really hard to read codebases subject to these formatters?


The formatter that set the expectation of standard formatting was that for Go (gofmt), which applies a very light touch: it never removes newlines and rarely inserts them, so you still have latitude to arrange the code in a way that respects its meaning.

By contrast, the Java formatter we use at Google is an example of the totalitarian instincts unleashed by this new technology: it insists on reflowing everything, including the contents of comments so that they always use the full 100 columns. Madness.


There are always corner cases. For example, clang-format at $day_job doesn't understand semantically that I want my 4x4 matrix constant to be formatted as 4 rows with 4 numbers each; putting all 16 numbers in a single long array will get a lower score.

But as a whole, I found it easier to navigate the monolithic code base when most of it has been auto-formatted, exceptions aside. When applying a formatter to existing code, I usually do that in a separate commit and then follow it with any hand-rewritten lines afterwards (in case either the formatter or my changes afterwards accidentally regress something).

And once the formatter is enabled by default, it's usually easy enough to come up with a set of smaller statements that auto-format well to replace a longer statement that auto-formats poorly.


> For example, clang-format at $day_job doesn't understand semantically that I want my 4x4 matrix constant to be formatted as 4 rows with 4 numbers each

Comments can be a low-key forced line break:

    int matrix[16] = {
        0x0, 0x1, 0x2, 0x3, //
        0x4, 0x5, 0x6, 0x7, //
        0x8, 0x9, 0xA, 0xB, //
        0xC, 0xD, 0xE, 0xF, //
    };


Usually in cases like these I prefer to add the magic “formatter off” comment around the block: it’s self-documenting.


Don't just blindly run a formatter.

I've developed a habit of refactoring long lines early (e.g. breaking large expressions and assigning to intermediary variables) to prevent the formatter from adding hard-to-read mid-expression line breaks


Interesting.. not having to worry about where to break long lines and letting the formatter (in my case Prettier) figure it all out for me has been one of the most wonderful things in recent memory that has happened to my programming experience. I can just write awful looking code at the speed of thought and know it will turn out fine when I hit save.

Having to switch back to a non-opinionated non-line-breaking formatter like golang's makes me very sad.


Prettier outputs fairly decent formatting for code that is already reasonable to begin with. It's not so great when you have things like template strings with large expressions in the interpolation slots.


This is basically the approach I take, write towards the standards of the auto formatter as much as possible and it'll usually handle the rest.


It depends for me on the formatter. For me, blank lines are often part of how I express meaning. If the formatter is good at respecting my chosen line breaks, fantastic. If not, it's pistols at dawn.


I didn't really understand this until I started using golang. Now I don't want to go back to other languages and have to deal with varying syntaxes and manually running a formatter.


What's a language that lacks such a facility? ClangFormat covers C/C++/Java/JavaScript/Objective-C/Protobuf/C#. gofmt obviously can do Go. buildifier will rewrite your Bazel BUILD files.


Those aren't highly opionated default formatters though. The JavaScript community seems to have settled on Prettier as the go-to formatter, but not everyone uses it, and even Prettier had to add a bunch of options for certain things that will always be highly personal. Stuff like trailing commas, tabs vs spaces (and how many spaces‽), line lengths, and semicolons.


Well, one part of trailing commas in Prettier is that different versions of JavaScript support them in different places. With go fmt, the formatter is updated in lockstep with the language.


ruby


ruby has rubocop, although it is more basic.


> It puts an end to useless style arguments.

I don't like auto code formatting in my workflow and I don't like "easy"-syntax languages like Go. I see it is of value for big treadmill-code-machines like google. They want every codemonkey to be easily replaceable.

But for me code is also art and fun. I align my assignments and statements and I use different code-formattings according to the situation.

If I do all right I only write the code once but read it again and again for years. So I like to spend some time to format my code nicely, whatever that means :-)

Writing a code formatter is a fun challenge and sometimes its useful to have a good formatter but please not in my day-to-day job.


> They want every codemonkey to be easily replaceable.

Code monkeys at Google are very expensive to replace. I'm sure Google would change that if they could, but nobody knows how. Code formatters are definitely not part of any stratagem to accomplish that goal. On the contrary, they want those monkeys spending their time producing value, not arguing over where to put the closing brace.

As long as you never work with anyone who disagrees with your style, or whose style you disagree with, an opinionated auto-formatter is not that useful. For the rest of us, it helps a lot. I remember early in my career before Google settled on clang-format, gofmt, etc., how every fourth person had a different opinion on how to indent this or that thing, and how occasionally it came up in code reviews. Or teams had their own team-specific style, so if you needed to fix something in someone else's code you had to break all your habits . . . god, it was horrible. How did we live like that.

Now I just run the code fix program before sending for review, and the only points to discuss are naming and software design, not the placement of spaces and newlines. I'm still so thankful for it even half a decade after these concepts became widespread at the company. Fates bless the people who work on code formatters.


> As long as you never work with anyone who disagrees with your style

> [...] and the only points to discuss are naming and software design,not the placement of spaces and newlines

lol. In my 15 years working with different people in different companies I think I never had a serious argument about code formatting or a serious discussion about placement of spaces and newlines (at most small talk about those issues on coffee & cigarette breaks).

Once you are used to jump between different projects and languages you just accept to use the style which is given.


oh, I have. Full on per-project style guides, determined in multiple core team meetings at kick-off.

And these people were excellent engineers...or at least they were when they weren't focused on minutiae.

Even with automatic code formatters, I've gotten into style discussions or been witness to them. At Google! In code review!


> They want every codemonkey to be easily replaceable.

Most folk/bosses who are risk averse actually want their codemonkeys to be easily replaceable (although admittedly, they might not realise it!). Nobody likes/wants a system with a single-point of failure.

If I'm working for a client, and get hit by a bus tomorrow, it is rather reassuring for the client if they know the work they've paid for so far can be fairly easily picked up and completed/maintained by another hire.

I deliberately make it part of my strategy to bring a client's attention to this kinda stuff, in an ongoing fashion. It means I can happily spend time on e.g. documentation and/or other technical debt, because my client understands the values of these things.

My actions to make myself (more) replaceable lowers my client's risks. (A good client understands this, and might acknowledge and support such efforts with fatter pay checks: someone who helps lower their risks can be a particularly valuable asset)

I guess it depends on one's attitude and understanding towards 'being replaceable', and how one pitches that understanding.


> Most folk/bosses who are risk averse actually want their codemonkeys to be easily replaceable

Absolutely, everything else wouldn't be very smart. Maybe my comment came off a bit more negative than I intended to.

I don't blame google if this mindset is reflected in their tools but I wouldn't like to have to work with that.


You can dislike auto code formatting without disparaging people.


??

You mean the word "codemonkey"? That was meant from the employers point of view. I'm a codemonkey myself...


I’m not a 1980s network censor chasing after “bad words” here. You painted a nasty picture of a company and its relationship with its employees. You can express yourself without attacks.


I don't see it that way. I think I just painted a picture of reality. I didn't even mean to say this behavior is bad.


Except for when they make the code harder to read, as black does with a decent chunk of pandas idioms.


I use prettier religiously. Works on js, typescript, jsx, css, json, html, sass, yaml and a whole bunch of others.

The vscode prettier plugin brings format on save and it makes coding a joy.

I think what prettier does is much simpler than dart. It tries to put things on a single line if it’s within line length, if it can’t then it uses multi line where each item is in its own line with ending bracket at same indentation level. For arrays, functions, object definitions, jsx etc. I’ve written some automated code generators that output pretty code and that’s how I tend to approach it too. You either get compact or expanded version and that happens recursively. There is some backtracking but there’s early exits when compact strategy exceeds line length so it’s pretty fast.

After reading the blog article it put me off that starting and closing brackets were not at same indent level. I know lispers do this but I’m not used to it so it hurts my brain.

One of my interview questions is to pretty print json files to a certain line length.


Shameless plug that if you're bored and want a really tough, fun challenge, the prettier project has many open tickets: https://github.com/prettier/prettier/issues?q=is%3Aopen+is%3...

Only two are tagged "difficulty:easy", likely for good reason – when I contributed a few fixes a while back, I found it much more challenging than implementing a language syntax, an unrelated project which I was also doing at the same time.

It's really fun, though, and watching colleagues code and use the features of prettier that you built is a nice feeling – "I helped write that code for you!"


If you enjoyed reading this, you might want to check out CodeBuff:

https://github.com/antlr/codebuff

CodeBuff is based on a paper by Terrence Parr and Juergen Vinju, “Towards a Universal Code Formatter through Machine Learning”:

https://arxiv.org/abs/1606.08866


I can't help myself, the required semicolon at the end of each line in Dart programs keep hurting after writing for many years in Python, JS, Kotlin and Swift. I grew up semicolon-less with QBasic and learned semicolon since Delphi 1.0 and Java 1.0 but today, wanting as few noise on the screen as possible, they are hurting in my eyes.


I'm the author of the post and work on the Dart language. I spent a bunch of time last year trying to figure out a good way to make semicolons optional in Dart without breaking a lot of stuff.

Because of a handful of syntax quirks particular to Dart, it's really difficult. Most other languages that have optional semicolons were designed from the start to make them optional. Dart was not. On top of that it:

* Uses C-style variable declarations which means there's no keyword to indicate the start of a variable declaration.

* Likewise uses C syntax for function and even local function declarations. Again no keyword marking the beginning of a function.

* Has complex syntax for things like function types which means a type annotation can be arbitarily long and may split over multiple lines.

* Has lots of "contextual keywords" that are not true reserved words but behave like them only in certain contexts.

* Has a rich syntax that overloads a lot of tokens to mean different things in different contexts. "{}" can be an empty map, empty set, empty block, empty function body, or empty lambda body.

All of that makes optional semicolons just really difficult. I haven't given up entirely. We have a thing now called "language versioning" that lets different Dart libraries target different specific versions of the language. That lets us evolve the language in nominally "breaking" ways without actually breaking existing code. (For example, this is how we will migrate the world to non-nullable types.)

That may give us a way to make other grammar simplifications that would then also let us make semicolons optional. But syntax changes like this are always difficult. People have very strong opinions about you changing their existing syntax under them, and the value proposition in return isn't super compelling.

There's an old saying, the best time to plant a tree is twenty years ago. The second best time is today.

Language syntax isn't always like that because the migration cost once you have a lot of extant code (and users) can just be too painful. The best time to make semicolons optional was twenty years ago. The next best time may be today, but it may just be never.


Could Dart use this solution, which Python uses and Oil shell now uses?

1. Newlines are significant tokens and they behave like semicolons (this is how shell is)

2. Within () [] {}, newlines are suppressed. The lexer has to recognize the matching pairs. (Oil added an expression language to shell which does this)

This means that

    var x = 1 +
            2
is not valid because there's a statement terminator after +, but

    var x = f(1 +
              2)
is valid. I feel like this captures the fast majority of cases where you want to break a statement over lines.

I don't know Dart, but since it seems to have JavaScript-like syntax, I don't see why that wouldn't work? JavaScript has some weird rules, the common gotcha seems to be is:

    return 
    {
    }  // oops this does not return a dictionary like I thought

With those rules you would write:

    return {
    }
and hopefully the former is a syntax error, not a mis-parsed program.

----

edit: OK my guess is that Dart probably doesn't want to enforce a brace style like Go (and Oil) do?

That is in Go only

    if (x) {
    }
is valid, rather than

    if (x)
    {
    }
which is pretty much like the 'return' example. I have been using that brace style for so long that I forgot other people don't :)


> Within () [] {}, newlines are suppressed.

Unfortunately, no. That doesn't work. Consider:

    var map = {
      key: long
        - value
    }

    // block
    {
      long
      - value
    }
The newlines should be ignored inside the map literal, but not inside the block. A correct semicolon insertion for this should be:

    var map = {
      key: long // <-- no semicolon here
        - value
    };

    // block
    {
      long; // <-- but there is one here
      - value;
    }
But the lexer doesn't know when curlies are maps and when they are blocks. The parser does, which means you could theoretically define the semicolon insertion rules in the grammar, which is what we'd likely have to do, but that makes it much more complex.


Ah OK, that makes sense. So Python doesn't have that issue because it doesn't overload braces for hashes and blocks.

Oil does overload them, but it has a separate lexing mode for statements and expressions. It switches when it sees say the 'var' keyword, so the right of var x = {a: 3} is lexed as an expression rather than a statement. This sort of fell out of compatibility constraints with shell, but ended up being pretty convenient and powerful.

The lexing mode relies pretty strongly on having the "first word", e.g. 'var' or 'func', Pascal-style. So yeah I can see how it would be more complex with Java-style syntax.


> So Python doesn't have that issue because it doesn't overload braces for hashes and blocks.

Yes, and also because lambdas in Python can only have expression bodies, not statements. That means you can never have a statement nested inside an expression. This is important because Python's rule to ignore all newlines between parentheses would fall apart if you could stuff a statement-bodied function inside a parenthesized expression.


Yes the lambda issue is something I ran into for Oil. Although this part of the language is deferred, the parser is implemented:

    # Ruby/Rust style lambda, with the Python constraint that the body can only be an expression
    var inc = |x| x + 1

    # NOT possible in Oil because | isn't a distinct enough token to change the lexer mode
    var inc = |x| {
      return x + 1
    }

    # This is possible but I didn't implement it.
    # "func" can change the lexer mode so { is known to start a block.
    # In other words, { and } each have two distinct token IDs.

    var inc = func(x) {
      return x + 1 
    }
I think this is a decent tradeoff, but it's indeed tricky and something I probably spent too much time thinking about ... the perils of "familiar" syntax :-/


> Could Dart use this solution, which Python uses and Oil shell now uses?

This works well for Python, because the following causes IndentationError:

    x = 1
      + 2
In a language without significant indentation, would this be two statements, an assignment and a separate unary `+` expression? Is that OK?

If not, one possible solution would be to disallow arbitrary expression statements. AFAIK this is what Go does.

> JavaScript has some weird rules, the common gotcha seems to be `return \n {...}`

Disallowing expression statements also fixes this particular example, but it doesn't solve the `return` problem in general. It's not practical to disallow standalone function calls, so `return \n f()` still has the same issue.

Lua solves this by enforcing that a `return` must be the last statement in a block.


> It's not practical to disallow standalone function calls

If your language structurally differentiated procedures (i.e., void-returning functions) from (other) functions, it would be practical to disallow standalone functions (not procedures), though you might have a modifier to explicitly discard the result of a function that allows it to be standalone.


> you might have a modifier to explicitly discard the result of a function that allows it to be standalone.

ocaml does this and i find it very annoying, even though i know it's theoretically saving me from accidentally not consuming a return value.


There are benefits to unambiguously marking the end of each statement with a semicolon rather than just using a newline. Is there a good algorithm for determining whether or not a newline actually represents the end of a statement?

JavaScript gets this wrong in both directions, sometimes unexpectedly ending a statement (e.g. in `return \n 0`) and sometimes unexpectedly not ending one (e.g. when a new line begins with an open-parenthesis).

Python's method (newline always ends a statement unless it's inside (), [] or {}) is straightforward, but makes the language syntax strictly line-based. This matches Python's significant indentation, but can it work in a language without it?

Another option I've seen is for all newlines to end statements, unless they follow a token that cannot end a statement. Unfortunately that means that the following two assignments have different behavior:

    foo = bar +
          baz

    foo = bar
        + baz
The first is a sum, the second only assigns `bar`, followed by a standalone unary `+` expression. Go works this way, but considers the second form to be an error ("+baz evaluated but not used"). Python considers both of these to be errors.


> There are benefits to...

True, and these kinds of bikeshedding discussions about tiny details are infuriating because they're so irrelevant. I wish we could all rise above them to discuss the next level of expressibility.

We let the little stuff suck up so much of or time.


Semicolons are just extra signal to humans and compilers, they shouldn’t hurt eyes like the use of fixed width fonts do.


It’s not that they hurt the eyes for me as much as they introduce tons of micro delays and extra keystrokes when editing. For instance, any “X to end of line” (cut, copy, etc), you now need to think about what the destination is and whether you should include the semi, change the selection, go back and delete the semi later, etc.

It sounds slight, but it really builds up over time. Once you get used to “powerediting” without them it’s hard to go back.


End of line inference algorithms are incredibly fragile, both by the compiler and by the human reader, having a semi-colon can make things clearer in the absence of another secondary signal (like indentation). There are so many type errors in scala or typescript that can be solved by adding a semicolon somewhere to eliminate some idiotic ambiguity.


Only-EOL like python is fine and always-semicolon like C is fine. Sometimes-optional but sometimes-required semicolon like Javascript is bad in my experience. Javascripts C-like syntax doesn't cope well with missing semicolons


> End of line inference algorithms are incredibly fragile

I am working on a language right now with optional semicolons (that is, newlines terminate statements unless there is a semicolon, which does the same early) and it’s not really that hard. The only change is that instead of looking for a newline character for statement termination I often also need to look for a semicolon.


> newlines terminate statements unless there is a semicolon, which does the same early

Since I'm writing a toy language with optional semicolons, I'm curious - how are you solving ambiguities in multi-line expressions? For example:

  1
  -1
Does your language not allow multi-line expressions, treating the above as two statements? Or, perhaps they must be explicitly an expression, like:

  (1
  -1)
I agree with you on another comment, that JavaScript's automatic semicolon insertion is complicated - so I'd like know if there's a sensible solution for languages without semicolons.


In this language, there's very few multi-line expressions. However, for your specific case: yes, I would consider that two statements.

  1 -
  1
would be one: putting a binary operator by itself in one line would cause the parser to keep looking.


Ah, I see, so new lines are like invisible operators (with low precedence) that ends a statement unless preceded by another operator. Right, that makes sense.


Python behaves "sensibly" with regard to this. Semicolons are statement breaks. Endlines are statement breaks, unless theres some reason for them not to be, such as being in a open brace state.


I think the issue is more when the new line doesn’t terminate the statement. Like in JS:

const a = doThing()

(() => throw Error)


JavaScript for some reason chose a very strange way to have its newlines work; a number of other languages have it figured out: Swift, for example.


Why not infer the end of a line when finding the end of a line? ;)

This is what most BASICs and assemblers did.

(I think QBasic had a line continuation character? - or maybe that came with Visual BASIC. I don't remember. But, anyway, if you wanted, it was possible to continue a source line past a line break, it just wasn't the default.)


I’m all for that option also, and we can always add an anti semi colon to write a multi line statement.


I love that the author's final "gotcha" isn't even domain specific. It's a hard limit on recursion, just like all modern debuggers and most other formatting tools. It's crazy to me that they even tried to support formatting auto-generated code snippets, I know I personally would've created a much more naive/conservative bail out.

> If the line splitter tries, like, 5,000 solutions and still hasn’t found a winner yet, it just picks the best it found so far and bails.


I wrote a simple formatter for outputting json values. It respects maximum line length.

https://gist.github.com/nojvek/875d8cad1005da0b95da2840bbc89...

It implements some of the ideas from https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascri...

Could be re-used for other languages very easily. The core is you have nested group chunks with (indent breakpoint, dedent breakpoint, separator breakpoint, more text chunks and child group chunks).

when formatting a single line, indent and dedunt breakpoints do nothing, but in multi line format, they specify how text will be laid out.

The algorithm is simple, try to format in single line if it fits within maxLineLength, if not break into multiple lines, and do that recursively.


Footnote 4 [0] discusses the struggles of formatting comments. This was interesting to me because back in December I wrote most of a code formatter for Lox, the toy language from Crafting Interpreters [1], but I never finished it because I got caught up on how exactly to handle and represent comments, and then I got distracted by other projects. Does anyone else have thoughts on how to handle comments in a code formatter?

Edit: Well, apparently this blog post (from 2015) is by the author of Crafting Interpreters. That makes this more interesting to me.

[0]: http://journal.stuffwithstuff.com/2015/09/08/the-hardest-pro...

[1]: https://craftinginterpreters.com


I'm surprised it was necessary for the author to design and write Dart's formatter essentially from scratch. Is there not prior art - formatters for other languages that enforce a maximum line length?

Also, I'm surprised that enforcing a maximum line length was considered so important that it was worth all this work. Even though I personally like to limit line lengths (and I know plenty of people who don't), I'm not sure enforcing such limits are worth this amount of complexity in the formatter, nor the sheer amount of labor required to implement and maintain that complexity. I think the sweet spot is probably Go's `gofmt` - automatic formatting that allows unlimited line lengths (possibly emitting a warning if formatted code exceeds a desired length) seems to provide almost all of the benefits of a formatter while being a much simpler solution.


To some degree he’s just rediscovered branch and bound, the classic way to solve NP hard problems like this. It’s a neat little trick.

The other way of doing this would just be to formulate it as some kind of ILP and plug it into somebody else’s solver, quicker but there’s no way it’d work as well.


I never grasped why this is so hard, you don’t need an exponential blowup with some common sense heuristics. Just look at the AST and put the breaks as high up in the tree as possible.

For example: If I have a function application with several arguments one of which is a slightly longer lambda expression, don’t even try to break the lambda before putting the arguments (nicely aligned) on their own line.

I’m a big fan of the ergonomics of prettier, but some of its output is objectively ugly and seems the result of over-complicating their search space.


10 years ago I though that writing S-expression pretty printer should be significantly easier than what the complexity of typical CL/Scheme pretty printer (ie. Waters' XP as described in AIM-1102) implies. Well that is not true ;)

And since trying to do that I almost involuntarily format any code I write by rigorously applying the pretty printing algorithm of XP by combination of manual line breaks and Emacs' newline-and-indent.


How do you indent block comments? Do they stay untouched? Do you indent them the same as the previous line? If so, what to do if there are already indents in the comment? Add the additional indent?

I wrote a simple but IMHO really nice XML prettifier a while ago and remember that I finally gave up on block comments, because I could not find a way to handle all cases in a satisfying manner. So I decided to leave the indenting untouched.


> How do you indent block comments? Do they stay untouched?

Block comments are a little tricky and there are a handful of heuristics to handle them. Generally they stick to the preceding token. Block comments are rarely used in Dart and when they are, it's often just a small marker to describe the previous element like:

    foo(null /* missing */, blah);
So adhering it to the previous token usually does the right thing.

> If so, what to do if there are already indents in the comment?

It doesn't touch the body of the comment itself. There's too many ways for that to go wrong given all of the various things a human might choose to stuff in there: ASCII art, tables, prose, etc.


The first program I ever wrote as a professional in the early 80's was a Jovial (before Ada what the Air Force used) prettyprinter written in Fortran. Hard to imagine now how little I knew yet still managed to get it to work (required for Air Force documentation) in about 6 months. I wish I still had the source code as that combo is nuts.


Somewhere on my todo list is to make a systemverilog autoformatter using the parser from slang [1].

[1]: https://github.com/MikePopoloski/slang



Why make a language and make a linter? Make the linter part of the language.

In my opinion, there's no point in encoding certain concepts into a linter as best practice, if it's best practice encode it as law into the syntax of the language.


I really don't like this guys writing style.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: