Hacker News new | past | comments | ask | show | jobs | submit login
Why Compilers Don’t Autocorrect “Obvious” Parse Errors (chelseatroy.com)
64 points by skilled on April 10, 2022 | hide | past | favorite | 101 comments



The risk analysis argument makes sense from a language-usage perspective. I think there's a language design argument we can make as well:

1) if your language papers over a syntax error, then that error is effectively just an alternative syntax

2) alternative syntaxes make a language more complex

3) complex languages take more work to implement and more work to learn


This is what happened with HTML. At first the syntax was pretty simple, but the error-correcting parsing meant that invalid syntax became widespread. In order to ensure compatibility between implementations the spec ended up having to specify the exact parsing of any form of incorrect HTML also. Now the parsing algorithm of HTML is incredibly complex: https://html.spec.whatwg.org/multipage/parsing.html


PHP has entered the chat.


> Because, as smart as we compiler designers think we are, you, dear programmer, know your program better than we do.

When you compile a file successfully, make an edit, then compilation fails, are there any compilers/IDEs that compare the before-and-after of the file to create better error messages? The compiler would have a lot of extra information this way because files usually change gradually and not all at once.

I'm thinking of cases where you're refactoring some code but miss out a bracket, the compiler says the missing bracket could be anywhere down the bottom of the whole file but anyone watching intuitively knows what block of code the missing bracket likely falls into, usually localised around what you just edited.

> elsif say_goodbye nd we_like_this_person:

> If the compiler tried to automatically add a colon, I’d have two colons and the code is even wronger.

Couldn't a smarter compiler guess that because "we_like_this_person" and "say_goodbye" are defined variables and there's no variables similar looking to "nd", that "nd" should probably be the "and" keyword?

I'm surprised by how unhelpful error messages still are for most tools. I'm curious how much this is because it's a very hard problem rather than it's a neglected area that developers accept as normal. I heard Elm is meant to be good here (where strong static typing allows for certain kinds of hints): https://elm-lang.org/news/compiler-errors-for-humans


> because files usually change gradually and not all at once.

As a counterexample, suppose you are checking out a new version of a file, or a new branch with many changes across many files. Identifying this usage would require the compiler to be aware of the version control system, and still wouldn't correctly identify that the version sent from $COWORKER via email for some weird reason isn't a gradual change.

For me personally, debugging is difficult enough without needing to worry that the compiler is going to maintain state across multiple runs. If I see an error message that is different at all, I assume that means I'm triggering a different failure mode, and debug accordingly.

Edit: That said, the Rust compiler is tremendous with error messages, without relying on time-dependent state. If a variable is misspelled, it will look for similarly named variables that are in scope, and ask if you meant one of them. But this behavior is still consistent for a given file and compiler version.


> For me personally, debugging is difficult enough without needing to worry that the compiler is going to maintain state across multiple runs.

Ooh, the idea makes me shudder.

I remember looking at a project's makefile which called a custom build script where the README said to run the buildscript twice-- once to generate some state, and a second time to compile stuff using that state.

Without any comments provided, the makefile called the custom buildscript three times in a row.

I can't even imagine the superstition and cargo culting that would arise from an IDE "helping out" by analyzing who has changed what, when, and in what order they changed it.

Please paste a new empty function named "momo" here before doing a release. Also make sure your blinds are closed before compiling.


Couldn't you just gate the comparison with compiler flags and define corresponding make targets? Flags could also identify the VCS and how the old file should be obtained.


> unhelpful error messages still are for most tools.

One thing that winds me up is error messages that tell you that some string can't be parsed or a file can't be read but they don't show you the string or the file path.

This is especially irritating for the actual end user because they generally do not have the opportunity, or knowledge, to run the program in a debugger or to examine the source code.


Seems relatively easy to implement by having your "make" step commit to git, then on a compilation error show the diff alone with the error.

Then simply rebase out the intermediate steps before pushing.


> compare the before-and-after of the file to create better error messages?

There was a PhD thesis I read in the 90s that included a version of this idea. I forget the specifics.


I think you are referring to Tim A. Wagner's History-Sensitive Error Recovery. http://harmonia.cs.berkeley.edu/papers/twagner-er.pdf


That was it! I thought I could remember 'Tim' but my mental completions were all wrong. Thanks.


This is the difference between building a parser for a compiler and for an IDE.

For a compiler, you want to stop at the first error. The parser can also emit an intermediate representation as it goes, so that what it is processing is not necessarily serializable to the original code. This makes it difficult to use as a data model for tools like IDEs.

For an IDE, you want to process the entire file, recovering from errors as you go. This is so that the IDE can keep things like function resolution working without turning the entire file red as the user is typing code, while ideally only updating the references that haven't changed. It also allows the IDE to offer different fixes and auto-complete functionality.

This makes it difficult to share parser logic between the two.


It's only difficult to reuse logic from a batch compiler in a responsive compiler. It's trivial to derive a batch compiler from a responsive one.

You do not actually want to stop at the first error in either case. You want to accumulate all the errors at a given phase of the compilation and halt. Sometimes that allows other phases to progress (for example, you do not want an error in one compilation unit to halt the compilation of any other units until linking in an incremental compiler).

You do actually want to reuse IRs in the IDE, otherwise it can be extremely difficult to get certain things correct (and some are next to impossible, like macro expansion/syntax extensions, decompilation of libraries, etc).

Unification of the reference compiler and IDE backend are extremely desirable, in my opinion. Very few languages take that tact (C#/.NET being a major exception) but not because it's a bad design - it's because it's hard. Writing a lexer and parser is easy if you don't care about edits and and updates. And there are very few parser generators that do make that possible (tree sitter being the major exception). And once you have a working lexer/parser it is difficult to replace it in your compiler, so few language devs ever take that approach.

It's essentially a massive engineering effort in something that is rather boring for low payoff in the early life of a language implementation, the RoI is only obvious much later when usage scales up. So it's unsurprising many languages do not do it early on, and like objects, most languages die young.


> makes it difficult to share parser logic between the two

It’s difficult to reuse traditional compiler logic in an IDE, but there’s good examples that the reverse isn’t true. IDE validation of language semantics is a strictly more complex problem, but if you start by solving that, it’s not as hard to add a compiler backend. A compiler’s job is to either take correct code and translate to its compiled form or take incorrect code and report errors. There’s no reason an IDE-focused parser/compiler can’t do both.

IIRC, Microsoft talked publicly about how they built the C# compiler as IDE-first and found that it simplified things greatly. And I think there has been substantive discussions within the Rust community about bringing parts of rust-analyzer into the official compiler whereas the RLS approach of reusing compiler APIs wasn’t able to provide a reasonable IDE experience.


My perspective comes from writing an XPath and XQuery lexer and parser for IntelliJ, which has its own lexer, parser, and AST APIs.

The XPath lexer and parser are designed to be overridden where needed to implement the XQuery lexer and parser.

The lexer itself has state as a stack-based lexer in order to tokenize the different structures (string literals, comments, embedded XML) correctly. A compiler could use the parse state as the context to drive the tokenizer without needing a state/stack-based lexer.

The lexer also treats keyword tokens as an identifier type as keywords can be used as identifiers. This is not necessary in a compiler as it knows when it is reading/expects a keyword.

My parser handles the different versions of XPath/XQuery, the different extensions, and vendor-specific extensions all in a unified lexer/parser. A compiler could ignore the bits it does not support and simplify some of the logic.

My QName parser is very complex due to providing error recovery and reporting for things like spaces, etc. -- Other parsers (e.g. Saxon) treat the QName as a single token.

I'm also generating a full AST with single nodes removed, e.g.:

    XPath
       InstanceofExpr
          IntegerLiteral       "5"
          XmlNCName            "instance"
          XmlNCName            "of"
          SequenceType
             AtomicOrUnionType
                QName
                   XmlNCName   "xs"
                   Token       ":"
                   XmlNCName   "string"
             Token             "?"
I'm traversing this AST to do things like variable and namespace resolution. For the modules, I'm using the IDE's mechanisms to search the project files. -- In a compiler, these would be collated and built as the file is parsed, which does not work with incremental/partial parsing.

I'm getting to the stage where I can evaluate several static programs due to the need of implementing IDE features, and providing static analysis.


This isn't necessarily correct. Many modern compilers (e.g., C# and F# compilers) will do error recovery and keep processing as far as they can go, accumulating errors in the process. These same compilers are not different from that used in the IDE - they are one and the same. And finally, modern compilers can also be tuned based on usage, such as enabling batch mode to optimize for speed in a single thread or using as many threads as are available to optimize for IDE scenarios.


Why don't you want to do the IDE-style parsing in a compiler?


From time to time, I see errors in IDE parsing. It's not a big deal there, but it would be in a compiler or interpreter.


What case would introduce a parsing error in an IDE that isn't the case in a compiler?


I figure it is on account of the desirable situation you describe in your other post not obtaining: in order to satisfy the goals of the IDE, it attempts to go beyond where the compiler parser would stop, as the compiler is more of a batch than a responsive one, and sometimes the IDE gets it wrong. As you say, batch to responsive is the difficult way to go.

In addition, I suppose that there are people hard at work applying ML in tools to help understand incomplete code and mitigate the false positive problem of traditional static analysis. I can imagine probabilistic parsing being useful in this case, but not so much in compiling.


Bad language plugins in an IDE can show you this. Sometimes I'll be using a niche language with someone's side-project plugin that has some issues even though it's correct, like when its file-formatter can't parse the code and fails with an error even though it's valid code for the compiler.


If the plugins used the same parser as the compiler this wouldn't be an issue?


I was a little curious about this too. It's contrary to what I see in the Go and Rust compilers. My understanding was that it's good to have a go at parsing all input if possible so the end-user can batch fix mistakes, but it's unreasonable to expect error checking in post-parsing steps to occur if there are parse errors because the AST is almost certainly incomplete.


That can be desirable but there are a few challenges:

- The compiler code becomes more complicated, making correctness harder

- The compiler might become slower to run

- Introducing new languages features may become harder, again due to code complexity


Compiler will get plenty complicated without IDE scenarios, trust me on that one. Slowness is also never really a thing to worry about here, especially because usage patterns in an IDE vs. a batch process are so different. It's almost always the other way around: someone writes something that's completely fine for a batch process but tanks IDE performance.


> The compiler code becomes more complicated, making correctness harder

> Introducing new languages features may become harder, again due to code complexity

It'll be written for IDEs anyway. Might as well reuse if possible, right?


This was tried in the 80s with teaching Pascal compilers. The compiler would fix a problem and issue a warning and continue. However continue on with what was left and issue strange and bewildering messages to the poor student.

Turbo Pascal at the time just stop at the first problem and the student could focus on addressing that one and only one issue at the time. Yes it was a game of whack-a-mole with syntax errors but at least it was a straight forward process to getting something to compile


I think this is the main issue. Compilers have always been trying to do some kind of syntax error recovery, to be able to spot more than one syntax error at a time. However, these heuristics are unfortunately fragile. It often ends up with cascading syntax errors where you're better off ignoring the errors after the first one. Not to mention that many error recovery heuristics tend to skip over some statements, which means they can't be used to "autocorrect" the program.

A while back I looked at how several languages implement this and Pascal was actually one of the better ones. It is a very hard problem...


> Yes it was a game of whack-a-mole with syntax errors but at least it was a straight forward process to getting something to compile

It also helps that Turbo Pascal was an extremely fast compiler for its time. So you could fix one error, re-rerun the compiler, and get another error quickly.


> game of whack-a-mole with syntax errors

first time I ever "wrote" a program was hand copying one from PC Magazine. Knowing nothing about pascal syntax nor semantics, what you said describes that whole week of mine.


This is somewhat related to the "robustness principle" that was guiding internet development in the early days. It wasn't about programming languages, but about protocol data, but the issue is similar.

Yet it turned out that doing that introduces a lot of subtle security issues. Today many people came to the conclusion that the robustness principle was a mistake: https://www.ietf.org/archive/id/draft-iab-protocol-maintenan...


> [Javascript and to some degree ruby] will try with all their might to divine something runnable from what you wrote. How kind of them, right?

Maybe this is semantics, but a loose syntax is different than the language trying to automatically correct mistakes.

JavaScript has optional semi-colons and braces. The semicolons seem to fall into the into the autocorrecting category because you are supposed to use them. Optional braces are a language feature shared with C.

Ruby has optional parentheses on method calls, which is usually fine until you attempt to do `a(b(4))` as `a b 4`. It’s easy to get into a syntax error omitting writing code like that. But the fact it will give you a syntax error when it hits an unclear structure means this is a (mis-)feature, rather than an attempt as guessing what you meant.


I really dislike the automatic semicolon insertion feature of JavaScript. (It is, in my opinion, one of the worst features of JavaScript.)

(A preprocessor could be used to fix it if wanted, I suppose, but then it must be preprocessed and converted)


I feel the same, the exclusions to ASI (such as on "return") turn it into a pain if you try to use it and the feature itself makes it a pain if you don't try to use it. The worst of both worlds in that regard.


I've always thought of ASI as more "statements are separated for you but if you need them to be separated in a special way you can add semicolons to manually control separation behavior" than a "you forgot semicolons, let me fix your code for you". Pretty much the same as the argument for optional braces being a language feature not an auto-correction.

That said when you look at either in terms of how they are implemented it'll seem like a correction feature. I think the real difference between auto-correction and optional syntax is simply whether or not the language spec designed it to be optional.


Semicolon insertion in Javascript is nothing like braces.

The grammar for e.g. an if statement is simple: *if (* expression *)* statement *else* statement. One particular value of statement is a block statement, which is where the braces come from. Nothing more, nothing less.

Inversely, the grammar specifically says that most statements (of types empty, expression, do-while, continue, break, return, throw, and debugger) must end with a semicolon, and ASI is explicitly described as a few cases where you're allowed to add an extra token to the token stream when the grammar refuses to accept the stream as-is.


That is a sensible way to think about it, and it would be great if the language worked that way, but unfortunately it does not. Statements in JavaScript are not separated by line breaks.

Here is an example to illustrate:

    console.log('a')
    (1 < 2) ? console.log('b') : console.log('c')
You might expect this to output 'a', then 'b'. However, it instead outputs 'a' and then throws an error like this:

    Uncaught TypeError: console.log(...) is not a function
...because a semicolon was not inserted at the end of the first line.


What about "statements are separated for you" implies line breaks split valid statements? The issue in your example is your first statement "console.log('a')(1 < 2)" tries to pass "true" to the return value of console.log triggering a runtime typeError. If you define an appropriate function for console.log (remember it's not part of JS) e.g.

    console.log = function(value){console.error(value);return function(value2){console.error(value2)}}
Your example runs just fine because there was never actually a syntaxError anywhere in it to begin with, let alone a sytanxError that could be fixed with a ; by ASI. Similarly if I define console.log = 2 all of the above will throw typeError but that also has nothing to do with ASI.

This is precisely what "but if you need them to be separated in a special way you can add semicolons to manually control separation behavior" was referring to.


Lua has very similar rules to JS regarding semicolons from the user's perspective: semicolons are optional, and only change a programs meaning in some very unusual edge cases. From a PL design perspective, they're fairly different, though. Lua's grammar doesn't have a statement terminator, and the language just lets you insert semicolons anywhere that doesn't appear inside a statement if you so wish. JS's grammar does have a statement terminator, but has rules for inferring it in some places when it's not present.

Does this distinction matter in practice? Probably not. The more important different is probably just that JS has more unfortunate edge cases related to semicolons than Lua.


These two things are related in that autocorrection tends to create de-facto loose syntax. Once programmers become aware of it and it becomes part of the language as it is used, the language specification becomes more complicated - and possibly extremely so, as every edge case between what the current parser can and cannot correct correctly becomes part of that specification.


If you correct errors without failing the compilation due to a nonzero error count, then you've essentially forked the language. What is an error in the standard language is a de facto nonconforming extension in your implementation of it, and the users will be in for a nasty surprise when they try to port their code to another implementation.

Compilers used to correct obvious parse errors a lot more than they do now. The goal wasn't to make the program pass compilation so that the user can ignore the error messages. That would be harmful, as noted above. The goal is to be able to continue processing the program and uncover more errors in it in a useful way.

There is a gamble there:

- if you make a good correction to the token stream, all is well: you can diagnose more errors later in a pertinent way.

- if the correction is wrong, then the compiler may emit a flurry of nonsense errors which caused by the correct, so that only the first diagnostic makes any sense.

There is a third risk:

- the correction may lead to looping. This risk exists in any correction that lengthens the token sequence. The compiler may have to quit when the error count reaches some defined maximum. The looping may otherwise be infinite, or possibly unpredictable in length (think Hailstone Sequence).

In the 1970's, Creative Computing magazine conducted a contest to see who could produce the most error messages using the least amount of code.

The reason old time compilers tried to correct as many errors in a single run is that the programmers didn't always have use of the computer; they had to produce the program using keypunch equipment onto punched cards, and then line up at a job submission window, where an operator would submit their card deck for execution. You wouldn't want to line up to fix one semicolon at a time.


The compiler "guessed" your errors. This does not mean the compiler "knows" your errors. A rephrase of your question is -- Why compilers don't always assume the "guessed" correction of your code? This is because compiler doesn't know and you won't know when it may guess wrong. And in the case of guessing wrong, you will have very very mysterious errors to the best, and have very very mysterious wrong application behavior (that doesn't even fail) to the worst.


Actually, when I was teaching, I used to see a student strategy I called “obey the compiler”, which was to fix whatever the compiler complained about, without thinking. If the compiler said “semicolon expected at col 42”, the student would put a semicolon at column 42. If the compiler complained “undeclared identifier prnit_results”, the student would declare a name prnit_results. The strategy, when followed to extremes, could convert an almost-valid program into a string of nonsense (post-conversion was when the student came to me for help), more or less the opposite of PL/C's strategy, which I mentioned in an earlier comment. To be fair, this strategy was mostly found in first-year, and a few weaker second-year students.


I think Java Script compiler will assume its guesses for the semicolon cases. Not every one is happy about it.


Language syntax is redundant so that compilers can detect errors. If there was no redundancy in a language, every random stream of bytes would be a valid program.

It's like having dual-path redundancy in an airplane avionics system. If they disagree, then it is clear there's a fault in one of them - but it doesn't mean you can tell which one is erroneous. Without redundancy, there's no way to detect faulty operation.

Guessing which parse is correct, or which avionics subsystem is correct, is as bad as no redundancy at all.


> Language syntax is redundant so that compilers can detect errors.

Unfortunately, programming language syntax is not as redundant as we would wish. When the code is in the valid syntax, the compiler can parse it. When the code isn't in the valid syntax, the compiler doesn't even have a valid base to parse any information out of the code. The compiler writer may assume a common cause for certain parsing error and insert some "meaningful" error messages, but that is very different from the compiler knows anything. The "redundant" information is carried in the out-of-band channel (human vocabulary and common patterns) rather than in the syntax.


If the code does not match the grammar for the language, that is redundancy in the grammar.

The compiler can (and does, for error messages and error recovery) guess at what was meant, but it cannot know what was meant.


I think you are thinking of "redundant" in terms of information. A syntax does not carry any information on its own. It is a frame for information to reside in.


> Have you ever heard that phrase about how “Every happy family is the same, but every unhappy family is unhappy in their own way?”

This really deserves proper attribution: it's the opening sentence of Anna Karenina by Leo Tolstoy.


Tangent ahead. There's the sort of folksy saying "Don't argue with an idiot, they'll bring you down to their level and beat you with experience" (often attributed to Mark Twain, so I'm leaving it anonymous), and I think that these sayings are trying to get at the same thing. When someone believes something wrong, it could literally be beacuse of any other wrong belief that they have. Trying to untangle that structure is intractible.


> "Don't argue with an idiot, they'll bring you down to their level and beat you with experience"

In a further tangent, I've long been fond of a mildly-related idiom (whose source I do not know) which instructs the listener

"Never wrestle with a pig. You both get dirty and the pig likes it."


My 2nd CS course at college was (I'm showing my age here) PL1 programming. The Watfiv compiler would correct obvious parse errors. Often, this would lead to much more insidious and not-so-obvious bugs down the line.


I think you're thinking of Cornell's PL/C compiler (Watfiv was for Fortran, and didn't have a lot of error-correction), circa 1970ish. PL/C would famously convert

PTU LIST('Hello, world!

into a valid program (in fact, the claim was that it would never fail to convert any string of text into a valid program).

PL/C made a lot of sense when short student programs were entered on punched cards (and hence trivial typos were tedious to correct) and batch turnaround times were measured in hours. This makes much less sense when (a) editors can give us clues about typos right away, e.g., by indenting in a surprising way, and (b) compile times for short modules are very short.


Yes, you’re right - it’s been a long time. So long, in fact that I was using punch cards at the time. I remember getting my printouts and wondering at the results only to realize that it had converted bad input text into a “valid” program. Good times.


The Pascal compiler on our CDC Cyber (used by undergraduate courses when I went to NYU) would do this, too. Yeah, was a long time ago :)


I also remember when compilers made more of an effort to fix trivial errors. It was worth it when using a batch system and you could only run a few compilations each day.


An old-timer professor at my university used to tell the story of the university's use of the PL/C compiler from Cornell[1] which promised to automatically correct syntax errors in student's code. This was back in the days of punched card and next-day compilation times, and it was hoped that the PL/C compiler would reduce the amount of compute time spent compiling bad code. Instead, it would end up turning poorly thought-out code into code which would crash the system or cause endless loops. Its use was quickly discontinued after a short time using it.

[1] https://en.wikipedia.org/wiki/PL/C


Because no software is good at removing ambiguity. Only humans or maybe AI are good at maybe detecting ambiguity and removing it. It requires previous experiences, something computers cannot do accurately enough. Programming languages are a bit like real languages, they require human intelligence.

And there are still edge cases that could be ambiguous to humans, so you definitely want any compiler to refuse ambiguous programs. Computers are mathematical machines, they do everything without asking for your permission, so you better pray their behavior is well defined.

Look at what happens when language are ambiguous like javascript or HTML: it becomes hard to use, and js engines are monsters you don't want to understand how they work. I'm not a fan of C++ and its difficulty, but it's my favorite language because it's well defined.

Maybe compiler engineers may attempt to demonstrate how inserting semicolons in some place could create undesirable situations. Writing parsers is one of the toughest programming task, in my view.

Rules in languages don't exist for nothing. Even duck typing has a cost. It's like deciding that people can drive anywhere on the road, and let people decide how to avoid each other. Sure they can, and it would work 99% of the time, but 99% of the time is not good enough.


Well that's how you end up with https://github.com/mattdiamond/fuckitjs

    Javascript Error Steamroller

    FuckItJS uses state-of-the-art technology to make sure your javascript code runs whether your compiler likes it or not.
Technology

    Through a process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly compiles your code, detecting errors and slicing those lines out of the script. To survive such a violent process, FuckItJS reloads itself after each iteration, allowing the onerror handler to catch every single error in your terribly written code.


My understanding is that, if the compiler can with 100% confidence correct a mistake or a missing label then that means the language has a redundant/useless syntactical appendage and we can just get rid of it. That has happened with ";", in some newer languages, we no longer require those.


I recall a classmate (who'd gotten the unobtainable new Apple IIgs at the time) mentioning the compiler he was running giving an error message, which he described as "I see you forgot a semicolon; should I add it for you?" The UCSD Pascal we were using on shared school IIe and II+ didn't.

Years later, when I was meeting with Tim Berners-Lee, he wanted to see the doc for a Web-related Scheme library I had with me, and he started speed-reading it in front of me. The doc had an irreverent criticism I'd thrown in, about the practice of overly-permissive parsers in Web browsers. In the days of dotcom gold rush, when anyone who would spell "HTML programmer" was getting truckloads of investment money dumped on them, I'd proposed a very prominent angry red browser error indicator for Web pages with invalid with HTML. I thought that having that could be a source of shame, like the creator of it didn't know Web, and all the people tossing around money blindly and not knowing who to invest in might take that as one indicator. :) (Sir Tim later gave a big talk endorsing Python for the Web, but he did reference one of my arguments for why I was adopting Scheme at the time.)

"Conservative in what you send, liberal in what you accept" seemed a good default model for protocol interoperation, especially in an environment of legacy systems and imperfectly-specified protocols. But Web was new, and HTML was often being handwritten, and having the Web browser silently accept invalid and often ambiguous HTML without giving any indication it was wrong even during development seemed to create an unnecessary mess.

I actually had to spend a chunk of last weekend dusting off some code to handle that mess, because another open source developer was still running into the mess: https://www.neilvandyke.org/racket/html-parsing/#%28part._.H...


I liked that idea in the early days. I wanted to drop back to default mode, with default fonts and layout, after displaying the first error. But this was when HTML by itself mostly defined the page layout.

The HTML5 spec has a long, detailed set of rules for consistently parsing bad HTML. They're very funny to read. That was the best anyone could do at that late date.


In the early 70s, Warren Tietelman (also inventor of Undo; his 67 PhD thesis for Minsky was on what we would call an IDE today) developed a feature for Interlisp called DWIM, for Do What I Mean. It would figure out that you’d forgotten a paren or mistyped a function name and would rewrite your code for you.

It was good the way autocorrect is good today, and I hated it, but you couldn’t switch it off because it was also used for macro expansion!


Given that Interlisp is available, one might even try it out today.

https://interlisp.org/

The manual entry for DWIM:

https://interlisp.org/IRM_files/content.htm#bookmark20


For Python specifically, the colon is always used to introduce a block, which makes parsing somewhat easier as well as being consistent throughout the language.

However, one could easily imagine a design where certain keywords automatically introduce a block after the current line, which would eliminate the need for the colon. It would prevent one-liners (e.g. “if x: y”) but that’s no big loss. The colon would continue to be used for e.g. lambda, dict and annotation syntax.


"If the computer knows I’m missing a semicolon here, why won’t it add it itself?"

The computer does not know that. The computer is being too smart. And probably wrong.


The compiler is asking the developer to state his intent.

The fact that some programming languages are overly pedantic is part of their design.


A place where autocorrect might be considered is in REPLs. Out of habit I still regularly write "print 'a'" in the Python REPL although I've been using Python 3 for a while. You get:

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("a")?

Well yes... obviously... so please just print it.


That would mean adding a rule to the language. The rule of "you can use 'print' as an statement". A rule that, time after time, has been shot down.

Which leads us to the real issue at hand: if the compiler is going to do anything by itself, that means it is following well defined rules. Therefore, whatever automatic thing the compiled does is part of the language. And, sometimes, the design rules of said language plain and simply do not allow for that.


Be careful what you wish for. Parsers guessing at the authors intention is what gave us HTML.


That is a great point. Eclipse in Java in the past 5+ years has started auto completing beginning of expressions. As someone who started programming in vi, it's really enjoyable. It doesn't do everything, but strings, statement completions, loops and scopes, even variable naming and selection, etc all autocomplete. I realize that's not quite the same as autocorrect, but it works well. Second, with the OpenAI GPT-3 stuff, it does generate syntactically correct code, though I haven't learned to trust it enough to generate semantic instantiations of what I'm thinking.


These days compilers provide fix-it hints that are very useful.i find them especially convenient for misspellings and printf format string errors, especially when coupled with on the fly error checking.


All of the points this article makes are good and true, but something I've never, ever seen and am wondering why is a compiler feedback loop that corrects the code with human input.

The compiler is smart enough to guess what variable I meant when I misspell a variable. How come nobody's ever given me a tool to close the loop and when the error is reported, confirm that I want my source code edited to correct and correct it in place?


Because the compiler probably doesn’t have a stdin it can use at that point so you’d need to thread a feedback mechanism through the entire build process. You’re better off building that as a protocol that can be used by IDEs or LSP clients.


Excellent point. I can't think of a protocol that supports it either... Likely one exists and I haven't encountered it.


Okay so autocorrect is a bad idea. But what I'm tearing my hair out trying to figure out is why can't compilers even detect an "obvious" error? Why does GCC often output cryptic general-purpose errors about the WRONG line of code, when I have a simple typo? Why can't it have a simple rules engine that detects common typos in syntax and suggests a fix to me, right in the error message?


It can be hard to say, since things can get wildly complicated depending on language semantics (especially when type inference is involved). But yeah, there is often a tendency in compilers to react very poorly to a typo and freak out over what comes after it instead of going, "hmmm, this doesn't look complete". In some languages, this can be because the typo is actually ambiguous with something that's fine, just not in the context of the code that follows. But every compiler and language is different, so it's hard to say.


> Because, as smart as we compiler designers think we are, you, dear programmer, know your program better than we do.

Copilot, however, begs to differ.


There's a lot of hubris in ML!


I want to disagree with him, that sometimes there is an unambiguous way for the compiler to solve such errors. But the road to hell is paved with good intentions, as anyone working with MATLAB will have experienced firsthand. Instead of trying to guess what the programmer meant, a compiler should just be a dumb machine that does what we tell it to do.


Some languages are more redundant than others. I remember at least one Ada compiler that would make an assumption about what you meant, correct the internal representation, and keep going. It would fail with an error, but still allowed one to push further through the compilation process. This was a big help given how slow the compilers were.


I don't think gcc or CPython should do this, but separate programs that flag and correct errors could be useful. The would not correct silently but give messages such as

"line 5 needs a terminating semicolon -- add it?"

Has this been attempted?


To the final example I don't think double colon would ever be valid so why would the "compiler" suggest to make that change?


I'd like to be able to run the compiler in interactive mode where it asks to apply the fix along with the diff.


Your IDE might already do this for you and it's obvious until it's not.


I don't think python will tell you you're missing a semi colon.


I prefer an explicit pass.


The LaTeX compiler tries to correct many parse errors.


As a longtime LaTeX user, I find this behavior irritating, not once has it saved me effort. I would much rather have LaTeX exit to shell immediately and to report a good error. Instead, the dynamic "fix" is applied in opaque manner, totally unclear what the compiler did, it optimizes for "patch the input so compilation runs to completion" instead of "do what the user wanted", and once LaTex has finished, the user still must manually edit the original file to get a proper solution.


This is the most annoying thing about LaTeX, IMO. The "wait, how many edits back was the error" game is no fun.


I strongly disagree. Compilers should autocorrect “obvious” parse errors. I also am not aware of any that doesn’t.

What they shouldn’t do is produce a binary based on their (smart or stupid) guesses about the programmer’s intention.

That allows you to compile, fix multiple typos, compile, instead of compile, fix one typo, compile, fix the next typo, compile, etc, _and_ prevents you from running a program that you didn’t write.

I am not aware of any compiler that doesn’t do this, as it would be extremely annoying to have a compiler give up at the first error.

The search term to use is parser error recovery. It doesn’t give obviously great hits, though. Sample hits:

- https://www.geeksforgeeks.org/what-is-error-recovery/

- https://cs.adelaide.edu.au/~charles/lt/Lectures/07-ErrorReco...

- https://en.wikipedia.org/wiki/Burke–Fisher_error_repair

- https://en.wikipedia.org/wiki/LR_parser#Syntax_error_recover...


Not sure why you are being downvoted but you are completely right.

A few years ago GCC wasn't as good at error recovery, so the "too many errors, bailing out" message was a common occurrence (code for Internal Compiler Error, but managed to print at least one diagnostic). Today it is much much rarer to encounter it and using the compiler is a much more pleasant experience.


Unfortunately this isn't so simple. Autocorrecting to something could result in picking something that's actually wrong and impacting whole program semantics, either causing no errors when there should be errors, or causing errors when there should be none. That could be extremely confusing for new developers.

Instead, compiler authors need to understand and prioritize good ergonomics. Diagnostics should be accurate, come with suggestions, have unique error codes you can look up, and follow patterns you can predict over time.


I wasn’t making any claim it’s simple. Writing a compiler is simple (1), writing a compiler that produces good code is hard, writing a compiler that produces good error messages is very hard. Once you can produce good error messages, error recovery isn’t that hard anymore.

I think languages that use different ways to delineate different loops (do…od, if…fi, while…wend, repeat…until) make it easier to do error recovery of “about compilable” source than C-style ones that use {…} everywhere. In general, redundancy will improve the ability to do error recovery.

(1) The trick is to not think about code quality at all. Emit assembly for every individual statement, never inline functions, feel free to write a load from/store to memory for every variable read/write, etc. It will get you slow code, but also code that’s faster than an interpreter for the same language (your version 2 could post-process to eliminate superfluous loads and stores. Even only removing loads gives a speed up and a code size decrease).


As long as compilation is instant, or nearly so, recompiling after each error fixed is likely preferable. The reason being that the error correction is frequently flawed, so that anything beyond the first error is usually suspect, and often will disappear after the first error is fixed. The novice programmer may not realize this, and thus become quite confused when they try to diagnose the later errors, only to realize they don't exist.


> I also am not aware of any that doesn’t.

Can you give an example of CPython, or GCC/clang autocorrecting a parse error?


As a clang++ demonstration see https://godbolt.org/z/bTMT6qd4f. The failing `static_assert` uses the variable `i` which comes from the line missing a semicolon. It only reaches this assert because the compiler internally fixed the first error.


They do auto correction while parsing. If you have a C code and forget a semicolon or something else. It will try to guess a way to fix it in order to parse the rest of the program, in order to give you other possible errors it found. There even is the -Wfatal-errors flag to disable this functionality.


Have you ever seen GCC or Clang give more than one error message for a program? They're able to provide more than one error message because they correct the first error, and then continue parsing.


For some value of “correct”. Actually, it's often more like skipping a few tokens until some kind of synchronization point (e.g., a semicolon) is reached. It's good manners for a compiler that does this to refuse to produce a binary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: