Hacker News new | past | comments | ask | show | jobs | submit login
Parse, don’t type-check (neilmadden.blog)
178 points by todsacerdoti on Nov 26, 2020 | hide | past | favorite | 139 comments



The problem is that fundamentally we always have types. We must define different types of bytes and the different things we can do with those bytes. If you can generate a binary from your programming language, it must use a form of static or inferred typing to know what assembly code to generate. A dynamic language just checks at runtime, and adds a bunch of overhead, which isn't always necessary, and is more likely to result in weird "defined" behavior like we get from javascript and php. Typescript is very popular because you can make certain guarantees at compile time, instead of a vague hand-wave of "the code looks good".

As an example of the benefits of type-checking, in the Vulkan api, there are a lot of handles. VkRenderPass, VkPipeline, VkSwapchain, VkDevice, etc. All of these are just pointers. If we take a function like:

    void vkDestroyCommandPool(VkDevice,VkCommandPool,const VkAllocationCallbacks*);`
And remove the type checking:

    void vkDestroyCommandPool(void*,void*,const void*);
It is the same API, but without type-checking. Even if the parameters are labelled, with this new API I could mess up and pass the wrong thing.


I don't see this as challenging anything that the article is ultimately proposing. See, for example, the last paragraph:

> I thank the flying spaghetti monster almost every day for the type system in my day job (and that is only Java’s tepid concoction)...

Isn't exactly a statement you'd expect to find in the concluding remarks an essay whose goal is to question the value of type systems or static types. It's presumably something more nuanced like that. Maybe something more like

> we are in danger of thinking that type systems are the only way of achieving correctness in software construction.

Which seems reasonable to me. Type systems are excellent for eliminating certain classes of errors. But we shouldn't let them become a golden hammer. Even if they can be used to eliminate other classes of errors, that doesn't necessarily mean they're the best tool for the job. Lately, for example, I've been seeing some disillusionment with dependent typing, and people saying, in effect, "Yeah, it's a really impressive technique, but try to find a simpler solution first."

And some of those simpler solutions do come from the dynamic programming world. For example, "You can then go further and remove the more primitive operations, only allowing access to the higher level abstractions," is another excellent way to make illegal states unrepresentable. And you don't need shiny new programming languages or rocket powered type systems to do it. I was really rather disappointed that that section of the article gave a nod to Dijkstra, but completely failed to mention Alan Kay's The Early History of Smalltalk, which was, to an approximation, several thousand words' worth of grinding away on that point.


The article is not even remotely suggesting removing type information. On the contrary, it's talking about some ways to augment a type checker with different grammar or ways of constructing APIs.


No, but you can always encode this augmented information in types to make it explicit, and possibly help the compiler make optimizations. Using a weaker type when you can use a stronger type is just as bad as removing type information.


> No, but you can always encode this augmented information in types to make it explicit

Technically yes, but not all type systems make it practical, or really relevant e.g. you can encode nullability in a type in java… but your reference to your Optional is still itself nullable.


I think it’s important to be more specific here. You can always mess up and pass the wrong thing, the type system can never know what is semantically correct, no matter how strict your types are. They’re about ergonomically reducing the space of valid programs.

There are a lot of interesting tradeoffs to be made there. Over strict specification (whether parsing or type checking), specifically checking for invariants that aren’t actually useful, can make code and systems brittle and hard to change, just like under specifying them. The specifications can also be different depending on what you’re doing with the data.

Also as the expressiveness of types increases, the space of possible ways to constrain your program space goes up, along with the complexity of the transformations of that space, making it difficult to understand what the valid program space actually is, hard to understand when you’ve accidentally excluded perfectly valid programs, etc.

So yes, fundamentally there are always types and underlying structure to useful data, but saying that it has “a” type is dramatically over-simplifying things.


>I think it’s important to be more specific here. You can always mess up and pass the wrong thing, the type system can never know what is semantically correct

This is not true. Consider the map function of a typical functional language with parametric polymorphism:

map : (a -> b) -> [a] -> [b]

That signature is complete, it covers everything there is to know about the function. And, except for trivial implementations, there is no way to implement it wrongly.

In some cases, one can even rule out trivial implementations and infer the correct implementations from the type signature alone.

In any case, there is no "semantically wrong" input to such a function. It is just a function and totally agnostic regarding your use case.


Of course that type doesn't encode everything there is to know about that function, even assuming purity.

Here are some functions that match that type signature:

    map1 foo x:xs = [foo x]

    mapR foo x:xs = append (mapR foo xs) foo x

    mapI foo x:xs = foo x : mapI foo x:xs
Of course, we can create many more functions that arrange the values in other creative ways, that add more or fewer values etc. If the function isn't pure, we can also imagine many more dimensions of functions that do arbitrarily different things from what you would expect.

If you want complete signatures without dependent types, you need functions which take only single values of different unknown types, like apply: (a -> b) -> a -> b. If you have more than one parameter of the same unknown type, you'll have some ambiguity already.

Non-dependent types can only be used to encode extremely simple properties of a program. They are certainly useful, but they are not in any way the be-all, end-all of static verification.


> Of course that type doesn't encode everything there is to know about that function, even assuming purity.

Right—we need to require foo id = id.


That only works because Haskell can't really express it. That is, if you could write Haskell code that checked that equality, that equality wouldn't be enough, because you could write a function `mapStrange foo [x]` that is equal to `map` if `foo = id` and is something else otherwise.

But since you can't compare 2 functions for equality, and you can't compare two arbitrary values for equality, then yes - your informal requirement is enough togethet with the type to force the function to be `map`.


Sorry for the confusion, I think we are using different definitions of "semantics" here, I'm not talking about formal semantics. I'm considering these to be semantically different, even though their structural types are the same:

    fn min(x: int, y: int) -> int
    fn max(x: int, y: int) -> int
The meaning of a computation changes depending on which one we use.

I probably should have used a different word than "semantics" since that has a technical definition in the PL field...


> I think it’s important to be more specific here. You can always mess up and pass the wrong thing, the type system can never know what is semantically correct, no matter how strict your types are.

Can you give an example?


  transferAmount(from: Account, to: Account, amount: Money)
How can a type system know that you are passing the accounts in the correct order?

You might start by trying to make more strict types such as SourceAccount and DestinationAccount.

  transferAmount(from: SourceAccount, to: DestinationAccount, amount: Money)
But we know that any account is likely to be both of these at different times. We only persist Accounts and have to convert at runtime. Now you face the challenge of converting a regular Account to a FromAccount.

  src := fromAccount.toSourceAccount()
  dest := toAccount.toDestinationAccount()
  transferAmount(src, dest, amt)
How can the type system know that you are converting the correct Account? The type system cannot determine what is semantically correct. It's a recursive problem.


What you are saying is all correct, but I think we disagree about the terminology. You said "the type system can never know what is semantically correct" but I think the right way of saying it is "the type system can sometimes not know what is semantically correct".

Because in practice, code like "src := fromAccount.toSourceAccount()" is very rare and happens at system boundaries or logic boundaries once - after that, all semantical errors can be caught by the typesystem, and this accounts for the vast majority of the code.

So yes, the typesystem can and will prevent semantical errors. But it can not protect you from making them in all cases, that's impossible.


A programmer will still need to write correct code, but in your first example the error happens at the boundary between two modules. In the second example, the error is fully contained within one module.

One of those two is easier to spot, either by the original programmer, or by a code reviewer, or by someone trying to track down a fault.

If all the type system does is make semantic errors easier to spot and fix, it's still offering a lot of value.


And indeed, in opengl, where all handles have the same type (GLuint), it's very easy to mix up handles of different types.

It's quite easy to fix this without breaking ABI compatibility; but it hasn't been done in any implementation that I know of.


In C perhaps that's unfortunate. I'm sure glad we have more modern C++ that would enforce pointer types.

... as long as the functions aren't declared in `extern "C" { ... }`

which many libraries are. That's for many reasons, some of which include the very type system itself being an inhibitor of easy integration across languages.


I don't quite follow. C will also prevent you from mixing up disparate pointer types; and c++ has no protection against void pointers.


> C will also prevent you from mixing up disparate pointer types

No it won't. In C++ it's a compile error [0]. In C it's just a warning [1].

> c++ has no protection against void pointers

You're somewhat correct but the answer is a lot more nuanced (because void* is very nuanced). Again, C++ will complain unless you do it "right" [2] while C is ... willing to let you blow your leg off [3]. However the definition of "right" is still just as dangerous as C's because all type information was lost in the void and it's up to the programmer to get it right -- and programmers are notoriously buggy.

[0] https://gcc.godbolt.org/z/166ns8

[1] https://gcc.godbolt.org/z/vEnajj

[2] https://gcc.godbolt.org/z/cGe39z

[3] https://gcc.godbolt.org/z/orb1s8


Using extern "C" doesn't change the C++ rules for implicit pointer conversions, though.


> The essential message is to make illegal states unrepresentable.

That's why, in C we have:

    if (a = b)
and:

    if (a < b < c)
that cause (in some compilers) warnings. I wanted to make this an error in D. One option is to check for and disallow them after the parse, but that left:

    if ((a < b) < c)
as still an error. I wanted to allow these constructs to remain if they were parenthesized, as that would imply they were deliberate. (Yes, parenthesized expressions could be a separate grammar, but I thought that was an awful idea.)

The solution is to fix the grammar, so they wouldn't even parse. The beauty of this is one doesn't have to enable warnings, or have a separate check for them, it just won't compile. Even better, D's expression grammar is the same as C's, and it did not upset anything else in the grammar as a side effect.

Based on my experience at Boeing designing hardware, defining the problem out of existence is a great way to design robust systems.


What's your take on desugaring it to

    (a < b) && (b < c)
ala Python?

Speaking hypothetically of course, since it would be a breaking change.


That was considered, but the problem is what does the user think when he sees `a < b < c` ?

1. is it intended to be the C way?

2. is it the Python way?

3. did the user assume it was (1) or did he assume it was (2) and guess wrong?

This means that such expressions are a never-ending red flag for a code review. Hence it is made an error. The user can still get the C behavior by using parentheses.

Note that if D was a Python-derived language, then selecting (2) semantics would be a no-brainer. But as a C-derived language, it has to be an error.


Note that Python applies same logic to other operators, which can be pretty unintuitive:

    1 in [1,2,3] is True
evaluates to False.


"pretty unintuitive" is an understatement...

  In [1]: import dis
  
  In [2]: def a():
     ...:     return (1 in [1,2,3] is True)
  
  In [3]: def b():
      ..:     return ((1 in [1,2,3]) is True)
  
  In [4]: a()
  Out[4]: False
  
  In [5]: b()
  Out[5]: True
  
  In [6]: dis.dis(a)
    2           0 LOAD_CONST               1 (1)
                2 LOAD_CONST               1 (1)
                4 LOAD_CONST               2 (2)
                6 LOAD_CONST               3 (3)
                8 BUILD_LIST               3
               10 DUP_TOP
               12 ROT_THREE
               14 COMPARE_OP               6 (in)
               16 JUMP_IF_FALSE_OR_POP    24
               18 LOAD_CONST               4 (True)
               20 COMPARE_OP               8 (is)
               22 RETURN_VALUE
          >>   24 ROT_TWO
               26 POP_TOP
               28 RETURN_VALUE
  
  In [7]: dis.dis(b)
    2           0 LOAD_CONST               1 (1)
                2 LOAD_CONST               5 ((1, 2, 3))
                4 COMPARE_OP               6 (in)
                6 LOAD_CONST               4 (True)
                8 COMPARE_OP               8 (is)
               10 RETURN_VALUE
Why?!


The disassembly in parent comment is made with CPython 3.6.9, but nothing substantial changes on CPython 3.8.x/3.9.x or PyPy.

Maybe it is related to grammar, and not to compiler...

AST dump in CPython 3.6.9 (manually formatted):

  In [1]: import ast
  
  In [2]: print(ast.dump(ast.parse("""\
     ...: def a():
     ...:     return (1 in [1,2,3] is True)""")))
  
  Module(
      body=[
          FunctionDef(
              name='a',
              args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]),
              body=[
                  Return(
                      value=Compare(
                          left=Num(n=1),
                          ops=[In(), Is()],
                          comparators=[
                              List(elts=[Num(n=1), Num(n=2), Num(n=3)], ctx=Load()),
                              NameConstant(value=True)
                          ]
                      )
                  )
              ],
              decorator_list=[],
              returns=None
          )
      ]
  )
  
  In [3]: print(ast.dump(ast.parse("""\
     ...: def b():
     ...:     return ((1 in [1,2,3]) is True)""")))
  
  Module(
      body=[
          FunctionDef(
              name='b',
              args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]),
              body=[
                  Return(
                      value=Compare(
                          left=Compare(
                              left=Num(n=1),
                              ops=[In()],
                              comparators=[List(elts=[Num(n=1), Num(n=2), Num(n=3)], ctx=Load())]
                          ),
                          ops=[Is()],
                          comparators=[NameConstant(value=True)]
                      )
                  )
              ],
              decorator_list=[],
              returns=None
          )
      ]
  )


OK, mystery solved. In Python:

  1 in [1,2,3] is True
is evaluated as

  (1 in [1, 2, 3]) and ([1, 2, 3] is True)
similarly to

  1 < 2 < 3
:facepalm:


I don't see anything surprising. "is" has higher prevalenz than "in" and that's it. Or am I misreading something?


It's not a matter of operator precedence for two reasons:

1) "in" and "is" have same precedence, and group left to right (see https://docs.python.org/3/reference/expressions.html#operato...)

2) you'll have runtime error:

  In [1]: def c():
     ...:     return (1 in ([1,2,3] is True))
  
  In [2]: c()
  ...  
  TypeError: argument of type 'bool' is not iterable
It seems related to CPython bytecode compiler implementation, the two functions are parsed in a different way, parentheses make the compiler go on a different path... but I'd like to understand why, without diving into CPython source code :) Anyone?


If this was the reason you'd get a TypeError, since there's no `bool.__contains__`.


this is one place where prefix notation really shines:

    (< a b c)


It's the message you should really take away from meta-programming in LISP is that code is a message that can be read different ways.

That's limited though by practical considerations such as the Turing and Gödel results, how your algorithms scale with problem size, etc.

There is room for a revolution in parsing tools, but it seems almost every attempt falls short. (PEG in Python is the latest, maybe they will expose a general purpose interface to it someday, but by then certain performance concerns will be so baked in that features that expand use-cases for parsing such as automatic unparsing and round-tripping won't materialize)

We are still coding to 1970's style 'callback hell' interfaces from yacc when we should be able to write one grammar and get not just the parser but also a set of semantic structures, for instance if you were parsing Java in Java you'd have objects to represent expressions, operators, etc.


> PEG in Python is the latest, [...]

The popularity of PEG baffles me. I guess it ties in nicely with the wider and older pattern of rejecting good formal language and parsing theory, but PEG seems like an all around bad technology; like, is there even a single benefit to using PEG?

The difference between PEG and the usual grammars is that PEG introduces special meaning to the "choice" grammar operation so as to define away ambiguity. But defining away ambiguity actually just "hides" language constructions that will be ambiguous to users. Although PEG notation may be indistinguishable from those used to define CFGs, PEG sucks as a language specification tool as it is unclear what language a PEG defines (unlike with CFGs). From the perspective of implementation, PEG parsers backtrack, meaning either exponential run-time or additional linear memory requirements.

Also some Jeffrey Kegler quotes:

> With PEG, what you see in the extended BNF is not what you get. PEG parsing has been called "precise", apparently based on the idea that PEG parsing is in a certain sense unambiguous. In this case "precise" is taken as synonymous with "unique". That is, PEG parsing is precise in exactly the same sense that Jimmy Hoffa's body is at a precise location. There is (presumably) exactly one such place, but we are hard put to be any more specific about the matter.

and

> When you do not know the language your parser is parsing, you of course have the problem that your parser might not parse all the strings in your language. That can be dealt with by fixing the parser to accept the correct input, as you encounter problems.

> A second, more serious, problem is often forgotten. Your PEG parser might accept strings that are not in your language. At worst, this creates a security loophole. At best, it leaves with a choice: break compatiblity, or leave the problem unfixed. and

> PEG is not unambiguous in any helpful sense of that word. BNF allows you to specify ambiguous grammars, and that feature is tied to its power and flexibility and often useful in itself. PEG will only deliver one of those parses. But without an easy way of knowing which parse, the underlying ambiguity is not addressed -- it is just ignored.


> Is there even a single benefit to using PEG?

In a formal sense, there are some non-context-free languages that PEG can recognize ( A^n B^n C^n is the classic example ). It’s an open question whether there are any context-free languages PEG isn’t capable of recognizing.

Fundamentally, PEG is a formalization of the common practice of writing a recursive-descent parser: It defines languages based on an exemplar algorithm that parses them instead of the generative approach taken by BNF, which defines how you can enumerate the legal strings of the language. They’re both as mathematically rigorous as each other, but approach the problem space from opposite sides.

For BNF, ambiguity is a question of whether the generation process is reversible: An ambiguous grammar has multiple paths that produce the same string which means that there’s no way to recover the path used to generate that string.

PEG is “unambiguous” because it’s not based on a generative model at all, so this definition is nonsensical. With PEG, we know, by definition, how any given string will parse but it can be hard to enumerate the strings that will parse in a given way. If PEG has an interesting notion of ambiguity (which it probably does), that’s the place to look for it.


PEG for one thing offers the possibility of composibility.

For a language like Python or Java it shouldn't be much harder to add an "unless(X) {Y}" equivalent to "if(!X) {Y}" than it is in LISP if the compiler was structured in such a way that you could take "Java Compiler A" and specialize it to "Java Compiler B" such that one term was added to the grammar and one AST tree transformation that writes the term.

Exclusive of POM files, import statements and other packing material that should be less than 50 lines of code if we built compilers to be extensible.

The arangodb database has a kind of SELECT statement that returns a list of unordered dictionaries which is frustrating if you want to plot the result in a pandas Dataframe. I wrote something in Python using PEG that would parse an adb query and static analyze it and determine the intended field order when definite.

I am up for any compiler technology that makes it easy to do that even if it means I don't get compilation as fast as Go.

Another issue that matters in error handling. Drools is a great rules engine on paper, but as it is composed with an external Java compiler you can often get into cases where the error messages don't make sense at all and it isn't like the Jena rules engine which is simple enough that I could get quick results looking at it in the debugger. That's a complicated example but I think language users deserve better error handling and compiler authors need better tools to do that.


Lisp is certainly the most powerful/elegant meta-programming language out there.

But there is a nuance here: > code is a message that can be read different ways

Think about it. When you read code directly, then you are now dependent on code. And code can change for many reasons, including language changes and mere syntax changes.

For that reason, I like the approach of describing the required actions in a certain data format and then reading this data format in a different ways, in opposition to read code directly. Both in the same language of course. This also reliefs you from having to represent generic expressions, operators etc. you only need the subset that is relevant for you, which is most likely much more coarse grained.

This is imho the cleaner approach but it comes with the overhead of having to design a "sublanguage" in your language and than interpreting it in your language.


And pray tell how do you verify that the parser is actually correct? How strict can you get?

Lax parsers are the bane of every web browser's existence.

Ultimately parsing does not get you to correctness, ensuring properties does, and a type system is the lamest way to do it. There are actual logic solvers/theorem provers that can handle properties and are as powerful as programming languages while also ensuring correctness and most importantly completeness.

Some of them were even available as extensions for Java, though constraint programming is not currently in the mainstream. (Besides a small variant called borrow checking, used in Rust for instance.)


You should read Alexis King's articles on how this is achieved. One of the most imporant aspects of modern parser combinator libraries is that a very small composable core can be exhaustively tested (importantly: with both hand written and random cases) to provide a reliable kernel for parser combinators.

Given that those cited parsers were built from the ground up to mesh with an effective and expressive type system like Haskell's, this problem was significantly reduced. The downside, of course, is that it became quite opinionated.

There is no magic want to wave, but certainly it's easier to work on a common minimal core of components than work on custom ad hoc validators. Lax parsers are not an inevitable outcome; they're a decision propagated by decades of bad opinions being taken as gospel.


An expressive language should let programmers reason about code, including parsers, for their correctness.

You know it is correct when you see it, and you get more confidence in not making silly mistakes with unit tests. This is a different approach to formal verification using logic solvers or theorem provers.

For an example of parsing with Haskell, see this recent discussion: https://news.ycombinator.com/item?id=25212372

This approach, as hinted in Parse, Don’t Type-Check, is to raise the level of abstraction. The jargons are to create a domain-specific language (DSL) to model the problem, as in language-oriented programming.

With this approach, you understand what the code is trying to do and can change it easily because it reads like pseudo-code. (Pseudo-code is usually a synonym for code at the right level of abstraction.)


> And pray tell how do you verify that the parser is actually correct? How strict can you get?

The idea presented in the article is that the data type of the parser output ensures that your invariants are satisfied. It’s not a panacea for solving every possible problem with input validation, it’s more of an alternative technique that gives you much better assurances that some of the invariants you think should be satisfied, are satisfied.


The point of input validation is to fail early, not to ensure the function executes. Returning an empty list when the input is an invalid list sounds like a really bad idea.

What if the function is deleteWhereNotIn( list param )? That empty list just deleted everything.


> The point of input validation is to fail early, not to ensure the function executes.

> What if the function is deleteWhereNotIn( list param )?

  error: could not assign `param` = `mungeList foo`
  note: failed to match `List a` <= `Either SomeError (List b)`
  note: failed to match `List` <= `Either SomeError`


The original article was in a Haskell context, where you'd use a Maybe or Either to prevent that sort of mistake.


Lax parsers made the modern web possible. New web pages with image tags, forms, and scripts were possible without breaking that page's parsing in older browsers.


I think that at this point, folks have realized this is not in fact true. If anything, lax parsers made the evolution of the web more difficult and more subject to corporate cooption.


What's the argument for that? If HTML was strictly parsed the first page with an image tag would have been broken in all other browsers.


Ignoring for one second the specifics of <img> in relation to SGML's `O` option (which was rectified in XML), this doesn't really need to have been the case. HTML could easily have said "if you encounter an unknown tag, render its contents as PCDATA" and sites would have degraded at least as gracefully as they do today. (If less gracefully than they did in 2000.)

(Heck, it could've been a generic SGML feature! "Unknown elements' contents are CDATA, unless they have this attribute in which case they're PCDATA, or this attribute in which case they're ignored" as a rule the DTD...)


> HTML could easily have said...

But it didn't! The problem with being strict here is that every possible usage has to be pre-imagined and perfectly implemented. You're suggesting the original developers should have just made affordances for everything that will be added in the next 30 years. That's easy to say now. The first web browser was essentially just a hugely successful prototype.

And can you imagine having to type all your tags in upper-case? Yuck. :)


We never needed to parse tag soup. We only needed to say which DTD defines the new elements a document uses, and what a browser should do with valid but unknown elements. The latter could use #FIXED or default attribute values in the DTD, because some new elements have human-readable content and others don’t.


The looseness of html is usually in regards to automatically closing tags or unquoted attributes. You can keep strict syntax enforcement and still recognize and skip unknown tags or attributes.


> Lax parsers made the modern web possible.

that's the whole problem indeed


https://mcshowblog.files.wordpress.com/2019/08/dumpsterfire....

Yeah bad example if you wanted to convince me it’s any good :-D


This is not due to lax parsing though. It is due to the convention of how to process undefined tags and attribute:

User agents must treat elements and attributes that they do not understand as semantically neutral; leaving them in the DOM (for DOM processors), and styling them according to CSS (for CSS processors), but not inferring any meaning from them. https://html.spec.whatwg.org/multipage/infrastructure.html#e...

This is unrelated to parsing - for example a XHTML parser will be strict and draconian in its parsing, but still process undefined tags and attributes as above.


That's like saying sloppy programming makes software development possible. At best, it speeds things up (a temporary benefit) at the cost of quality (causing potentially permanent damage).


That’s not like what was said above. They said that a strict parser would choke on unrecognized tags, thus making experimentation non-viable.

Sloppy programming is not about enabling new syntax at all. That simile is not useful.


Using parsing and abstractions for safety is the basis of why Clojure, even though dynamically typed, is a relatively safe language[1]. The abstractions are safer in Clojure: no for loops, no while loops, only higher level iteration and transformation constructs, data interfaces are all immutable and functional. Shared state can only be used through higher level managed APIs, numbers auto-promote themselves to higher precision, there's no linear search APIs to force the use of indexed data-structures instead, etc.

And for parsing: Spec is a way to define contracts, but it's actually a parser DSL. User defined macros which specify a Spec will have their input parsed by the Spec at read-time, and if the Parser fails, a read error will show up. Unfortunately, this only happens for macro, but the article made me curious about possible value of having this done on functions as well.

An example of this is say you have some macro which would let you do:

(:from coll :take 5)

Your Spec can parse this, and error if the first element isn't :from, the second element isn't a reference or a collection literal, the third element isn't :take, the fourth element isn't a reference or an int, as well as if it isn't called with exactly 4 elements. And all this happens at read time, when the code is parsed. You could go further, and if a collection literal was used, you could assert that the take int isn't bigger then the number of elements in the collection literal.

Now, what I've observed is that reference types kind of limit what the parser can do. I'm not sure the article addresses that. Like in my last example, I can validated the range of take over the size of the collection, but not if the collection is a variable. How do I track those references? Maybe somewhere else in the program you can find the collection literal, but not here.

P.S.: It's hard to know why Clojure demonstrates relative safety, and I assume here it has a little bit to do with the abstractions it uses and this parser Specifying mechanism. But it's also likely it's all due to the REPL or something else.

[1] https://www.i-programmer.info/news/98-languages/11184-which-... - Shows Clojure as having some of the least number of defects per commits


Clojure, for those unfamiliar with the language, unsurprisingly, does in fact allow for both 'for' and 'while' loops, and are part of the core language library.

https://clojuredocs.org/clojure.core/while

https://clojuredocs.org/clojure.core/for

And here are examples picked from a random repository of these constructs being used in the wild:

https://github.com/ring-clojure/ring/search?q=while

What clojure does do is also provide a number of higher level control constructs, and, by the nature of the syntax of the language, puts them on the same level as 'while' and 'for', which themselves are part of a library, rather than baked into the language itself.

This can, of course, be done in other languages, using fluent interfaces.

https://en.wikipedia.org/wiki/Fluent_interface

But is typically much more awkward and harder to compose. Also, you typically have to wrap expressions in lambdas.


Not to be too nit-picky, but it doesn't actually have a for-loop. The for you linked to is not a for-loop but a for-each loop (simply called for). Well it is actually more restricting even then a for-each loop, as it maps over elements and can only return a collection of the same length as it iterates over.

That said, you're correct it does have a pseudo while-loop. Though it will still force you to use a managed mutable container along with it, which is still safer than without. Also, it's a very ackward while-loop, as it doesn't support continue or break, that's because in reality it isn't a real while-loop, it's a fake while-loop built on top of a recursive loop.

And for all other readers, it's true, Clojure doesn't disallow the use of less abstract less safe construct, it simply makes them more ackward to use and not the default and obvious choice. The reason for this is also something the article didn't bring up, but higher level safer abstractions often come with a performance cost. This is why most languages have escape hatches for whatever safety mechanism they provide, and why Rust lang loves to talk about "zero-cost" abstractions.


Clojure's "for" is really a list comprehension function, but Clojure's "doseq" is more or less what you'd expect from a for loop (like Python's anyway or any other range-based for).


Python's for isn't a for-loop either, it's a for-each loop also sometimes called a for-in loop.

In that sense, neither Python nor Clojure have real for-loops. But Python gets much closer, because Clojure's doseq is similar to its while, it is actually syntax sugar for a recursive loop, and not an imperative loop. While Python's for is an imperative for-each loop, which supports the statements break, continue and return inside it.


doseq isn’t imperative because it doesn’t support break or continue? That doesn’t make sense.


Oh sorry, I meant to say iterative, not imperative.

Basically all loops in Clojure are implemented in terms of recursion (loop/recur) and not jumps. Which is why there's no break, continue or return statements.

Whatever categories you assign them in, I was more trying to show the differences between Python's for and Clojure's doseq. And also show that neither are your classic for-loop.


It is iterative too, though: it iterates through an input sequence.

> Whatever categories you assign them in, I was more trying to show the differences

Fair enough. I take your overall point, I just don't think that not being a classic for-loop matters so much in practice, certainly not "most of the time" anyway.


> It is iterative too, though: it iterates through an input sequence.

I don't think these words have supper formal definitions to be honest. I'm using iterative to contrast it against recursive. So in that respect, Clojure's doseq is recursive. You can also see it as iterating over a sequence and thus say it is iterative, but the way it loops over the sequence is through recursion, which is where doseq differs with Python's for. Maybe it's clearer if I say that doseq is recursive while Python's for isn't and just not mention iterative at all haha.

> Fair enough. I take your overall point, I just don't think that not being a classic for-loop matters so much in practice, certainly not "most of the time" anyway.

I don't know what you mean by "matter", but with regards to the article I'm discussing I'd say it does. A classic for-loop is prone to easy to make bugs that are well known, such as "off by one errors", and "overflows". So the fact that the language has you use this higher level abstraction for looping over collections can help prevent a certain amount of such errors and thus provide additional program safety.

Now I don't know if the further differences between Clojure doseq and Python for would also result in safer code. I guess the question is: is the use of break and continue a common source of defect? And is forcing branching instead a safer alternative? Personally I don't know, I'd say maybe not.

That said, Clojure also favours the use of its immutable for or reduce over doseq, and that is arguably much safer, because mutable state inside a for-each loop is also a known source of common defects, like changing the sequence as you loop over it, and especially if concurrency is involved. So that's another relevant practical difference in my mind.


> So in that respect, Clojure's doseq is recursive.

The implementation is recursive, but the user facing interface isn't really.

> I don't know what you mean by "matter"

I mean that in any real world code, at least any that I've encountered in ten years of Clojure and ~20 of Python, the difference has not impacted any actual code that I've written. Sure, in Python I have seen and used "break" but in Clojure it has never been necessary.

> A classic for-loop is prone to...

Ok, so you're actually arguing in favour of non-classic for loops? I certainly agree. Even in C++, I try to avoid traditional loops in favour of range-based loops or the standard library abstractions like std::for_each, std::transform etc. I think maybe we're on the same page, I thought you were arguing that the lack of classic for-loops was a deficiency and I was just saying that in Clojure it doesn't actually matter that its loops aren't the classic ones.

> Now I don't know if the further differences between Clojure doseq and Python for would also result in safer code.

Probably not. I think Clojure does result in safer code, but for other reasons than the differences between doseq/for, as you say yourself. I don't think break is a common source of defect, personally, but I also don't find the lack of break a problem in Clojure: you only use doseq when you need side-effects, why would you need to break/continue? Typically you'll already have filtered the list before calling doseq anyway.

> Personally I don't know, I'd say maybe not.

I agree, I don't think its a big deal either way.

> Clojure also favours the use of its immutable

Yes, I think that (at least in my own experience), Clojure tends to be safer and less error prone, but the reasons are not its loops, they're mainly due to its immutable-by-default nature.

I think that over all, we are more or less in agreement. But I did overlook that doseq doesn't support break/continue in my original comment, so thanks for pointing that out!


Yup, we're in agreement. My point is actually that using higher level abstractions, such as how Clojure does it, leads to safer code, i.e., less prone to bugs and security vulnerabilities. I also personally think it often yield better readability as well, once you get used to the abstractions. And that, when needed, such as for raw performance, there are still escape hatches if need be.

> Yes, I think that (at least in my own experience), Clojure tends to be safer and less error prone, but the reasons are not its loops, they're mainly due to its immutable-by-default nature

Definitely immutability is a big one. Though I think depending what you compare it against, the loops help as well. Like languages where the only iteration is through while or for-loop (the kind that takes a condition). Or those that lack higher level declarative constructs like filter, any?, every?, partition, etc.

Also, related to immutability, it kind of implies loops must be recursive, cause if you think about what a standard for-loop is:

for(i = 0; i < 10; i++) {}

If your language doesn't have (or discourages) mutable variables, how does i work?


I suppose the loops do help in the sense that looping in Clojure tends to favour using higher level abstractions. I don't think I've ever used while, I use doseq only for side-effects, for is for list comprehensions (or for when its more readable than map), but most of my looping is actually through map, reduce, filter/remove, etc. So I suppose you're right because that is definitely less error prone than C-style for loops. So yeah, as you say, the higher level sequence abstractions definitely do help reduce errors in comparison to traditional loops.

> how does i work?

The language can handle the mutability internally, perhaps? That is, the user only sees immutability, but under the hood, the language maintains the state mutably. But I guess that's just an optimisation over recursion since the end-users code looks the same as if it were recursive. Kinda like 'reduce'.


Sibling thread went into the nature of things in Clojure that are not "true" low level loop constructs, so just to point out, Clojure does also have loop/recur where you have more control on looping.


Types are just labels that a parse has already been done and doesn't need to be done again. Use them when you don't need to parse again, don't when you do want to parse again.


The parts of this that are fully-baked aren't about parsing or type-checking.

  - the trivial language given doesn't solve anything
  - dependent-typing is more type-checking, not less
  - preferring internal iteration over indexing is several degrees removed from type-checking issue
  - on E: dynamic typing requires type-checking at runtime
  - DSLs are not at odds with static typing
I like this essay. It's a walk through a space of ideas guided by a what-looks-interesting greedy algorithm, along the lines of an "essais" [1]. But I think it's stretching awfully hard to fit the theme it started out with.

1: http://www.paulgraham.com/essay.html


Thanks. For what it's worth, the flow of thought was the following:

1. The original "parse, don't validate" essay clearly describes why parsers are preferable to validators.

2. It uses a type system to make illegal states unrepresentable in a programming language - but a type system is a validator under its own terminology.

3. What would a parser-like approach to solving the problem look like?

I believe that actually adjusting the programming language grammar (and thus the actual parser) is clearly a parser approach to solving the problem. Abstraction is also, to me, a parser-like approach to solving the problem - it transforms code that allows illegal states into code that enforces particular invariants. I probably didn't make those connections explicit enough in the blog.

Perhaps the diversion into object-capabilities is a bit off the track, but I found it an interesting connection at least. (And frankly, I'm usually close to the sole audience of my blog so I better make it interesting for me!)

Edit: formatting


I see, it's an interesting line of thought but I think trying to move type checking into the grammar is fundamentally a bad idea. You don't want to mix grammar and semantics because that's not how people think about code.

That said I agree that it is interesting to see what the 'parse, don't validate' approach to type checking would be, but I think it should still take place on the type level.

And I think this leads you towards interfaces, though perhaps we can improve interfaces by embracing the idea. Taking C# as an example interfaces basically ensure that if the type check succeeds then the function will be given an object that supports the methods of the interface and only the methods of that interface. Even further the methods and fields names of the interface can conflict with the fields and method names of the object but the type checker 'resolves' those conflicts and essentially returns a 'parsed' object with the right methods.

Unfortunately this doesn't yet allow for arbitrary conditions (explicitly). Provability is always going to be an issue so let's say we fix that by dynamic type casts. Then I think the behaviour we want is essentially an interface defined by a function

IMyInterface parse(MyObject: MyType)

where the constructor of IMyInterface enforces some checks and an object implements the interface if it defines an implementation of 'parse'.

Now we only need a language to support this. Maybe we can persuade the Julia guys? They seem to like this kind of type-polymorphism based aproach.

Edit: Just checked, apparently they have implemented it, it's just called 'convert'.


> I see, it's an interesting line of thought but I think trying to move type checking into the grammar is fundamentally a bad idea. You don't want to mix grammar and semantics because that's not how people think about code.

I know that it’s common to refer to type checking as “semantic analysis”, but the logician in me is not happy. There’s nothing inherently more “semantic” about types compared to grammars. As Benjamin Pierce says [1]:

> A type system is a syntactic method for enforcing levels of abstraction in programs.

(Emphasis mine).

> That said I agree that it is interesting to see what the 'parse, don't validate' approach to type checking would be, but I think it should still take place on the type level.

You can certainly write parsers at the type level, but that doesn’t change the fundamental fact that a type checker is a validator. To change a type checker into a parser it would have to output a representation which makes invalid states impossible to represent. From a programmer’s point of view (not the compiler writer), this means that the actual compiled object code would have to have this property. I can see a couple of possible ways you could do this:

- the compiled object code enforces invariants using dynamic checks provided by the runtime/processor - the type-checker outputs something like proof-carrying code [2] that can be rigorously checked prior to execution to ensure the illegal states removed by the type system cannot be recreated.

[1]: https://www.cis.upenn.edu/~bcpierce/tapl/

[2]: https://en.wikipedia.org/wiki/Proof-carrying_code


>I know that it’s common to refer to type checking as “semantic analysis”, but the logician in me is not happy. There’s nothing inherently more “semantic” about types compared to grammars.

They can be semantic and in practice are used semantically, especially in nominal type systems. It depends on the programming language how much they enforce though, but even if you don't enforce anything names carry meaning, and to some extent meaning isn't even about what should be impossible but what you should expect (though some states should be impossible). For instance there's no restriction you can build in that would make 'DegreesFahrenheit' and 'DegreesCentigrade' any different, in fact you could argue that any place one of them can be used the other is also valid. The meaning however is very different.

>To change a type checker into a parser it would have to output a representation which makes invalid states impossible to represent.

Agreed.

That doesn't require you to do anything on the level of the grammar of the language though, parsing the text a programmer wrote and parsing an object to a certain type are two very different operations. It doesn't matter how a programmer produced the AST what matters is whether it makes sense (which is a semantic matter as the parser already forces the AST to be grammatically correct).


> They can be semantic and in practice are used semantically, especially in nominal type systems. It depends on the programming language how much they enforce though, but even if you don't enforce anything names carry meaning [...]

But grammar productions are also named and grammars are also designed to capture semantic notions. The syntax is after all intended to convey meaning.


> I believe that actually adjusting the programming language grammar (and thus the actual parser) is clearly a parser approach to solving the problem.

How do you get the properties of type safety (progress and preservation) in this case?


The title suggests that type checking should be replaced with parsing, but that's not the article's message. Rather the article recognizes that powerful type systems (more powerful than in many mainstream languages, perhaps) are a great tool to enforce certain invariants. However, those type systems (dependently typed ones, in particular) are also complex and often fickle beasts. The article suggests that it may therefore be worth replacing them with different, less powerful, but easier to work with approaches.


In re: E and capabilities, see also "macaroons":

> This paper introduces macaroons: flexible authorization credentials for Cloud services that support decentralized delegation between principals. Macaroons are based on a construction that uses nested, chained MACs (e.g., HMACs) in a manner that is highly efficient, easy to deploy, and widely applicable.

> Although macaroons are bearer credentials, like Web cookies, macaroons embed caveats that attenuate and contextually confine when, where, by who, and for what purpose a target service should authorize requests.

https://research.google/pubs/pub41892/


I don’t understand the idea here. Isn’t typechecking a means of making impossible states impossible? The article references ideas like non empty arrays or limiting the input from all numbers to just integers between 1 and 5 inclusive.

These are both ideas you can express in a type system (quite trivially in most as well). But instead we should use a function at runtime to parse the input and return an object of this type?

Is this article against static type systems, or against runtime type checks? Or both...

Personally I love having a type system and typescript (what I work with professionally) allows for making impossible states impossible just like many other type systems


> Is this article against static type systems, or against runtime type checks? Or both...

It's not "against" anything. It's pointing out that "make impossible states unrepresentable" is poorly satisfied by most typed languages themselves. The more of the language you lift into the parser (decidable! easily-analyzable memory bounds!), the less you need to worry about types.

As a trivial case consider a language where `5 + ""` would fail typecheck. But why even get so far? Why does your parser accept `lit-term := lit '+' lit` and not `lit-term := [str '+' str] | [ num + num ]`, etc.


Because the parser can't typecheck "ident '+' ident", so why make it vastly more complicated only to overlap with a later pass that is necessary anyway?


What constitutes "vastly more complicated"?

"The type checker can't find all errors, why make the compiler vastly more complicated only to overlap with my unit tests later anyway?" is a pretty common argument. What differentiates your case aside from your personal aesthetics?


> The more of the language you lift into the parser (decidable! easily-analyzable memory bounds!), the less you need to worry about types.

Doing the easy cases of type-checking at a different time doesn't allow you to worry about types less, because you are still doing that work; just at a different time. Making it necessary to pass type information between phases increases the amount of worrying about types.

> As a trivial case consider a language where `5 + ""` would fail typecheck. But why even get so far? Why does your parser accept `lit-term := lit '+' lit` and not `lit-term := [str '+' str] | [ num + num ]`, etc.

This approach increases the complexity of parsing so much that your trivial example is wrong. What good is `lit-term` in a language with different expression-trees for different value types? You need something like `maybe-str-expr := [maybe-str-expr '+' maybe-str-expr] | [function-call] | ...` -- where everything is `maybe-`, because enough type information is available to determine that certain things are wrong, but not enough is available to ensure everything is correctly-typed. We could strictly improve on this situation by increasing the scope of the parser, to include looking up the types of items so that it can fully type-check the tree and accept only validly-typed programs. We can further improve the situation by splitting the parser into two phases: one that just identifies the lexical structure of the input, and one that does the type-checking. Let's just not call that part a type-checker -- it's part of the "parser" now.


Congratulations, you have demolished my trivial example made to clarify for someone who was having trouble understanding the point of the article. I would much rather you engage the ideas in the article directly, and not the deliberately trivial pedagogical case.


> What constitutes "vastly more complicated"?

You're duplicating special cases of the typechecker into the parser, completely unnecessarily (since the typechecker will handle it) and for no gain (since this is an approximately non-existent occurrence).

Worse, by rejecting the program so early you're making it much harder to provide any sort of valuable feedback.


>parser (decidable! easily-analyzable memory bounds!),

Hahaha, you wish. Sure there are languages that are simple to parse, but they are that way by construction. Parsing arbitrary grammars is hard to impossible depending on how general your class of grammars is.


I'm well aware of the limitations of some classes of parsers.

> but they are that way by construction

If I could summarize both this and the original "parse, don't validate" article, it would be "let's construct as much as possible this way."


Fair enough but it wouldn't surprise me if you just get the exact same problems you get with type systems except phrased differently. You already hinted at this somewhat with your "num = num + num" example, surely that's just a type system in disguise?

And well, personally I view types more like a way to denote pre- and post-conditions (with varying degrees of strictness). And a grammar just tells you what you are allowed to write, it doesn't dictate semantics. This becomes a problem when you get stuff like "daily_revenue + temperature", which grammatically makes perfect sense, but semantically does not. With types this can be fixed (admittedly in this case it might not be worth fixing), in a grammar this seem very hard to forbid.


> As a trivial case consider a language where `5 + ""` would fail typecheck. But why even get so far? Why does your parser accept `lit-term := lit '+' lit` and not `lit-term := [str '+' str] | [ num + num ]`, etc.

Because it's an unnecessary complication to duplicate a few corners of the typesystem as special cases of the parser?


I think there were two main concepts:

1) When implementing a type system, you could have a separate representation for a typechecked AST than for a not-yet-typechecked AST, in the same way that you normally have a separate representation for a parsed AST than for a not-yet-parsed token stream. This would ensure against invalid states within a parser by making the type-checking pass another parsing pass instead of a validation pass.

2) The author then seemed to veer into applying this philosophy to application code. However, the author also seems like they've had a lot of experience with languages where the line between parser and application is blurred, so in that context this isn't such a big leap.


Making impossible things impossible seems off. Type checking, by name, is just checking types. Akin to units checking.

That said, some things may be impossible at a piecemeal, but correct as whole. Easiest is really any system. The parts usually do not stand on their own. A gear without the bike is effectively impossible to use, but still correct, after all.


> Making impossible things impossible seems off.

It's the entire purpose of a static type system.

All will fail to provide the necessary tools at one point, and many will fail almost immediately, but types are how you encode what is possible, and typechecking is thus what excludes the impossible.

> Akin to units checking.

And dimensional analysis is how you know you fucked up and your formula is nonsensical before you even apply it.


While it’s true that a gear without a bike is not very useful, one could also argue that it’s also effectively impossible to misuse from a perspective of cyclist safety.

Anyway the real power of the type system is that it gives you, the designer, the ability to define what is correct and what is not correct by composition. For example, you might define a type, CompleteBicycle, which is used as an argument to a ride() function. In your definition of complete bicycle, you could require a non-null gear property of type Gear. Sure, Gear exists independently in the type system, but you haven’t defined any uses for it, so it (by definition) can’t be used (or misused).

Edit: Grammar


Making incorrect states impossible to represent is exactly what type systems are about. If you're still thinking in terms of something that's basically checking units, I'd recommend experiencing a more advanced type system like Haskell, at least for a few hundred or thousand lines of code. C or Java are terrible examples of what a type system is capable of.


It's extremely formalized, but at least for me the foundation I always come back to to try to explain type systems is the taxonomy in chapter 1 of Stepanov and McJones's Elements of Programming[0]. Type systems help ensure your binary data are well-formed value (and your bits are numbers, your numbers are temperatures, etc. up various hierarchies). I find this is more convincing initially for mechanically sympathetic programmers (including myself) than any taxonomy of endofunctors.

[0] http://elementsofprogramming.com/


This is awesome, thanks for the resource!


> Making impossible things impossible seems off. Type checking, by name, is just checking types.

That's the purpose of (static) type checking: to prevent incorrect programs from running. Ideally your types encode your valid programs (with some limitations).


Pascal has subrange types -- you can declare a type that is 1..10 for example.

Statically typed languages could be more strict and have more basic typing features. In most languages, it's not even as good as units checking because you can't say numeric type X is incompatible with numeric type Y (e.g. Celsius and Fahrenheit).


I'm not sure I buy the premise. The point of the original article was that you shouldn't dig a hole for yourself to get out of later. Compiler writers usually go out of their way, digging extra deep holes, to accept a broad range of invalid input. Not because they like complexity---though, as a group they.. I digress---but because it lets them give more valuable user feedback.


Yeah I am currently writing a compiler, turns out for useful error messages my parser should accept a superset of my language, then validate it afterwards.



The way the D compiler does it is by parsing until it finds an error, marking the node as "poisoned" then walking back up until it can keep going.

It seems to work although I would include a "fail early" flag just to help with enormous error messages (if you apply the principle to sema as well)


Yeah, the main coherent point I take out of this is that it makes sense to provide capabilities rather than to check them, i.e. if you have access to a certain API then you're by construction allowed to use it. It seems this is something the author has been working on.

This is then somehow shoehorned into "Parse / Don't typecheck" but it doesn't make much sense honestly.


Is it me misreading it or the author trying to re-invent dependent types?


I would certainly hope that a type "checker" returns a more detailed representation of a program than a parse tree - one in which all expressions are typed and ill-typed expressions cannot be represented. So the premise is fundamentally wrong.

Are there non-type ways to improve program safety? Maybe. But since we all understand types and types can cover all the cases we've managed to find so far, I'd stick to types unless and until we can demonstrate a need for something else. Yes, handing out fundamentally unsafe APIs like array-access-by-index is a bad idea - but trying to put types on those APIs is the easiest way to see that they're the wrong abstraction. If you insist that your APIs be type-safe, you'll naturally be guided away from APIs like get-at-index and towards APIs like map / fold / find.


> I would certainly hope that a type "checker" returns a more detailed representation of a program than a parse tree - one in which all expressions are typed and ill-typed expressions cannot be represented.

Do you have an example of this? It is not my experience of compiler intermediate representations or ASTs that they have this property.

> But since we all understand types and types can cover all the cases we've managed to find so far, I'd stick to types unless and until we can demonstrate a need for something else.

This doesn’t really argue with anything I wrote in the article. (Although I’d argue that nobody strictly “sticks to types” as their sole way to enforce correctness).

> trying to put types on those APIs is the easiest way to see that they're the wrong abstraction

I’m not sure I’d say it’s the easiest, but it certainly helps. I love Haskell as a language, primarily because it has a lovely set of well-designed abstractions. It’s clear the design of those abstractions were guided by types and benefit from that. Many of them remain good abstractions when translated to untyped languages.


> Typed intermediate language. GHC has always used a statically typed intermediate language, one of very few production compilers to do so. (Maybe even the only one.) Let's keep that property. Richard's thesis makes it clear that doing so is not straightforward.

-- Simon PJ (https://github.com/ghc-proposals/ghc-proposals/pull/378#issu...)

So it looks like it's rare to even have a typed intermediate language, let alone one whose structure forces it to be well typed. The latter would require some sort of dependent types.


I like the principle of "make unauthorized states unrepresentable", but it makes me nervous, or at least evokes a nuance that is well addressed by https://github.com/tree-sitter/tree-sitter

> Robust enough to provide useful results even in the presence of syntax errors

Especially when it comes to human-generated input (but even outside of that) it is beneficial to have a representation for unauthorized states, so long as you can easily detect that that's what you're representing. That allows you to better report errors, or try to mitigate them.

In other words, don't forget to have an https://wiki.c2.com/?EscapeHatch


The dictum is usually phrased as, "make illegal states unrepresentable". Perhaps this is just playing with semantics, but, to me, that difference in wording is incredibly important.

An unauthorized state (which I'm interpreting here to mean an error state) may not be on the happy path, but that doesn't necessarily mean it's invalid. For example, "the user is trying to open a file, but they don't have permission to read it" is a valid state. The resulting error condition is also valid, even if it's not somewhere they want to find themself.

An illegal state is, in the context of this phrase, meant to be one that's nonsensical. Perhaps "invalid" would be a better word. Think negative array indexes in languages that only let you index from the front of the array, or objects that are only halfway initialized. Situations where, having entered such a state, the behavior of the program is now undefined. We should strive to make it physically impossible get the program into a state where its behavior is undefined.


The author clearly does not understand what type checking is about.

In a good strongly-typed language, types don't just prevent errors. They are put to work to perform computation. They choose overloads, and therefore control flow. Different overloads produce different results, therefore data.

In the extreme, the type organization of a library is itself a program, one that runs at compile time, and generates the program that will then be compiled and run.


I'd say that the type-checker in e.g. Haskell does already do a bit of the "parsing" here, in that it takes code with some type annotations, and returns (on success) that code with fully resolved types. I.e., type variables will be resolved, missing annotations inferred, etc.

That said, the idea of moving parts of the type-checking to the parsing stage is novel and interesting.


> That said, the idea of moving parts of the type-checking to the parsing stage is novel and interesting.

It's exactly the opposite of novel, from the very beginnings of parsing theory its goal was to separate lexical, syntactic and semantic analysis. The benefits of the separation come as both better run-time performance and less code size and complexity.


Isn't this just saying the exact same thing as the original article but with pedantic word games? A type checker includes a parser, a lexer, a syntax tree builder and the type checking itself is the specific grammar the author is looking for. Am I missing something?


Reading articles like this makes me want to avoid Haskell.

It seems like the language is so over-complicated that it requires developers to invent an entire religion to be able to effectively apply it to real world use cases.

From the perspective of a non-Haskell coder who worked with many different mainstream languages, I find that many articles written about Haskell, Erlang and Elixir seem to have no meaning outside of their own narrow domain; they try to present some solutions as some universal insight which is applicable to all languages. In fact, they seem to revolve around solving problems which the language itself created and only exist in that language.

I like languages which stay out of my way and don't create any new problems or require inventing unnecessary abstractions and ideology (my OOP philosophy is so simple, I could explain it to a 10 year old). I can't believe that you would need several articles to explain to developers how to do something as trivial as validation. It's not rocket science (...well it shouldn't be). In the communities I participate in, most developers seem to intuitively know how to do validation... There are multiple ways to do it. You just have to think a bit to figure out which way is the best for your project.

A lot of these trendy languages seem to be on a futile mission to try to automate away some of the complexity of building software. Building software is irreducibly complex because there are always going to be a huge number of possibilities at every step; this is because the real world (which software tries to model) is complex, always more complex than it seems. That's why I like simple expressive languages which stay out of my way. Any tool which constrains my thinking into specific narrow patterns is going to make my work harder and more error-prone, not easier.


Reading articles like this make me want to avoid OOP.

It seems that like the concept is so over-complicated that it requires developers to invent an entire religion to be able to effectively apply it to real world use cases...( my procedural philosophy is so simple, I could explain it to a 10 year old).

There are many ways to build software, but only few ways to build it reliably. One major aspect of reliable software is composability, so that may ensure correct large systems are out of correct smaller systems. Correctness in the small, trivial bits and their glue is what makes correctness at large. You say you just have to think about what's best for your project. Why not think about and understand the problem in general and build tools to tackle the specific complexities you have.

Feel free to not use Haskell. Just keep in mind that you too live in a bubble with a certain mixture of doctrine and trade-offs.


Haskell has been around for a really long time, almost 4 decades. It's used quite a bit by programming language theorists and category theorists to prove their mathematical theorems in a real language. It's not trendy or experimental, although some parts are one or both. There are about 5 ways to do everything in Haskell, but that's not bad once you understand it. It's not a language you can just pickup in a weekend and easily Google questions about, it's generally used for applications that require some higher level of data and quality assurance, from a fully proven program using Coq to simply a program that's easier to use equatorial reasoning on, like industrial processes or stream processing pipelines.

Erlang has been around about as long. It's also not trendy, and it's designed for telecommunications applications, where the extremely distributed actor model is great. Others might try to shoehorn it everywhere, mostly via elixir, but it's a very reasonable language for many uses, like chat programs or distributed in memory stores.

You might want to understand something before you crap all over it, everything exists for a reason, try to figure out the reason.


Your statement about Erlang in telecom is mostly a myth. There was one somewhat successful niche piece of equipment from Ericsson and even that was ~20 years ago. Erlang is basically unheard in the telecom field.


Interesting, do you have any details on that? I've always seen it's origin story start in telecommunications.



Haskell simply makes you do a thing you should do: check that the thing you expect to be there is actually there. You can think of it simply as the compiler acting as a testing framework.

On the other side there are things like TypeScript where selecting the 'n' index of an array does not give you a type representing the fact that the item may not exist, and thus does not force you the check if it does. You may call this "getting out of your way", but I don't see how this makes software less complex - it makes bugs more difficult to track down and necessitates more tests, which could simply be removed by an adjustment in types.

TyoeScript/JavaScript is easier to learn, but dear God does it offer so many ways for your to shoot yourself in then foot.


> I find that many articles written about Haskell, Erlang and Elixir seem to have no meaning outside of their own narrow domain; they try to present some solutions as some universal insight which is applicable to all languages.

As a developer who does not use Haskell (but understands most of it), let me put it into perspective:

If you write an article about the visitor-pattern or any kind of other OOP pattern, how do you think this is perceived by a developer who ever only wrote assembly?

Yes, they will think in the same way you think right now. This phenomenon has been described e.g. here: https://wiki.c2.com/?BlubParadox

Fact is that from the perspective of a Haskell developer, you are writing unorganized spaghetti code when not using the techniques they write about. In the same way you probably would see an assembly developer writing spaghetti code instead of using OOP techniques.

> I like languages which stay out of my way

Languages that stay out of _your_ way are languages that encourage the style that _you_ are used to. There are no languages that generally stay _out of the way_ because that's not possible and not even desired. So saying this just means "I want a language where I don't have to learn another way of doing things". That's fair, but it should be emphasized.

> my OOP philosophy is so simple, I could explain it to a 10 year old

Many people learn Haskell first and than have trouble learning classical OOP in e.g. Java. And the other way around. What does that tell you?


> I find that many articles written about Haskell, Erlang and Elixir seem to have no meaning outside of their own narrow domain; they try to present some solutions as some universal insight which is applicable to all languages. In fact, they seem to revolve around solving problems which the language itself created and only exist in that language.

I don't find that to be the case often; Haskell articles, specifically, often have solutions to problems that static typing brings to the fore (they usually exist without it, but are less obvious without static typechecking and the solution approaches are sometimes different) and may present solutions that require a type system more powerful than is typically available outside of Haskell (though they are usually translatable to a weaker type system with some loss of safety.)

In fact, I've learned a lot applicable to programming in languages like Python, Ruby, JavaScript, and C# from learning a bit of (and reading articles focussing on) each of Haskell, Clojure, Erlang, and Elixir. Most paradigms can be applied productively in most languages if you are familiar with the paradigm and target language (though they're are definitely things that depend on being on one side or the other of the static/dynamic divide.) And even workhorse languages these days tend to be multiparadigm to start with, so learning techniques from the languages where a paradigm has been most fully explored and is most naturally expressed is often a good source of understanding to apply in other languages when that paradigm is the right choice.

> A lot of these trendy languages seem to be on a futile mission to try to automate away some of the complexity of building software

All languages higher level than raw machine code exist for primarily that purpose (and, for that matter, the design of modern machine instruction sets itself does some of that, too.) That isn't limited to “trendy” languages, which probably just means popular ones that are outside of your particular zone of preference established through familiarity.


> Reading articles like this makes me want to avoid Haskell.

Huh? The article doesn't even mention Haskell!



I use a parser to check for a non empty list return. It’s called a unit test.

All joking aside, what is the practical difference or benefit here to testing?


Very simple.

Say you have a user that inputs a list. And then you have another function somewhere in the code, that takes a list, sorts it and uses the max value to send a notification email.

If it's done your way, then you'll have to add an assertion in that function, such as

    if(input_list.length < 1) explode("Bug! Bug! You can't call me with an empty list")
And you will have to write multiple unit tests _each time_ you call this function from somewhere, to make sure it's not called with an empty list.

If you use a compiler... all of this goes away. The function just expects you to give it a non-empty list and if you don't the code just doesn't compile.

Less code to maintain = win.

The price you pay for this is learning, understanding and using a more advanced typesystem. That's not a small price to pay. Pick your poison.


Your example isn’t even what the articles attempt to suggest, and indeed, it’s not even possible, since there’s no way to ensure that the program passes non empty lists at compile.

For example the grandparent article compares these

  validateNonEmpty :: [a] -> IO ()
  validateNonEmpty (_:_) = pure ()
  validateNonEmpty [] = throwIO $ userError "list cannot be empty"
 
  parseNonEmpty :: [a] -> IO (NonEmpty a)
  parseNonEmpty (x:xs) = pure (x:|xs)
  parseNonEmpty [] = throwIO $ userError "list cannot be empty"
He then states

  “Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away.”
But the caller still needs to deal with the empty type. It’s just moved it up higher in the code. Which makes checking easier. He’s referring to this as “parsing”

  “parseNonEmpty is a perfectly cromulent parser: it parses lists into non-empty lists, signaling failure by terminating the program with an error message”
The parent article however seems to want to take this further, but again, to make use of this as suggested by actually parsing still would require unit tests.

A compiler can’t solve the problem of preventing the class of bug you’re talking about, hence why there always needs to be some code that handles cases like empty files or lists etc.

To understand why check out the halting problem:

https://cs.stackexchange.com/questions/72014/what-cannot-be-...

https://en.m.wikipedia.org/wiki/Halting_problem


I think your misunderstanding lies here:

> But the caller still needs to deal with the empty type.

Yes, the caller has to deal with it. The caller in this case is, in my example, the function that takes the user inputs. But every subsequent part of the application (in my example the sorting/email function) does _not_ have to deal with it anymore.

So essentially: you only have to deal with errors at the boundaries of your application - and after that, the rest of your code (usually the vast majority) just works without checking for errors again and again.

So to answer your summary:

> A compiler can’t solve the problem of preventing the class of bug you’re talking about, hence why there always needs to be some code that handles cases like empty files or lists

Yes. The difference is: do you want to handle these case all over again in each function of your code base... or just once at the boundary.


Ok, well I agree with that assessment. I just don’t think the article has made the case that the methods he talks about simplify handling these things at the boundary in some special way.

There are numerous ways, even in dynamically typed languages to use abstraction to raise the level of such errors and handle them in a single place.

For instance let’s take ruby. Rails has a “.blank?” Function that wraps up handling empty strings, lists, nils, etc. So if I use that abstraction, I’ve raised up having to check all those issue in every function, instead now I need to ensure the “blank?” function does what’s its supposed to.

So there’s definitely value in abstraction, and in moving certain classes of errors up the layers in the application. And yes I can see how some fancy uses of type systems (in this case Haskell) can make this work. The grandparent article seems to leave it there.

This parent article though seems to muddle things with the idea that parsing can do even better, and that’s were I think the idea is half baked.


Just to put a nail on it, a previous commenter above spells it out

  “And some of those simpler solutions do come from the dynamic programming world. For example, ‘You can then go further and remove the more primitive operations, only allowing access to the higher level abstractions,’ is another excellent way to make illegal states unrepresentable. And you don't need shiny new programming languages or rocket powered type systems to do it. I was really rather disappointed that that section of the article gave a nod to Dijkstra, but completely failed to mention Alan Kay's The Early History of Smalltalk[1], which was, to an approximation, several thousand words' worth of grinding away on that point.”
So I feel like maybe we all were confused by what the article was suggesting, if indeed it suggested anything concrete at all.

1. http://worrydream.com/EarlyHistoryOfSmalltalk/


> The grandparent article seems to leave it there.

I think the article unfortunately just assumes that you are already familiar with using an advanced statically typed language and draw the conclusion about the advantages automatically.


I am actually very familiar with advanced statically typed languages, however the article isnt trying to make the case that such languages have any advantages. (Another debate entirely) It’s instead making the case that adding “parsing” to such languages helps deal with some certain set of bugs, but it still seemed pretty non specific and doesn’t seem to cover any new ground in any interesting way frankly.


No one genuinely has an answer to this?

I get that we’re talking about Haskell here which is compiled. But why wouldn’t we talk about running the code through unit tests.

For example the article talks about given function that requires a number that must be between one and five... and then goes on to specify specific grammar rules.

How is that better than simply typing that up in a test.

And by simply running test suites you eliminate issues with bugs in your parser system.

What am I missing?


When you have a non-emptiness property on a type, the compiler can guarantee you that all current and future occurences of values of that type in the system will be interacted consistently with the specified non-empty property. Or in other words, the accent is moved from the limited set of units that you are able (and willing) to test to the invariants that will be held for all occurences of the type at all times, regardless of an unlimited number of units that are willing to use them.


How does one guarantee that for all values? If we’re talking about a parser, wouldn’t it need to test all the values? And if your not testing all the values, aside from simple cases, it’s not something that could just be computed (np, halting problem etc)... so wouldn’t the parser still just need to test some preset number of values?

I guess maybe we’re talking kind of like a prebaked in set of unit tests for certain circumstances?


The author contradicts the headline in the body. This seems like a poorly written bit of highbrow content marketing for his book.


Rhetorical Flourishes Considered Harmful?


Isn't the first step of writing a type checker to write a parser?


It really depends on language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: