Hacker News new | past | comments | ask | show | jobs | submit login
Mangle, a programming language for deductive database programming (github.com/google)
187 points by aarroyoc on Nov 26, 2022 | hide | past | favorite | 46 comments



This seems already almost valid Prolog syntax, which is also a syntactic superset of Datalog.

The main question I have for implementors of Datalog and Prolog variants like this: If you are that close to using Prolog syntax, why not go all the way and rely fully on Prolog, a language for which a well-defined ISO standard and several interesting implementations already exist. One of the key benefits you get in this way is Prolog's strength for meta-programming and reasoning about programs with the same formalism you use to state the specifications and queries. Abstract interpretation, query optimization etc. can be easily implemented in this way, instead of having to parse an additional formalism.

It may be possible to implement such Prolog-"variants" entirely within Prolog by defining suitable infix or prefix operators, or adding conforming extensions in implementations. A conforming extension is one that does not conflict with existing ISO syntax. For example, something that would be a syntax error in conforming Prolog implementations could be used as an implementation-specific extension.


Prolog is not purely logical in that you can write imperative programs by taking advantage of the fixed order of search and also alter the executing with cuts. One of the great disasters of symbolic AI in the 1980s was the discovery that you can’t parallelize Prolog.


"...you can write imperative programs...". It's worse than that: you have to look carefully at the order of the alternatives to understand the computation. I re-read Clocksin and Mellish during a recent programming language binge and realized that an imperative core lurks under the logical veneer.


Because syntactic doesn’t mean semantic subset. Datalog uses bottom-up evaluation by default while Prolog uses top-down. As such they are computationally very different.


Yes well, as you mention, that is only the default execution strategy, and nothing prevents us from using other execution strategies for either language. In fact, the main advantage of using a declarative language such as Prolog or Datalog is precisely that it can be readily interpreted with different execution strategies, and indeed many Prolog implementations already provide alternative execution strategies, the most common of which is currently SLG resolution, also known as tabling. Alternative execution strategies are applicable as long as you keep to the pure monotonic core of the language.

The fact that the default execution strategies of Prolog and Datalog differ is not a valid argument for or against using a slightly different syntactic formalism. If, as seems plausible, an alternative execution strategy of Prolog can provide the same semantics as Mangle and other Datalog "variants", then it may be worth considering doing that instead of devising a different syntactic formalism, especially if the alternative syntactic formalism is already so close to valid Prolog syntax as it is in this concrete case.


Yes, I’m well aware of tabling. But high performance Datalog implementations will fundamentally be very different than the usual Prolog implementations. But sure, it isn’t impossible. Just long a ways from ISO Prolog.

Consider Soufflé, designed for large scale program analysis. Or LogiQL, designed for efficient incremental evaluation.

You also failed to disclose that you are not exactly unbiased individual here.


> Consider Soufflé, designed for large scale program analysis. Or LogiQL, designed for efficient incremental evaluation.

I have now looked up these systems: They are far removed from standard Prolog syntax, much farther than the system outlined in the posted article, which uses almost Prolog syntax to such an extreme extent that one of the given examples is already valid Prolog syntax and even a valid Prolog program without any changes. Therefore, the question I posted above ("If you are that close to using Prolog syntax, ...") does not apply to the same extent to these systems.

> You also failed to disclose that you are not exactly unbiased individual here.

Personally, I believe I have a solid grasp on the benefits of using standard Prolog syntax and the disadvantages of so slightly deviating from it, and I also believe that the question I asked is justified and what I stated in this thread is true.

If you see a need for any disclosure beyond what I stated above, please add it and what you believe to be the nature of my bias as your reply to this post for everyone to see. Thank you a lot!


Datalog programs guarantee termination while Prolog programs do not.


Yes, the subset of Prolog that is called Datalog can be interpreted in such a way that termination is guaranteed, and that is a nice and valuable property of this specific subset, i.e., the functor-free subset, of Prolog.

This is a good example to illustrate why it can be useful to interpret specific Prolog programs and fragments in specific ways, which may also differ from the default execution strategy.

As a specific instance of such an execution strategy, we can again consider SLG resolution: If we interpret any pure Prolog program that does not use compound terms with SLG resolution, then every query we post terminates universally.


I don’t understand what you mean. Okay, you can create anything you want on top of Prolog. Is your point that authors of Mangle should have created a Prolog instead of their own Datalog?

Or is more “why don’t they just use Prolog”.


The code snippets shown on the page are almost valid Prolog syntax. The only issues that prevent them from being actual Prolog syntax are, as far as I can tell, very minor syntactic issues. For example, != cannot be defined as an infix operator in Prolog because ! is a so-called solo char and therefore != is not admissible as a single token in Prolog. The obvious fix for this concrete issue is to replace every occurrence of != in these snippets with the well-known Prolog predicate dif/2 which holds if and only if its two arguments are different.

For instance, the first example from the page:

    projects_with_vulnerable_log4j(P) :-
      projects(P),
      contains_jar(P, "log4j", Version),
      Version != "2.17.1",
      Version != "2.12.4",
      Version != "2.3.2".
would become:

    projects_with_vulnerable_log4j(P) :-
      projects(P),
      contains_jar(P, "log4j", Version),
      dif(Version, "2.17.1"),
      dif(Version, "2.12.4"),
      dif(Version, "2.3.2").
or, shorter and equivalently, by using full Prolog including higher-order predicates such as maplist/2:

    projects_with_vulnerable_log4j(P) :-
      projects(P),
      contains_jar(P, "log4j", Version),
      maplist(dif(Version), ["2.17.1","2.12.4","2.3.2"]).
This third version could of course be generated automatically from the snippet above. We could also define dif or many other tokens as infix operators, so that we can use operator notation as in the original snippet, all while keeping to standard Prolog syntax.

Analogously for |> and the let-bindings that occur in the sample snippets: Chances are that they can be expressed somehow also with standard Prolog syntax with suitable operator definitions, or with small conforming extensions if absolutely required.

The third example in the README is already perfectly valid Prolog code, and can be parsed and interpreted with every conforming Prolog system:

    contains_jar(P, Name, Version) :-
      contains_jar_directly(P, Name, Version).

    contains_jar(P, Name, Version) :-
      project_depends(P, Q),
      contains_jar(Q, Name, Version).
The main issue here is: This "own Datalog" is extremely close to Prolog syntax, much closer to Prolog than any other Datalog "variant" I have so far seen on HN. The question therefore is: Why not go the small extra step and just use standard Prolog syntax? One advantage of this is that such programs could then be read and reasoned about directly (i.e., without requiring any manual parsing) with every conforming Prolog system. Another advantage is that such programs would also automatically benefit from all improvements in Prolog engines, either via meta-interpretation or by virtue of being valid Prolog programs too. For instance, the first snippet shown in the README, with the small syntactic change I outlined, is already a valid Prolog program and can be run with SICStus Prolog etc.

Using Prolog syntax does not mean that that the programs have to be interpreted in the same way as Prolog would by default. It is also possible to syntactically include constructs that cannot be directly interpreted in Prolog, but require a custom interpreter. The point about syntactic compatibility stands regardless.


It was a conscious choice to stick the prolog-like syntax because there are numerous teaching resources (e.g. "foundations of databases") and also academic research papers available that all discuss datalog in this syntax.

Like tmptmpgo I consider datalog something different entirely and worthy of study in its own right; a small kernel (or toy) language that by itself is not sufficient for many day-to-day tasks.

It is enough to cover relational algebra with recursion fixpoints. This gives us a handle on SQL queries. What we want is a well-behaved, predicable mix of programming and those relational operations. We have to end this SQL madness.

IMHO, it is undeniably true that Prolog is one particular way of achieving such a mix (and standardized etc), but it is not the predictable, easy-to-use and well-behaved mix we want. We should be able to isolate syntactic fragments where it is easy to ensure safety properties, at the minimum the datalog fragment. That is why having a fragment correspond as close as possible to textbook Datalog was a design constraint for me.

That said, when we talk implementation, there are of course a lot of optimizations that apply equally to Datalog and Prolog, say the magic set transformation.


> What we want is a well-behaved, predicable mix of programming and those relational operations. We have to end this SQL madness.

These are great goals! Thank you a lot for working on them, and for taking the time to participate here!

> We should be able to isolate syntactic fragments where it is easy to ensure safety properties, at the minimum the datalog fragment.

Again a great goal! Luckily, for every given Prolog program, it can be decided statically whether it is in the usually considered Datalog fragment. Likewise, for every given Prolog program, many things can be statically decided, for example it can be said statically, i.e., by just "looking" at the program without running it, with certainty that impure constructs do not occur in a given Prolog program and that these transformations can be applied without changing the declarative meaning of the program.

I think it is a very good idea to restrict oneself to a fragment of Prolog, especially if you need a lot of leeway to optimize programs, and to reason about them. For example, query optimization is much easier if you know that no side-effects occur in a Prolog program.

The question is whether all this needs any syntax beyond what standard Prolog already provides. Prolog is a very flexible language and supports user-defined operators. If possible, I recommend to keep to this standardized syntax, and to use a sensible fragment of it to facilitate reasoning, at least for as many programs as possible. The availability of existing teaching material and research papers you mentioned is another excellent argument in favor of this approach. It is perfectly OK to only support a proper subset of full Prolog, in fact that is exactly what Datalog does!


Looks like Mangle differs from Prolog when it needs to express aggregation. I am curious what prolog dialect/library you think has good(or best?) aggregation syntax?


>pure monotonic core of the language

What does that mean? Monotonic describes a function, or a sequence, that only every increases or decreases, never a combination of both. What does an "always decreasing" or "always increasing" language core look like? This may make sense for concatenative languages.


In logic programming monotonic means that adding a fact or rule does not remove an answer.

Like SQL has monotonic queries where adding entries to a relation never gives you fewer answers.

More formally: Database1 subset Database2 implies Query(Database1) subset Query(Database2) for all databases

Which is exactly the definition of monotonicity (x =< y implies f(x) =< f(y)) applied to sets and functions on sets.


I guess that makes sense. I wonder though what other adjectives might apply to a "language core" other than "monotonic"? Why did you pick that one? Also are there any language are are monotonically decreasing? This is less useless than it sounds, since this is an elegant way to model decay, and decay is hard to model.


I agree this would be very useful. There are a number of datalog idioms (demand transformations is a big one for encoding functional programs, adding provenance, inlining relations, doing some light compile time backwards proof search) that it would be nice to have a good meta-programming/macro language to express. Prolog seems like a natural choice. I briefly tried going this route programming in prolog syntax so prolog could parse it, but generating souffle syntax out of the prolog metaprogram.


This is my project! So soon ... I was considering a Show HN but wanted to wait until things like documentation are a bit more complete, but here we are.

Ask me anything.


What is the provenance of the project? Is it for something you're working on at Google? Or is it a personal open source project?

I never got to work on anything this interesting in my 10 years at Google :-)


Mangle is used in an internal application. That application is a staffed project with internal usage.

In mid 2021, I saw an opportunity to apply some ideas on query languages to support development of that particular application. That application was an early idea and a prototype back then.

It was clear that it would need integration with many data sources and that this data would need to be queried in various ways that were not fixed in advance and would evolve. A very common situation...

With the team, we decided to try a novel approach using datalog, and it did indeed help development. Soon they started asking for features, found my bugs, optimized a few things and pushed for design changes.

As the programming language person, I tried to keep internal consistency of the language design and ease of use (static checks!) and wrote most code, yet Mangle certainly would not exist without this team, the application, it's users and management support at various levels.

I insisted on keeping the language as a separate thing since I had these other uses in mind. Open sourcing was therefore not difficult. For the mentioned application, Mangle is merely a mechanism, a library that enables flexible access to data.

So Mangle is not my or anyone's main project, yet many people (including management support at various levels) contributed to it's development.

I am glad you find it interesting! I feel very privileged to work with the larger team of people whose efforts led to this opportunity to do this kind of applied language research.


Other resources for logic programming and Go:

ichiban/prolog - ISO Prolog interpreter in pure Go, getting close to v1: https://github.com/ichiban/prolog

trealla-prolog/go - ISO Prolog interpreter embedded via WASM: https://github.com/trealla-prolog/go

guregu/pengine - library for interfacing with Pengines (SWI-Prolog's RPC protocol): https://github.com/guregu/pengine

biscuit-auth/biscuit-go - Biscuits are a fancy auth token with a little Datalog engine: https://github.com/biscuit-auth/biscuit-go

I'm a big fan of logic programming. We've been seeing a small resurgence of interest in it (for example Yarn using Prolog made some waves) and I have some optimism for its future.


Thanks for sharing Biscuit, I was collecting examples of authentication policy languages.

Datalog is also the basis for Open Policy Agent https://www.openpolicyagent.org/docs/latest/ , more specifically it's Rego language which is also implemented in go https://github.com/open-policy-agent/opa/tree/main/rego


Interesting; Google engineer previously published Datalog variants for BigQuery: https://research.google/pubs/pub43462/ & https://logica.dev/

This new language seems similar to differential-Datalog (which is sadly in maintenance mode): https://news.ycombinator.com/item?id=33521561


That's right. And btw Logica is also open source: https://github.com/evgskv/logica


Inventing a language seems to be a rite of passage for every engineer at google.

Go, Dart, Carbon, Mangle, am I missing some?

I'm not criticizing, I would not dare as I'm creating my own language as well :P


I can think of a few others…

Sawzall, a language focused around processing logs. Rob Pike led on this but use has pretty much all been replaced by Go. https://en.m.wikipedia.org/wiki/Sawzall_(programming_languag...

Dex, a language focused around array processing from the team behind the Jax machine learning library. Early stage research project. https://github.com/google-research/dex-lang

Rune, a language focused on security, early stage research project. https://github.com/google/rune

Wuffs, a language focused on writing safe file format handlers (parsing, encoding, decoding) https://github.com/google/wuffs


Cue and skylark (now starlark), too.


Also other internal languages: borgmon, borgcfg. Don't think they use borgmon anymore though.


The internal GCL language, as well as its failed replacements. There's a nice introduction in this paper: https://pure.tue.nl/ws/portalfiles/portal/46927079/638953-1.... Back when I worked there, I actually quite liked the syntax of the language, though the semantics of the language was quite a dumpster fire.

Another internal language for querying Monarch. See section 5.1 of http://www.vldb.org/pvldb/vol13/p3181-adams.pdf


4 languages doesn't really seem that many given how many engineers there are at google...


Inventing a language is a rite of passage for every engineer. Domain specific problems are best appreciated in the constraints (and flexability) of a domain specific language.



To be fair, it is a rite of passage for anyone doing a proper software engineering degree anyway.

Hence why there should be a very good reason to start an ecosystem from scratch, given how many languages get invented per year across the world universities.


Yes - my only attempt was in the early 1980s when I started to develop a language implemented on top of interpreted BASIC. It was called MAL (My Attempt at a Language). I wish I still had it - I have no recollection what my design goals were, if any! The only thing I have left is the name, and I'm quite sure that was the best part about it anyway.


Initial version of AngularJS by Miško could almost be considered it's own (compile to JS) language.


those are just the public ones


Some of the other stuff looks intriguing, but regarding the claim that "Unlike SQL, our Mangle rule projects_with_vulnerable_log4j has a name and can be referenced in other queries." goes, SQL in a VIEW or common table expression (CTE) can also be referenced in other queries.


Yes, there are CTE and recursive queries, and various DBMS also offer views. There are even table-valued functions.

These things are not widespread, and differ by implementation, and the way these are used by clients are copy-and-paste. Something as thoughtful as ZetaSQL https://github.com/google/zetasql does not have mechanisms for structuring (modules, packages, interfaces). SQL will not, cannot evolve into such a direction (or, anything that evolves, will not be recognizable as SQL).


I don’t think a CTE or a VIEW is an intended reasoning behind this claim. It’s more like choosing a part of a WHERE statement based on previous criteria and prior results.


CTE's and VIEW's can indeed express inference in relational databases. A "projects_with_vulnerable_log4j" view is a nice example, but possibly-recursive CTE's can achieve pretty much arbitrary inference, e.g. on graph-like data.


RDFox looks like the best bet for datalog databases, it computes changes incrementally, also with aggregation extensions. Logicblox, Soufflé, datomic, inter4ql, corese are also worth a look. Looks like there's a lot of innovation possible in the space, like distributed logic processing, incremental sorting, adding assert statements, figuring out why specific rules don't match recursively, etc


RDFox using datalog underneath is definitely interesting. Maybe worth making a distinction between products/services and open source projects that can be used as building blocks for such products/services or whatever custom setup one may have.


I’m probably wrong (I’ve been deep diving into RDF triplestores lately) but I think sparql does all that and has a W3C specification.

Maybe the difference is you don’t have to convert your data into (subject, property, object) triples?

Been reading this hexastore paper and they seem to be trying to solve the same problems but I’m no data scientist so who knows.


Datalog generally has n-ary relations instead of binary relations like RDF. (That's not to say there aren't Datalogs restricted to binary relations. I believe the Datomic stuff is that way, but not this one).

Datalog precedes SPARQL by years... decades. Though not standardized AFAIK. But also ... honestly... SPARQL also is not nearly as elegant to my eyes as Datalog.

I'm glad to see a recent flush of interest for this stuff. Seems like there's a new Datalog article every week.

(I am not a Datalog expert, but I'm a relational-model-nerd. Also as of a couple weeks from now I will be working at https://relational.ai/. Check out their 'Rel' language, it's got similar vibes)


Another useless and overly convoluted project just like all things google.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: