Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Datalog (michelin.io)
362 points by jgrodziski on Feb 15, 2023 | hide | past | favorite | 89 comments



Datalog is great for representing authorization rules. Check out Biscuits, which are auth tokens with Datalog embedded in them. This article is what made it 'click' for me: https://www.clever-cloud.com/blog/engineering/2021/04/15/bis...

I actually thought that Datalog was so cool that I went to learn Prolog and it completely changed the way I think about programming. Highly recommend trying out logic programming if you haven't before.


Agree but also want to point out that people usually have a narrow view on "logic programming". Datalog can also be understood with out the top-down evaluation / resolution that is typically associated with prolog, which is why it is known to database researchers and in finite model theory. Prolog is great, but bottom up techniques to evaluate datalog are awesome, too, and would arguably also qualify as logic programming. It is rare to see this acknowledged.


Acknowledged, by whom? As far as I know people in the logic programming community have no trouble recognising datalog as a logic programming language, be it evaluated bottom-up or not (you can still evaluate a datalog program top-down, by resolution, as if it were a Prolog program without "compound" terms) (a.k.a. functions).

Indeed, I get the feeling that the primacy of Prolog as the logic programming language has waned. If you look at back issues of ICLP* proceedings you'll find plenty of work that is nothing to do with resolution- namely, Answer Set Programming, which is wildly popular.

I tend to think it's mainly the database community that kind of ignores the logic-programming nature of datalog.

Oh and btw, when I talk about datalog, I mean definite clauses without functions, not the aberrant syntax in the article above. I'll never understand why people do that.

________________

* ICLP is the International Conference on Logic Programming.


In the last few months the mention of Datalog has increased, I wondered how it differed from graph databases and found a clear answer in SO [1]. I am not an incumbent but found graph databases and clause approaches interesting.

[1] https://stackoverflow.com/questions/29192927/a-graph-db-vs-a... (2015)


XTDB, which is mentioned in the post, is subtly different from the other Clojure-based Datalog systems in this respect, because its Datalog engine executes in terms of multi-way joins using a "Worst-Case Optimal Join" implementation that is ideal for graph processing (vs. a tree of binary hash joins). Therefore, based on statistics and query planning heuristics, it will often perform graph pattern matching before resolving the logic/horn clauses. (source: I work on the XTDB team)


Interesting architecture:

https://raw.githubusercontent.com/xtdb/xtdb/master/docs/conc...

Btw, is that 'RocksDB or ?' for the local store current or other storage engines can get plugged in?

p.s. this is datomic's architecture for comparison.

https://docs.datomic.com/on-prem/images/clientarch_orig.svg


Hmm, it seems there's some SVG rendering issue when viewing directly. It should say "LMDB" in that empty spot. It renders okay for me on the GitHub readme: https://github.com/xtdb/xtdb (and here https://docs.xtdb.com/concepts/what-is-xtdb/)


I did an interview [1] with Kevin Feeney, one of the founders (no longer active) of TerminusDb, which goes into some depth about the difference between RDF stores and (property) graph databases, where the former is more closely aligned with datalog and logic programming. There are links to a really excellent series of blog posts by Kevin on this topic in the show notes.

[1] https://thesearch.space/episodes/5-kevin-feeney-on-terminusd...


I love The Search Space. Waiting patiently for new episodes!


Thank you! They're coming, I promise :)


Second that!


With RDF* and SPARQL* ("RDF-star" and "SPARQL-star") how are triple (or quad) stores still distinct from property graphs?

RDFS and SHACL (and OWL) are optional in a triple store, which expects the subject and predicate to be string URIs, and there is an object datatype and optional language:

  (?s ?p ?o <datatype> [lang])

  (?subject:URI, ?predicate:URI, ?object:datatype, object_datatype, [object_language])
RDFS introduces rdfs:domain and rdfs:range type restrictions for Properties, and rdfs:Class and rdfs:subClassOf.

`a` means `rdf:type`; which does not require RDFS:

  ("#xyz", a,        "https://schema.org/Thing")
  ("#xyz", rdf:type, "https://schema.org/Thing")
Quad stores have a graph_id string URI "?g" for Named Graphs:

  (?g ?s ?p ?o)

  ("https://example.org/ns/graphs/0", "#xyz", a, "https://schema.org/Thing")

  ("https://example.org/ns/graphs/1", "#xyz", a, "https://schema.org/ScholarlyArticle")
There's a W3C CG (Community Group) revising very many of the W3C Linked Data specs to support RDF-star: https://www.w3.org/groups/wg/rdf-star

Looks like they ended up needing to update basically most of the current specs: https://www.w3.org/groups/wg/rdf-star/tools

"RDF-star and SPARQL-star" (Draft Community Group Report; 08 December 2022) https://w3c.github.io/rdf-star/cg-spec/editors_draft.html

GH topics: rdf-star, rdfstar: https://github.com/topics/rdf-star, https://github.com/topics/rdfstar

pyDatalog does datalog with SQLAlchemy and e.g. just the SQLite database: https://github.com/pcarbonn/pyDatalog ; and it is apparently superseded by IDP-Z3: https://gitlab.com/krr/IDP-Z3/

From https://twitter.com/westurner/status/1000516851984723968 :

> A feature comparison of SQL w/ EAV, SPARQL/SPARUL, [SPARQL12 SPARQL-star, [T-SPARQL, SPARQLMT,]], Cypher, Gremlin, GraphQL, and Datalog would be a useful resource for evaluating graph query languages.

> I'd probably use unstructured text search to identify the relevant resources first.


That's a really neat example of something I'm not familiar with. Going up a tree from child to parent is often the heaviest part of dealing with regular datasets, and usually requires a mix of queries and application logic. The idea of flattening the data along some pattern like that is of course always possible in a relational db, but it's not usually efficient, especially not for heavy writing. Lateral joins and window partitions can help. But this seems like an interesting approach to removing the app code completely.


I work on TypeDB (https://vaticle.com/typedb), and it sits somewhere at this intersection. The exposed query language has elements of both logic programming constructs and graph-like structures. Both amount to a kind of "constraint" programming.


A quick peek shows it seems along similar lines as TerminusDB sorta kinda, but they have WOQL [0]. At this time I start to worry again about all the different kinds and flavours of query languages that are emerging.

[0] https://en.wikipedia.org/wiki/TerminusDB#Query_language


why would you worry? this space has been occupied by the default for so long, its refreshing to see people experiment with what might be possible.


Agree, only concern is that whatever emerges here has conceptual clarity and doesn't get bastardized by people who haven't studied the foundations of the relational model.

I have this fear because there's a history of that with novel query languages and DB platforms tossing in network/hierarchical/"document"/object-oriented features, and creating a dog's breakfast which loses the compositional/expressive power of the relational algebra. Think MongoDB or Redis. Conceptually a big mess.

RDF itself has a history of this as well. Appeals to novelty.

Or even Google's F1, which smashes hierarchical tree-structured protobufs into a SQL DB, and so has really weird behaviour on joins and projections.

Well, whatever, you know my opinions on this stuff, I think :-)

At this point I'd settle (or ask for) for a network available tuplestore which just receives relational-algebraic operators from a client, and optimizes/executes, and returns pure tuples, and the client-side could formulate whatever query language (or API) it wanted on top of that. I started playing with building something like that between the two jobs, but never got far.


I really like TypeDB! Haven't been able to use it for anything serious yet, but have a couple of project brewing where it might fit :)


You might be interested in https://relational.ai/

Treats graph edges as binary relations ("graph normal form"), has a Datalog-ish language. Built for managing large interconnected knowledge sets in a declarative way.

I recommend this talk: https://www.youtube.com/watch?v=WRHy7M30mM4


Great project, not open source alas. This is another great talk about RelationalAI (and its precursor), highlighting how using powerful databases can simplify complex applications: https://www.hytradboi.com/2022/experience-report-building-en...


Not speaking on behalf of the company, but... It remains difficult to do open source and also pay people.


Yes, sorry, I didn't mean that to come off as "if it's not open source, it's not worthy of attention".

In the case of RelationalAI, as far as I understand there is no way to even try it out without becoming a customer? I just wish there were, because I really do like the approach!


ICYMI, there's an excellent interactive introduction to `datalog` that's referenced in the article's references.[0]

Last time I used `datalog` was years ago, I was developing an internal interactive tool that was used to compare different approaches to solving a certain problem at my employer. I used `datascript`[1] by way of clojurescript to store all experiment data and then interrogated the `datascript` DB via `datalog`. This is something I always remember fondly.

[0] https://www.learndatalogtoday.org/ [1] https://github.com/tonsky/datascript


As mentioned in the article, Datomic is a database that uses Datalog as its query language:

https://docs.datomic.com/on-prem/query/query.html#why-datalo...

(Some ten years ago worked at a startup that used Datomic. It seemed to work great, although the only queries I ever needed to add to the system were simple copy-paste hacks of existing ones, so I never got to dive into Datalog.)


Datascript is the open source analog for Clojure, ClojureScript and JS: https://github.com/tonsky/datascript


There are many open source alternatives in Clojure using this query language: https://github.com/simongray/clojure-graph-resources#datalog


'ish.

datahike would be the closest to datomic in terms of features/implementation (support for as-of, transactor etc).

Then in terms of maturity I think the choice is between xtdb and datascript, both are very solid/maintained but they are also vastly different.


A bit related, just stumbled upon Flix, a functional JVM language with Datalog contraints and (somewhat?) Go-like concurrency:

https://flix.dev

HN Thread from 8 months ago: https://news.ycombinator.com/item?id=31448889


Flix definitely looks interesting! For comparison, I ported the "Datalog Enriched with Lattice Semantics" example from that homepage to XTDB's (Clojure) Datalog after I saw it posted on HN originally: https://gist.github.com/refset/21b3fc1dec9a6928943073809e133...


So I struggled with this.

I guess the intention is to be better than SQL but then I was left with "under which circumstances?".

With that question in mind I didn't feel the article addressed the issue.

The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".


> I guess the intention is to be better than SQL but then I was left with "under which circumstances?"

Excellent question.

Two of the most common use cases for databases are "transactional processing" (manipulating small numbers of rows in real time) and "analytical processing" (querying enormous numbers of rows, typically in a read-only fashion).

SQL is generally fine for transactional workloads.

But analytical queries sometimes involve multi-page queries, with lots of JOINs and CTEs. And these queries are often automatically generated.

And once you start writing actual multi-page "programs" in SQL, you may decide that it's a fairly clunky and miserable programming language. What Datalog typically buys you is a way to cleanly decompose large queries into "subroutines." And it offers a simpler syntax for many kinds of complex JOINs.

Unfortunately, there isn't really a standard dialect of Datalog, or even a particular dialect with mainstream traction. So choosing Datalog is a bit of a tradeoff: does it buy you enough, for your use case, that it's worth being a bit outside the mainstream? Maybe! But I'd love to see something like Logica gain more traction: https://logica.dev/


That's a good point, to be honest I'm not sure there's a particularly good answer to that beyond "it's a much better underlying model". SQL is an ugly bastardization of the relational model, whereas Datalog is a clean application of the logical model to the relational setting.

One thing that's really nice about Datalog is that it puts the focus on modelling relations between data instead of "tables" which often end up with a jumble of data and then need to be normalized to work properly. It pushes you into good structures by default, which is a really good property to have. It's also much much simpler than SQL, no need to think in terms of inner joins and outer joins and whatever else, it is all relations.

It also easily produces derived data from existing data, without needing any kind of procedural process, and is highly composable.

It's not so much that it's solving a burning problem than SQL, it's just better than SQL...

(note: the clojure syntax used in that page is much more confusing than the native Datalog syntax IMO)


Another problem with Clojure-based Datalog is performance. Yet another is you are pretty much tied to the Clojure ecosystem. And I don't really like the Clojure-fused Datalog syntax either. These pains spurred me to write my own: https://docs.cozodb.org/en/latest/ (FOSS), and so far I'm satisified with my own work.


What is the performance problem?


I'm not the author, and I don't know how author would answer this question, but some things are really on the surface:

1. Modularity. In Datalog it's very easy to name a common (sub-)query and reuse it in multiple queries. You can have stored procedures / functions in SQL, but the syntax is very different implementation to implementation and users are typically afraid of using the feature (the SQL/PSM does not go far enough to define what users would normally want). Also, SQL/PSM is an imperative extension of otherwise declarative language. The two just don't work well together.

2. Structures other than primitives and tables (you get lists, vectors, hash-maps and sets, but you can also make your own). A lot of practical SQL extensions also offer some of these structures, but they aren't in the standard.

3. Operations on primitive types fall into the same category / work in the same way as operations on complex types. I.e. select / update / insert / delete in SQL only apply to tables, but strings or numbers don't work in the same way. It's more uniform in Datalog (not 100%, but still better).


A researchy perspective: Datalog was invented to extend relational algebra with recursion. Since it started out as an academic tool, people have been studying recursion-specific optimizations you can do for decades so it is extremely well suited to recursive use-cases e.g. iterative graph algorithms. Using Datalog for network algorithms won the thesis award in databases almost 20 years ago https://boonloo.cis.upenn.edu/papers/boon_interview.pdf .


This is the answer I subscribe to. CTEs and recursive CTEs are SQL's answer to a limitation of plain relational algebra; no loops.

CTEs are a great and most welcome addition to SQL but they are a bolt-on patch as compared with Datalog where it is a core feature.


> Datalog was invented to extend relational algebra with recursion.

I'm not sure that is exactly right. Do you have a reference? (Not trying to put you on the spot, I'm just curious to learn the history!)


Agreed, my understanding is that Datalog has a distinct (though related) lineage that directly emerged from Prolog (i.e. logic programming, not relational algebra / database theory) - skimming the introduction of "Horn Clauses and the Fixpoint Query Hierarchy (1982)" seems to confirm this: https://dl.acm.org/doi/pdf/10.1145/588111.588137

Edit: this presentation describes things differently but it doesn't sound quite right to me "Chandra and Harel - 1982 Studied the expressive power of logic programs without function symbols on relational databases" https://www.dbai.tuwien.ac.at/datalog2.0/slides/Kolaitis.pdf


Yeah I'm not up on the Prolog history side of things. My info is based on the Wikipedia article for Fixed Point Logic: "Least fixed-point logic was first studied systematically by Yiannis N. Moschovakis in 1974,[1] and it was introduced to computer scientists in 1979, when Alfred Aho and Jeffrey Ullman suggested fixed-point logic as an expressive database query language.[2]" [2] = Universality of Data Retrieval Languages : https://dl.acm.org/doi/10.1145/567752.567763


Thanks!


Yeah, if the OP can give a reference I'd be very interested, too. I've searched for the "original" reference to datalog because I wanted to cite it, but I couldn't find anything like that.

I have a sneaking suspicion that "function-free Prolog" is as old as ordinary Prolog, and "datalog", as an idea separate to Prolog and used as a database language, was born in the database community, but like the OP I have no reference to this.


> I guess the intention is to be better than SQL but then I was left with "under which circumstances?".

The way I see it, SQL's great strength as a query language is selection and projection whereas Datalog's great strength is inference.

It takes a shift in mindset to take advantage of inference. I've started using Datalog when analyzing unfamiliar codebases. So I might set up something like:

    writes_to(microservice1, db1).
    writes_to(microservice2, microservice1).
    writes_to(microservice3, microservice2).

    depends_on(X, Y) :- writes_to(X, Z), depends_on(Z, Y).
    depends_on(X, Y) :- writes_to(X, Y).
Allowing you to query something like:

    depends_on(microservice3, X)
and get back db1, microservice1 and microservice2. Of course, you can ultimately do this in SQL as well, but it's far more compact in Datalog.

Which brings me to the second half of the motivation - the principal advantage to the logical database model is the open world assumption. The relational model in practice tends to be fairly buttoned down - data is expected to suit a particular schema (which has its own advantages as a transactional system). This makes it easy to extend a logical database and ask more questions without changing the fact formats already specified.

Of course, I can ultimately do all the things that feel natural in Datalog in SQL. I can work through queries like the one above by building out the data model and writing recursive CTEs. It's about strengths, not possibilities which I suppose brings me back to why SQL 'won'. A disproportionate amount of code is written for transactional line-of-business systems. It's really not hard to see why the validation layer and transactional focus would win in those areas.


In 2021 Google introduced Logica, a Datalog variant. In their introduction blog they contrast it with SQL, so this may answer your question:

https://opensource.googleblog.com/2021/04/logica-organizing-...

In short, it's easier to compose (and decompose), which helps with complex queries that can be assembled from simpler, independently tested parts.

No personal experience in this, I just remembered that blog contrasting Datalog and SQL.


One of the biggest advantages is re-use, extensibility and being robust to change, because you define things bottom up. Think of aggregates as "hashmaps with typed keys" rather than "structs with places in them".

Aggregates are implicit groupings, not explicit slots. This wards you from risk of change and speeds up your feedback loop of making changes and experimenting with ideas. Plus you can compose new aggregate variants out of the same primitives (attributes) which is a fairly typical use-case.

Another is query composition. I'm more familiar with SQL and I'm somewhat comfortable with writing nested queries. But we all know how both query builders and similar are often hard to re-use and compose. With these datalog dialects, composition, refactoring and extraction of logical units is much more straight forward, even as a beginner. Queries are "flat", and consist clauses that can be moved around and composed. Think of having more associative and commutative operations.

On top of these, the mentioned technologies (in the article) are temporal and support time travel out of the box. That's something you can do on top of SQL of course but it's quite powerful to have it in-built as a primary feature. Whenever you have an application that does things like (non-exclusive), reporting, undo, revisions etc. You might want that.


Sql doesn't compose at all. You can't take two sql programs and combine them with a single operator and get a new program.

In datalog you can. Either by using the comma or semicolon operator.


You are thinking about the declarative part. SQL/STP, the standard that's the basis for definition of the imperative part of SQL (stored procedures) can compose in a similar way (eg. you can call a stored procedure from a stored procedure), but the standard doesn't go far enough, and so every practical database has its own extensions which deters people from using stored procedures because they'd end up with non-portable code.


stored procedures, being procedural, are the antithesis of a declarative database library.

I'm talking about this:

Query 1: SELECT a, b, c FROM table1 JOIN table2 ... Query 2: SELECT d, e, f FROM (SUBSELECT ... ) AS table3 JOIN table4 ...

Now, join those queries together. How? Well you have to analyze the join clauses in from and then combine those together, then rewrite the SELECTION. Okay, great... not so hard. Now use Query 2 in a recursive CTE for Query 1. Oh goodness... much harder. Or what about:

Query 1: SELECT a, b, c FROM table1 JOIN table2 ... Query 2: SELECT d, e, f FROM (SUBSELECT ... ) AS table3 JOIN table4 ... ORDER BY table4.column LIMIT 10

Now join 1 and 2 together... the syntax is much much much different.

Some SQL libraries in Haskell, like Beam, Opaleye, Squeal, etc, demonstrate the kind of composability I mean. In particular, all three of these offer a monadic interface where arbitrary queries can be joined using well-known relational operators (relational cross joins form a monad). I'm not sure of Opaleye and Squeal, but Beam does it's best to produce a human-comprehensible query.

Moreover, the first queries given above are 'easy' in the sense that the portion after the `FROM` in a regular SQL SELECT does actually compose well, since the JOIN operators are basically a direct translation of the underlying relational algebra. However, they start to fail when you want to join up aggregations and the like. While true that aggregations and ordering are not strictly relational, it is of great use to be able to compute with such things.

Again, the libraries above demonstrate the composability. For example, the beam library offers this: https://haskell-beam.github.io/beam/user-guide/queries/aggre...

Notice how we are able (in the second to last query), simply combine the query `all_ (track chinookDb)` with the complex aggregate expression. In this DSL, the entire aggregate could be floated out. In particular, notice how the queries for either the first `all_` or the second complex aggregate would look much different on their own. That's what I mean by SQL doesn't compose.

Of course all these DSLs have the problem that they're trying to work with an existing language SQL that is not nice and orthogonal, and so they all have their varying deficiencies.

Datalog mostly fixes these issues and provides a nice model.


> then I was left with "under which circumstances?"

Foremost, Datalog has great compositionality, which is one of SQL:s big weaknesses (it was designed with completely different goals). Also, things like recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions.


> Datalog has great compositionality, which is one of SQL's big weaknesses

There's a good write-up on this aspect (amongst others) here: https://www.scattered-thoughts.net/writing/against-sql/

> recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions

And similarly: https://github.com/frankmcsherry/blog/blob/master/posts/2022...


Of all the anti-SQL screeds out there, this is my favorite as well. And Frank McSherry... well, everything he writes is gold.


>> The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".

I have no idea who first came up with the name "datalog", and what exactly was their motivation (I've looked, but couldn't find the original reference to datalog), but the burning problem that datalog does fix is Prolog's semi-decidability, or in other words, its tendency to enter infinite recursions.

One half of the reason for that is Prolog's "compound terms" (known as functions, in First-Order Logic terminology) and the other half is Prolog's use of depth-first search for evaluation. Functions, along with logic variables, mean that a predicate's Herbrand base (its set of logical atoms) can be expanded forever: if f(x) is a term, f(f(x)) is a term, f(f(f(x))) is a term... and so on, ad infinitum. The depth-first search evaluation just gets stuck in left-recursions when the first body literal of a clause is a recursive literal: p(x):- p(x), q(x) loops forever, in Prolog with depth-first search.

Datalog is Prolog without functions other than constants, and it can be evaluated "bottom up", both of which overcome Prolog's semi-decidability (but eliminate its completeness).

But of course this is not a very clear motivation for its use as a database language, only its use as an alternative to Prolog.

More to the point, there are many datalog variants. Some that accept negation, some that do not, some that allow recursion, some that do not, and so on. There's a lot of information on different datalogs, seen from the point of view of database research in this free ebook on databases:

http://webdam.inria.fr/Alice/

See chapters 12 - 15.

And see these lecture notes for a more logic-programm-y discussion of fixpoint semantics of definite logic programs:

https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...


> but the burning problem that datalog does fix is Prolog's semi-decidability, or in other words, its tendency to enter infinite recursions.

Prolog's built-in search is not semidecidable. https://en.wikipedia.org/wiki/Decidability_(logic)#Semidecid...

As you observe, Prolog gets stuck in left-recursions. But iterative deepening for Prolog is semidecidable (among other fair methods). This indeed is restricted to the pure, monotonic subset of (full) Prolog together with all pure, monotonic extensions.


>> As you observe, Prolog gets stuck in left-recursions. But iterative deepening for Prolog is semidecidable (among other fair methods). This indeed is restricted to the pure, monotonic subset of (full) Prolog together with all pure, monotonic extensions.

Thanks, yes, I might be fudging terminology a bit.

Generally, when I talk about Prolog I mean definite programs (sets of definite clauses and Horn goals, the latter usually called "definite goals") under SLD-resolution. That's more or less what is usually meant by "pure" Prolog: definite programs, without not/1 implementing Negation-As-Failure, without the cut, and without side effects; and executed by giving an initial Horn goal as a "query". Entailment between definite clauses and satisfiability of definite programs is undecidable.

By "semi-decidable" I mean that an SLD-refutation proof will terminate if there is a branch of the proof that ends in □ and is of finite length. If such a branch does not exist, a proof will "loop" forever. That's regardless of whether the branch succeeds or fails, which is a bit of a fudge, but, in practice, there is no other way to decide the success or failure of a proof than to search all its branches.

Left-recursion is one way in which Prolog, with depth-first search, generates branches of infinite length, but you can get those with right-recursion also. There are restrictions of Prolog, like SLG-resolution (similar to DFS with iterative deepening) that don't loop on left-recursions but the general case remains undecidable.

Fortunately, there seem to be at least as many finite proofs as there are infinite ones, or in any case I have never encountered a Prolog program that looped inintially and that couldn't be rewritten to terminate, at least when called in the way it was meant to be called. And that's also a bit of a fudge.


I had the exact same reaction.

I did a bit of reading about Datalog and I get that it’s useful in things like static analysis where you get the fixed point analysis for free, but apart from that I am not really sure why I’d use it.


The main reasons I'd say are the very intuitive JOINs and queries. Like you give some partial relation and you get every "fit". So as long as you mostly do equalities in JOIN and WHERE, the queries will be very intuitive and obvious to the layperson.

If you work with VIEWs or CTEs a lot, then Datalog is your friend, as every derived relationship is just a VIEW. If you like to encapsulate and reuse queries, then Datalog is ideal. We had plenty discussions on HN about SQL code reuse. Not an issue in Datalog.

And lastly, there are many ways to compile Datalog programs to "executables" that take maybe a few CSV-files as input and give you the results. I'm not saying that this would be always faster than loading the CSVs into SQLite and running SQL against the data, but a lot of work has been done on Datalog query optimization and compilers can emit very efficient code.


> If you like to encapsulate and reuse queries

This is what I always thought was the main benefit.

You are able to much easier raise your level of abstraction, by building up a vocabulary of definitions, and model your further queries (or definitions) using them.


I’ve been using Prolog a bunch recently, and also embedded and extended MicroKanren in a project. Something I came to appreciate was that Prolog’s depth-first search, and Kanren’s lazy stream approach are good with memory even when generating/searching through infinite solutions. It is my understanding that Datalog, on the other hand, will iteratively expand a set of data. Isn’t this a problem?


"Iteratively expand a set of data"? I'm not sure what you mean here. I think you are probably talking about the "bottom up" evaluation strategy of datalog, right? That's where datalog is evaluated by a so-called TP-operator, which derives the set of all logical consequences a datalog program by calculating its least fixed point. That's the same as the Least Herbrand Model (LHM) of the program, or, in other words, the set of atoms entailed by the program (atoms in the logical sense, of atomic formulae, not in the Prolog sense of constants).

That's the same thing that Prolog does, calculate the LHM of a logic program, but the difference is that datalog programs have finite LHMs, because they don't have functions that can be self-instantiated for ever ( f(x), f(f(x)), f(f(f(x))), ... ) and the bottom-up evaluation, that goes from the "body" to the "head" of a clause, avoids infinite left-recursions.

Prolog, evaluated top-down (clauses are "picked apart" head-first) can get stuck in infinite left-recursions, so datalog's finiteness, and its decidability under TP, is a big gain in efficiency, as anyone who has had to kill a Prolog console session because of an infinite loop will know.

Also, it is not widely recognised but I am the author of the dumbest and most inefficient TP Operator implementation in existence. Obviously I hang my head in shame and will not link to my code. I understand however that there are optimisations that one can perform that make bottom-up execution efficient, and even quite fast. Unfortunately, I don't know what they are :P

Note that Prolog can also be evaluated without fear of left-recursions, by SLG-Resolution (a.k.a. "tabling", a.k.a. memoization) but there is still the danger of infinite right recursions. Prolog is semi-decidable, because it is Turing-complete. Datalog is decidable, sacrificing completeness for, well, efficiency.

So, in short, it's not a problem if you consider the alternative, but of course there are trade-offs, always. It's like growing old, vs. dying young.

(I hope all this is not completely irrelevant to your question).


Whatever language this is... This is not datalog. This looks like a particular implementation of datalog in closure.

Actual datalog looks like prolog.


In reality, people are using "datalog" for a genre of datastore concepts based around horne clauses, or, basically relations + implicit joins. Datalog as a subset or dialect of prolog is only one variant of this. And Datomic has made an sexpr-syntaxed variant built around binary relations popular. To the point where some people in this thread can't seem to tell the two apart.

I am more interested in the general category of relational data model + logic programming than I am in any purity about Datalog in particular. In particular I'm very excited by "data/knowledge + behaviour sitting in a tree, k-i-s-s-i-n-g"


Thanks. Is Datomic also a dialect of Prolog then?


No, and I wouldn't call Datalog a dialect of Prolog either necessarily. It's more like, classic Datalog syntax looks Prolog-ish (similar syntax, and is kind of a subset of Prolog). And you can do Datalog-type stuff in Prolog. But from what I've read of Datomic it's got a totally different syntax, but similar semantics to Datalog .. but not Prolog...

Basically, Prolog can do more than Datalog. Datalog is like, some subset of concepts of Prolog, but specialized for relational data queries, so it can achieve some optimizations and focus on data retrieval only.

Is that confusing enough?


> No, and I wouldn’t call Datalog a dialect of Prolog either necessarily. It’s more like, classic Datalog syntax looks Prolog-ish (similar syntax, and is kind of a subset of Prolog).

Datalog originated specifically as a restricted subset of Prolog, which is why the “classic” syntax looks prolog-ish.

> Basically, Prolog can do more than Datalog. Datalog is like, some subset of concepts of Prolog, but specialized for relational data queries, so it can achieve some optimizations and focus on data retrieval only.

Datalog is a purely declarative, non-Turing complete subset of Prolog, removing the imperative features and assuring termination. This gives up lots of power, obviously, but it also, well, provides a termination guarantee. Removing imperative features (cut) from Prolog also means that Datalog implementations can use different (or multiple, switchable) strategies, while differing in what kinds of data sets and queries they perform well on, not correctness.


> Datalog implementations can use different (or multiple, switchable) strategies

Including distributed strategies, as seen in Bloom http://boom.cs.berkeley.edu/


As others have said, datalog arose as a dialect of prolog that removed turing-completeness and expressivity, thus guaranteeing termination.


Yeah I know, I'm just being pedantic.


The language discussed in TFA appears to be Datomic's proprietary Clojure DSL, but has nothing to do with Datalog/Prolog.


It is a datalog dialect, and there a multiple open-source implementations: https://clojupedia.org/#/page/Datalog


I created a datalog engine a few years back called Dr. Lojekyll: https://www.petergoodman.me/docs/dr-lojekyll.pdf

It was pretty cool; you could stream in new facts to it over time and it would incrementally and differentially update itself. The key idea was that I wanted the introduction of ground facts to be messages that the database reads (e.g. off of a message bus), and I wanted the database to be able to publish its conclusions onto the same or other message buses. I also wanted to be able to delete ground facts, which meant it could publish withdrawals of the prior-published conclusions. A lot of it was inspired by Frank McSherry's work, although I didn't use timely or differential dataflow. In retrospect I probably should have!

This particular system isn't used anymore because we made a classic monotonicity mistake by making it the brain of a distributed system, and then having it publish and receive messages with a bunch of microservices. The internal consistency model of the datalog engine didn't extend out to the microservices, and the possibility of feedback loops in the system meant that the whole thing could lie to itself and diverge uncontrollably! Despite this particular application of the engine being a failure, the engine itself worked quite well and I hope to one day return to datalog.

I think what a lot of people miss with datalog, and what becomes apparent as you use it more, is just how unpredictable many engines can be with the execution behavior of rules. This is the same problem that you have with a database, where the query planner makes a bad choice or where you lack an index, and so performance is bad. But with datalog, the composition of rules that comes so naturally also tends to compound this issue, resulting in time spent trying to chase down weird performance things and doing spooky re-ordering of your clause bodies to try to appease whatever choices the engine makes.


Huh cool, I didnt realize it come from that Michelin.


I remember being surprised once that Michelin, like the star, was the same as Michelin, like the tire. It's really cool to see a company move beyond their core competency more than once in a meaningful way, even if the first time was to sell more tires. They have a really interesting blog that this is just a single article from.


I believe they started the star ratings as a way to encourage people to drive more. They saw it as part of “Travel”.


Yes, absolutely, as are the eponymous "Guide Michelin" (Guide vert) that conveniently fit in the glove compartment of your automobile. More people drive, more they consume tires.


This is a cool concept but I'm not sure I'd fancy upskilling a team of SQL analysts to use this.


Many SQL analysts are wasting their skills performing mental and syntactical gymnastics to get around the limitations of SQL in order to grasp at the actual conceptual elegance that lays underneath it. Most people who are writing SQL for a living already understand at least some part of what makes the relational model powerful. But SQL is relatively poor tool for accessing it.

I personally don't find the Sexpr-based syntax of Datomic's variant of Datalog all that useful here, and yeah, maybe someone working in SQL for a living would struggle at first with that syntax. But it's not intrinsic to the model itself. Have a waltz through my employer's documentation and see what you think: https://docs.relational.ai/rel/primer/overview

I think it's quite understandable (if a bit terse) and many people doing SQL for a living would appreciate the ability to better compose and structure things in this way, not to mention the ability to handle transitive / recursive relationships in a less awkward way.


Maybe it’s just me but I find the SQL much easier to read in all the examples given.


The problem with Datalog, and Clojure in general are the licenses. Terrible licenses.

Everything is about Rich Hickey. Apache 1.0.

Now that Nubank basically owns it and there's very little progress or activity as of late, I don't see why one would chose to use Clojure, Datalog etc.

Also, a lot of functional programming concepts has been since added to big programming languages like Javascript and hell, even Java has lambdas now.

I'm guessing that also hardcore FP people have moved on to Haskell. The ones that like LISP to Racket... and only people tied to the JVM in legacy projects are with Clojure.


> very little progress or activity as of late

There's been equally frequent releases and updates, I'm not sure how you came to this conclusion?

The best I can think of is that you're confusing core language updates with other tech. Clojure is structured differently than other languages, and core language/library updates are (and always have been) relatively rare while the surrounding ecosystem/tooling provided by the Clojure team is active.


Quick reminder that Clojure is available on the JVM, CLR, Node, browsers, stand-alone interpreter (Babashka), on the Erlang VM and there's a C++ impl. that is being worked on actively.


What are the licensing issues with First-order Horn clause logic?

Datalog is not a licensed product or software, the same way Answer Set Programming isn't.


XTDB is MIT and has clients for Java or HTTP if you don't want to write Clojure. Not every Datalog implementation is 'about Rich Hickey'


> Also, a lot of functional programming concepts has been since added to big programming languages like Javascript and hell, even Java has lambdas now.

JavaScript was "functional" since the beginning. It always had statically scoped first class closures.

But as you say, that's not very special anymore...

If you want to do primarily FP with a data oriented touch you want stuff like:

- immutable, persistent data structures

- a comprehensive library around transforming these

- expression based syntax

- idiomatic (functional style code) performance is good

- a comprehensive library around manipulating, composing and transforming functions

- more stuff that I'm forgetting right now

Clojure gives you all of that and _more_, such as a powerful REPL, macros, multiple dispatch, spec instrumentation, proper namespaces, ...

Personally I write JS in a very straight forward, functional style. But the differences is still night and day in many dimensions. There are fundamental issues with the language that are unfortunately unfixable. Just having first class function objects isn't cutting it.


> and only people tied to the JVM in legacy projects are with Clojure.

You say this as if it's not an absolutely massive body of programmers. A lot of large companies are institutionally committed to JVM, and clojure has settled into a niche as the language teams within them use when they don't want to use java.

I expect there are more people working on projects that fit that description than there are ones working in racket and haskell combined.


Saying that Java has lambdas is vague:

it's like confessing one doesn't know what a Lisp lambda implies, in particular creation of closures to capture free variables, since Java creates a copy of free variable values, rather than inform the code generator as to allocation of said free variables, in order to honor the closure-extended lifetimes from stack allocation to heap allocation (for example).


Datalog != Datomic.

Datomic ∈ Datalog


It's true that this SPARQL-inspired view of Datalog as a triplestore query language is quite a narrow interpretation compared to something closer to the academic Prolog roots like https://souffle-lang.github.io/ - what do you feel are the most important differences?


Disclaimer: I'm not a Datalog expert. I do work for a company that makes a relational graph knowledge management system (RelationalAI), that has some Datalog-type aspects, but I only dabble at the front-end/query-language end of that; I work on the storage layer. I've also never used Datomic or its offshoots. Yes from what I understand, Datomic works only with binary relations. Classic Datalog can work with full n-ary relations. (However any n-ary relation can be thought of, or logically recomposed, as a series of binary relations anyways, and vice-versa)

Syntax also clearly is not the Prolog-ish syntax of Datalog "proper."

But my point above is, Datomic is but one product and implementation of these concepts. It's a wrong take to make a statement like "The problem with Datalog, and Clojure in general are the licenses"; there is a field of many products, all with multiple licenses, and also weird to make this "Clojure" swipe, when there's only tangential connection between "Datalog" and "Clojure" based on one product (and its offshoots).


Thanks for the insights and clarifications - I agree with all that completely :)

I mainly responded because "Datomic ∈ Datalog" is worthy of discussion in itself ...but I now realise this is probably the wrong subthread for that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: