Trustfall: How to Query (Almost) Everything

nathanwallace · on Feb 9, 2023

Readers who like SQL may also enjoy Steampipe [1], an open source tool to live query 99+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc across the services. (Disclaimer - I'm a lead on the project.)

1 - https://github.com/turbot/steampipe

mafro · on Feb 9, 2023

+1 Steampipe's aggregators to combine multiple datasets into one are great

sztanko · on Feb 9, 2023

Very cool! Is it possible to connect to steampipe via jdbc connector?

code_biologist · on Feb 9, 2023

I presume so: "You can also use psql, pgcli, Metabase, Tableau, or any client that can connect to Postgres."

nathanwallace · on Feb 9, 2023

Correct - it's just Postgres, so all your favorite clients / tools work as usual. We have docs for a number of example integrations [1] that work with Steampipe CLI (or cloud).

1 - https://steampipe.io/docs/cloud/integrations/overview

xrendan · on Feb 9, 2023

Is it possible to join those data sources against an existing database like MySQL or SQLServer?

nathanwallace · on Feb 9, 2023

Steampipe runs an embedded Postgres database [1]. One approach would be to connect that Postgres database with other databases using the usual (advanced) techniques like Foreign Data Wrappers [2]? With that connection in place, joins would work as normal.

1 - https://steampipe.io/docs/develop/overview 2 - https://wiki.postgresql.org/wiki/Foreign_data_wrappers

metadat · on Feb 9, 2023

This is crazy cool. Instead of searching Google which returns info from literally any random source (occasional good sites among the ocean of SEO spam, malicious sites, ad-ridden clone sites, annoying trolls, paywalled, etc), you could have your own set of diverse query sources you've deemed to be actually useful and trustworthy.

I suspect this is only a basic, naive idea compared to the true potential capabilities Trustfall could unlock.

Terretta · on Feb 9, 2023

See kagi.com search engine, with custom Lens, for something that queries "occasional good sites among the ocean".

xwolfi · on Feb 9, 2023

It's weird, like, does he pay you to drum up support ? It s a rust code that read a very complex json dsl to do queries, I haven't seen any sort of potential.

Your use case even seems to be a UNION and not even a JOIN ?

paavanb · on Feb 9, 2023

It's effectively a subset of GraphQL, not a JSON DSL.

crosen99 · on Feb 9, 2023

This is a nifty tool. It’s existence alongside the emerging LLMs reminds me of the two diametrically opposed approaches to harnessing it all:

1. Store the knowledge in a highly structured way and interrogate it with a precise and rigorous query language to extract the exact answer you want based on a well defined set of rules

2. Store the knowledge in whatever ad hoc way it’s produced, and then rely on a higher form of intelligence to take an equally ad hoc query, feed it through the entire universe of knowledge with some attention mechanism, and magically return a (statistically significant) response

Both approaches are so satisfying when they work. Of course you also have everything in between and then you have tools like LangChain that start to bring it all together.

TickleSteve · on Feb 9, 2023

i.e. Approximate vs precise.

But, there is another option. The tool (presumably AI/ML based) that takes the ad-hoc query and turns it into a precise query to the structured-data tool in order to return precise results.

I hope that this will be Googles answer to ChatGPT... i.e. a chat interface to their existing search tool. This should then be able to "explain" the query its actually generated to the tool as a side-effect.

crosen99 · on Feb 9, 2023

For sure, I just edited my comment to mention LangChain, which starts to get at that capability. Would be nice to see LangChain integrate with Trustfall similar to how it can integrate with Wolfram Alpha.

vsroy · on Feb 9, 2023

I imagine with these types of things the vast majority of the work is writing integrations. Could you explain how this makes writing integrations easier?

obi1kenobi · on Feb 9, 2023

Right now, to be perfectly honest, it doesn't yet. This is by design: the purely integrations-oriented side is a crowded space, and I'm not ready to jump in there yet. Stay tuned :)

The focus instead has been in a place where it's easier to stand out: deeper query ability over the data providers than any competing solution, with stronger optimization guarantees. The goal is that any query optimization you could implement by hand, you should be able to implement within Trustfall -- while having the benefit of purely declarative type-checked queries, with integrated docs and autocomplete. If a query is too slow, you don't have to rewrite it -- you can just use Trustfall's APIs to tweak how it executes under the hood and speed it up by using caching, batching, prefetch, or an index.

For a real-world demo of that, I wrote a blog post about speeding up cargo-semver-checks (a Rust linter for semantic versioning) by 2272x without changing any query, just by making some indexes available to the execution of the existing queries. This is awesome because it empowers anyone to write queries, knowing that either their query will either already be fast or it can be made fast in the future in a way that is entirely invisible to the query itself. Link: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

cormacrelf · on Feb 9, 2023

I read that yesterday and felt a bit tricked. You said there was a one line diff, which to me suggested you had made a query optimiser and added that to trustfall so that consumer applications could transparently benefit without writing any new code. But really, as it turned out, you just added APIs for for indexing a table and using those indexes, and then used those APIs to do select manual optimisations in a “fast path” in the trustfall-rustdoc-adapter crates.

What does it matter that some crates are called trustfall adapters and some are not? You still had to optimise the execution of the query manually. I can see how it’s cool you didn’t need to change the text of the query, but people like SQL because the execution engines are smart enough to optimise for you. They will build a little hash table index on the fly if it saves a lot of full table scanning. The expectations re smart optimisation in the market you’re competing in are very high. If you say it was a one line upgrade to the trustfall executor then people will believe you.

The net result is better than what most GraphQL infrastructure gives you. GraphQL doesn’t give you any tools to avoid full table scans, it just tells you “here’s the query someone wants, I will do the recursive queries with as many useless round trips as possible unless you implement all these things from scratch”. At least your API has the concept of an index now. But I think you’re trying to sell it as being as optimisable as SQL while trying to avoid telling users the bad news that it’s going to be them who has to write the optimiser, not you.

obi1kenobi · on Feb 9, 2023

Trustfall isn't trying to compete with SQL; doing so would be suicidal and pointless. If the dataset being queried already has a statistics-gathering query optimizer, it's just the wrong call to not use that optimizer. If one wrote a Trustfall query that is partially or fully served by data in a SQL database, the correct answer is to use Trustfall to compile that piece of the query to SQL and run it as SQL (perhaps by adding some auto-pagination/parallelization before compiling to SQL, but that's beside the point).

Most uses of data don't have anything like SQL / any kind of a query language, let alone an optimizer. No tool I know of other than Trustfall can let one have optimization levers (automatic or human-in-the-loop) where one can optimize access to a REST API, a local JSON file, or a database -- all separately from how you write the query.

With Trustfall, I'm not promising "magical system that will optimize your queries for you without you having to lift a finger" -- at least not for a good long while. But I can promise, and deliver, "you can write queries over any combo of data sources, and if need be optimize them without rewriting them from scratch." This means that you can have product-facing engineers write queries, and infra-facing engineers optimize them, with both sides isolated from the other: product doesn't care if there's a cache or an index or a bulk API endpoint vs item-at-a-time endpoint, and infra has strong guarantees on execution performance and optimizability so they aren't that worried about a product query getting out of hand and wrecking the system. Trustfall buys operational freedom and leverage across your entire data footprint.

You can see this effect in play in cargo-semver-checks. We use lint-writing as an onboarding tool, because anyone can write a query, and we know we can optimize them later if need be. Both Trustfall and the adapters will get better over time, so queries get faster "naturally". We get efficient execution over many different rustdoc JSON formats simultaneously, without version-specific query logic. And while the hashtable indexing optimizations required some manual work that I didn't time exactly, it was limited to ~1-2h tops and made all queries in the repo faster automatically with no query changes. Rolling out the optimization would be operationally very simple: it's trivial to test, and thanks to the Trustfall engine, I wouldn't have to test it with every combination of filters and edge operations -- if the edge fetch logic is correct, the engine guarantees the rest. Put simply, nobody else needed to know that I made the optimization -- the only observable impact to any other dev on the project is that queries run faster now.

cormacrelf · on Feb 9, 2023

I know all that. I just thought you might like to do another pass editing your piece. It is your marketing material at this stage. It would be nice if it gave people a clearer impression of what the capabilities are and where trustfall is positioned in relation to SQL, GraphQL, and other stuff. I came away a bit suspicious of your claims because I didn’t understand them when I first read it.

My only question about the actual code is whether you can write these indices to do hash lookups across data sources. Can I avoid table scans when joining two data sets from different adapters?

obi1kenobi · on Feb 9, 2023

I appreciate it! Writing is hard (especially not in my native language) and I'm always looking to improve, so feedback like this is valuable.

To be honest, that blog post was targeted at cargo-semver-checks users and r/rust readers, to give them a sense of how cargo-semver-checks is designed and why, with a motivating example of speeding up queries while supporting multiple rustdoc versions. It wasn't really meant to be "Trustfall's entrance on the world stage" even though it kind of ended up being that...

I plan to write more blog posts (and code!) about Trustfall's specific capabilities (and things it can't/shouldn't do) in the future, so hopefully those will come up first in people's searches and give folks the right impression.

Re: multiple adapters, yes, that's the plan. I have some prototype code for turning multiple adapters into a single adapter over the union of the datasets + any cross-dataset edges, and it supports the new optimizations API so the same kind of trick should work in the same way. In general, Trustfall is designed to be highly composable like this: you should be able to put Trustfall over any data source, including another instance of Trustfall, and have things keep working reasonably throughout.

srcreigh · on Feb 9, 2023

that’s cool, does this API support index nested loop and hash joins? Or is it just for filtering?

obi1kenobi · on Feb 9, 2023

Trustfall's Adapter API that data providers implement leaves the joining to the adapter implementation. It provides an iterator of rows and an edge (essentially a join predicate), and asks for an iterator of (row, iterator of rows to which it is joined) back.

The adapter is free to implement whatever join algorithm makes sense, together with any predicate pushdown, caching, batching, prefetching, or any other kind of optimization.

At a high level, Trustfall's job is to (1) perform any optimizations it can do by itself, so you don't have to do them by hand, and (2) empower the adapter implementation to add any other optimizations you wish to have, without getting in the way.

obi1kenobi · on Feb 9, 2023

Trustfall author here, pleasantly surprised to find this posted!

The goal of Trustfall is to be the LLVM of data sources. GraphQL, OpenAPI, JSON (with JSON schema or not), SQL, RDF/SPARQL -- and none of them can natively talk to each other. Sure, you can stick JSON into Postgres, or compile GraphQL to SQL -- I've done both in production and it's always ultimately a poor fit because you're cramming one system into another when it was never originally designed to support that.

Here's an example: tell me the GitHub or Twitter accounts of HN users that have commented on HN stories about OpenAI. The data is available from the HN APIs on Firebase (for item lookup) and Algolia (for search). I know all of us could write a script to do it -- but would we? Or is it too annoying and difficult, and not worth it? That "activation energy" barrier is something I want to eliminate. Here's that same query in the Trustfall Playground, where it took just a minute or two to put together: https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...

Trustfall is designed for interoperation from day 1. It separates the queries from the data providers, allowing the infrastructure to evolve and change how it serves queries without any of the queries noticing anything except faster execution. In practice, that means you don't have to rewrite your product to make it run faster -- which makes both the product side and the infra side happier :)

Here's a real-world example of that. The `cargo-semver-checks` Rust semantic versioning linter implements its lints as Trustfall queries over JSON files describing the package API, and I recently was able to speed up its execution by over 2000x without changing a single query -- just by changing how those queries execute under the hood. More details in my blog post here: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

AMA, I guess :)

bhawks · on Feb 9, 2023

> More details in my blog post here: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

Tl;Dr implementations of the adapter api can provide hashmap lookups and the query engine will use this when possible before doing a full scan - this is transparent to the query author.

A 3000 word essay takes approximately 10 minutes to read. (Assuming subject familiarity) It is frustrating when the main point is at the end and very little is discussed about the shape of the new api besides the fact that it is experimental.

I spent quite awhile on the site trying to understand the features and benefits of the query language. I left unsatisfied by the long commentless query examples and repeated claims of being able to query anything. It'd be nice to have a discussion of the what the language is, how an adapter should be implemented and how the engine will use my adapter. Maybe that is covered in linked video talks but there are no slides to skim to see if thats worth my time.

OP I appreciate your passion and enthusiasm but as a potential user I can't figure out where to start with the project.

obi1kenobi · on Feb 9, 2023

Thanks for the detailed feedback, I appreciate it!

I plan on writing more about all the topics you mentioned, and add docs for writing adapters and queries. I was planning on doing a "Show HN" when more of that was in place, but I'm not the OP here -- someone else submitted it :) Obviously all of the below needs to be more discoverable, and is something I'll work on!

Here are the slides from my talk from the HYTRADBOI conference last year, containing some more real-world query use cases: https://docs.google.com/presentation/d/1foUdlEDOQ1WcTadhAxsA...

This is the current stable adapter API: it's a trait that adapters need to implement https://docs.rs/trustfall_core/latest/trustfall_core/interpr...

It's also available in the Python bindings: https://github.com/obi1kenobi/trustfall/blob/pytrustfall-v0....

Tl;Dr of the optimizations API is that every call into the adapter gets a reference to a `QueryInfo` struct. It's totally safe to ignore if the adapter isn't interested in optimizing, and otherwise allows the adapter to ask questions like "from this location in the query, will future operations include traversing edge X and then filtering on property Y?" If so, it's possible to use the API to resolve the expected values for that Y property and use them to speed up the lookup, as in the case with the hashtables in cargo-semver-checks.

The optimizations API is still pretty raw and I didn't want to dig into it too much in that post (which is mostly about cargo-semver-checks' architecture, not just Trustfall internals) in case it changed dramatically. If you really want to see the current state of the API, here's a link to the draft PR: https://github.com/obi1kenobi/trustfall/pull/131/files#diff-...

ToJans · on Feb 9, 2023

A small suggestion for your docs: instead of using foo and bar, you might be better off using well known examples, like cart and items, or projects and tasks. This usually helps to understand the context. One might even add a schema of the explainer context in the documentation somewhere.

obi1kenobi · on Feb 9, 2023

Good call, thank you! I'll almost certainly do something along those lines.

wiw2 · on Feb 10, 2023

Hi there!

In https://www.hytradboi.com/2022/how-to-query-almost-everythin... @ 4:15, you mentioned data modeled as graphs are equivalent to relational models.

I can see relation models can be represented as graphs (e.g., table foreign keys to another table are "edges"; fields in the table that are not primary keys are "properties")

However, I am not quite confident in the claim that arbitrary graphs being mappable as relational models. I suspect there are certain graphs that can be difficult to express as relational models.

So question for you: Would you say all Trustfall expressions can be translated to SQL? Or, there are cases where there's no such translation possible?

obi1kenobi · on Feb 11, 2023

I'm not sure about arbitrary graphs, because all kinds of weird and wonderful math exists out there :)

But from 7 years of working on graph databases, and even participating in some graph query language standardization work, I know two things:

- Anything a graph database can represent and store, a relational db can also represent and store.

- Trustfall's representational power is broader than what either of those would reasonably represent.

I say reasonably because, for example, with sufficient contortions you could make SQL to give you function-like edges like you'd expect in an API. Table-valued functions are awesome! But if every one of your joins was via a TVF ... you'd have a number of problems.

Examples: in Trustfall, you can write a query that asks for the prime factors of all numbers in a given range, without having precomputed or stored any of that data -- it can be computed and loaded "on-demand". You can also have an edge like `divisors_that_are_multiples(of: Int!)` that requires an integer to be provided before being able to be resolved. Both of these are possible to represent in modern SQL, if you try hard enough. But should you? Absolutely not, or you won't be happy with the outcomes.

To make true datasource-agnostic querying, you can't start with a "preferred datasource" that everything else must translate to. Either all data sources are equal, or else there's a massive incentive to use the preferred one at the expense of all else -- so there are massive incentives to optimize predominantly for the preferred source, and we're right back at square one.

rehevkor5 · on Feb 9, 2023

I am curious if you can draw any comparisons with Calcite? I think Calcite's target (mainly just databases) is more limited, but it probably tackles similar problems regarding query planning/optimization? Other thoughts?

obi1kenobi · on Feb 9, 2023

Good question! I've spent some time looking at Calcite and similar systems.

My hope is that it isn't an "either-or" but an "and". I also believe that Calcite is primarily database-oriented, as are most similar systems. Trustfall is designed to be able to delegate queries (or portions thereof) to underlying systems: for example, you could take a portion of a query and compile it directly to SQL instead of executing it via an interpreter -- no need to worry about query planning over a SQL database when the db itself can almost certainly do that better. In principle, Calcite should be possible to be plugged in like that as well.

Same goes for other systems that do federation over data sources, whatever they may be. Whenever there's an awesome database or federation solution out there that covers some data sets, plug it right in! No need to reinvent the wheel -- Trustfall development can then focus elsewhere, and everyone wins. The goal here is to have a universal data system that works great for any data source in any programming language, not a set of small-time turf wars over whether SQL or GraphQL or $NEW_TECH is better.

Keyframe · on Feb 9, 2023

What if you wanted to join a few different sources together, large ones at that, via query and have it filtered out, without reading all of it into memory first of course? Or.. running analytical queries, matches via reconciliation functions, basically doing useful stuff with different data sources available at once.

obi1kenobi · on Feb 9, 2023

It should all be doable:

- Trustfall queries can in principle be compiled into SQL and executed on an analytical database. I haven't built this into Trustfall right now, but a prior project I built that used very similar syntax had this capability so I know it can be done without too much trouble.

- Trustfall's built-in query interpreter is lazily evaluated, and has very strict performance guarantees about what gets loaded into memory when and for how long. I've gone to some rather ridiculous lengths on this, and I look forward to writing it up in blog posts in the future.

- Edges in Trustfall's query graph can be parameterized: they can take an arbitrary number of type-checked arguments of a variety of types, so it should be possible to express just about any kind of join, API endpoint, reconciliation function, window function, etc.

- Whether a data source is large or small doesn't matter to Trustfall. So long as the data provider implements the Adapter interface, it can be plugged in. Whether the underlying data is read into memory all at once or progressively is the adapter implementation's business only -- Trustfall is extremely laissez-faire on this.

- Multiple data sources can be combined together, and queries written over all of them. This is something I'm working on right now, and hope to have more to share on in coming months.

It sounds like you might have a specific use case in mind, perhaps? Feel free to DM me on Twitter if you'd rather not discuss publicly.

mcqueenjordan · on Feb 9, 2023

What went into the decision to use the DSL that you did, versus (for example) SQL? I always miss writing SQL queries, so I got excited when I initially thought queries would be written in SQL.

obi1kenobi · on Feb 9, 2023

SQL is kind of a monster language to parse properly[1], and has some unfortunate and confusing edge cases[2]. It would have cost a ton of time and energy to implement properly, and all it would have ended up with is "SQL but worse."

SQL is also really awkward around APIs, since API norms and schemas don't map cleanly to SQL tables and JOINs. For example, many APIs take arguments, and while the most natural way to express an API call in SQL is as a SELECT or JOIN, requiring user-provided arguments there is really clunky. Worse, if the API that takes an argument sets a default value if that argument is not provided, then you end up in a situation where a `SELECT * FROM API` can return fewer results than a `SELECT * FROM API WHERE <some predicate>`. Using a new language meant that I could design it to best highlight Trustfall's capabilities with minimal work.

Trustfall's core is language-agnostic, so it should be possible to put a new language parser in front of it in the future -- including SQL. So I haven't ruled it out forever, I just didn't want the first thing I built to be "SQL but worse" because it's in a domain poorly suited for it.

[1]: https://twitter.com/sc13ts/status/1413729761247916036

[2]: https://www.scattered-thoughts.net/writing/select-wat-from-s...

SirGiggles · on Feb 9, 2023

This is completely off topic but I see your username and respond with, in a raspy voice: "General Kenobi"

obi1kenobi · on Feb 9, 2023

I'm actually just one specific Kenobi :P

Thanks, this made me laugh out loud :)

sbt567 · on Feb 8, 2023

Online playground for querying Hacker News: https://play.predr.ag/hackernews

parhamn · on Feb 9, 2023

When I saw a demo of this I was blow away by how easy it is to add another data source. Great work, looking forward to using this soon.

tnzk · on Feb 9, 2023

Curious, I didn't find how to add another data source in the talk at the top of the README[1]. Do you mind sharing the demo you saw?

[1]: https://www.hytradboi.com/2022/how-to-query-almost-everythin...

obi1kenobi · on Feb 9, 2023

I bumped into him and showed it to him in person one time :)

Realistically, the docs are still very sparse -- I wasn't planning on posting this for another few months, but someone else noticed my project and beat me to it!

New data providers need a schema and an Adapter implementation.

For Rust, this is the Adapter trait that data providers need to implement: https://docs.rs/trustfall_core/latest/trustfall_core/interpr...

This is the equivalent abstract class in Python: https://github.com/obi1kenobi/trustfall/blob/pytrustfall-v0....

For Python, the package docs have some (underwhelming) additional examples of building a schema and passing an adapter to use to run a query: https://pypi.org/project/trustfall/

The process in Rust is equivalent, here's the code powering the demo in the talk: https://github.com/obi1kenobi/trustfall/blob/main/demo-hytra...

I realize this is all extremely suboptimal, but like I said, I wasn't ready Please stay tuned! I'm planning to spruce all of this up in the coming weeks!

wodenokoto · on Feb 9, 2023

Clicking through to the python repo and looking at the examples there, I guess the query language is GraphQL.

Reading further, it seems that it is actually a variation of GraphQL.

I've never used GraphQL before, maybe one can get used to it, but it doesn't look very nice for defining data queries.

obi1kenobi · on Feb 9, 2023

The language is syntactically similar to GraphQL, but the semantics are much richer -- in terms of expressiveness Trustfall is more like SQL than GraphQL.

No language syntax makes everyone happy :) But it's definitely is something that isn't hard to get used to -- in my experience, people go from "never seen it before" to "can productively write their own queries" very quickly after trying a few examples. Try out the example queries over HackerNews in the Playground (ideally not on mobile) and see for yourself: https://play.predr.ag/hackernews

And in the long run, Trustfall is more meant to be like an "LLVM for data sets," with pluggable language frontends. There's already been interest in plugging in other languages and making them generate Trustfall representation internally, to make use of the rest of its functionality without using the language itself. This is fine too!

alexisread · on Feb 9, 2023

Nice! I wrote something similar for my workplace a couple of years ago based around Rx (as Rx has many implementations in different languages - we had a multi-node browser/server requirement, and optimization in a streaming DSL is easier than SQL, as you've done, as you can hint+order lazy materializations) and libs like https://pypi.org/project/lquery/ to do pushdown queries.

Are you planning on doing reactive/live/materialized queries? You could use a config file to specify the live queries (in the Trustfall DSL) which can be fed to the engine on startup.

obi1kenobi · on Feb 9, 2023

Thanks! Reactive/live/materialized queries are certainly a possibility I've tried to keep open and I'd love to build, but there's so many possible things to build from here that I'm not sure when that might happen.

In particular, while that tech is really cool, it's not clear to me how many use cases for it are actually ready to adopt it. I'm hesitant to build something that looks really cool, but is somewhat complex and doesn't get used a lot.

flanked-evergl · on Feb 9, 2023

You may be interested in RDF and SPARQL which supports federated queries. Much simpler than reinventing the wheel.

For more context, see https://ontop-vkg.org/guide/

hobofan · on Feb 9, 2023

If one is faced with the complexity and footguns of SPARQL reinventing the wheel is a good option.

unshavedyak · on Feb 9, 2023

Could this be used to manage ingesting data from messier sources? Ala files (pdf/etc), web pages, etc?

edit: Admittedly the website/video tends to talk about data sources a bit hand-wavy. I'd have loved some real world examples on how one goes about adding a data source. Also how we handle problems of scale.. ie passing filters to the data source, rather than drinking from a firehose and filtering after the fact.

With that said, the idea is becoming interesting to me. At the very least i am liking the idea of a standardized query interface to "things". Just feels like edge cases might drown me.

obi1kenobi · on Feb 9, 2023

To be completely honest, I was planning on doing a "Show HN" in a month or two, but someone posted the link today and it caught me a bit off-guard -- that's why the docs and examples aren't there yet :)

There's definitely a bit of a learning curve because the docs are few and far between, but in general, everything is much simpler than it seems. The people that have adopted it already have been able to do so within a couple of hours with minimal guidance from me.

For passing filters to the data source, the API there is still very experimental (it was just an idea in my head until last week!) but I've been working on it recently. Here's a blog post showing a small bit of that new API and using it to dramatically optimize a workload: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

This is my full-time project right now, so stay tuned and expect a decently steep progress trajectory :)

unshavedyak · on Feb 9, 2023

I'm definitely interested, will give it a try!

obi1kenobi · on Feb 9, 2023

Please feel free to open issues on the repo whenever anything is unclear and I'd be happy to help out!

obi1kenobi · on Feb 9, 2023

Yes, in at least two ways:

- It could be used as the "ingestion process" itself: write the ingestion as a Trustfall query (or queries) over the data sources, and store the results in a traditional database or other system.

- It could be used to make the "ingestion" a mere implementation detail: you could write a query over all the data sources, and that query could be executed against the raw data sources themselves, or some "ingested" format which would presumably be faster and more efficient. The query can't tell the difference, and this frees up the ingestion system to evolve independently of how it's being used.

It sounds like you might have a specific use case in mind? I'd love to learn more about it if so! My Twitter DMs are open, or I could give you my email address if you'd prefer that.

sdfhbdf · on Feb 9, 2023

Very interesting, sounds kinda like GraphQL counterpart to https://github.com/cube2222/octosql

obi1kenobi · on Feb 9, 2023

The syntax is reminiscent to GraphQL, but semantically it's a lot more expressive and not compatible. It's much closer to SQL in expressive power, actually: left joins, recursion, arbitrary filters, aggregation, etc. -- none of that stuff works in GraphQL but is supported in Trustfall.

debarshri · on Feb 9, 2023

At first glance, the goal of the project feels very similar to SPARQL.

Somehow it didn't make it to the mainstream.

flanked-evergl · on Feb 9, 2023

Wiki Data runs a sparql endpoint and I think this integrates with Wikipedia. Not the most popular thing in the world but it is not fringe technology.

riku_iki · on Feb 9, 2023

and do we know how many people use SPARQL for querying wikidata?..

obi1kenobi · on Feb 9, 2023

FWIW, I've genuinely spent time trying to use it, and I've never managed to run any non-trivial query on it without having it time out. SPARQL is very expressive, but its implementations seem impossible to optimize and as a result impossible to use because they have to be locked down so tight.

debarshri · on Feb 9, 2023

True!

whoopdeepoo · on Feb 9, 2023

Already way too many tabs in the example. I could see this getting completely unreadable very quick.

paavanb · on Feb 9, 2023

The tabs are due to mobile limitations, it's hard to fit a full-fledged query editor and documentation browser on such a small screen. Try it on desktop! It's much easier to manage there, more like a typical GraphQL editor. Full disclosure: I wrote it.

loveparade · on Feb 9, 2023

Isn't this the same as converting websites into a GraphQL API? Aren't there already dozens of projects and services that do the "convert websites into an API", at scale? What exactly is the innovation?

obi1kenobi · on Feb 9, 2023

No, it isn't: it's neither about websites nor GraphQL.

The query language is substantially more powerful than GraphQL, much closer to SQL. And the idea is unifying and querying any kind of data, not just websites but also databases, raw files, ML models, anything at all.

For example, the cargo-semver-checks semver linter for Rust uses Trustfall to abstract over the different JSON formats in which different Rust versions describe a library's API. The lints are written as Trustfall queries (GraphQL isn't expressive enough for this!) and don't care about how the underlying data is represented. Recently, I was able to tweak how those JSON files are used to serve data, and sped up cargo-semver-checks by over 2000x -- without changing a single query. More info here: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

If you want to see more of it in action, there's a 10min talk featuring demos and real-world use cases here: https://www.hytradboi.com/2022/how-to-query-almost-everythin...

leobg · on Feb 9, 2023

I googled “convert website into a GraphQL API” and literally found no matches. Did you have any specific projects/services in mind that attempt to do that kind of thing?

yevpats · on Feb 9, 2023

Also relevant - High Performance Open Source ELT Framework - https://github.com/cloudquery/cloudquery

haolez · on Feb 9, 2023

This looks a lot like what Apache Drill already does.

darkteflon · on Feb 9, 2023

This looks super cool. Going to put this together with Dagster tonight.

@obi1kenobi, could you comment a bit on the motivation and background to this project?

obi1kenobi · on Feb 9, 2023

Might I direct you to a 10min talk I gave on the subject? https://www.hytradboi.com/2022/how-to-query-almost-everythin...

The motivation is that queries that are trivial in a SQL database are painfully impractical over any other data source. Most query languages are tied way too closely to a data format or representation, and fail miserably at the boundaries to other systems or representations. Yes, we can e.g. plug SQL into APIs but there's a severe impedance mismatch there -- we end up forced into all sorts of unintuitive situations and poor UX because SQL was never meant for APIs.

Based on experiences I had at work (some of them covered in that talk!), it felt like an "LLVM for data sources" could be useful -- that's what Trustfall is. Yes, it includes a query language. That's because SQL and GraphQL and the like just weren't a good fit to demonstrate true datasource-agnostic queries: we want a modern GraphQL-like user experience, with SQL-like expressiveness but without limitations that make it hard to plug in APIs and ML tool. But the query language is just one layer of the system, and it's entirely possible to replace it with another language parser while keeping the rest of the benefits.

A key benefit of Trustfall is that you can write queries over any data sources, and then change the underlying implementation to add optimizations or to migrate to a new system without the queries needing to change. For example, cargo-semver-checks, a Rust semver linter I built on top of Trustfall, supports 4 separate format versions of the underlying data source it uses with the same set of queries, and I was recently able to speed up its execution by over 2000x with a tiny amount of code and effort and without even a single query needing to change -- just by tweaking how the data adapter executed queries to plug in some indexes where they mattered: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

Excited to see what you build, please post a link or email / DM me on Twitter if you'd rather not post publicly? The docs are still sparse (I wasn't planning to do a Show HN for another couple of months, but someone beat me to it today) but I'd be happy to help with anything. There are Python and Rust bindings, and feel free to open issues for anything that's confusing:

https://pypi.org/project/trustfall/

https://crates.io/crates/trustfall_core

srcreigh · on Feb 9, 2023

What join algorithms does this support? How many data sources are integrated so far (what’s the MOM growth?) What’s the largest dataset this integrates with? And how many different people do you estimate have written queries in the language?

obi1kenobi · on Feb 9, 2023

As I wrote in reply to your other comment, the join algorithms are the adapter's choice. I wouldn't necessarily recommend using it as a SQL replacement, though.

There's no measurable MoM growth yet. This is the first major look the broader community has had at the system outside of a relatively small conference where I gave a talk, and demos I've given to friends and colleagues. I wasn't planning on submitting to HN for another few months -- but someone else beat me to it today :)

There are production-grade adapters for HackerNews APIs and for Rust's rustdoc JSON format, used by the Playground (https://play.predr.ag/hackernews) and the Rust semver-checking linter cargo-semver-checks, respectively. There are also a few more demo adapters in the Trustfall repo itself: https://github.com/obi1kenobi/trustfall I expect the number of adapters to grow significantly in the coming months, so please stay tuned and let me know if you have datasets you'd like to try it with!

The "largest dataset" is a bit of a trick question: how big is the dataset of "all HackerNews data available via its Firebase and Algolia APIs"? Because that's what the Playground queries.

The Rust semver linter has been used by dozens if not hundreds of crates, and the JSON payloads in question there are in the 100-400MB range. The example in this blog post runs 40 quite complex Trustfall queries (they express semver rules!) over 400MB across two JSON files in 8 seconds: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

You can also see more real-world use cases in this talk I gave last year: https://www.hytradboi.com/2022/how-to-query-almost-everythin...

Probably around 200-300 people or so have written queries in the language, with experience levels ranging from seasoned engineers to Excel analysts and SQL analysts with no programming experience outside of that domain. The language itself is actually a refinement from an earlier open-source project I developed and open-sourced through my previous job. That open-source project was used to query everything from a multi-TB SQL cluster to APIs to ML models, and with Trustfall I've taken the opportunity to revisit and update design decisions that the previous project got to regret in retrospect.

dilawar · on Feb 9, 2023

obligatory, "written in Rust".

obi1kenobi · on Feb 9, 2023

It is! But it has Python and JS bindings, and the web playground uses Trustfall compiled to WASM: https://play.predr.ag/