It would be extremely handy to have a document store / graph database which had row polymorphism. I started writing a prototype, but unfortunately never got there. I wonder if any such things already exist?
For the index structure itself you can use succinct dynamic data structures to reduce the size. The bulk loading approach is particularly amenable to use of succinct data structures as the node-vectors in a HNSW are montonic they can use an Elias-Fano encoding. The neighborhoods can use log-arrays.
Creating a multi-index also seems possible, where you use the triangle-inequality to prune the candidate vectors. This will probably require storing the distances for neighbor distances in the bottom layer, in order to be time-efficient for query and would thereby be slightly bigger. I haven't tried it yet but I intend to.
There is also the possibility of using random projection for dimensionality reduction. I also haven't tried this yet either but will give it a go soon. We haven't folded the parallel hnsw into our open source VectorLink yet, but we'll be doing it in the near future - we wanted a little bit of stability of approach first.
I'm going to need to research to understand most of what you said here before being able to give a coherent response. Thanks for giving me an opportunity to learn!
TerminusDB founder here. TerminusDB is now at the absolute outer edge of compact data representation. Neo4j is literally 10 times more space for the same data meaning that you'll need 10 times as much memory for a big graph. Succinct datatypes will have their day yet.
I was looking at various forms of indexing solutions to solve search and clustering problems with TerminusDB for clients. When I compared solutions against embeddings from LLMs, they LLMs were just far easier to work with and got much better results. I believe traditional text indexing will die quickly, as will a lot of the Entity Resolution and traditional clustering methods to be replaced completely by LLM's. We found them so compelling we wrote our own open source vector database sidecar: https://github.com/terminusdb-labs/terminusdb-semantic-index...
TerminusDB represents the data using succinct data structures which reduces the required memory substantially over many other representations. Each branch needs to be capable of being loaded into memory completely - but individual revisions are loaded separately.
Diffs can be constructed between two objects, or you can get sets of diffs of objects between commits automatically. You can manually construct diffs and use them to patch branches.
We don't have a conversion tool from SQL database tables, but it's something on my list.
Mainstream languages will lag behind others forever. The average programmer in Java or C# neither cares for nor understands 'new' features or concepts. Most people I know in this area have never even heard about prolog, lisp, smalltalk, etc.
> Datomic Cloud is slow, expensive, resource intensive, designed in the baroque style of massively over-complicated CloudFormation astronautics. Hard to diagnose performance issues. Impossible to backup.
You should give TerminusDB a go (https://terminusdb.com/), it's really OSS, the cloud version is cheap, fast, there are not tons of baroque settings, and it's easy to backup using clone.
TermiusDB is a graph database with a git-like model with push/pull/clone semantics as well as a datalog.
Hi this is Gavin and I founded TerminusCMS (terminusdb.com).
CMS stands for "content management system" and headless means API-based, with no restrictions over where you use the content. Devs are you folks.
Existing headless CMS tools sometimes make it up as they go along - starting with the idea of ‘I want to build a company that delivers a headless CMS’ and then quickly slapping a bunch of technologies together and ending up with a pile of JSONs floating around a MongoDB or another similar frankenstein. As we’d already built the document graph data layer from the ground up, we could properly integrate the CMS features to give a seamless developer experience.
We are never going to send you to a screen that says, ‘TerminusCRM requires NodeJS version 10+ and a Mongo database’. It is all contained in one in-memory, highly compressed data management system that we designed and built for this specific purpose.
A few of the other offerings are just headless markdown backed by git - which is actually a good idea, but comes with the capacity and performance limitations that git implies.
Why not have a highly performant document graph content management system that incorporates the most important concepts from git in the data layer? All the version control features that content needs (clone, push, pull, branch, revert, merge) and are highly awkward in other systems.
We also thought that GraphQL was the obvious data manipulation choice for content, but found weak implementations wherever we looked. For TerminusCMS, we’ve implemented a suite of features which allows you to query a TerminusCMS project using GraphQL in such a way that deep linking can be discovered. We can use path queries with GraphQL.
To summarize our market thoughts - it seems that devs want:
* Deploy anywhere open-source
* Dev-first in memory, highly compressed, scalable and fast content management so you can build complex and fully featured web apps of every shape and size
* Git-like features to help with permissions and version control (and - crucially - merge)
* Best-in-class GraphQL implementation
And there was nothing offering that mix until we released TerminusCMS.
TerminusCMS is a content platform that sits at the convergence of content and knowledge. It is a model-driven, API-first approach to content management. With TerminusCMS you can use your data as content. The data employed in content has meanings associated with it, and because that content is well structured, the content can also be used as if it were data — in fact, that content is data. TerminusCMS is structured like Git, so you get all the git-like features for your content engine. History, change management, branching, non-linear development, easy backups, distributed development, and more. You can enrich your content with semantics which can give you the ability to personalize content and build superior recommender and AI/ML systems.
Content curation needs change requests: you need to be able to add a new translation, or new content in a branch, which is viewable as an entire site, but which only goes into production when you "merge to main".
TerminusCMS gives devs a powerful way to define schema, query, deliver content and assets to front end. It has schema-as-code; is standards-based for interoperability; has an extremely fast publishing API; and provides workflows made by way of content and network model, not wired into product itself. It give editors a way to manage content automatically so that devs don't need to do much to support them. You can use TerminusCMS to build your complex app, or to manage your organization knowledge.
TerminusCMS is open-source all the way down, so if we vanish or piss you off, you can recover and continue.
The business is cloud hosting and enterprise delivery, which we hope can keep the wolf from the door. It is freemium with a generous free tier and easily cloned examples, so take a look.
This blog gives some details of how to order a wide variety of different data types using lexical embeddings. This is very useful for TerminusDB as we use dictionary structures to represent ids, but it would also be relevant in radix trees or other database structures that rely on sharing of prefixes.