A novel approach to entity resolution using serverless technology

Major_Grooves · on Aug 11, 2021

Hi, I’m one of the (prospective) co-founders of TiloDB, a serverless “entity-resolution” technology.

We built TiloDB as the tech team at a European consumer credit bureau when we were faced with the technical challenge of how to assemble hundreds of millions of data sets about tens of millions of people in a way that is scalable and allows fast searching, without breaking the bank.

We tried various technologies, such as graph databases, but none of them could give us satisfactory performance.

So we turned to the opportunities of serverless technology (AWS specifically) to build a new type of entity resolution technology.

In this article we write about the technology breakthroughs that led to TiloDB, and there is also an interactive demo where you can submit data, see it linked, and see other people submitting data in real-time.

We want to spin the tech out into a new company, release it as OSS, and so are keen to hear about potential use cases you might have.

willvarfar · on Aug 11, 2021

Very cool project.

Setting up a business based on a new DB tech that has one user, though, is tricky.

Playing devil's advocate, how do you plan to make money? Who are the users, why do they turn to TiloDB, how do they learn about it, how do they adopt it, how do they be convinced to pay you something for it? Etc

Major_Grooves · on Aug 11, 2021

Thanks for your question. You are right - it is not an easy business to start. Investors are more used to open source projects that are already released and have community adoption that they can measure. We are kinda the opposite - enterprise ready software that wants to go open source.

So we want to make the software open source, but restrict a few modules that would be necessary for enterprise customers, such as security and auditing features.

We have quite a few companies lined up that want to do proof of concept trials with us. So far they are mostly big fintech companies that use it for anti-fraud, also AML/KYC companies that need to match and search lots of data from different sources in real time. Also very large companies that need to solve their "data silo" problem.

Adoption - hopefully they start with the OSS version, play with then want to upgrade to the enterprise version.

One area we have less experience is with which type of OSS licence to use.

boulos · on Aug 11, 2021

Disclosure: I used to work on Google Cloud.

I see a lot of similarities between Kafka and Confluent. You're looking to spin out a tool that worked well for you, and offer it commercially.

I'd suggest planning more around operating TiloDB as a managed service. You happen to just need lambda, s3, and dynamo today, but the "capturable" value for many customers will be if you also manage it all for them (especially upgrades). You can still offer the open core and let folks run their own, but it sounds like a lot of the goodness comes from the way you run it.

Having said that, licensing is currently fraught in this space. Each major "database" vendor (Elastic, Redis Labs, Confluent) is basically trying to find a way to figure out how to avoid AWS (and other clouds) from just taking their code and operating it as a service.

People have very strong opinions on this topic, ranging from "open-source isn't a business plan" to "AWS is violating the spirit of the OSS community" and many more. My personal advice would be to assess more clearly why you want to be open source (you mentioned community and applications you couldn't imagine) and whether you think open source better achieves those goals than say a free tier or distributing a core binary / container image for free.

What, more specifically, do you want to get out of being open source? Contributions to the core? Contributions to the operational part? More users and feedback?

Major_Grooves · on Aug 11, 2021

you make a very good point. In fact, we just came out of a call with a very knowledgable investor who wants us to go down exactly that route. It could very well be that the managed service route is the way to go. Thanks for the comment. If you'd like to connect directly, I think my email address is in my profile.

re OSS strategy - it's more users and feedback that I firstly think of. For instance, people keep telling us that there could be a really great use case in crypto compliance/auditing - tracking related wallet addresses etc. We don't really know enough about blockchain to validate that, but I think an OSS community could.

smackeyacky · on Aug 11, 2021

I think what you might want to do is try the closed source / SAAS route first either by locking in that 2nd customer or taking some investment. If it doesn't work, then start down the open source route.

It seems to me that all the knowledge / experience of your staff is your real asset here, not necessarily the code. While that, in theory, should mean open-sourcing the code won't matter for your business, in practice it means you will be seeding competitors unnecessarily.

Turning your staff into highly chargeable consultants could be a more sustainable business model than trying to herd the cats of the internet into trying to improve your product offering, when most of those guys won't have the experience of your existing team. By offering the code out as open source you are giving a bunch of people a leg up and cutting short your time as the only game in town, which puts extra pressure on sales and might not work out.

kall · on Aug 11, 2021

Just a quick suggestion: if you restrict the security module, do so in a way that someone can still run the OSS version in a basic secure way. If there are lots of insecure instances of your db out there, or someone else steps in and provides a solution, that doesn‘t reflect well on the project. This wasn‘t great about elasticsearch and they changed it later.

skafoi · on Aug 11, 2021

The idea for that is, that typical enterprise features like authorization for certain records or even attributes are not publicly available. Also e.g. encryption of the data in S3 and other parts may be an enterprise only feature. Other things, like API authorization, preventing public access to S3 and therelike must be included in the OSS version for the same reasons you mentioned.

ALLTaken · on Aug 11, 2021

Also highly interested! It would be awesome if the database could be accessible via C/C++ or Rust library in order integrate into existing applications, if that makes sense.

Major_Grooves · on Aug 11, 2021

You would be able to access everything via the GraphQL API to integrate into existing applications.

Major_Grooves · on Aug 11, 2021

This blog post from the CTO of VMware and Springsource, gives a pretty good summary of the entity-resolution field: https://blog.acolyer.org/2020/12/14/entity-resolution/

lmeyerov · on Aug 11, 2021

ER for identity graphs is a great use case! We see teams do this a lot and with not-great tools. (Ex: users/IPs in splunk/elastic, which are better for simpler matches.)

For one Graphistry project, we run a single node neo4j with 0.5b nodes/edges, so something in the description isn't adding up for me here wrt perf. Maybe an open benchmark would help?

I do agree indexing matters, as that was night/day for our use cases. For ML workloads, we are looking at vector indexes, which graph DBs do not currently support. The ones in this article are on text and take > 100ms, so I'm curious..

skafoi · on Aug 11, 2021

Regarding the performance on neo4j: the challenge for an honest and fair test towards this would be about how to properly compare a server-based solution vs. a serverless solution. TiloDB automatically scales up and down without any further interaction due to using Lambdas for all calculations. So would you compare it with a relatively small neo4j instance or with a large cluster? I honestly don't know. When we started doing this internally for our previous company, we obviously did test the edge cases. After all, we didn't want to create our own solution. For graph databases, the issue here are either a lot of edges leaving from one node or a long chain of edges. The first scenario was still handled okish: response times of around 6 seconds for 1.000 nodes if I remember correctly. The second scenario was a total fail. The problem for the later one lies in the transitivity as a graph database has to jump from one node to the next one and so on. To be fair though: when it comes to dynamic entities, so choosing which rules are relevant, graph databases might be the better choice - especially when response times don't matter.

The response times provided in the article are for the whole process of searching and returning the entity. The indexes themself are obviously a lot faster - to be precice we are using DynamoDB for storing the indexes, which most times return results in <10ms. Compared to other databases this may still sound slow, but we know that we won't run into scaling issues in this way and that's kind of what matters currently most for us.

Hope that somehow makes sense what I wrote.

lmeyerov · on Aug 11, 2021

Separate benchmark per claim and core use case :) A scale-to-zero + autoscaling graph db could be both broadly relevant and differentiated, so I'd be curious there + table stakes for regular queries.

RE:extremes, we see graph DBs OK for small time series (ex: 2 nodes with a bunch of event multiedges), but not full blown time series... where we'd use a tsdb. Some vendors demo this, but always felt like wrong tool.

The many-hop case is interesting! We don't see 1K-hops typically, and I get nervous even at 10-20 on graph DBs we've used. I can imagine in logistics or sciences that happening more, or maybe even some rdf systems. Partition keys start mattering fast, whether a kvdb or a mpp, but I don't have an intuition here. Probably easier to differentiate on, but too niche?

skafoi · on Aug 11, 2021

Thanks for your input.

1k hops is also not something we see on a regular basis in our old business, which is much about people moving houses and transactional data from payment service providers. Ppl with money issues seem to move a lot more often and also fraud cases often have a lot of hops.

Major_Grooves · on Aug 11, 2021

I'd be very interested to hear peoples' thoughts on OSS licences. We are rather new to that world so very rapidly learning about the difference between Open Core, Elastic 2.0 and Apache Licence etc.

What licence should a new company like us adopt when we want to build a community but we also want to commercialise the technology, especially when we already have a "enterprise ready" version of the tech?

Zababa · on Aug 11, 2021

Here is the page of CockroachDB https://www.cockroachlabs.com/docs/stable/licensing-faqs.htm.... Their focus seems to be forbidding cloud vendors from selling their product as a service unless they allow it (a bit like Elastic).

greenrobot_de · on Aug 12, 2021

Afaik Cockroach - as well as elastic and redis and many others - all followed Mongo DB and changed their licenses to a Source Available License (that is not accpeted by the OSI as open source) that aims to protect their cloud offering from cloud providers hosting it. Details vary afaik, e.g. Mongo has implemented a copyleft effect in their license for that specific case

trhway · on Aug 11, 2021

from experience with similar product (where we had similarly sounding way of entity resolution based on rule based fuzzy indexes and fuzzy matching, and it was working for tens of millions of entities on regular, though beefy, RDBMS more than a decade ago) - the issue isn't that much technological, it is that each customer/client has custom everything when it comes to ER, and thus scaling that business is extremely hard (that specific business collapsed primarily for that reason)

skafoi · on Aug 11, 2021

I would really love to hear more about your experience regarding the client customizations. So far the two things I can see are domain model customization and rule customization. Obviously with the rule customizations being the more challenging one.

trhway · on Aug 11, 2021

the rule customization is match rules and index generation rules, and all these customizations are data source specific. And a large company with a lot of departments and divisions, some of them being former acquisitions, would have a number of different data sources, as well some external ones too, reference ones in particular. Beside the pure basic issue of connecting those data sources and pulling data from it, the data in them have different reliability/quality, different standard of maintenance, etc. as well as different role in the system - some of those sources are master sources, some are only used to extract matches without update, some are [also] consumers of the match/deduplication results, so they have to be updated back. As as result any of such actually implemented ER systems is totally one-off.

skafoi · on Aug 11, 2021

Thanks a lot. Sounds similar to the issues we had in our previous company. We provide some basic ETL functionalities to tackle these. Will be interessting to see where this will eventually end up and how much our solution does there or where it is better to use the probably in those companies existing ETL tools.

Major_Grooves · on Aug 11, 2021

the real technical challenge is the "transitive hop" problem that we describe. The matching of the data is not so complicated - that can be done with any technology - but searching with Data A, and getting result Z - that was the tricky bit that took us years to solve and was only possible thanks to serverless tech.

trhway · on Aug 11, 2021

in our case we didn't have explicit "transitive hop". Instead in this example it would be just one entity with a bunch of addresses (with related dates if configured so), and multiple known names, etc. attached. Granted in order to get to that state the data loading process would include a massive batch matching which would be performed on the array of worker nodes - the number could be configured for given run, so there is kind of rough technological similarity to your serverless.

skafoi · on Aug 11, 2021

That sounds mostly like deduplication which is often used in marketing contexts. There are indeed some good solutions out there, but from our experience they have difficulties handling huge amounts of data (>1 billion data sets) and they are often batch based, so your data is always outdated, whereas we constantly add new data in near real-time.

trhway · on Aug 11, 2021

Typically deduplication is more like clean, standartize and match (using relatively simple match approaches). Our process was much more complex (more along the lines of what another commenter linked https://blog.acolyer.org/2020/12/14/entity-resolution/) with more rich resulting functionality, and, yes, marketing was the segment where it started though i'd say it was only about 3rd of the business at the peak of it. Yep, performance for most of the solutions is an issue. We had to do significant re-engineering at one point to parallelize at much finer granularity and thus were able to scale much more. Our biggest number of entities counting across all the sources in the largest implementation was just under 200M. It was right pre-Nehalem hardware.

The batch mode had naturally orders of magnitude higher throughput. We did have real-time single-record mode which was pretty fast as long as the stream of the incoming single-records wouldn't saturate the worker array capacity (here is the difference from serverless as the worker array was limited by whatever was statically configured at the moment as adding/removing nodes wasn't an instant on the fly operation)

Couple years later i worked at another company on a similar, though somewhat simpler, project when it was in the process of total rewrite for performance reason - the old version was really slow - that rewrite failed spectacularly for a lot of reasons. So, yes, performance is a kind of a noticeable factor in the domain.

mkhnews · on Aug 11, 2021

Ever look into HPCCSystems ?

Major_Grooves · on Aug 11, 2021

no I had not seen them. Thanks - I will check it out.