Entity Resolution: Reflections on the most common data science challenge

kartoolOz · on Sept 13, 2022

I recently had to solve the Entity Resolution Problem at my workplace, and here was how i went about it.

Problem Statement:

   - Find the right entity for a query among ~50m entities.

   - Queries could be few of the entity attributes (If entities have n attributes, query can in 1..n)

   - Queries can mention attributes in various ways (Partial information, Typing errors, abbreviation, Extra information etc)

Existing Solution:

    - Elastic search based match, using complicated heuristics overfit on a small training set. Gets worse over time as number of entities increase, the top-20 search retrieval accuracy was around ~40% on current number of entities.

Implemented Solution:

     - Embedding search using a Sentence embedding model (pretrained Deberta finetuned for current problem) trained via Contrastive Learning, where positive pairs are generated using augmentations for each attribute which best mock the queries (after going through many user queries)

     - Top-20 accuracy was around 98%, filtering out right entity was through hueristics and other business logic with proper confidence measure (hyper-param tuned on val set), after final pipeline we could get the high confidence top-1 accuracy to around 99.995% (precision) and 86% (recall).

We ended up going with pinecone for the embedding search, and the search latency was around ~100ms (top 50 among ~50m embeddings)

hendrik_tilores · on Sept 13, 2022

Can you add data to your solution in realtime and is it possible to delete records? How and where are you hosting your solution (on premise, cloud, ....) is it highly available and does it automatically scale?

Did you try zentity (for elastic) or zingg.ai?

kartoolOz · on Sept 13, 2022

It's supports all CRUD operations, used pinecone as the vector storage and it's highly available and can be scaled to additional pods, existing solution was using zentity.

shaqbert · on Sept 13, 2022

The main impediment to companies adopting entity resolution tech is the incentive structure. Companies want to show growing user numbers, transactions, leads, order, etc. Alas if you look closely and sift out the dupes/frauds, your growth looks a lot less expressive. So why look closely?

hendrik_tilores · on Sept 13, 2022

I think the point there is that you want to have clean data. Sure for you numbers it would look better if these are bigger - like for twitter and the bots... However you also need to see the other side - the operational one. If we have the same customer 5 times in your data, you will also target that customer 5 times with the same marketing initiative, you will have 5 times the costs etc.

Coming to compliance it even gets worse. If you have to answer a GDPR DSAR and you have 5 different records for one person, but only show one, then you can get into serious trouble with the authorities and also pay high fines.

So I think less high quality data is worth more than a lot of trash data.

hendrik_tilores · on Sept 13, 2022

here https://www.linkedin.com/feed/update/urn:li:activity:6931955... you can read more about the mentioned twitter ER problem.

maxdemarzi · on Sept 13, 2022

How are you storing and querying your graph data? Are you using a graph database underneath all that or ?

skafoi · on Sept 13, 2022

I am one of the developers (and a co-founder) of TiloRes. We are not using a graph database for that purpose, because they would be way to slow for huge entities. Instead we're using a combination of AWS serverless services, for storing the data mainly DynamoDB and S3.