The core of the product is a bunch of regex or something smarter?
Odd that a product that want you to base your legal safety is not more transparent about the actual implementation on the initial page, instead it reads like an advertisement for closed source SaaS.
Hey, I'm one of the maintainers of Presidio OSS.
We built this tool mainly for Microsoft's strategic customers and decided to open-source it for others to use. No hidden agenda here.
The engine has two main advantages: It's easily expandable and customizable, and it works well at scale. We do know of organizations who are very close to production with it.
Every organization has their own requirements for PII entities, many of them specific to the org itself, so the engine allows a developer to easily add support for new PII entities using code, regex or black-lists.
As for the productization aspect, we did some performance tests and are confident with taking it to production. Single instance cluster with a medium size machine has a ~65ms response time for a 100 word sentence. Using better machines lowers the response time to ~24ms for 100 words and 150ms for a 1,000 words input.
The current service uses regex for known patterns and Spacy for named entity recognition (person names, places etc.). Users often built custom ML models to detect new types of entities.
Presidio is free, completely transparent, and fully customizable. Feel free to use it and let us know what you think.
"Simply follow the instructions" -> "Simply follow the <US_DRIVER_LICENSE>" ; same for "contribution". So I'm guessing some overly eager regex is to blame, which doesn't make you super confident about using this for something sensitive.
Thanks, this is indeed the case. the US_DRIVER_LICENSE confidence is 0.01 and the demo doesn't put any threshold on the response. We're working on fixing the demo.
I can see this being used to detect accidentally leaked sensitive content (for example scanning repos, outgoing emails, shared folders).
However using this to redact material sounds a bit risky. Are there some use case where you could accept the potential mistakes (missing something that should have been redacted)?
Although Presidio would like you to think they are cloud security (higher margins) they are really just a VAR/reseller. Weak services but they will sell you stuff. So will I!
More importantly, this is proximate enough that they may sue for the name, OP.
This is good.. just need a hashbytes formula for Excel to make anonymization accessible to the majority of MS customers who are fumbling around pii haphazardly.
Odd that a product that want you to base your legal safety is not more transparent about the actual implementation on the initial page, instead it reads like an advertisement for closed source SaaS.