Hacker News new | past | comments | ask | show | jobs | submit login
Presidio: Customizable data protection and PII data anonymization service (github.com/microsoft)
103 points by yarapavan on Aug 27, 2019 | hide | past | favorite | 20 comments



The core of the product is a bunch of regex or something smarter?

Odd that a product that want you to base your legal safety is not more transparent about the actual implementation on the initial page, instead it reads like an advertisement for closed source SaaS.


answering myself, yes. It is a pile of regex

https://github.com/microsoft/presidio/blob/master/presidio-a...

and i can't find the body of test data easily... that's not good.


They also have a NLP engine, the regexes are just predefined recognizers for known formats.

https://github.com/microsoft/presidio/tree/master/presidio-a...


Serious question - is there anything fundamentally wrong with using regex here?


For the specific example cited (credit card numbers) the Luhn check is very standard. So, no there isn't :)


So you might have more fun with this email one https://github.com/microsoft/presidio/blob/master/presidio-a...

:)

... seriously, they should have broken that down, or used the multi-line with comments pattern, if they really want people to contribute to this.


Yeah, seriously. Are email addresses even regular to the extent that a regex would work 100% of the time?


Hey, I'm one of the maintainers of Presidio OSS. We built this tool mainly for Microsoft's strategic customers and decided to open-source it for others to use. No hidden agenda here.

The engine has two main advantages: It's easily expandable and customizable, and it works well at scale. We do know of organizations who are very close to production with it.

Every organization has their own requirements for PII entities, many of them specific to the org itself, so the engine allows a developer to easily add support for new PII entities using code, regex or black-lists.

See hree: https://github.com/microsoft/presidio/blob/master/docs/custo...

As for the productization aspect, we did some performance tests and are confident with taking it to production. Single instance cluster with a medium size machine has a ~65ms response time for a 100 word sentence. Using better machines lowers the response time to ~24ms for 100 words and 150ms for a 1,000 words input.

The current service uses regex for known patterns and Spacy for named entity recognition (person names, places etc.). Users often built custom ML models to detect new types of entities.

Presidio is free, completely transparent, and fully customizable. Feel free to use it and let us know what you think.


Still some work to do ; on their demo page [1]:

"Simply follow the instructions" -> "Simply follow the <US_DRIVER_LICENSE>" ; same for "contribution". So I'm guessing some overly eager regex is to blame, which doesn't make you super confident about using this for something sensitive.

[1] https://presidio-demo.westeurope.cloudapp.azure.com


To be fair, there are different levels of detection [1], the demo is probably using the weakest one.

[1] https://github.com/microsoft/presidio/blob/74ea983cc50ff76d7...


Thanks, this is indeed the case. the US_DRIVER_LICENSE confidence is 0.01 and the demo doesn't put any threshold on the response. We're working on fixing the demo.



I can see this being used to detect accidentally leaked sensitive content (for example scanning repos, outgoing emails, shared folders).

However using this to redact material sounds a bit risky. Are there some use case where you could accept the potential mistakes (missing something that should have been redacted)?


My first thought was it was related to the cloud security company Presidio not MS.


Although Presidio would like you to think they are cloud security (higher margins) they are really just a VAR/reseller. Weak services but they will sell you stuff. So will I!

More importantly, this is proximate enough that they may sue for the name, OP.


This is good.. just need a hashbytes formula for Excel to make anonymization accessible to the majority of MS customers who are fumbling around pii haphazardly.


Is anyone using presidio in production?


What a horrible name


I kind of agree with you.

From the README.md: Presidio (Origin from Latin praesidium ‘protection, garrison’)

In Spanish sounds like prison, or related to prison. An inmate can be called presidiario. Same root as the latin presidium.


Please everyone:

Before you brand your software, google the name! It's not difficult!

https://www.presidio.com/




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: