Show HN: IPDetective – An API for IP bot detection

azalemeth · on Sept 13, 2022

I have to say, I really, really hate IPv4-based bot detection. If you end up on an IP that is labelled erroneously as "bad", your life becomes a gCaptcha hell for no real reason – my cable ISP does cGNAT and I have no control over it and frequently end up being blacklisted. I tend to use a variety of always-on VPNs to both avoid censorship and try to improve this – I can pick different endpoints.

VPN usage is increasingly common by consumers and in my country I've seen ads for it in places ranging from Mozilla in my browser to NordVPN on my TV. There's massive overshare of IPv4 addresses and it's really, really annoying to find out that you can't use a service because you're on a naughty list. I feel like yelling "I am a customer and want to buy something!" on occasion – the net result is I go elsewhere.

Stopping abuse and bot detection is one thing. Banning people for something they might have literally no control over is quite another.

AndrewCopeland · on Sept 14, 2022

I would agree with everything you are saying. I did not start this service for the sake of stopping actual people from using certain websites. I started this service to mitigate/detect bot abuse of consumer applications.

I think it really is dependent on the consumer application. But mentioned by some other folks is to add the ability to have IP exceptions, essentially an allow list, which I am in favor of. This brings up an entirely different issue, which is identifying the user as legitimate.

bragr · on Sept 13, 2022

>Let me know what your thoughts

Do you have any data showing your service is better/more accurate/better false positive rate than competing offerings?

>IPDetective collects data from about 60+ different sources such as official cloud provider endpoints and public VPN/Proxy/Tor/Bot net lists.

Are you on the up and up with all those public sources? I'm not sure which ones you're sourcing, but many do not allow commercial use, resale, or at the very least have some attribution clause to keep security companies from mooching off crowdsourced data.

>No, currently IPDetective does not support ipv6 addresses. However this feature is on the road map.

That's a major shortcoming in 2022.

Finally, as itake pointed out bellow, I'm not giving you my email just to run a simple test query and see what the results look like.

AndrewCopeland · on Sept 13, 2022

Thanks for your input.

>Do you have any data showing your service is better/more accurate/better false positive rate than competing offerings

I have signed up to some of the competitors and my service averaged about 20ms per request for a single IP address. From the competition it was typically around 200ms to 300ms. Regarding better false positives that would be rather hard to do.

> Are you on the up and up with all those public sources?

I gathered from sources that did not have any licensing related to non-commercial use. Would you recommend I reach out to all of the sources regardless even if they say you can use it for commercial use? Or do not say anything about commercial use at all?

> No, currently IPDetective does not support ipv6 addresses. However this feature is on the road map.

Yeah I know, just focusing on IPv4 right now. I am still collecting ipv6 addresses however I have not implemented it in the service yet.

Thanks a bunch for writing the comments above.

kosolam · on Sept 13, 2022

Yeap… measure the amount of requests for ipv6 support you get from your users, not hn users :-)

bragr · on Sept 13, 2022

https://www.google.com/intl/en/ipv6/statistics.html

zamadatix · on Sept 13, 2022

Wrong kind of user, customer vs end user. Still probably worth pursuing but this more shows the percentage of end users using Google via a mobile phone than anything.

Terretta · on Sept 14, 2022

>> currently IPDetective does not support ipv6 addresses. However this feature is on the road map.

> That's a major shortcoming in 2022.

In the US, it’d be nice if national ISPs (fiber, cable, and mobile) thought this.

itake · on Sept 13, 2022

1/ needs authenticationless way to search. all other ip look up tools don't require an account for me to test this. I'm not going to register.

2/ I've tried blocking vpn or proxy users (by ip hosting provider), but found too many false positives. Using a VPN is common for IT professionals or consumers trying to 'protect' their privacy and this impacted the growth of my app.

How is your tool better at detecting a bot vs a human using a vpn?

AndrewCopeland · on Sept 13, 2022

>1/ needs authenticationless way to search. all other ip look up tools don't require an account for me to test this. I'm not going to register.

Okay I will look into a solution.

2/ I've tried blocking vpn or proxy users (by ip hosting provider), but found too many false positives. Using a VPN is common for IT professionals or consumers trying to 'protect' their privacy and this impacted the growth of my app.

That good to know, I do not find that I have a lot of false postivies but I would imagine it all depends on audience.

>How is your tool better at detecting a bot vs a human using a vpn?

It does not know the difference as or right now. I was thinking about adding a user-agent validation as well which could add another layer.

michaelmior · on Sept 13, 2022

> I was thinking about adding a user-agent validation as well which could add another layer.

Presumably this service exists because bots try to avoid detection. I don't think UA validation would really help much and there are plenty of libraries already that do this.

itake · on Sept 13, 2022

I didn't have issues per se with bots, but people trying to hide their location (think Nigerian user using a US-based VPN).

My trick was to get the device's timezone (which you don't need privacy permissions for on web or mobile). If it didn't match up with the ip address's country (or was in a banned country), then the account was banned.

AndrewCopeland · on Sept 14, 2022

That is a real good idea, I really like it.

StayTrue · on Sept 14, 2022

Browser anti-fingerprinting tools may alter the timezone. Mine does.

itake · on Sept 14, 2022

My use-case is a mobile app. If you use those tools, you can't expect your web experience to be 'normal' :-/

bootsmann · on Sept 13, 2022

How do you plan to keep the list up to date and especially reliably accurate? (providers change, bot nets get cleaned, ASN especially in 3rd world countries are very unreliable, crawlers might scrap weird things etc.)

Disclaimer: I work for a company that operates in a similar space so don't go into too much secret sauce if you don't want to :)

AndrewCopeland · on Sept 13, 2022

So I have a scraper that gathers the information from about 60 different sources using a wide array of techniques which is run everyday.

The largest thing I try to look at is to make sure my sources are not older than 1 year. Sometimes these sources break and I update them or they go offline all together. I am always looking for new hosting providers, vpn and proxies.

Also I can do further port analytics on these ips as well. A lot of the VPN services have specific ports open used by VPN services. I would say that automating every source from the start was a good way to start so I can stay up to date.

ThinkBeat · on Sept 14, 2022

It would be interesting to see a breakdown of shared / unique entries across the different competitors in this space.

I have a feeling that there would be overwhelming overlap.

That might not be bad, just low hanging fruit everyone can get.

The speed at acquiring high probability threats I guess would be a better / more valuable comparison.

Perhaps running two or more of these offerings side by side for a couple of months.

I would prefer a feed of ALL, with frequent updates, instead of querying a 3rd party every time. (well obviously you can cache responses for a period of time locally at least)

AndrewCopeland · on Sept 14, 2022

How would you plan on storing the feed of IP addresses? I currently reach out to 3rd party and then aggregate all of the IP addresses once a day.

stevenicr · on Sept 13, 2022

I checked the posted privacy policy - and I have no idea what your system logs and for how long.

There are a few sites I could use this for, but some of them would also end up sending private customer ips for lookups, and for that reason I just manually check the suspicious ones and manually don't look up the ones I know have passed the 'already a member and human not hacker' gate.

I keep look for something I can just add lists to my server and check against that, but not going to spend thousands on it.

marvinblum · on Sept 13, 2022

Same here. I would like to integrate a simple blacklist into our system. Does anybody know of a good IP lists that contains known bots, data centers, and so on?

Also from a performance perspective, I can't do hundreds of millions of requests a month over the wire. That just a waste.

AndrewCopeland · on Sept 14, 2022

IPDetective collects data data from bot nets, datacenters, tor nodes, proxies and vpns. Is that what you are looking for?

Also IPDetective can be used just as a detection solution rather than a prevention solution.

Sometimes I use it against my nginx access logs and see how much of my traffic could be from bots.

marvinblum · on Sept 14, 2022

I think your service would be close to what I'm looking for. But can you handle 120+ million requests per month? I would rather integrate an open-source list into our open-source tool than using a proprietary service.

AndrewCopeland · on Sept 14, 2022

I feel you. Maybe we could work something out.

AndrewCopeland · on Sept 14, 2022

Thank you very much for your input. I will look into having the privacy be more verbose around the system logs and for how long I store them.

> I keep look for something I can just add lists to my server and check against that, but not going to spend thousands on it.

What do you mean by this? Like having the ability to create your own deny/black list of ip addresses and then you can validate against it?

toast0 · on Sept 14, 2022

>> I keep look for something I can just add lists to my server and check against that, but not going to spend thousands on it.

> What do you mean by this? Like having the ability to create your own deny/black list of ip addresses and then you can validate against it?

They want to download your database and run queries on it locally, rather than call your API. This way they don't share data with you and don't need to worry about your data practices.

AndrewCopeland · on Sept 14, 2022

Ahhh makes sense. I could definitely offer that it's pretty large being 250 million IP addresses. I wonder what the best format would be? Csv? SQLite?

stevenicr · on Sept 16, 2022

Currently to make things a little smaller I am putting many ip's in with cidr notation..

whichever format you would offer I could likely find ways to convert if needed.

I am still researching what is the fastest / lightest on server resources for blocking ips (mainly from data centers lately been the main problems) -

ip-block-file that is ready via cpanel's WHM is what I am using on one system, and iptables/uffirewall / fail2ban on another server..

I'm using wordpress firewall plugin to block some on another site - but I think it's lighter on server resources to block before the requests hit wordpress.. I wonder if it'd be lighter to pull the file from a separate server and return yay or nay - or if it's fine loading up 250 million ips via one of the firewall options already in place.

How large of a file it 250 mill ips? (non-compressed) - I suppose available RAM could also play into these things.

AndrewCopeland · on Sept 16, 2022

All of the IP are stored in memory. This takes up about 3.5 GB. I have a text file that is 445K lines that contains all of the specific IP addresses and the IP prefixes (CIDR) that make up that 250 million ips.

Uncompressed I do not know. Are you interested in the ranges and ip addresses or a list of all the ip addresses that make up all the ranges as well?

Let me know what you need and I could definitely try and accommodate for you.

whodev · on Sept 15, 2022

I wonder where you get your data from. I checked the dedicated IP I use against other services and they all come back as clean, but yours labelled it as a bot.

AndrewCopeland · on Sept 15, 2022

Is it your home dedicated ip?

whodev · on Sept 15, 2022

No it's from my VPN. That's what is making me wonder about the data you used. If other intel sources say it's a clean IP not used for bots, I want to know what makes your service say otherwise. Are you labeling all IPs from specific ASN's as bots?

AndrewCopeland · on Sept 16, 2022

Is you VPN installed in some type of datacenter or hosting service? If so thats how I am detecting it. Yes I am blocking all IPs from specific ASNs.

mxuribe · on Sept 13, 2022

I didn't see my question in the FAQ, so asking here: do you have a process to contest accidental inclusion of an IP address?

AndrewCopeland · on Sept 13, 2022

I do not, but it seems like a great feature request. I will look into adding something like this.

mxuribe · on Sept 13, 2022

Thanks so much @AndrewCopeland!

I should have added context (or at least an anecdote) to help you during any product roadmap meetings...I used to oversee web ops for global real estate company (one of the biggest in the U.S. and the woirld)...and our consumer-facing websites would show tons and tons and tons of listings of residential homes for sale. Of course we really were just showing listings from our data store as well as other real estate companies who agreed to share listings data. As in many data sharing and data synch arenas, there are data issues. The most common scenario: "hey, you're showing a home that sold X time ago...stop showing it!" And, in all cases it was "someone else's data". But to the customer, or even other realtors, we were the bad guys. Even realtors who should have known betteer that there are always data issues that we can not fully control; at least nopt at the source...would complain to us. Now, one might assume that once the source data updates properly, the "correct data" should flow through, right? Well, not in real estate! Clearly this is a different arena. But i've learned that even if its only to remove a local cahce of data, its a good idea to give users and/or customers a mechanism to at least properly communicate to you about stale data...and if appropriate, maybe evenallow self-service for a user/customer to get rid of the "bad data". Obviously this merits putting in place protocols to avoid abuse...but i hope you get the idea. Good luck!

AndrewCopeland · on Sept 14, 2022

I get the idea, seems like a hot topic in the comments and a solution is definitely possible. The "placing protocols to avoid abuse" seems like the most difficult part :).

mxuribe · on Sept 14, 2022

Yeah the controls part is always tough. Remember, a mechanism can be out in place but it needs not be fast nor easy. Maybe slowness can help? Good luck!

AndrewCopeland · on Sept 17, 2022

I ended up allowing a mechanism to exclude ip addresses for a specific user. The step after this is too have certain IPs excluded from the entire IPDetective list and not just for a specific user.

mousetree · on Sept 13, 2022

Good work on the pricing and free tier. Much cheaper than your competitors (we're busy testing out ipinfo.io). But would recommend you add a bit more data in your API response if you already have it or can built it out easily. For example, any geocoding, ASN name, whether its a proxy or a VPN etc.

AndrewCopeland · on Sept 14, 2022

Sounds good, this has been a common request and is definitely on the road map. I am trying to create a very fast service and data querying slows it down. With that being said I am thinking about adding a query parameter like `?info=true` that would end up providing more information. Like the examples you gave above.

mousetree · on Sept 14, 2022

I'm not so sure that low latency is the most important thing in each case. Personally, in our case, we wouldn't care - specifically we're only requesting (detailed) information per IP on a small set of specific events (i.e. login, signup) and additionally those calls to ipinfo/ipdetective would be async anyway. So perhaps there are two use cases here: 1. low latency, high volume). 2. Higher latency but lower volume or don't care about minimising latency to below 20ms.

exabrial · on Sept 14, 2022

Is there a way we can please for the love of god summit exceptions? We need a corporate vpn and we don’t need every friggen website blocking us.

Without this you were also probably inadvertently contributing to the de-democratization of email. I used to run my own email off a cloud server, but that time has passed.

AndrewCopeland · on Sept 17, 2022

I have added the feature for a user to add known addresses so they do not get flagged as bots. Take a look at the API docs here: https://ipdetective.io/api

Currently the known address is scoped to the user/client who is using the service.

AndrewCopeland · on Sept 14, 2022

Sounds like a great idea and an even better feature. Would that exception only work for you? Or would it also work for other users of IPDetective?

exabrial · on Sept 14, 2022

I think allowing users to register exhorting to your list as ‘known ips’ to have them stay off the list would be useful for all parties involved. If there was abuse behavior, if could go to a graylist while the matter is handled.

throwaway742 · on Sept 14, 2022

I think what you are doing is counterproductive and contrary to the free and open nature of the internet. Actual bad actors can afford clean residential IPs from dubious sources, but if I want the privacy of a VPN I am blacklisted from accessing large swathes of the internet.

RockRobotRock · on Sept 14, 2022

Congrats on the launch! I personally hate anti-bot tech but I also make money from bots and scrapers so I’m not exactly unbiased.

Do you have any plans to expand the business? Captchas as a service? A cheap version of CF’s proxy maybe?

AndrewCopeland · on Sept 15, 2022

I would like to expand. See what users need and pivot as needed. A cheap version of CF's proxy seems like an interesting start. I do not know if I have the expertise though.

asdadsdad · on Sept 14, 2022

what kinds of things do you doo with bots and scrapers, if you can disclose? I'm interested in scraping too and was curious what are some applications people pursue nowadays. It looks like a lot of the earlier ideas are so crowded now, like price monitoring, etc.

RockRobotRock · on Sept 14, 2022

A lot of it was for the endgoal of lead generation, so any place where a lot of names, phone numbers, email addresses etc were accessible. I also got into the habit of reverse engineering mobile apps, because those would typically touch endpoints which weren't monitored for rate limiting and had more interesting information. It was typical for a company to monitor its competitors.

AndrewCopeland · on Sept 13, 2022

Please feel free to leave comments if you have any questions or any feedback.

joshmn · on Sept 13, 2022

Does this handle residential or LTE proxies? Obviously harder to identify, but it's not unfeasible.

AndrewCopeland · on Sept 13, 2022

Currently we collect the data across several free public proxy lists. Definitely more difficult to do wit the proxies you mentioned. I think traffic analytics and port scanning would need to be implemented to get these residential and LTE proxies.

asdadsdad · on Sept 14, 2022

I think ip blacklisting is really a graveyard of false positives.

asdadsdad · on Sept 14, 2022

Also, how does it compare to https://focsec.com/pricing

AndrewCopeland · on Sept 15, 2022

It seems more affordable, I do not know if its more accurate.

arepaw · on Sept 14, 2022

Excelente