Lets compare this to a similar decentralized p2p network, the World Wide Web. If every node where a search engine and used web spiders we would fill up all bandwidth and web server resources (memory, cpu) with spider requests. Selfish nodes like google can and sometimes cause services to go down because of too aggressive behavior. Yet the network (and people who operative websites) do accept some amount of crawlers, and don't compare it to stealing a dollar from our wallet (even if power+bandwidth+man power could easily equal to just that). The cost of crawlers should normally not cause any actually latency issues since web servers operative in very parallel way, utilize multiple cpu cores, and bandwidth are generally wide enough (and not metered).
This is why I asked if there is real impact. In theory web spiders are extremely aggressive and non-scaling nodes, but in practice they are mostly background noise and have no impact on the service quality. In the cases they do impact we have blacklists (robot.txt) for just that purpose. As such the real world impact of non-scaling World Wide Web nodes is minimal.
You describe in the concrete example that there is currently a few dozen fake entries and that they cause increased latency in some implementations. A) is there any measurements and B), do it run parallel C), do the fake entries get higher priority than real nodes?
> Lets compare this to a similar decentralized p2p network, the World Wide Web.
Horribly horribly flawed analogy. You can't compare a DHT (or most p2p networks for that matter) and the web. The web is barely decentralized and mostly client-server, not peer-to-peer while a DHT is distributed, not just decentralized. The traffic characteristics are dramatically different. The client-server architecture alone already assumes that there is an asymmetry, that there are resource-providers and resource-takers. That is the opposite of how Kademlia is designed.
Additionally the web is very different from a game-theoretic perspective. Each node is responsible for its own content, so it can make cost-benefit tradeoffs for itself. In a p2p network for Node A stores on B for C to retrieve data node B has no direct incentives to store anything at all, they only exist indirectly if B is interested in the network operating smoothly, this means fairness is far more important since otherwise nodes can become disincentivized from participating in the network.
Anyway, web crawlers are not a violation of any web standard. Aggressive DHT indexing on the other hand as discussed here is not merely aggressive, it blatantly violates specs in ways that are obviously detrimental to other nodes. If we wanted to compare it to something web-tech it would be like randomly choosing http servers and then flooding them with slow-read attacks and spoofed HTTP headers for... uhm... bandwidth testing purposes or some other reason which the server operators have no interest in.
> and don't compare it to stealing a dollar from our wallet (even if power+bandwidth+man power could easily equal to just that).
So you acknowledge the equivalence but then dismiss it anyway without further arguments?
> A) is there any measurements
yes
> B) do it run parallel?
I don't understand your question.
> C) do the fake entries get higher priority than real nodes?
Priority by which metric? There is no explicit priority in kademlia, but many operations involve relative ordering of nodes, the fake nodes one can be closer to the destination, yes, which is what slows down the lookups.
> So you acknowledge the equivalence but then dismiss it anyway without further arguments?
Its an acknowledgement that we don't call it stealing even if something has a cost and is unwanted. It just part of normal operation of running a webserver that you get unwanted traffic, through only as long the cost is so minimal that it does not effect service quality.
Which is the question at hand I wanted to know. What specific impact something like Magnetico has in the service quality for users who don't run their own search engine. A middle solution to the centralized model of a handful global known torrent website, and the complete decentralized p2p model where everyone run their own search, would be local community sized torrent sites all running the same search engine crawler like magnetico. Would that scenario cause significant service disruption to cause more harm than good to the global community, or the opposite (torrent sites are often said to be needed in order to have the network operate smoothly).
It sound by your answer that the damages of fake nodes is so significant that a few ten-thousands Magnetico installations would have such impact on the DHT network to render it unusable and basically shutting it down. That would be very bad and you will have my support. If such group increased latency and bandwidth use by a order of 10^3, it would be clearly too costly. If it it had 10% increase, it would still be costly but maybe worth it. if its around 0.1%, then I don't see any significant uproar happening in the community over it (similar to web crawlers). The question is which order of costs it is, and while I have a implied answer it would be interesting to know a more explicit one.
It seems like you have entirely missed the point of my posts. If they want to do indexing that is perfectly fine. But they must do so while being a good citizen, i.e. be spec-compliant and contribute resources (storage, stable routing information) to the network within the same order of magnitude as they consume. There is no reason to justify a net-drain implementation when they could take zero-sum or positive-contribution approaches.
This is why I asked if there is real impact. In theory web spiders are extremely aggressive and non-scaling nodes, but in practice they are mostly background noise and have no impact on the service quality. In the cases they do impact we have blacklists (robot.txt) for just that purpose. As such the real world impact of non-scaling World Wide Web nodes is minimal.
You describe in the concrete example that there is currently a few dozen fake entries and that they cause increased latency in some implementations. A) is there any measurements and B), do it run parallel C), do the fake entries get higher priority than real nodes?