Skimming over the code it seems like it's doing fairly egregious bittorrent spec violations. In other words its very selfish and other clients may blacklist it if they detect such behavior.
> Skimming over the code it seems like it's doing fairly egregious bittorrent spec violations.
It's true and even understated: magneticod is not only selfish but also incredibly aggressive. It will literally contact a thousand (and if your network allows, even more) nodes every single second and deceive them by forging thousands of ids to create the impression that it is "very close" (in terms of XOR-distance used in DHT Kademlia) to the every single node.
There are on average 7,000,000 DHT peers in the network and less than a quarter of the nodes stay longer than 12 hours in the network (i.e. the node you contacted this morning might be long gone at the evening). There are quite a few public papers[0] describing various Sybil-attacks on BitTorrent DHT, so it's not something unknown. BEP 42[1] tries to mitigate the issue by making node IDs dependent on the IP address of the node, but as far as I know not many clients (if any) implements it.
I spoke too much but one last point: These attacks -given the scale of BitTorrent DHT- I doubt to be disruptive, but they are more dangerous for the privacy of the peers. One can gather thousands of IP addresses each day as easily using the same technique. (See http://opentracker.blog.h3q.com/2007/02/12/perfect-deniabili...)
I'm not sure what you're trying to argue here. That it's ok to misbehave because there are papers describing how to do it? That it's ok to be selfish because there are enough compliant nodes to compensate for it?
> That it's ok to be selfish because there are enough compliant nodes to compensate for it?
Yes. Of course, I do not have data nor the foresight to claim that its short or long term effects are negligible, but I really doubt if it's not the case.
> That it's ok to misbehave because there are papers describing how to do it?
Not because of that, but I was pointing out that the technique is nothing new, and probably numerous real attacks of much greater scale (with the intention of actually disrupting the network) using the same techniques must have been launched many times before magnetico. If BitTorrent DHT survived those, surely it is more resilient than you tend to think.
----
I would love to hear the opinion of a BitTorrent developer though. I'll try to e-mail them tonight and ask for their opinion. =)
> Yes. Of course, I do not have data nor the foresight to claim that its short or long term effects are negligible, but I really doubt if it's not the case.
And what if people see your solution and try to build an end-user thing out of it so that millions of users could run it at home?
When building something in a P2P network you always should consider whether it would still work if the majority of the network behaved like that.
> If BitTorrent DHT survived those, surely it is more resilient than you tend to think.
Yes, the rain forest is also quite resilient. Nothing wrong with logging a few more trees, right?
I'm saying your kind of reasoning is responsible for tragedy of the commons outcomes. You're not even trying to make your nodes a good citizen in the network because "others do it too".
--
Anyway, such behavior can be detected and if it becomes more common other clients will probably implement blacklisting for it.
You haven't taken into account the resources required to do so.
1) Open ports, alredy rules out a large part of those millions
2) Willingness to sacrifice bandwidth for something that will still take a long time to populate a large enough database when run on a home ISP connection
3) The average user will stop using it when too much UDP crashes the router
4) The "Crawling BitTorrent DHTs for Fun and Profit" paper only takes into account the Vuze-only DHT
5) A completely distributed torrent database system with a pretty nice search function like Tribler alredy exists and doesn't seem to be that popular.
I am using "does this approach work if every node in the network behaves like this" as a yard stick to gauge whether some p2p node is a good implementation. You're not just running the software on your own machine, you're running in a network with everyone. If the network would fail if a specific implementation becomes dominant then it's badly engineered.
But that's just the point where harm becomes obvious. That doesn't mean there's no harm when they're only a minority of nodes. They're still making things less efficient for other nodes, drain resources or might make lookups more unreliable. This might make the network less attractive for end-users for example.
So the extrapolation simply helps making an already-present net drain on the network resources more apparent.
Could you explain the harm in a bit more details? How many false identities (that don't respond) is needed to flood the routing table and cause a noticeable impact, and further, to completely make a resources inaccessible? Are we talking about 1:1 for each fake vs real, 10^2:1, or does it more or less need to fill up limited resources such as memory or network?
I'm not sure if you have understood my argument. The precise ratio at which it becomes noticeable or at which things start breaking down is not relevant to gauging whether an implementation is well-behaved or not.
Fundamentally a p2p network can only work if averaged over the nodes the following holds: resources consumed <= resources provided. If everyone tries to consume more than they provide (possibly zero) then the network does not work. This can be extended a bit further by also looking at whether some node behavior impacts how much other nodes consume. Ultimately means selfish nodes are harmful to the network on a fundamental level. They simply cannot satisfy that inequality.
Even a single misbehaving node will inflict an - admittedly miniscule - amount of harm on the network. It's akin to stealing a dollar from your wallet, it probably won't even impact your day.
And just like stealing dollars gets worse and worse as you take more of them from people the noticeable impact in a network gets worse and worse as the fraction of misbehaving nodes increases. It starts with wasting everyone's bandwidth (metered internet!), then increases the latency of lookups because you need to visit more hops and then less strict implementations start to fail lookups. The most strict, defensive implementations gradually failing more and more of their lookups is just the last and most visible kind of harm.
To give a more concrete example, I have observed utorrent and libtorrent DHT implementations having polluted routing tables near their home buckets since they don't sanitize fake nodes aggressively enough. This adds maybe a dozen entries to their routing tables, it's not much. But any lookup coming their way will experience significantly increased latency because the terminal phase of a lookup should be exhaustive, so an implementation has to try all the bogus entries. There are counter-measures of course, but they increase implementation complexity and result more maintenance traffic than otherwise necessary. Those fake nodes come from only a few thousand IPs and yet they cause a negative impact for everyone because they have indirect effects.
So in the real world the impact depends on a) fraction of bad nodes b) aggressiveness of bad nodes (not all bad behavior has the same impact) c) implementation quality of those who are trying to be good citizens.
Lets compare this to a similar decentralized p2p network, the World Wide Web. If every node where a search engine and used web spiders we would fill up all bandwidth and web server resources (memory, cpu) with spider requests. Selfish nodes like google can and sometimes cause services to go down because of too aggressive behavior. Yet the network (and people who operative websites) do accept some amount of crawlers, and don't compare it to stealing a dollar from our wallet (even if power+bandwidth+man power could easily equal to just that). The cost of crawlers should normally not cause any actually latency issues since web servers operative in very parallel way, utilize multiple cpu cores, and bandwidth are generally wide enough (and not metered).
This is why I asked if there is real impact. In theory web spiders are extremely aggressive and non-scaling nodes, but in practice they are mostly background noise and have no impact on the service quality. In the cases they do impact we have blacklists (robot.txt) for just that purpose. As such the real world impact of non-scaling World Wide Web nodes is minimal.
You describe in the concrete example that there is currently a few dozen fake entries and that they cause increased latency in some implementations. A) is there any measurements and B), do it run parallel C), do the fake entries get higher priority than real nodes?
> Lets compare this to a similar decentralized p2p network, the World Wide Web.
Horribly horribly flawed analogy. You can't compare a DHT (or most p2p networks for that matter) and the web. The web is barely decentralized and mostly client-server, not peer-to-peer while a DHT is distributed, not just decentralized. The traffic characteristics are dramatically different. The client-server architecture alone already assumes that there is an asymmetry, that there are resource-providers and resource-takers. That is the opposite of how Kademlia is designed.
Additionally the web is very different from a game-theoretic perspective. Each node is responsible for its own content, so it can make cost-benefit tradeoffs for itself. In a p2p network for Node A stores on B for C to retrieve data node B has no direct incentives to store anything at all, they only exist indirectly if B is interested in the network operating smoothly, this means fairness is far more important since otherwise nodes can become disincentivized from participating in the network.
Anyway, web crawlers are not a violation of any web standard. Aggressive DHT indexing on the other hand as discussed here is not merely aggressive, it blatantly violates specs in ways that are obviously detrimental to other nodes. If we wanted to compare it to something web-tech it would be like randomly choosing http servers and then flooding them with slow-read attacks and spoofed HTTP headers for... uhm... bandwidth testing purposes or some other reason which the server operators have no interest in.
> and don't compare it to stealing a dollar from our wallet (even if power+bandwidth+man power could easily equal to just that).
So you acknowledge the equivalence but then dismiss it anyway without further arguments?
> A) is there any measurements
yes
> B) do it run parallel?
I don't understand your question.
> C) do the fake entries get higher priority than real nodes?
Priority by which metric? There is no explicit priority in kademlia, but many operations involve relative ordering of nodes, the fake nodes one can be closer to the destination, yes, which is what slows down the lookups.
> So you acknowledge the equivalence but then dismiss it anyway without further arguments?
Its an acknowledgement that we don't call it stealing even if something has a cost and is unwanted. It just part of normal operation of running a webserver that you get unwanted traffic, through only as long the cost is so minimal that it does not effect service quality.
Which is the question at hand I wanted to know. What specific impact something like Magnetico has in the service quality for users who don't run their own search engine. A middle solution to the centralized model of a handful global known torrent website, and the complete decentralized p2p model where everyone run their own search, would be local community sized torrent sites all running the same search engine crawler like magnetico. Would that scenario cause significant service disruption to cause more harm than good to the global community, or the opposite (torrent sites are often said to be needed in order to have the network operate smoothly).
It sound by your answer that the damages of fake nodes is so significant that a few ten-thousands Magnetico installations would have such impact on the DHT network to render it unusable and basically shutting it down. That would be very bad and you will have my support. If such group increased latency and bandwidth use by a order of 10^3, it would be clearly too costly. If it it had 10% increase, it would still be costly but maybe worth it. if its around 0.1%, then I don't see any significant uproar happening in the community over it (similar to web crawlers). The question is which order of costs it is, and while I have a implied answer it would be interesting to know a more explicit one.
It seems like you have entirely missed the point of my posts. If they want to do indexing that is perfectly fine. But they must do so while being a good citizen, i.e. be spec-compliant and contribute resources (storage, stable routing information) to the network within the same order of magnitude as they consume. There is no reason to justify a net-drain implementation when they could take zero-sum or positive-contribution approaches.
> DHT indexing already is possible and done in practice by passively observing get_peers queries. But that approach is inefficient, favoring indexers with lots of unique IP addresses at their disposal. It also incentivizes bad behavior such as spoofing node IDs and attempting to pollute other nodes' routing tables.
Figurative speech aside, bittorrent.org acknowledges the need & the issue, and calls it bad, not "egregious". The paper[0] mentioned in the BEP 51 proposal describes more or less how magnetico works ("horizontal attack") and states in conclusion that
> Through an extensive measurement study since December 2010, we have identified that both of these attacks are happening in the real network. We have analyzed their exact behavior through honeypots and have shown the scale of the on-going activities. We must stress that we have
no concrete proof of actual malicious activities; our work only shows that the scale of attacks is large enough for this to be a concern.
Repeating for the last time (as I feel this slowly creeps into a flame war):
1. As I have said, these technique is already in use. This means that the network is more resilient than you claim and the chances that a few hundred (or thousand, in future) magnetico instances will harm the network is near zero. I could answer in a more direct way if you did not use rain forest metaphor and addressed the issues you see more directly.
2. magnetico uses a similar but also a lot different technique. The horizontal (Sybil) attack requires the offending node to maintain its identity: "First, the attacker only has to answer two types of messages, `PING` and `FIND_NODE`; he can ignore every other message."[0] magneticod, in contrast, answers only `GET_PEERS` and `ANNOUNCE_PEER` queries, which are essential for the remote node to register itself as a peer downloading the torrent. magneticod will send hundreds of `FIND_NODE` queries per second[1] using forged node IDs, and that's all! According to the BitTorrent DHT protocol[2], "after 15 minutes of inactivity, a node becomes questionable." Hence, presumably, magneticod should be forgotten & removed from the routing table after it does not answer several `PING` queries. Consider that `FIND_NODE` queries that magnetico makes are targetting (at every single time) totally random nodes, so same set of nodes will not be targeted ever (statistically speaking).
The problem is, BEP 51 is still a draft and AFAIK no clients in the wild implements it, so it's simply impractical to write a DHT search-engine depending on this particular behaviour. Once it becomes more widespread (hopefully together with BEP 42), magnetico will (and has to) drop its current technique and support the new and proper one.
> bittorrent.org acknowledges the need & the issue
It does not acknowledge or imply that what you are doing (forged replies) is necessary. It only acknowledges that indexing happens in the wild and that when it is happening nodes are currently incentivized to misbehave, that does not mean they need to misbehave to achieve their goals.
> and calls it bad, not "egregious".
That's nitpicking words, a spec is obviously written in a more conservative tone. Your implementation is far outside specs, intentionally so to manipulate the behavior of other clients.
> This means that the network is more resilient than you claim and the chances that a few hundred (or thousand, in future) magnetico instances will harm the network is near zero.
I have not denied that the network has some resilience or claimed that a single instance of this code running will destroy the network. I am saying that your code is misbehaving and greedy, which makes it a bad citizen in a p2p environment.
> BEP 51 is still a draft
Hardly relevant, many widely deployed BEPs are in draft status.
> AFAIK no clients in the wild implements it
That's not entirely true, but even assuming that it is it is not relevant either because BEP51 is not the only approach to be a good citizen while indexing the DHT.
> Once [BEP51] becomes more widespread (hopefully together with BEP 42), magnetico will (and has to) drop its current technique and support the new and proper one.
Why not be part of the solution then ? Make Magnetico respond to BEP51 requests properly, and you'll help migrate the DHT towards this.
I didn't see anything about "OK" -- it says attacks, right there in the comment. I think "Please Read this" is for magnetico author, or for potentially benevolent PR submitter.
is there a way to accomplish this while being peer-friendly?
or the nature of this software is possible only by violating the spec and also being "rude" to other peer in the network
yes. a) you can run a fully compliant node and passively collect hashes and then do lookups on them (optionally run on multiple IPs to increase throughput) b) implement BEP 51.
Nice project! Last year I made a similar PoC on Elixir based on the paper "Crawling DHTs for fun and profit"[1].
In that paper the researcher uses a pool of IPs to introduce multiple nodes (to cover the hash space). On my PoC I used less IPs but multiple ports per IP. How does this project manages to cover that ?
I too read that paper, but magnetico crawls Mainline DHT (which is the de facto standard) whilst the paper is on Vuze DHT, though the idea is more or less the same.
This project uses only one instance of crawler running at the same time, so one IP and one port for crawling the DHT. Whenever it receives a `GET_PEERS` query from another peer in the network, it forges itself a new ID that has the first 15 bytes in common with the info hash in the query, so that the querying node would think that it found the closest (or a close enough) node to announce. After responding back, most of the times the same querying node sends a `ANNOUNCE_PEER` query to register itself as a peer downloading that particular torrent. Upon receiving `ANNOUNCE_PEER` query, magneticod then connects the peer using BitTorrent (TCP) protocol, receives the complete metadata, and closes the connection.
Keep repeating these and you'll end up with thousands of torrents at the end of each day.
Possibly, utilizing multiple IPs might increase the throughput and prevent IP based bans, but I didn't bother to be honest, as the current solution seems to work well enough.
1. I assume it is "dangerous" to run this on an IP address that is easily traceable back to you, since it claims, to various peers, to be downloading all the things, even though it doesn't. Is this correct?
2. A site, skytorrents.in, was posted to Show HN a while ago, that appears to be similar to a centrally hosted version of this, in that it builds its index of magnet links by trawling the DHT rather than relying on user submissions.
Also, i think its going to be very difficult to find anything in the DHT (spam) chaos without some of the pre-filtering that exists by using torrent sites. Fun project though..
Every DHT has to be bootstrapped from something. If that makes it not distributed then i'm not sure anything is. You can use whatever bootstrap nodes you want by changing the values.
Readme says "truly decentralised" which is the point, not "not distributed". It certainly is distributed, just not truly. It requires bootstrapping with prefixed domain names. Bootstrapping is not bad (when done securely), but the false claim is.
DHT bootstrapping is an implementation detail, not part of the protocol. You don't need DNS or any dedicated bootstrapping mechanism since any node in the network can be used to join it. You could call your friend and ask for the IP and port of his bittorrent client.
So the protocol is fully decentralized. Implementations tend to use bootstrapping mechanisms that rely on an expandable, customizable list of known bootstrap nodes, but this is not essential.
Well, it is decentralised but not truly/completely distributed.[0] I guess a distributed system would allow clients to search the network right from the querying machine instead of a server.
The adjective truly meant that you can actually start using it right away. Starting a traditional torrent website/tracker is not hard and there must be open source implementations as well, but the problem is (as in all social systems), they suffer from the trap of initialisation: there is no content because you have no users, and there are no users because you don't have anything to offer them, and so on. magnetico discovers the torrents so it's a truly decentralised solution that you can actually start using right away.