I'm curious about two things. I don't really understand how the randomized addressing helps with data locality. Is it just saying that XOR gives random results, and so adding random hops before doing lookups gives a higher statistical percentage of reaching nodes?
Also how does this compare to something like libp2p w/ DCUtR for NAT traversal? Is the NAT problem circumvented by the randomized walk routing?
Yes, the random walk relates to the "NAT problem". Basically, due to NAT, peers may not succeed when trying to connect to arbitrary other peers (especially if you assume no TURN servers or other infrastructure to facilitate). The same situation may also arise in mobile ad-hoc networks, where your wireless signals simply don't get everywhere. As a result, the usual greedy routing will get stuck in a local minimal. By prefixing the greedy routing with a random walk, you basically randomize the starting position, and then end up at a random local minima. Replicate at enough local minima (O(sqrt(n)) and do enough (O(sqrt(n)) lookups and the birthday paradox will make it increasingly likely for you to succeed. Without the random walk, both the putter and the getter would greedily route always to their local minimum, and if they differ, never find each other's data.
Where does sqrt come from? Shouldn't it be log? I don't know much about DHTs, but Kademila (BitTorrent's DHT) has query complexity of around O(log n), with n being the total number of nodes in the network. Kademila also uses random node IDs and XOR disrance metric.
BitTorrent's mainline DHT is O(log n). Base 2. However you can usually skip 2-3 bits per round of queries. This means you can locate any of the 20 million nodes in BitTorrent in about 7 hops. In practice it's about 20 queries and takes about 3-5s. No persistent connections. Barely a few KB of memory for every client. No other real world implementation is that good.
Co-author here. If you look into the R5N paper (link in draft), you can find this term. If you assume a routing performance of O(log n) and O(sqrt (n)) number of lookups to find the value, you get O (sqrt (n) * log n) as overall complexity (last section in paper).
For the argument of the sqrt (n) itself, see the paper.
Edit 1: Quote: "Assuming that T [random path length] is chosen large enough to achieve perfect mixing, sqrt (n / (c−1)) replicas would need to be created in order for a GET request to succeed with about 50% probability according
to the birthday paradox."
Edit 2: In a world without local minima the term itself will be 1 (you only need a single lookup and will find the value with 100% probability.
In the restricted route scenario you will have to so O(sqrt(n)) lookups.
Aha, thanks for that, the paper seems to reason better at first glance, I'll read it.
You mention sybil attacks on Kademlia, especially those that arise because nodes may set their ID and "camp around" some known key in the hash table. How is this addressed here? By glancing over the standard draft, I see the identifier is just a SHA-512 of the pubkey. Wouldn't it be feasible to bruteforce the first some bits of the key and "camp around" some keys a malicious sybil actor wants to deny service to?
On this Internet if you don't trust third parties to route all of your traffic through and peers are behind NAT or CGNATs. Or imagine a mixed network with some peers on https://en.wikipedia.org/wiki/Wireless_ad_hoc_network (s), possibly with some connected to the Internet. The key assumption we have (for the random walk to work) is that it must be a small-world network (https://en.wikipedia.org/wiki/Small-world_network) and network topologies surprisingly often are.