I wonder if you could construct a hash collision for high pagerank sites in the google (or Bing) index. You would need to know what hash algorithm google uses to store URLs. This is assuming that they hash the URLs for their indexing. Which surely they do. MD5 and SHA1 existed when google was founded, but hash collisions weren't a big concern until later IIRC. You'd want a fast algorithm because you're having to run your hashing algorithm on every URL you encounter on every page, and that adds up quickly.
The max legal length of URLS is 2048, but I wouldn't be surprised if there aren't plenty of non-compliant URLs longer than that in the wild. If you were limited to 2048 characters, and a valid URL format, I suspect it would be hard if not impossible to build a URL with the same MD5 of an arbitrary high ranking URL like "https://nytimes.com/" But what if you just wanted to piggy back the pagerank of any mid to high rank site? Is there a URL in the top million ranked URLs you could MD5 hash collide?
I doubt google would use a URL hash as strong and as slow as MD5. Maybe Page and Brin weren't even thinking about cryptographic hashes, and just a good mixing function.
Google has developed several non-cryptographic hash functions. CityHash, FarmHash, and HighwayHash come to mind. BigQuery provides FarmHash next to the MD5 and SHA-family of cryptographic hash functions. Who knows, maybe they use FarmHash today to index page URLs.
I don't remember the details but I think there was a legend at some point of two different search queries hashing to the same 64-bit integer once and causing problems.
> Is there a URL in the top million ranked URLs you could MD5 hash collide?
This is answered in the article. No
"Q: Is it possible to make a file get an arbitrary MD2/MD4/MD5/MD6/SHA1/SHA2/SHA3, or the same hash as another file? A: No."
I'm not sure page ranking works that way.
But you could get people to share www.nytimes.com/2011/05/02/world/asia/osama-bin-laden-is-killed.html?garbage=blahblah and collision it with your site crappyproduct.com?rubbish=meh
The max legal length of URLS is 2048, but I wouldn't be surprised if there aren't plenty of non-compliant URLs longer than that in the wild. If you were limited to 2048 characters, and a valid URL format, I suspect it would be hard if not impossible to build a URL with the same MD5 of an arbitrary high ranking URL like "https://nytimes.com/" But what if you just wanted to piggy back the pagerank of any mid to high rank site? Is there a URL in the top million ranked URLs you could MD5 hash collide?
I doubt google would use a URL hash as strong and as slow as MD5. Maybe Page and Brin weren't even thinking about cryptographic hashes, and just a good mixing function.