Not only that, BitTorrent also uses a Merkle tree of the torrent data split into pieces (e.g. 1 MB per piece), meaning you need to compute thousands of colliding hashes, with the added restriction that the data must be exactly the piece length.
The torrent ID is the Merkle root hash, which is a hash of all the piece-hashes hashed together in a tree structure: https://i.stack.imgur.com/JVdvj.png
If you find a collision for the smallest piece, you don't need to find collisions for any higher nodes on the Merkle tree because you'll just be hashing the same values as the legit version.
I don't know enough specifics about sha-1 or bit torrent, but it would seem to me that it depends on the data that is chosen for the base ("leaves") of the tree: if it just consists in taking the hashes, concatenating them, and move higher until you only have one hash, then yes.
However, if the data itself is concatenated for the first step, that might not be the case (depending on the sha-1 algorithm). I seem to recall that md5 operates on 512 bit chuncks (padded if necessary), and takes the result to seed the next computation. So, with md5 (if I recall correctly, still), the attack could work if each part of the torrent was a multiple of 512 bits, Merkel or not, and regardless of the base.
Of course, you cannot change the whole torrent in that case, as it would be prohibitively expensive. And you still have to keep the same file size.
I would love to hear a bit more on the topic from someone more knowledgeable about hashes and torrents.
The process is fairly simple. Files are concatenated and that blob is then chunked into multiple-of-16KiB-sized pieces. Each piece is hashed. The hashes are concatenated and embedded as string in the torrent file. Then a specific subdictionary of the torrent file including the piece hashes, filenames and the length of the data is hashed to yield the infohash.
Magnets only convey the infohash. .torrent files convey the whole metadata, including the pieces hash string.
Well, you could put most of the payload in other parts of the file. People probably wouldn't notice if they didn't go looking. And the small part could exploit a popular codec instead of being an executable itself.
Good point. It should be possible to create a video file that fills ups exactly the first n pieces, such that the last piece contains only the ambiguous data.
Most torrents use a flat list of hashes. The merkle tree is an extension. With a collision you could replace the metadata wholesale though, pointing at completely different content.
But then the torrent client would also present that different content from the start when asking you what to save.
Do you know of anybody that's actually using merkle torrents? I expect they're probably being used in some limited scenarios, but they're probably less than 1% of BitTorrent activity, so it's a bit misleading to refer to it when defending BitTorrent's security model.
The torrent ID is the Merkle root hash, which is a hash of all the piece-hashes hashed together in a tree structure: https://i.stack.imgur.com/JVdvj.png