I'm not surprised at Google's response, since this looks to me along the same li...

knome · on March 10, 2014

How could Google use hashes to avoid duplication? They'd have to download each link before they could hash the contents thereof, so the damage would still be done.

blauwbilgorgel · on March 10, 2014

The damage could be 3 downloads per Google Document. If 3 downloads produce 3 similar hashes then start limiter/throw up capchta/delay to avoid heavy intra-document duplication.

ma2rten · on March 10, 2014

How could Google use hashes to avoid duplication?

Rate limit per website (e.g. don't download more than 10 images per domain per second)

Limit the total number of images it downloads per document, so a single user can not cause too much traffic.

glass_of_water · on March 10, 2014

In that case, users may notice a performance decrease in spreadsheets for images from certain websites.

userbinator · on March 10, 2014

http://en.wikipedia.org/wiki/HTTP_ETag

(I know that servers can be configured not to send ETags or break caches by sending random ones every time, but this could reduce the data usage considerably since most of the responses would only include the headers.)

rogerbinns · on March 10, 2014

The query parameters make each request different. Etags are not unique across the internet - just for a specific url. There is no way an etag would help here, unless the same request is made later. Even making a request with an Etag still means lots of headers returned which while not 10MB will add up to lots of traffic.

mschuster91 · on March 10, 2014

But they could hash the filename (a hash prevents accidental disclosure of content).

toast0 · on March 10, 2014

Hashing the filename doesn't help, the URL is different, which is why caching doesn't work.

If we ignore that ETags are related to URLs and not 'files', ETag as suggested by userbinator might work for some cases, but if the large file is dynamically generated, it's unlikely to have an ETag; defaults in many servers are to make an ETag based on the inode of the file rather than any properties of the file, so if there are multiple servers behind a load balancer, they're likely to return different ETags.