What if google just immediately fetches images for all received messages, regardless of deliverability? They can dedupe wherever possible, but the sender still ends up with no information about whether the message was delivered, much less read.
What if google just immediately fetches images for all received messages, regardless of deliverability?
I'm trying to work out whether this is more useful as a way to get Google to DoS themselves or as a way to get them to DoS arbitrary web sites of others. Either way, isn't this a gift to trouble-makers?
Of course Google would probably develop an automated defence against such attacks quickly if they happened in practice, but it seems any such defence would necessarily involve not caching all the images in advance, which would defeat the original point.
I'm fairly sure sending an email is more expensive than sending a GET, so it should be more effective for an attacker to make the requests directly than trying to use this to get google to proxy an attack.
I also strongly suspect that google's crawling infrastructure is more than capable of fetching a bunch of images for every single message gmail receives.
But even if I'm wrong about the above, google is perfectly capable of throttling their fetching to mitigate. (The problem really ends up looking an awful lot like crawling the internet, which is an area that google seems to have a bit of experience)
Google can't tell, a priori, whether or not a series of similar e-mails sent to many thousands of people with Google Mail addresses and containing similar but different image links like the above is a genuine mail going out to someone's list or a DDoS of www.example.com in which Google is about to become an unwitting participant.
By the time they've worked out whichever trick is being used this time (in the same way that they adapt to changing black hat SEO tactics, but probably only make major changes every few months) it's not hard to see a hostile party busting the bandwidth cap for anyone on a basic, low-volume hosting plan.
Why involve Google? Aren't sites on basic, low-volume hosting plans easy to knock over, without resorting to DDoS tactics? And if you're trying to knock over bigger sites, it doesn't seem like Google would make a very good DDoS platform in any case, since the requests would be originating from a relatively small range of IPs that a bigger site could just ban. Presumably the only reason they wouldn't want to ban the requests is if they're actually the ones sending the emails in the first place, so the problem sort of solves itself.
This is an old problem with an old solution. If you have an expensive-to-generate resource that you don't want automatically retrieved en masse, you use robots.txt to deny access to it.
AND it could create a dis-incentive to load up an email with unique images since as soon as you send the email out all of those gmail addresses are coming right back at your server to request the images.
You can somewhat do this with the current system. I had no problems sending an email with 10 10mb images. Google happily fetched all 10 of the images off my server.
Not sure if they limit it at some point, but if a server accepts urls such as:
Google would fetch each separately. Send this out to a bunch of people, and it seems problematic. I'm going to be optimistic, and assume they built in some sort of limiting, but who knows.
We manage opt-in mailing lists for customers of restaurant chains, and for well timed, well-targeted campaigns they get open rates in the 30%-40% range with the majority of opens within 15-30 minutes of the send anyway. It takes more resources for us to handle the outbound mail load than to handle the inbound image requests, as for the image requests the url's contains enough info to do a trivial regexp based rewrite and fetch the images from a cache. I don't think handling a 100% open rate as soon as the mail was delivered would even remotely be a challenge.
With google's machine learning brain trust, I think they could still do a pretty good job deduping. Maybe not perfect, but I'd bet on them to win an arms race.
Edit: ah, codeflo and EGreg are right. I was just thinking about the task of determining that the images serve the same role in each message (which I'm sure google could do a good job of). But (as they point out) in the "Dear <user>" case they'll still have to show the right image to the right user. Although, as Nacraile and jaxn say, if they load all those images eagerly they'll remove the value of those unique tracking images, and impose a cost on the sender.
There's a huge difference between something that they might do, in theory, at some point, using vaguely magic machine learning technology, and what that they are currently doing, right now, to address the privacy concerns over a change that they already rolled out to millions of users.
Are you saying Google will try to guess and reconstruct "Dear Marge" in the same font as "Dear John" instead of requesting both from the origin server?
Scary enough to the average user it'd probably kill the the technique very quickly.
I'm not sure about that. It's nothing that desktop clients such as Thunderbird haven't been doing for years. I don't see any remote images in any e-mail until I click to say load them, and this works in much the same way that plug-in elements like Flash and Java are now click-to-show in various browsers. Numerous marketers and mailing list services still use the technique to track an approximation of reader numbers, though.
Nah, nobody pays attention to warnings attached to Gmail messages after having seen so many of them.
A particular example of crying wolf that comes to mind is the yellow box that says, "HEY! THIS SENDER ISN'T WHO THEY SAY THEY ARE!", which usually means that someone just forwards their .edu address to Gmail.
Although I'm sure it's possible to fool their image hashing algorithm, I doubt this will. Image hashing algorithms are designed to be resistant to small changes in the image and more advanced ones can generate hashes that determine how similar one image is to another. I haven't tried this, but you can probably see a proof of this using google image search. Add some text to an image, and see if Google image search can find the original.
Procedurally generated fractal backgrounds with random seed, that might work?
Anyway, I think it would be perfectly fine if Google matched up emails with similar content and where there was one image that was unique for everybody just remove it, maybe with a note to the user in that case.
Do NOT have a "click to load images" button. If users can't ever see them then it completely destroys spammers' ability to use them even for a rough sampling.
I would love to see spam as an advertising method be completely destroyed. It won't be, because even without tracking it is still easy and useful to spam out lots of ads, but this would help.
Yup. Though I suspect my university's alumni newsletter also has tracking images in it. I haven't checked, but if I were them I'd use a tracking image.
(Also, they don't currently seem to do anything like you're suggesting.)