Hacker News new | past | comments | ask | show | jobs | submit login
Gmail Ran out of space (daggle.com)
49 points by prabodh on Sept 29, 2009 | hide | past | favorite | 52 comments



I hate the fact that I can't search / sort by size in my gmail account. I'm sure I have at least 1gb of space that's being wasted by attachments that I no longer care about; but I don't want to go through all my messages to see what I want to delete and what I want to keep.


What you can do is add your gmail account as an IMAP account in Outlook or another email client and sort it there. Hard to believe that we must do that because of their anti sorting stance.


It isn't really an anti-sorting stance. If you have worked on appengine and with Google File System you will see that it is fast almost solely because there is no complete dataset. So sorting is really tough on 64K or 64MB chunks spread around many many machines. Google is so fast and scalable exactly because there is no complete dataset easily attainable. GFS works in 1000 item chunks and that is about the best you can get with that setup. Searching, counts/increments and metadata about ALL of your data in the GFS or email in this case, is a tough problem. The same underlying data system is used for gmail, reader, etc.


Why does it have to be a 'sorting' stance? Why can't you just search for something like 'size:>1MB' to get all emails that are larger than 1MB? Even if they aren't 'sorted' one can easily scale the search to narrow in on really large attachments.


I wonder how difficult it would be for them to create a metadata file per account that would only contain things like sent from, sent to, size, subject, date sent (standard email metadata). Then if someone wants to go digging around, send the metadata down as JSON and let the sorting happen client side and let the client return the order the messages should render.


It is possible but it also means creating more data. I could easily see this being a pay element to gmail in the future. Basically it could be almost like a bot/spider/worker that gets a fairly recent snapshot of your data as metadata and allows you to clean up or organize based on it. It could be pretty tasking though and may need alot of engineering because right now your gmail data might be spread across thousands of machines. With more data/years this only going to get more tasking and costly.

At a certain point with data we have been living in a relative small sandbox in terms of data. As our lives spread to terrabytes of data and across many services, we to will be unable to run atomic operations on the whole of your data.


Every time you mention GFS, you're really talking about Bigtable (which happens to be built on top of GFS)


Yes I say GFS because that is the core of why it is this way. From search to reader to gmail etc. All google data operates this way because of the architecture.

It is all in 1000 buckets. Ever gone beyond page 100 in google results? You can't. Even though there are millions of results you can only get to page 99 at 10 results per page because it too is built on GFS.


You are correct: http://www.google.com/search?q=blue&start=1000

> Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 1000.)


Results 811 - 817 of about 687,000,000 for test [definition]. (2.19 seconds) after seeing this, I doubt whether 687,000,000 is correct or not..


Term counting is the hello world of map-reduce examples, so while that number is obviously not exact (exactly 687 million?), it's reasonably accurate, considering the actual use of the count, which is more useless the bigger it is. No one created the opposite of Google Whack, finding the terms that yielded the most results; that's largely uninteresting when the counts are in the hundreds of millions range.


Found this out working at a Web Consultant when working with a google appliance. Specifically this page: http://bidmc.org/Search.aspx


Thanks for that-had no idea how GFS worked.

Seems like Google could easily treat it as a searching issue rather than a sorting issue, though. It would be easier than full text searching (which it already does).


Gmail does not support rich full-text search -- the indexer used is very limited. Among other things no synonyms, word stems, or subwords are indexed.


This is the biggest problem I have with gmail, and one of the principle reasons why I don't feel I'll be using it "permanently." My search queries very rarely find the emails that I'm looking for.


Thanks for the insight.


Just curious, but will adding my gmail account as an IMAP account in another client cause any synchronization issues with accessing my mail on mail.google.com ?


The only issue is that IMAP was never designed to deal with labels, so labels are mapped to folders. This can cause issues though. Deleting something from a 'folder' only removes that label from it. Then you have to go to 'all mail' to delete it.

You can work around this issue by using the Advanced IMAP Settings and hiding the 'all mail' folder in IMAP, then setting the 'delete mail once all visible labels are removed' setting. I had issues with using this and offlineimap to sync... Sometimes when I moved a mail from one 'folder' to another offlineimap would delete it from one folder and then create it in another, but Gmail's IMAP server would see that the mail had no labels on it (in the middle of the operation) and then move it to the trash instead. This issue shouldn't come up with a client like Outlook or Thunderbird though because of the way that they operate (I believe that there is a 'move' command in the IMAP protocol that they use).


I never had any problems. Just note that, when you read a mail on your IMAP account, it'll mark it as read on your webmail. Same for deleting/moving and whatever else you do.


Use "has:attachment" in your gmail search - not perfect (no sorting) but it at least shows you only the emails that have attachments.


and search "has:attachment filename:mov|mpg|jpg|rar|zip" and similar to find file types that are typically very large.


The problem with these is that I have thousands of emails with attachments, and the worst offenders are data dumps that don't have extensions.

I appreciate the advice though :D


Kind of misleading title - it implies that the whole of GMail ran out of storage space, whereas it was in fact just this guy.


This article is far too long for something that can be summed up in these words: I am an abnormally large user of my email. Google's "buy more space" feature is broken because of a bug. It should be fixed.


Not exactly. The point is that the ratio of gmail's storage increase is so low that more and more people will inevitably bump against the limit in a couple of years.

It's not about how full the average inbox is today, but where it's headed.


Not really. Look at the stuff he's deleting -- hundreds of Facebook notifications = 10s of MB. He goes through and deletes thousands of emails and doesn't really recover more than 100 MB. My first feeling when I read this was that he's obviously deleting the wrong stuff. What he should really be looking for are those picture album emails at 10 MB each. That's what taking up his space and that's what most people aren't doing. Google is practicing the old idea of 80/20 and doing it very well.


I don't feel that that's necessarily the case. I think that most people delete things like Facebook/Twitter nofitications. And his complaining about the size of the spam he gets is a mute point. The only reason that he cared about that was due to his proximity to his quota.

Gmail automatically deletes spam mails older than 30 days. It's not like people will fill up their Gmail account with massive amounts of spam just sitting in their spam folder.


Simple solution with an additional filter rule:

[x] Delete message after __ days

GMail's web UI is really overrated. Missing many features that desktop clients and even other webmail platforms have offered for years. I still prefer traditional clients with the web UI as a nice fallback.


I do enjoy the 'conversation' view though. It's nice that it will pull in the messages that I sent into a thread, rather than just relying on people quoting the whole message in replies.


If you use OS X, Postbox will do this.

http://www.postbox-inc.com/

I'm sure clients exist for other platforms which do the same.


I usually use mutt outside of the Gmail interface, but IIRC there is a Ruby-based console mail reader that aims to do the 'conversations' thing.



Curious to know which features you miss in GMail?


A few things:

1) Weak filtering/action rules. You can get pretty elaborate with Outlook or Mail. 2) Weak, non-real time, search. Have to wait for page reloads to resort things for example. 3) No mail preview plane. Lots of back & forth to the message list. 4) No easy way to backup your entire mail store 5) No way to change the number of lines previewed. I would prefer 4 or 5 versus a partial line after the sender/subject. 6) No saved searches or Smart Folder-ish functionality. Labels don't quite cover it for me.

If these things exist they are dong a terrible job with the GUI because I can't find them.


6) No saved searches or Smart Folder-ish functionality. Labels don't quite cover it for me.

How is that? Anything that you can construct a gmail search with can be used in a filter to 'tag' incoming emails. The only thing lacking is a way to 'filter' based on tags or to order the way that filters are processed. (i.e. you can't tag emails based on the other tags they are already tagged with, and even if you could it might not work since the filters for those tags might happen after the filter looking for them)


I don't think this would work for me. My saved searches are a bit too eccentric. I will try out the Google Labs deal.


And there is a lab for saved searches


You might want to make that more clear, "There is a Google Labs option in Gmail for saved searches." My initial impression was that 'lab' was typo of 'label' before I realized what you meant.


Regarding wasting storage on crud, I would be very surprised if the gmail storage system does not duplicate large attachments that have been forwarded around.

It would not be difficult to store one copy of every attachment or identical newsletter and link to it. This has to decrease their storage requirements dramatically.


Off the top of my head:

1) What if the disk your canonical copy is on dies?

2) You just introduced a dependency for cross-account reference counting (Alice and Bob share CryptoBible.doc attachment, Alice deletes the message it is attached to, Bob better still be able to access it). Your programmers will love you for that one. Don't forget all sorts of fun edge cases like "Does a suspended account peg an attachment on disk for forever, on the theory they could be unsuspended?"

3) Define "identical newsletter". No, really, pretend this is a job interview question. waits Did you take into account the headers? waits How about the salutation (Dear $FIRST $LAST,)? waits How about the tracking pixels and unsubscribe links, which are by necessity personalized?

4) What is the performance impact of the algorithm to determine "identical", with respect to your kinda-sorta-not-really "identical" developed in question #3? Can it classify billions of messages in realtime to make nigh-instantaneous retain-or-delete-and-add-pointer decisions?

5) Does this algorithm introduce multiple single-points-of-failure into Gmail to save a fraction of a sliver of the costs of running the service?


1) Why do you think Google has the canonical copy on a disk?

2) Why do you think Google ever intentionally deletes anything? Concurrent deletes are hard and expensive no matter what, and all of Google's tools make it much more so. They've gotten in all kinds of trouble in the EU for not being able to provide guarantees about how long they retain data.

3) Store headers, body, and MIME chunks separately. Your datastore will make hashes for its own reasons, and can make the de-duplication decisions independently of your app.

4) Your datastore is using hashes as the keys for retrieving the data already.

5) No, because there is no algorithm for this feature -- you got it for free because of other architectural decisions.


> They've gotten in all kinds of trouble in the EU for not being able to provide guarantees about how long they retain data.

The thing that I don't get is why they can't just 'wipe' the data. Maybe I'm misunderstanding GFS/BigTable here, but you don't necessarily need to 'remove' the information from that database. Just 'zero out' the data in there (or overwrite it with random data).


For starters, there are tons of indexes and snippets extrapolated from the data, and they're cached all over the place.

More fundamentally, why do you think any of the records are mutable at all? You reap enormous benefits at that kind of scale by making all writes log-structured.


I think you need to read about unix hard links, since they solve all 5 of your problems at once.

The name of each hard link is the hash of the file, and you are done.

Identical means identical, not close.

Each server will have a copy of attachments for email that lives on that server. No need to do it globally, just de-dup within each server.

I don't know if it saves much, but if your filesystem supports hard links, it's pretty much free.

That said, I don't know how google's GFS does things.


You're on the right track, but why the hell do you think your gmail lives on 'a server', much less in a fliesystem directly?

GFS normally uses chunks 2^14 times larger than a traditional inode -- you'd use infrastructure built on top of it (like Bigtable) to store email and indexes of it. You could easily end up designing a system that stores the email bodies in Gmail in a content-addressed form just as an indexing method, and end up getting de-duplication for free.


It's called "deduplication", and is used in some enterprise networks to free up storage space.


IIRC, MS Exchange does this. At least w/ attachments. I don't know about attachments from outside of the 'Exchange Ecosystem,' but if one Exchange user sends an attachment to 3 others, only one copy is stored in the Exchange DB.


A month or so [1] after Gmail launched, The Screensavers [2] got a Gmail address and, on the air, invited viewers to send them emails to fill it up and see what happened. Unfortunately, I didn't see the end of the show.

[1]I don't know what took them so long. I know that it wasn't that soon after launch because Gmail came out in April '04, but this show was in May. (I didn't have a television at college, but my parents did). Gmail invites were scarce, but one would think they could've gotten one.

[2]This was on TechTV. It must've been just before TechTV was folded into G4 -- literally the last month of programming.


What Gmail needs is an optional rolling deletion policy. When your mail gets to the ~7GB limit Gmail just deletes the oldest mail to stay within the limit. It could suggest you turn on this feature when you reach your limit.

MythTV does this when you fill your hard drive up with TV shows. It's much better than refusing the accept any new data,


What we need is a "keep for 30 days" button, rather than "archive". Sometimes I mean keep forever, but more often my reaction to mail is that I don't expect to need it, and if I don't need it in the next month, I definitely don't need it, but maybe I'll change my mind next week and want it.


Actually that is what the Delete button does by default, keep it for 30 days.


Yea, but sometimes people will empty their Trash though. If it was archived with a 'remove after 30 days' tag on it, it wouldn't really be possible to accidentally purge all such emails.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: