Hacker News new | past | comments | ask | show | jobs | submit login
How Mail.Ru Reduced Email Storage From 50 To 32 PB (smashingmagazine.com)
147 points by riqbal on Jan 28, 2017 | hide | past | favorite | 47 comments



Reading about their fix with a "magic" number it seems that when there was corruption - the file would never be deleted again since you've lost information at this point. (As they claim, even if the magic number went back to "0" you would still need to keep the file.)

Rather than this questionable solution, why not create an transaction-like system where a delete would result in a tombstone/marker placed on the email database with a rollback mechanism if one or both of the services crashed.

For example, an index marking the email as deleted, then you delete the file, then if the file was successfully deleted actually clear the email record.

So the email table would be: {id uint64, ... deleted bool}

1. If the file delete service failed, then unmark the deleted index (or try again).

2. If the marking the email as deleted failed, then you're back to square one with no change.

3. If the email is marked, the file is deleted, but actually deleting the email failed. Restart/Tryagain/etc until you get that pesky email record removed.


Transactions are "hard and slow". Distributed transactions are hard-er and slow-er. Random 32 bit integer additions are likely "good enough" and are fast and simple.

Also, assuming that their storage is directly some FS, you don't have access to the transaction. You can't do a distributed transaction against something that, as a black box, does not appear to do any transactions.

In this case, doing a distributed transaction would mean to use an entirely different storage solution. A transactional one supporting TPC or similar. Magic counters sound good enough yet? ;)


Thanks for your question. In your solution there is some consistency vulnerabilities:

- (Step 1) If "delete service failed" - timeout is a fail? The decrement probably was done, but your app do not see response. In this case you dont know whatever decrement was done really or not. You cannot do re-try with this error (unless you use "magic" =))

- (Step 1) If you do "unmark deleted index" this does not mean it was fixed on disk forever. Your app or even server can shutdown/reboot before you ensure disk write.

- (Step 2) If marking failed you generally cannot rollback. You can try to rollback the decrement, but operation can possibly fail too. On next delete iteration the counter in filedb will be corrupted in this case.

- (Step 3) Indexes become exclusive data owner. If you lose indexes for 1 mailbox somehow - you cannot rely on counters in filedb anymore. So, you cannot do any decrements anymore until full double-side recheck.

- Any bug in any app in this chain can destroy filedb consistency.


Your magic numbers don't seem to guarantee consistency either, multiple faulty exchanges and the file could be gone forever: maybe random numbers happened to be the same, or the sum of substractions during faulty exchanges got to 0, or there was an overflow, etc.

Distributed algorithms are not easy, but for this case there are proven solutions, look up how Riak deletes records for example, with "tombstones" and "reaping". Basically you have to think in terms of CRDTs and have a log of operations, a versioning scheme, a synchronization, etc., all of which help you to make sure that no operation is applied twice anywhere and that in case of faulty communications nodes can synchronize and apply missing operations. This gives you consistency, eventually, and a reliable way to know when files can be physically deleted. Or, you know, just use Riak.


or dynamo, or cassandra, or several others that all exhibit good behavior around these things. Also all of the above (and riak) have event streams when tombstones go away so you can hook into this and reap the file.


There would be no rollback of the deleted marker since we don't want to undo the delete - we want to keep trying until it works.

> timeout is a fail?

No, a timeout would just issue a retry until we got a response back (or if there is a reader it would keep connecting until it got through). There is no danger about deleting a file multiple times since any additional unneeded deletes would report such.

> this does not mean it was fixed on disk forever

Then the system would keep seeing the marker (after it restarted from the crash), repeat the delete call, and then trying to delete the email for good. It can crash 100 times but eventually it would get both removed.

> Indexes become exclusive data owner. If you lose indexes...

Then no mater how you build this system you're toast and have to restore backups. No system can handle losing indexes (& the error correcting or checksum copies) whether they are btrees, file blocks, or memory addresses.


This is a surprisingly primitive and simple system. A twice replicated file is a huge space overhead for not very reliable storage. (I mean the probability of losing two disks at the same time isn't high, but it's also not immeasurably low)

For example fb talked about how they use erasure coding to use 1.4x space overhead for far higher reliability.

https://code.facebook.com/posts/1433093613662262/-under-the-...

While erasure coding can have performance impact (although mostly only on failures) a multi-tier storage system will eliminate that. I presume that most attachments aren't read more the a week or two after being received.

So while it is definitely nice to get some savings through deduplication there is a lot more gains that can be had by more intelligent storage of the data.


As an aside, this provides an interesting before/after example of the editing process. The same article was on http://highscalability.com/blog/2017/1/2/efficient-storage-h... earlier this month, but has been edited here.

The title also changed from "Efficient Storage: How We Went Down From 50 PB To 32 PB" to "How Mail.Ru Reduced Email Storage From 50 To 32 PB" - the latter of which seems a lot better for places like HN where headline edits for context can be tricky to come by.


Instead of using random numbers and adding/subtracting, you could use random primes and multiply/divide. If the deletion is doubled, the random prime is no longer a factor of the product and you ignore it.

Of course, You need to be able to store a very large number :)


I got my email storage down to ~66% of it's previous requirement by enabling the zlib plugin for Dovecot and running a batch job to gzip compress mail in my Maildir.


That sounds like a good idea and I've bookmarked your post for future reference. Did you notice any performance issues after switching to gzip?


I didn't no, but it's not something I measured.


I don't get this:

If two systems (one with the index, another with the file) need to communicate in order for an email to be deleted, why would you need a random number in order to track email deletion in case of error? Wouldn't make more sense to decrement the index only once the file deletion was successfull?


This was answered in the comments section of the high scalability article [0]:

> Deleting email (and generally any file) is not transactional. It can be rollbacked by server poweroff, for example. This can generate additional decrement, that is the problem. Generally, 2-phase commit between "email deleter" and "counter decrement" is not enough. You will also need 2-phase commit between "deleter" and physical disk(not its cache), and so on. Also, a network error can occur between this 2 phases. We do not want to lose user's data in any case.

[0] http://highscalability.com/blog/2017/1/2/efficient-storage-h...


Thanks for pointing me in the right direction! :)


The scenario they wanted to handle is not a failing deletion request, but multiple requests caused by deleting one e-mail. This can happen if the index DB thinks the request failed and retries it, while it actually went through on the file DB side.

This would cause the file to be removed too early, as the counter would be decremented more than it should be.

That said, this magic number solution sounds pretty fidgety - preventing the index DB from retrying deletion requests would be enough to achieve the same effect (some unnecessary files kept, but no erroneous deletions).


There could be other necessary cleanups in the delete-function that needs to be completed aside from the deletion of the physical file. Other than that I generally agree, but also know never to assume to much about code written by other people.

I guess another way of solving this would be by doing something like putting each file in its own folder on the file system, giving the folder the same name as the original file and renaming the file to something. "file" or whatever. Then you create one symlink per email linking to the file and the email links to that symlink. When deleting the email you delete the symlink. After the deletion of the symlink you could check if there are no more symlinks in the folder, and safely delete folder if none present.

On receiving the first email with filehash X:

  mv ./original_unique_name.file ./original_unique_name.file/newfilename
  ln -s ./original_unique_name.file/newfilename ./original_unique_name.file/symlink_for_email_1
On receiving subsequent emails with filehash X:

  ln -s ./original_unique_name.file/newfilename ./original_unique_name.file/symlink_for_email_2
  ln -s ./original_unique_name.file/newfilename ./original_unique_name.file/symlink_for_email_3
  ln -s ./original_unique_name.file/newfilename ./original_unique_name.file/symlink_for_email_4
 ... etc
On delete email Y:

  unlink ./original_unique_name.file/symlink_for_email_$Y
  (pseudo) if no symlinks in ./original_unique_name.file/ { rm -rf ./original_unique_name.file/ }
Thus avoiding the need for a counter as each email is linking to its own symlink.

Will cost a few bytes per email, but seems like they have some to spare!


1. This is nothing to do with distributed storage

2. Symlinks are not required to be linked to existing file

3. Symlinks are not atomic

4. You still need to maintain filedb (or how do you resolve which storage contains given file?)


1. Depends on your distributed storage. If you use clustered file system your argument is void.

2. While true I fail to see how that is relevant. What part of my described flow would be broken?

3. While true I again fail to see relevance. Are you just listing characteristics of symlinks? A field in a database could exist with a filepath pointing to a non-existing file as well.

4. Sure. I was describing how to avoid the counter and magic number, not the database.


How does application-level deduplication compare to ZFS deduplication?


It doesn't, really.

Applications can use much more domain knowledge to do the dedup, possibly with less resource usage or better speed (memory, eg. because an index to dedup with already exists, or CPU, eg. it can use existing constraints on data and doesn't need to hash all data). FS-level deduplication is... well. For classic FSes it does exist [1], but it's a pretty ugly kludge, and is pretty much inherently an off-line process that can't happen during writing the data the first time. For CoW FSes it's less of a problem, and eg. ZFS and btrfs can do it, but it's often not a exactly a great fit. (Although one could see their snapshotting features also as deduplication)

[1] through extent sharing, ext4 does this, iirc experimental in XFS. duperemove uses this.


Application-level dedup usually works at the file level, while ZFS dedups data blocks (of variable size). The ZFS way of doing it allows saving space on large files that are similar, like VM images. On the other hand it requires storing block hashes in memory.

A ZFS dedup table entry is 320 bytes in size [1], so assuming 128 KB blocks that comes to 2.5 GB / TB. Moreover, the tables are kept in the ZFS cache (ARC) and they have to compete with other data and metadata. Since metadata is by default limited to 75% of ARC [2], you'll need at least 3.3 GB per TB and on the Internet you'll find even worse figures, like 20 GB [3].

[1] http://www.oracle.com/technetwork/articles/servers-storage-a... [2] https://github.com/zfsonlinux/zfs/commit/9907cc1cc8c16fa2c7c... [3] http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-ded...


tl;dr: I agree with you but would put it: DON'T DO DEDUP AT ALL, since even having all that RAM doesn't save you from real world problems.

Ultimately, the deduplication table (DDT) is consulted on each write to any dataset or zvol where deduplication is not "off".

The DDT layout on the underlying storage is effectively random. It is roughly an AVL tree with each node corresponding to a block checksum. While the tree is packed into as few underlying blocks as possible, as the DDT grows a larger number of seeks are needed to reach each node. This is important because back-to-back writes even of very similar data will need to consult spatially separated parts of the DDT.

Bear in mind that the DDT is brought into ARC on a demand basis, i.e., all that memory you advise goes unused if there are no writes to a dedup!=off dataset or zvol.

Where you have enough IOPS to service a couple of random reads per write, then you do not need the DDT in memory, so it's good that it is pushed out by hotter zfs metadata. This can be the case for high-4k-random-IOPS SSDs, for instance.

On rotating media -- spinning disks -- you will NEVER have enough IOPS.

[1] In perfectly ordinary circumstances, after a reboot or pool import or major arc shrink, the DDT will have to be brought in by future writes, which increases write latency by a factor of the track-to-track seek time for every write. The DDT reads are synchronous, affecting the entire TXG sync, so the extra latency affects all writes to the same pool, even to other (dedup=off) datasets and zvols. The extra seeking will also interfere with uncached asynchrnonous reads from the same pool.

[2] After a crash it is wholly possible that the DDT has to be walked during the import phase, so your import will have to wait on many many track-to-track seeks before it becomes available; on openzfs, this is liable to stall all sync tasks for all pools (i.e., "zpool list" is likely to hang, "zfs snapshot" on another already-imported pool might.) This is especially likely in the presence of a deferred destroy.

[3] Dataset and snapshot destroys require consulting the DDT for practically every DMU object in the snapshot or dataset; that's basically every file and directory at the POSIX layer, or the inside-the-zvol filesystem's metadata (and possibly many many ordinary blocks because of UNMAP). If your DDT isn't in RAM during this, that's multiple seeks per object. Additionally the DDT and the zfs metadata pointing to it needs to be updated as well, and that is almost never seek-free.

[4] 2 & 3 combine: if you crash or reboot out of frustration while a destroy of a deduped dataset or zvol is in progress, your import will have to wait until the destroy is completed, only this time NONE of the DDT will be in main memory to start with.

[5] resilvers and scrubs always walk the whole DDT first. If you have to replace a leaf vdev and you have a substantial DDT, you will have a bad time.

[6] the times involved to deal with a deferred destroy or cold DDT walk can range from hours to weeks depending on the deduplicated data and the freespace at the time each write was first committed.

Mitigations like persistent L2ARC, or vdev pinning (allocating a top level vdev for DDT metadata, for instance, so you can use a mirrored nvram to hold the DDT for a pool with rotating media holding the file data) have not yet landed in openzfs.

On fast nvram for datasets that are read-mostly and which are highly deduplicatable, dedup=skein can be a win. However, even if seeks are virtually "free", heavy writing to such a dataset is almost certainly going to make one's ARC less effective for caching other zfs metadata and file data, and main RAM is almost always much much more expensive and scarce than solid state leaf vdev hardware. Additionally, the benefits of queueing on the hardware itself tend to evaporate because of the additional random read load (random reads are the enemy of throughput!).

So even where DDT goes from merely survivable (which is never the case on media where random reads take on the order of milliseconds, because one fault will lead you to data and system unavailability) to pretty tolerable (a wide n-way-mirrored SSD-based pool) ZFS deduplication is unlikely to be worth turning on, and it's hard to back out of.

It's better to focus on shared subtree strategies (zfs clone, zfs receive -o origin=snap, checksums that allow nopwrite, etc.) and compression rather than deduplication. And where you can't, it's better to spend the money adding more top level vdevs than it is to spend the time dealing with a situation where deduplication leads to unavailability.


They realized they're mostly storing multiple copies of spam.


TFA says that they just deduped attachments with their homegrown system.

As I think about it more, it would make a lot of sense for a Warez group to use big storage email services with lots of accounts and some bot where you request a file and it mails you all of the compressed/encrypted chunks for you to put together. I've never seen anything like that in the wild, but it seems like it might work.


When I was younger and only had Internet access via a long-distance dial-up, Juno offered free e-mail with toll-free dial-up numbers. I spent many hours downloading files I had requested using "FTP by e-mail", which was sorta what you describe.


you just described usenet, requests or fills are usually done over irc


Usenet was a different system entirely. It was an application layer broadcast mechanism. All of the content pushed to Usenet was replicated all across the world so people could locally access it. Fantastic system for distributing current information to thousands of people worldwide over slow backbone links. Unfortunately it was also fantastic for broadcasting advertising to the entire world for free, which is why Usenet for news is dead.


Spam is very small. Totally eliminating spam will not meaningfully help your storage situation.


Yes I discovered that recently when I found my spam folder was only 200mb despite storing tens of thousands of spam. Makes sense that spam would be as small as possible for quicker delivery.


Most space in email attachments is photos and office documents.


Does GMAIL does something similar too?

This should be standard feature of mail softwares. At my organization, whenever a 2MB file is mailed by management to entire organization mail system crawls for some times.

EDIT: Our mail system is hosted in-house.


Organisations have different concerns; Microsoft Exchange used to deduplicate email attachments internally with SIS (Single Instance Storage). It did it within a mailbox database, not company-wide.

But when there was a single attachment and lots of people wanted to get it, it generated a lot of disk IO looking up where the attachment was stored and going to get it. Storage space is cheap, disk IO is more expensive, and in Exchange 2007 to 2010 timeframe - seven or eight years ago - they took SIS out of Exchange and it no longer does this. Now every user reads their mailbox and all the data is right there and the disk IO is reduced.

Disk IO went down, the storage design of Exchange changed to the point where it's recommended to be run on cheaper SATA disks entirely, which is a bigger saving than reducing storage size a bit.

https://social.technet.microsoft.com/Forums/exchange/en-US/e...

Mail.ru has different needs to a company internal mail server, companies can enforce quotas and provide alternative file storage.


Willing to bet that your mail system is configured in a suboptimal way such that the email gets scanned for virus/spam once per recipient instead of just once over all.

Internal to internal mail shouldn't even need spam filtering. Should probably have virus scanning on though to stop the spread of malware through an org.

Or could be something entirely different.


This only deals with storage of the file.

Your SMTP-server still has to deal with every email and every file, and your mail server still has to receive every one of those before it could even start to consider deduping. This is probably what's murdering your email system.

Quickest solution for your problem would be to put the file on a share accessible for all recepients and just link to it from the email text.


Notice that the mail.ru audience is some 25% off the April 2015 peak levels. Not sure what caused prior spikes in usage, acquisitions maybe?

https://top.mail.ru/visits?id=250&period=2&date=&gender=0&ag...


Is the W an attack one can execute by finding collisions in SHA-1 and delivering a different payload than expected?


If you could do that, you'd likely have much better and more profitable things to use your SHA-1 cracking skills on.


Secret Service shouldn't protect vice president, because there's a higher target!

This gets repeated so many times, I wonder if we even should use crypto for anything that costs less than a billion dollars. Imagine that you can steal a billion by figuring out how to generate SHA-1 collision. You generate one, then steal the money. Then what? Then you still have a way to generate collisions.

Yes, the possibility of hash collisions is real and it's generally assumed we'll have SHA-1 collisions "real soon now". Just because there are more profitable things to attack doesn't mean that we shouldn't try protecting other systems. For example, consider MD5 collisions: anyone can generate one for free. If someone invented a way to produce collisions for SHA-1, after profiting from the highest profitable target, they will probably try to profit from the next profitable target, etc. and eventually we'll learn that there are collision. What's next? The next step is to protect our systems from collisions. So, we're back to square one.

The simplest way to protect from SHA-1 collisions, is of course switch to the better hash function. For example, BLAKE2 - it's faster and more secure. We'll win some time until it's broken (prediction: it won't happen anytime soon).

Another way to protect against collisions is to build your system in such a way that collisions don't matter. For example, they can use HMAC with a secret key.


Are they "reinventing" reference counting and transactions?

Both concepts are in the article, but the words "reference counting" and "transaction" are nowhere to be found


Is it uncommon for email servers to NOT use de-duplication?


ISTR something about an older version of Exchange that did dedup, but that as soon as someone _opened_ the attachment it was no longer a candidate for dedup.


Yes. Many secure email services handle data in methods that prevent identification of doubles. If you cannot read the data you dare not reduce it (multiple layers of keys, same doc going to or held by different people cannot be identified).


One of the crappiest services in russia. You can remove all the propaganda posts and double free space.


I can understand an initial mail product not having deduplication implemented, but to get to 1PB let alone 50PB before implementing this seems a little extreme.

Glad they were able to avoid upgrading for a while now. :)


To those who were confused at first like me, there is a typo in the title of the article and it should read "from 50PB to 32PB"


I love such articles!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: