Ceph: open source petabyte scale distributed storage

va_coder · on Jan 22, 2010

How is this different from Hadoop distributed file system?

marcua · on Jan 22, 2010

Whereas HDFS is to be used programatically or through a shell, it appears (though the documentation is sparse) that Ceph was designed to be mountable like most traditional unix FSs. There's the MountableHDFS [1] project for Hadoop, and so these could end up being equivalent interface-wise. At that point it's all up to how they implement create/append/delete/seek/replication semantics, which Hadoop has way more documentation on than Ceph.

The Ceph docs also imply that they have designed it so that it's easy to snapshot directories---I'm not sure whether HDFS has facilities for this.

[1] http://wiki.apache.org/hadoop/MountableHDFS

twohey · on Jan 22, 2010

From http://ceph.newdream.net/roadmap/

"We hope to have the system usable (for non-critical applications) by the end of 2009 in a single-mds configuration. Even then, however, we would not recommend going without backups for any important data, or deploying in situations where availability is critical."

It looks interesting, but not ready for prime time. Anyone using this in the wild?

m_eiman · on Jan 22, 2010

From the latest Dreamhost newsletter:

In fact, Ceph is practically what you'd call "stable" at this point (which is not to say it's "production-ready"!), and we've actually begun testing it as a backup/replacement for our poor backup.dreamhost.com server (who's been having a terrible time for months)!

If YOU would like to give Ceph a try, please, download away... it's free!

Also, we're going to be setting up a "playground" test-bed where anybody can try out a Ceph installation we set up and maintain in our data center. If you're interested, just email beta@ceph.newdream.net, and we'll send you an invitation when it's ready!

[edit: added the bit about beta testing]

patrickgzill · on Jan 22, 2010

How is this different from Lustre (lustre.org) ?

anotherjesse · on Jan 22, 2010

We are using Lustre on a projec. The differences that jump out to me on http://ceph.newdream.net/about/

1) auto-balancing when new storage nodes are added

2) copies of objects stored in multiple storage nodes

Running HA Lustre requires a LOT of work.

1) Lustre requires tricks like copying files to rebalance

2) Lustre only has one copy of any section, so each OST is a SPOF. Sun recommends deploying OST in pairs with DRDB and Heartbeat functionality (cutting your space in half and complicating the deploying), but a box failing won't break your FS. This doesn't fix the problem that if a network partition occurs since you still have locality of chunks in a single rack.

For more information about how complicated Lustre in a production environment is check out the talks at the last UG by Sun employees (Lessons Learned & Best Practices, Managing High Availability on a Shine-Equipped Lustre Cluster):

    http://wiki.lustre.org/index.php/Lustre_User_Group_2009

I've not tested Ceph in production but many people I talk to about our Lustre woes recommend Ethan Miller's work - http://users.soe.ucsc.edu/~elm/ - which leads to Ceph ;)

Even if you aren't worrying about availability, Lustre has issues with usage:

Lustre requires older kernels - which means compatibility with modern non-RHEL distros leads to headaches when you need the lastest version of KVM (or python).

Lustre also tries to be a POSIX filesystem but it breaks horrible when you try to use features like O_APPEND from multiple nodes (file corruption!) We are still tracking down Lustre breaking when we read files in a certain order (rsync is ok, but start skipping around in a file read leads to the kernel stopping responding)

After months of dealing with Lustre issues with trying to support POSIX I have come to love the non-POSIX (S3-like API) to distributed filesystems.