Gluster is one of those projects that seem under exposed. In the early days (when I was following the project more closely) I got to know some of the core developers quite well and I've always been impressed with the overall architecture and the goalset.
If you feel like you're a good C programmer have a look at the codebase and prepare to have your mind expanded.
It's a very nice piece of software and a very strong proof that not all successful development is done in SV (Almost all of GlusterFS before the buy-out was written in India, not sure what the current situation is).
RedHat seem to be investing a lot of (needed) effort and resources into it and it's certainly being actively improved. The RH "product" version of Gluster is the RedHat Storage Server: http://www.redhat.com/products/storage-server/
Most development is still in Bangalore. There's one (hyper-productive) guy in CA, and several of us in MA, plus QE and other roles all over, but the "center of gravity" is still in India.
I use GlusterFS for storage across my LAN. Each machine has its own brick and uses replication for a bit of safety. Works really well when you want to go a level below 'file sync' and things like Dropbox and use the network as the file system without having a dedicated NAS - just use your existing computers.
The thing that stopped me setting up this is the incompatibility between versions. With so many different versions of Linux on my LAN I just cannot ensure that they would all have the same version of gluster installed, so gluster is a non-starter for me.
Have you tried more recent versions of GlusterFS? Starting with GlusterFS 3.3, all major versions are compatible. Even the recently released GlusterFS 3.5 is compatible with GlusterFS 3.3.
I tried using Gluster a couple years ago, but gave up and went with pairs of systems with DRBD and OCFS2 (which I'm incredibly happy with), on Ubuntu. My backing store is RAID6, so it's probably overkill -- I'm going to switch to RAID1 on pairs 4-6TB drives in the next hardware rev, on Atom 2758 or low-end xeon systems, with 32GB RAM.
The main failure I was getting involved VMs. I know I could do something with openstack, but this is mainly a platform to experiment, so I wanted to be able to support any kind of VM.
With DRBD you are stuck with 2 nodes and OCFS2 has given me nothing but head aches, at least in trying to getting it to work in CentOS 6. Apparently Oracle is "forcing" people to use their RHEL clone or at least their kernel to get OFCS2.
When I had to use DRBD I went with CLVM.
Having said that, starting with v3.4.0 gfapi can be used if your Qemu supports it (it's the case with CentOS 6.4+ or Ubuntu 14.04), therefore bypassing the FUSE layer. This should give you an improved performance.
OCFS2 is extremely easy to set up and use with Debian (basically just apt-get install ocfs2-console, launch it and let it guide you). On the other hand, RedHat (and I suppose its clones) makes it easier to use GFS2, while GFS2 is incredibly complex to set up on Debian.
Take a look at ZFS as an alternative to RAID. After doing research on this I am convinced that any data worth anything should be on an fS that does checksumming and intelligent data repair.
Yeah, ZFS is in my future; it hasn't been supported enough on my OSes until recently. (I'd consider moving to FreeBSD instead of Linux, if needed, too.)
The iXSystems guys are doing really interesting stuff, too.
Is anyone here using GlusterFS in production? I'd be interested to hear about how people have recovered from critical hardware failure when using it...
We run hundreds of GlusterFS clusters in production on EC2. We're currently on 3.0 and in the process of fully migrating to 3.4 (and maybe 3.5 one day).
Our primary use case for Gluster is in serving persistent filesystems for Drupal. Our customers store potentially millions of files on their GlusterFS clusters.
We've built a number of tools/processes to help protect Gluster against failures in EC2 (for instance fencing network traffic at the iptables layer to help protect GlusterFS clients from hanging talking to down nodes), as well as to help our team perform common tasks (resizing clusters, moving customers from cluster to cluster, etc.). We haven't necessarily hit blocker issues recovering from underlying hardware fails, but our team is definitely very experienced with many possible failure modes.
Overall GlusterFS has been very reliable over the years and our research has shown it is the best option out there for when our customers can't use something such as S3 directly.
If you want more details or would love to hack on a 8000+ node EC2 cluster running things such as GlusterFS feel free to ping me.
At 8k+ nodes, bare metal would likely be much cheaper. Why would you possibly want to host this on EC2?!
Disclaimer, I've been at companies with thousands of servers for my past 3 jobs. I've never once used any "cloud" service other than Linode for my personal VPS and ganeti for the Oregon open source lab VM donated to the gnome foundation (I'm a sysadmin alum for gnome.org)
Many of our customers do use S3 and we make use of S3 extensively ourselves. However, Drupal often expects to operate on a POSIX compatible filesystem. Drupal 7 does support PHP file streams which can be configured to use S3, but not every Drupal module follows the best practices. Plus, we support every flavor of Drupal under the sun (including custom code).
All of our enterprise customers receive a highly available setup running on multiple nodes--thus, we have the need for a persistent filesystem attached to multiple EC2 instances. We utilize GlusterFS to ensure all of our clients have the filesystem capabilities their apps may need.
Basically, they need EBS without the single-instance (writable) mount limit, if they're using it to sync Drupal installations (php files, temp folders, etc.) across multiple web servers. S3 isn't really intended for that kind of workload (it would be possible to use something like s3fs, but I don't think that will give you a POSIX-compatible file system which is required for Drupal, and it's probably too slow as well (lots of small files, etc.))
We run it in some 2-node configurations with 4-6 bricks/node (Gluster is configured as distribute+replicate). Replacing disks isn't a big deal, there's a replace-brick command which migrates the data away from a brick to a new one, or for hard fails you can just drop in a new disk, do replace-brick and force a heal to re-replicate data. A node going down and coming back is straightforward, the self-heal-daemon and disk activity get the "missed" changes re-replicated (or force it via heal) -- everything keeps working in the meantime. Native clients ignore "down" nodes without needing intervention (for NFS clients use a shared IP and fail that over to any "up" gluster node).
Performance gets better with every release, but the speed of `ls` (Gluster uses `stat()` to trigger replication/checking) is still woeful at times. 3.5 will hopefully make that a lot better. In general, Gluster is not the right answer if you have lots of small files.
I used a very early version of GlusterFS on a 5 node cluster and never lost any data (about 32T worth, quite a bit at the time). What I liked most about it was that it produced storage nodes that could be inspected with the regular toolset, even if something would go down (which it never did) it would definitely have been possible to recover by looking at the crashed storage.
Having the 'bricks' use the regular filesystems was a very smart decision, it means you don't end up in binary hell when things go south.
Split brain was one of the weaker points back in the days, I am not current with the software so no idea if and how much this has improved, someone with more recent experience would be a better guide there.
Sadly, split brain is still a bit of a sore spot. It probably will be as long as GlusterFS continues to use the "fan out" replication approach, though past and future enhancements around quorum should help too. On the other hand, it's great to get such positive feedback about the "everything is regular files" doctrine. It has certainly been a bit painful sometimes, especially in terms of performance, but the feeling has always been that the ops advantages are worth it. Nice to see someone else agrees.
Well, we used it for storage backend for VMs few years back.
For some reasons it usually failed on high load, resulting in bad split-brains, extremely slow IO and loss of data. Very bad. We contacted Red Hat regarding that and got comments that gluster was not VM ready yet. We ditched gluster and went for plane proven old iSCSI.
This was a method to replace an existing lockfile. Mercurial uses some similar code. The expected behavior would be that foo/bar/tmp/file replaced /foo/bar/baz/file. Instead, the outcome was that there was a race in Gluster where it got confused about which version of 'file' was correct, and it would end up with a split brain between the two nodes. This would be exacerbated by a node failure, but didn't always require one. Heavy load seemed to make the failure more likely. We couldn't replicate the same bug moving files in the same folder, it was the subfolders in the same gluster fs that seemed to cause the issue. The frustrating part was how gluster pointed fingers at the file operation being incompatible, despite advertising a posix compatible filesystem. Apparently the bug is fixed, but we moved off gluster, never to return.
Also, filesystems are hard, I get it, so no hard feelings :)
I've used it on production in EC2. The setup essentially replicated the underlying files to at least two nodes so if a host failed then the volume was still entirely available automatically, and when the dead host recovered it would replicate on read. It was great for mostly read workloads, but sluggish when the workload involves a lot of writes due to replication costs.
I used GlusterFS a couple of years ago (2011) for a hosting project, we had about 300 or so medium-large PHP sites based on top of a GlusterFS cluster. It ran beautifully but the pain points where it fell down was small file performance and its configuration system (I think this has been made much nicer in more recent versions).
I had a similar experience trying to build a highly-available wordpress infrastructure a few years back. The terrible latency of gluster completely killed any chance of getting decent web performance out of it. Ended up implementing OCFS2 on DRBD, much better.
I had a client running PHP from a gluster storage, split across two nodes. Performance was horrific, although some of the sites were improved via Varnish we eventually had to move to local copies of files & rsync. Not pleasant.
Gluster is nice, but you have to accept the fact that it is slow, and upgrades are painful. (We had issues where clients and servers had to be updated at the same time, to avoid interoperability issues.)
(1) How do you secure GlusterFS traffic? Does GlusterFS use some kind of encryption on the wire or do you have to manually setup a VPN?
(2) How well does GlusterFS work if you host bricks on web servers? If data are sharded per user, can you make a web server always have fast access to that user data (can you guarantee a copy of the data is always hosted locally on that web server?) -- or is this something that only happens indirectly through the OS cache?
(3) Do you need something more from the underlying filesystem to guarantee data integrity, like running GlusterFS on ZFS? Or is GlusterFS enough?
> (1) How do you secure GlusterFS traffic? Does GlusterFS use some kind of encryption on the wire or do you have to manually setup a VPN?
Although I have not used the feature, there is support for SSL, look it up. I have also heard about people trying it over OpenVPN though not sure how successful they were.
> (2) How well does GlusterFS work if you host bricks on web servers? If data are sharded per user, can you make a web server always have fast access to that user data (can you guarantee a copy of the data is always hosted locally on that web server?) -- or is this something that only happens indirectly through the OS cache?
FUSE is overridden more and more with the introduction of gfapi. Qemu and Samba can now both "talk" it, so you don't need FUSE for that any more, for example.
The native client still uses it though and this is unlikely to change too soon, afaik.
The performance hit is disputable, there aren't really better alternatives. Most distributed filesystems out there are using FUSE too (XtreemFS, MooseFS) and I am not touching Lustre.
There are issues with everything, your dismissing it like that is not really fair or constructive.
FUSE is overridden more and more with the introduction of gfapi. Qemu and Samba can now both "talk" it, so you don't need FUSE for that any more, for example.
It should be possible to avoid the userspace to kernelspace transition in the client by using LD_PRELOAD with a shim library as well.
The performance hit is disputable, there aren't really better alternatives. Most distributed filesystems out there are using FUSE too (XtreemFS, MooseFS) and I am not touching Lustre.
The Ceph distributed filesystem has an in-kernel client which is upstream. XtreemFS and MooseFS are designed for a different use-case (WAN filesystem) and aren't really directly comparable.
Lustre runs the majority of the world's supercomputers. It seems a little unfair to dismiss them without at least giving a reason why. They have made some progress towards getting their kernel client upstream recently.
I was a Ceph developer for a while. I read about Red Hat's recent acquisition of Inktank with interest. (Inktank was founded by Sage Weil, the creator of Ceph, to commercialize the filesystem.) Since Red Hat previously acquired the main company behind Gluster, this makes things a little-- how shall we say?-- interesting. It's unclear to me whether Red Hat will want to support two distributed filesystems going forward, or whether they will try to streamline things.
> Lustre runs the majority of the world's supercomputers. It seems a
> little unfair to dismiss them without at least giving a reason
> why. They have made some progress towards getting their kernel
> client upstream recently.
Lustre aims at performance and thus everything is in-kernel, which
makes operations a nightmare: basically everything you have is an
opaque filesystem structure on disk, and a bunch of logs which can be
very difficult to read. And you have to run the kernel with Lustre
patches, which only apply cleanly to RHEL and SLES; work is being done
on getting Debian in, but still no luck with Ubuntu or recent
mainstream kernels.
Also, there is no clear upgrade path between different versions: the
recommended procedure is to dump everything to another filesystem and
then reload after the upgrade. There is no replication handled at the
FS level. When running Lustre on top of ext4, you have to use the
Lustre's own versions of the ext4 utilities, which replace the ext4
utilities in the system...
In short: Lustre sacrified a lot of ops-friendliness for the sake of
high-performance. This is quite the opposite choice that GlusterFS
has made.
"It should be possible to avoid the userspace to kernelspace transition in the client by using LD_PRELOAD with a shim library as well."
We (I'm a GlusterFS developer) actually did this for a while once, and people have recently started talking about doing it again. With GFAPI it shouldn't even be that complicated, but there are still problems e.g. around fork(2).
"The Ceph distributed filesystem has an in-kernel client which is upstream."
...and it's probably not a coincidence that the file system component is the one piece of Ceph that still hasn't reached production readiness. Development velocity counts. The fact that it uses FUSE has rarely been the cause of performance problems in GlusterFS. More often than not, the real culprit is (relative) lack of caching, which is fixable in user space.
"XtreemFS and MooseFS are designed for a different use-case (WAN filesystem)"
Pretty true for XtreemFS - for which I have the utmost respect and which I promote at every opportunity - but MooseFS targets exactly the same use case as GlusterFS. OK, a subset, because they don't have all the features and integrations we do. ;) I'd also add PVFS/OrangeFS, which is contemporaneous with Lustre. It doesn't use FUSE, but has its own user-space "interceptor" which is equivalent.
"Lustre runs the majority of the world's supercomputers."
It runs a lot of the world's biggest supercomputers, because those people can afford to keep full-time Lustre developers on staff to baby-sit it. Not ops staff - developers to apply the latest patches, add their own, etc. I spent over two years at SiCortex trying to make Lustre usable for our customers. At that time, I believe LLNL had four Lustre developes. ORNL had slightly less. Cray, DDN, etc. each had their own as well. When it works, Lustre can be great. On the other hand, few users can afford to devote that level of staff to running a distributed file system. Those that can't will find themselves in the weeds with MDS meltdowns and "@@@ processing error" messages all the time.
Because of this, I'd say it's Lustre that's not really targeting the same market or use case as GlusterFS. We never encounter them head to head, in either the corporate or community context. The "performance at any cost" market is a hard place to make a living, and it barely overlaps at all with the "performance plus usability" market.
"It's unclear to me whether Red Hat will want to support two distributed filesystems going forward"
Why not? They've supported multiple local file systems for a long time, and there's an even bigger overlap there. When you look at both Ceph and GlusterFS in terms of distributed data systems rather than just file systems specifically, maybe things will look a bit different. Now we're talking about block and object as well as files. Maybe we're talking about integrating distributed storage with distributed computation in ways not covered by any of those interfaces. We're certainly talking about users having their own preferences in each of these areas. If there are enough APIs, and enough users who prefer one over the other for a certain kind of data or vertical market, then it makes quite a bit of sense to continue maintaining two separate systems. On the other hand, of course we want to reduce the number of components we'll have to maintain, and there will be plenty of technology sharing back and forth. Only time will tell which way the balance shifts.
I'm curious what your experience was with the LD_PRELOAD thing. By "problems with fork," you mean the possibility of forking without that environment variable set via exevpe or similar?
MooseFS targets exactly the same use case as GlusterFS.
Yeah, actually you are quite right. I was getting MooseFS confused with a different filesystem. MooseFS looks like it has a GFS heritage. Kind of like what I am working on now... HDFS.
...and it's probably not a coincidence that the file system component is the one piece of Ceph that still hasn't reached production readiness.
Ceph's filesystem component is reasonably stable. It's the multi-MDS use case which still had problems (at least a few years ago when I was working on the project.) The challenge was coordinating multiple metadata servers to do dynamic subtree partitioning and other distributed algorithms. Ceph has a FUSE client which you can use if you don't want to install kernel modules, and a library API.
It seems that Lustre has an in-kernel server. This might have led to some of the maintenance difficulties people seem to have with it. (I never worked on Lustre myself.) I don't think this discredits the idea of in-kernel clients, which are different beasts entirely. Especially when the client is in the mainline kernel, rather than an out-of-tree patch like with Lustre in the past.
It's a tough market out there for clustered filesystems. HPC is shrinking due to government sequesters and budget freezes. The growth area seems to be Hadoop, but most users prefer to just run HDFS, since it has a proven track record and is integrated well into the rest of the project. Moving into the object store world is one option, but that is also a very crowded space. It will be interesting to see how things unfold.
> By "problems with fork," you mean the possibility of
> forking without that environment variable set via exevpe
> or similar?
That might be a problem, but it's not the one I was thinking of. The problem with LD_PRELOAD hacks is that they need to maintain some state about the special fds on the special file system. That state immediately becomes stale when you fork (because copy on write) and gets blown away when you exec. Therefore you always end up having to store the state in shared memory (nowadays probably an mmap'ed file) with extra grot to manage that. Even then, exec and dlopen won't work on things that aren't real files. So it's not impossible, but it gets pretty tedious (especially when you have to re-do all this for the 50+ calls you need to intercept) and there are always some awkward limitations.
With regard to in-kernel clients, I'm not trying to discredit them, but they're not the only viable alternative. User-space clients have their place too, as I'm sure you know if you work on HDFS. Every criticism of FUSE clients goes double for JVM clients, but both can actually work and perform well. It seemed like some people were dissing GlusterFS because of something they'd heard from someone else, who'd heard something from someone else, based on versions of FUSE before 2.8 (when reverse invalidations and a lot of performance improvements went in). This being HN, it seemed like they were just repeating FUD uncritically, so I sought to correct it. The fact that GlusterFS uses FUSE is simply not a big issue affecting its performance.
If you feel like you're a good C programmer have a look at the codebase and prepare to have your mind expanded.
It's a very nice piece of software and a very strong proof that not all successful development is done in SV (Almost all of GlusterFS before the buy-out was written in India, not sure what the current situation is).