Ask HN: Distributed File System

comice · on July 10, 2009

GlusterFS is a FUSE based distributed filesystem. You can mirror the files on as many servers as you like, aggregate storage from different servers. It has no central metadata server either.

It doesn't serve via HTTP directly, but it's easy to point a web server at the filesystem (it has an Apache module to provide direct access without going through FUSE if you want higher performance).

http://www.gluster.org

bjclark · on July 12, 2009

I highly recommend against glusterfs. We used to use it at my current employer and had nothing but trouble serving large files from it. It was fine until we starting putting large files in it, then healing would randomly bring the whole server to it's knees.

kierank · on July 10, 2009

MogileFS is what you're looking for. Native http support with a MySQL tracker.

cperciva · on July 10, 2009

I'm not sure I understand what the problem is. You say that you've got approximately 40000 files; a database which keeps track of where those files are stored is a trivially small problem.

ErrantX · on July 10, 2009

It needs to scale robustly to support at least a million files (with 3 duplications - so 4 million entries minimum) and accept really high load (min. 1000 requests simultaneously, better 5000 requests simultaneously). The main concern is MYSQL is the single point of failure.

cperciva · on July 10, 2009

The load you're talking about is insignificant -- even 4M entries is going to fit into RAM, so you should be able to perform lookups at wire speed.

The main concern is MYSQL is the single point of failure.

Don't make it a single point of failure, then. Set up several boxes. Heck, don't use MYSQL -- your update rate is going to be low (even if you have 10Gbps of incoming traffic, that's only one new GB file per second), so you could run the file-location service over DNS if you wanted.

sho · on July 10, 2009

Exactly what I was thinking. On a box with a decent amount of memory, MySQL's query cache will easily handle 5k/sec or, probably, quite a bit more. From fuzzy memory of a friend's test earlier this year I think he got about 30k/sec over a socket connection and a little less over tcp; sounds about right to me. And that's nothing but MySQL, default installation, on an iMac. Just make sure you have plenty of memory, that's always the key.

SPOF concerns could be alleviated by having a standard master/slave failover setup. You might have other needs you haven't mentioned but otherwise sounds like MySQL can handle your needs for now, and keeping the setup as simple and off-the-shelf as possible is always a good thing ..

ErrantX · on July 10, 2009

wicked that makes me feel better about it. I was worried a lot about one server being the focus of such a lot of activity. I know it should handle a lot of traffic but it's always best to sanity check thoughts like that (when a crashed server ultimately = lost revnue :P)

sho · on July 10, 2009

Well don't take anyone's word for it, make sure to test it yourself! But I am fairly confident you'll be reassured by the results of those tests. Just make sure you're not actually testing the language the driver is in ...

Do share more details of the setup you end up with eventually, though, it's an interesting topic and I for one will be curious to see what you came up with.

ErrantX · on July 10, 2009

Yes I plan to.

Hopefully some part of our system will be open sourced :)

Tichy · on July 10, 2009

Not sure if something like Cassandra would help? http://en.wikipedia.org/wiki/Cassandra_%28database%29

I think there must be open source clones of the FS Google uses, but I don't know the names.

ErrantX · on July 10, 2009

a GFS clone was the angle I was looking at.

Cassandra looks pretty fun - are you suggesting that as the database right? Im thinking a quick python implementation for PUT (maybe DELETE) and meta operations, using cassandra as a backend and Lighttpd for the GET (high performance) might work..... cheers.

simonw · on July 10, 2009

If you want a GFS clone, you /really/ want HDFS from the Hadoop project - I'm reading the new O'Reilly book on Hadoop on the moment which is excellent.

skorgu · on July 10, 2009

No first hand experience here but my understanding is that HDFS and Hadoop in general are geared towards high-throughput and not necessarily low latency. You'll likely have to put a caching layer in front of it if this is still [1] the case. There is already some [2] HTTP accessibility for HDFS, I'm again not 100% clear on its status but if it works well enough a nice caching proxy (varnish) with plenty of memory should help with latency on heavily-accessed files.

[1] http://www.nabble.com/Using-HDFS-to-serve-www-requests-td227...

[2] https://issues.apache.org/jira/browse/HADOOP-5010

Tichy · on July 10, 2009

That is probably what I meant, too - I read that one of those OS projects had a clone of the GFS, only better (supposedly). Must have been Hadoop, as I haven't looked at many others.

gamache · on July 10, 2009

I am surprised that no one has mentioned Amazon S3. You will definitely plunk down a few bucks to store 40TB there, but it's a ready-made solution to your problem.

Edit: OK, now I see that S3 was suggested but vetoed. Unwise decision, in my opinion. If you need data security, encrypt your data.

ErrantX · on July 10, 2009

Bandwidth costs would probably kill us TBH. Ultimately we will have 20,000 clients connecting at least 3 times a week and uploading once a fortnight. It adds up to about 500TB of new data a year.

S3 is great for us atm but the work to encrypt it is too much because the cost doesnt scale for us long term. Potentially within a couple of years we will be spending 1/2 a million on bandwith and 1/2 a million on storage - which will continue upwards :)

Were facing the Google model: commodity hardware on a cheap T1 connection :)

simonw · on July 10, 2009

Sounds like a job for the Hadoop HDFS.

tk999 · on July 10, 2009

agree. You need to look into Hadoop HDFS. (a GFS clone...)

ajb · on July 10, 2009

Hmm, sounds like tahoe may do what you want: http://allmydata.org/trac/tahoe

Top80sSons · on July 10, 2009

Tahoe, looks very promising, but I've been unable to install it any time I tried. Do you know any enviroment it would work out the box? Tried Fedora 11, Ubuntu 9 and Windows XP myself.

zooko · on July 11, 2009

Was this with our most recent release -- Tahoe v1.4.1? Could you tell me more about what went wrong? Perhaps by opening a ticket on http://allmydata.org so that the other people who are responsible for making it easy to install can see the issue.

zooko · on July 12, 2009

By the way, I should point out that our overall strategy of how to deal with "packaging and installation" issues is to use automation. The creation of packages such as Python .egg's, Debian .deb's, and Mac .dmg's is done automatically by our buildbot, and we also have tests run automatically by buildbots to see whether the resulting packages can be installed and run.

I'm sure that it isn't all there yet, and your bug reports would help, but you might be interested in the overall approach -- applying the principles of test driven development (if it isn't automatically tested by your buildbot, then it isn't done) to packaging.

ErrantX · on July 10, 2009

Tahoe does all I want - and I based my implementation around their methodology.

But like you I have also failed to get it to work on anything with any degree of success.

jbellis · on July 13, 2009

I'm having trouble parsing this.

You mean that you tried to implement something on top of Tahoe, and it didn't work? Or Tahoe didn't work, so you went and wrote your own based on the Tahoe design? Or something else? :)

fh973 · on July 11, 2009

I suggest you also check out XtreemFS (http://www.xtreemfs.org/). It supports striping/parallel IO and your throughput will scale with the number of storage servers you deploy (we ran at GB/s, not Gbits, on a single file). It gives you full file system semantics (you mount it via FUSE) and is quite easy to setup.

bjko · on July 12, 2009

XtreemFS can also replicate write-once files over the WAN in addition to striping. There is a screencast at http://www.youtube.com/watch?v=0co_-_e0Hq4

dizz · on July 10, 2009

Ceph? http://ceph.newdream.net/about/

speek · on July 10, 2009

Ever thought about getting some super-duper SAN toys?

ErrantX · on July 10, 2009

Teensy bit expensive :)

We have 4000 clients atm but hope to increase that quickly (20,000 within a year). That's TB's of data a month. SAN was considered but it too expensive to scale :( EDIT: well, not over the top expensive. But commodity servers with software is cheaper/better for us.

Also we at some point need a HTTP interface to the outside world.

moe · on July 10, 2009

Your use-case doesn't sound like rocket surgery so I'd say you're on the right track. Write once, read many is trivial to scale. Just build it according to your needs and throw in memcached hosts as needed.

If your access pattern allows for good caching you could also skip all the hassle of maintaining physical spindles and just move your stuff to S3. Their storage fees are quite affordable, 100T would set your back around $15k a month.

The killer with S3 is in the traffic bill, though. So that's only an option when most of your requests can be served from cache.

ErrantX · on July 10, 2009

S3 would fit our needs perfectly. But the boos vetoed it :( security issues (I know, I know but the data is fairly high value). Sucks.

moe · on July 10, 2009

Use encryption, perhaps?

wanderingmarker · on July 10, 2009

I'm about to look into somewhat similar requirements for an app I'm consulting on, and I plan to research GlusterFS first.