Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Distributed File System
28 points by ErrantX on July 10, 2009 | hide | past | favorite | 34 comments
Hey peeps.

So I need some recommendations.

I've been building a distributed file system for work to store our hash tables with. These are 1 gig files (about 40TB worth of them) that are write once, read many.

It needs to duplicate the data across servers and make the file available via HTTP. Oh and it needs to scale quite well because as from next month we are potentially adding another TB per month.

So far I havent been able to find a DFS that does all the above so have been working on my own. But am nervous - the files are mission critical but I am not too worried about losing stuff per se (there are alternative backup solutions that make sure we have multiple static copies safe). Im more worried about not being able ot cope with the load. My current implementation is in Python and simply uses a central MYSQL server to track file locations.

So. Can anyone recommend a DFS option I have missed that fulfills my requirements. Or even better can anyone offer technical ideas to help with the development of our code.

:)




GlusterFS is a FUSE based distributed filesystem. You can mirror the files on as many servers as you like, aggregate storage from different servers. It has no central metadata server either.

It doesn't serve via HTTP directly, but it's easy to point a web server at the filesystem (it has an Apache module to provide direct access without going through FUSE if you want higher performance).

http://www.gluster.org


I highly recommend against glusterfs. We used to use it at my current employer and had nothing but trouble serving large files from it. It was fine until we starting putting large files in it, then healing would randomly bring the whole server to it's knees.


MogileFS is what you're looking for. Native http support with a MySQL tracker.


I'm not sure I understand what the problem is. You say that you've got approximately 40000 files; a database which keeps track of where those files are stored is a trivially small problem.


It needs to scale robustly to support at least a million files (with 3 duplications - so 4 million entries minimum) and accept really high load (min. 1000 requests simultaneously, better 5000 requests simultaneously). The main concern is MYSQL is the single point of failure.


The load you're talking about is insignificant -- even 4M entries is going to fit into RAM, so you should be able to perform lookups at wire speed.

The main concern is MYSQL is the single point of failure.

Don't make it a single point of failure, then. Set up several boxes. Heck, don't use MYSQL -- your update rate is going to be low (even if you have 10Gbps of incoming traffic, that's only one new GB file per second), so you could run the file-location service over DNS if you wanted.


Exactly what I was thinking. On a box with a decent amount of memory, MySQL's query cache will easily handle 5k/sec or, probably, quite a bit more. From fuzzy memory of a friend's test earlier this year I think he got about 30k/sec over a socket connection and a little less over tcp; sounds about right to me. And that's nothing but MySQL, default installation, on an iMac. Just make sure you have plenty of memory, that's always the key.

SPOF concerns could be alleviated by having a standard master/slave failover setup. You might have other needs you haven't mentioned but otherwise sounds like MySQL can handle your needs for now, and keeping the setup as simple and off-the-shelf as possible is always a good thing ..


wicked that makes me feel better about it. I was worried a lot about one server being the focus of such a lot of activity. I know it should handle a lot of traffic but it's always best to sanity check thoughts like that (when a crashed server ultimately = lost revnue :P)


Well don't take anyone's word for it, make sure to test it yourself! But I am fairly confident you'll be reassured by the results of those tests. Just make sure you're not actually testing the language the driver is in ...

Do share more details of the setup you end up with eventually, though, it's an interesting topic and I for one will be curious to see what you came up with.


Yes I plan to.

Hopefully some part of our system will be open sourced :)


Not sure if something like Cassandra would help? http://en.wikipedia.org/wiki/Cassandra_%28database%29

I think there must be open source clones of the FS Google uses, but I don't know the names.


a GFS clone was the angle I was looking at.

Cassandra looks pretty fun - are you suggesting that as the database right? Im thinking a quick python implementation for PUT (maybe DELETE) and meta operations, using cassandra as a backend and Lighttpd for the GET (high performance) might work..... cheers.


If you want a GFS clone, you /really/ want HDFS from the Hadoop project - I'm reading the new O'Reilly book on Hadoop on the moment which is excellent.


No first hand experience here but my understanding is that HDFS and Hadoop in general are geared towards high-throughput and not necessarily low latency. You'll likely have to put a caching layer in front of it if this is still [1] the case. There is already some [2] HTTP accessibility for HDFS, I'm again not 100% clear on its status but if it works well enough a nice caching proxy (varnish) with plenty of memory should help with latency on heavily-accessed files.

[1] http://www.nabble.com/Using-HDFS-to-serve-www-requests-td227...

[2] https://issues.apache.org/jira/browse/HADOOP-5010


That is probably what I meant, too - I read that one of those OS projects had a clone of the GFS, only better (supposedly). Must have been Hadoop, as I haven't looked at many others.


I am surprised that no one has mentioned Amazon S3. You will definitely plunk down a few bucks to store 40TB there, but it's a ready-made solution to your problem.

Edit: OK, now I see that S3 was suggested but vetoed. Unwise decision, in my opinion. If you need data security, encrypt your data.


Bandwidth costs would probably kill us TBH. Ultimately we will have 20,000 clients connecting at least 3 times a week and uploading once a fortnight. It adds up to about 500TB of new data a year.

S3 is great for us atm but the work to encrypt it is too much because the cost doesnt scale for us long term. Potentially within a couple of years we will be spending 1/2 a million on bandwith and 1/2 a million on storage - which will continue upwards :)

Were facing the Google model: commodity hardware on a cheap T1 connection :)


Sounds like a job for the Hadoop HDFS.


agree. You need to look into Hadoop HDFS. (a GFS clone...)


Hmm, sounds like tahoe may do what you want: http://allmydata.org/trac/tahoe


Tahoe, looks very promising, but I've been unable to install it any time I tried. Do you know any enviroment it would work out the box? Tried Fedora 11, Ubuntu 9 and Windows XP myself.


Was this with our most recent release -- Tahoe v1.4.1? Could you tell me more about what went wrong? Perhaps by opening a ticket on http://allmydata.org so that the other people who are responsible for making it easy to install can see the issue.


By the way, I should point out that our overall strategy of how to deal with "packaging and installation" issues is to use automation. The creation of packages such as Python .egg's, Debian .deb's, and Mac .dmg's is done automatically by our buildbot, and we also have tests run automatically by buildbots to see whether the resulting packages can be installed and run.

I'm sure that it isn't all there yet, and your bug reports would help, but you might be interested in the overall approach -- applying the principles of test driven development (if it isn't automatically tested by your buildbot, then it isn't done) to packaging.


Tahoe does all I want - and I based my implementation around their methodology.

But like you I have also failed to get it to work on anything with any degree of success.


I'm having trouble parsing this.

You mean that you tried to implement something on top of Tahoe, and it didn't work? Or Tahoe didn't work, so you went and wrote your own based on the Tahoe design? Or something else? :)


I suggest you also check out XtreemFS (http://www.xtreemfs.org/). It supports striping/parallel IO and your throughput will scale with the number of storage servers you deploy (we ran at GB/s, not Gbits, on a single file). It gives you full file system semantics (you mount it via FUSE) and is quite easy to setup.


XtreemFS can also replicate write-once files over the WAN in addition to striping. There is a screencast at http://www.youtube.com/watch?v=0co_-_e0Hq4



Ever thought about getting some super-duper SAN toys?


Teensy bit expensive :)

We have 4000 clients atm but hope to increase that quickly (20,000 within a year). That's TB's of data a month. SAN was considered but it too expensive to scale :( EDIT: well, not over the top expensive. But commodity servers with software is cheaper/better for us.

Also we at some point need a HTTP interface to the outside world.


Your use-case doesn't sound like rocket surgery so I'd say you're on the right track. Write once, read many is trivial to scale. Just build it according to your needs and throw in memcached hosts as needed.

If your access pattern allows for good caching you could also skip all the hassle of maintaining physical spindles and just move your stuff to S3. Their storage fees are quite affordable, 100T would set your back around $15k a month.

The killer with S3 is in the traffic bill, though. So that's only an option when most of your requests can be served from cache.


S3 would fit our needs perfectly. But the boos vetoed it :( security issues (I know, I know but the data is fairly high value). Sucks.


Use encryption, perhaps?


I'm about to look into somewhat similar requirements for an app I'm consulting on, and I plan to research GlusterFS first.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: