Hacker News new | past | comments | ask | show | jobs | submit login
Open source cloud file system. Posix, HDFS and S3 compatible (juicefs.com)
197 points by wiradikusuma on Feb 8, 2023 | hide | past | favorite | 77 comments



How does it compare to other players in the area? E.g Ceph, Gluster or Seaweed? (I'm no expert myself, only used those as a consumer of already setup systems)

EDIT: There is a whole comparison section in the docs that I missed: https://juicefs.com/docs/community/comparison/juicefs_vs_cep...


Bit weird comparison. Like sure CephFS doesn't support S3-like access... because the object store is a separate service that also runs on top of Ceph/RADOS store


It is weird, but it's also a valid use case. I can imagine someone wanting to pull files from a FS that was populated as a regular POSIX filesystem through an S3 api. I'm not sure if you can access the CephFS files from the underlying Ceph store easily.


You could run Minio just like they do.


It's doable to run a MinIO gateway on top of CephFS mount point, but that will has performance issue, especially for multipart-upload and copy. That's why we put MinIO and JuiceFS client together and use some internal API to do zero-copy uploads.


It seems to be a specific part of SeaweedFS, i.e. the "filer+client" components. It will use a database or key-value store for metadata and a blob store for data, and expose that as a filesystem.

The difference is that SeaweedFS has its own blob store ("volume server") while JuiceFS uses S3 (or some other protocols). SeaweedFS also decouples the server ("filer server") from the client ("mount" command or client libraries) while JuiceFS only has a single process, so the machine where you mount the filesystem talks to the metadata and data backends directly; this means you can't mount a filesystem on an untrusted machine if I understand correctly (you need full R+W access to the backends from the machine where you mount).

You can see it as similar to `rclone mount`, which allows you to mount a remote S3 bucket locally. The difference is that JuiceFS is much faster and filesystem-like, by storing the metadata in a separate faster database and by chunking your files in the backend rather than storing files unchanged in the bucket.


What I really want is a filesystem I can span across geographically remote nodes that's transparently compatible. I should just be able to chuck files into it from my NAS like any other. I think Mayastor [1] might get some of the way there?

[1] https://github.com/openebs/mayastor


It compares pretty well to the proprietary WAFL/FabricPool technology by NetApp:

- have "hot" blocks on local storage (SSD cache) - "cold" blocks are stored on S3 - POSIX semantics on top

Having worked with NetApp technology for >10 years this is a welcome addition on the Open-Source side of things.


The comparisons are helpful, but I'm curious why Ozone wasn't included, since that seems like the most directly comparable alternative.


Apache Ozone is not POSIX compatible, even with the File System Optimized format [1].

https://ozone.apache.org/docs/current/feature/prefixfso.html


That's a fair point. Thanks!


>JuiceFS has introduced S3 gateway since v0.11. The feature is implemented based on the MinIO S3 Gateway.

The MinIO S3 Gateway was deprecated: https://blog.min.io/deprecation-of-the-minio-gateway/

I don't know if JuiceFS is treating it like a fork that they are maintaining, or if they have other plans.


We have a fork of MinIO at https://github.com/juicedata/minio, which will be maintained by us.


Damn, I was hoping it was pre the AGPL cut-over, especially in light of juicefs's Apache 2


Yes, JuiceFS uses the Apache 2 fork [1] directly (master branch), but also provide a full featured S3 gateway (gateway branch) under AGPL for people' choice.

[1] https://github.com/juicedata/minio/tree/master


You may want to specify reasons for wanting a “free as in beer” license, otherwise it just sounds greedy. (Not a swipe at you; I’ve fallen into this trap more than once.)


We have to do better to support open source with money. GitHub likes don’t pay the bills. AGPL is fine.


Adopted SeaweedFS few months back. Never looked back since then. It's fast even on HDD disks.

https://github.com/seaweedfs/seaweedfs#introduction


I love this project and I'd love to switch to it. Hopefully constructive feedback: The big issue I always run into with this stuff is what am I supposed to do if something goes wrong? I think the project documentation people would be wise to document procedures to do when certain things go wrong and how you should deal with them, such as if a server or two fail, or there's some unexpected corruption. Without that, "distributed storage" systems really feel incomplete to me. Storage is usually "mission critical" and they had a procedure for every single thing that could go wrong on the Apollo mission.


By default, it starts up exposing the file server on all interfaces, not just localhost, AKA insecure by default.

Such a poor security choice, makes me question the entire project.


It is still growing. This is not the case since many months ago.

Also, welcome to make a PR.


This is very good news, thank you!


Can you share your experience? What were the alternatives? Did you consider it against AWS S3?


Seaweed is excellent, use it in a bunch of places both big and small!


That looks awesome! What else did you try before going to SeaweedFS?


AWS EFS, MooseFS & Ceph.


Ah I've been quite happy with LizardFS (which is a fork of MooseFS) and I found Ceph to be a bit of a letdown (too complex to manage). Well, time to try Seaweed then :)


> 99.99999999% (10 9s) reliability SLA

Can someone tell me in practical terms what that means?

1 second of unreliability every 317 years?


99.99999999% reliability means you will not loss more than one byte in every 10 GB in a year.

JuiceFS uses S3 as the underlying data storage, so S3 provides this durability SLA.


Important to note that S3 does not have any Durability SLA. We promise Durability and take it extremely seriously, but there is no SLA. Much more of an SLO


Also, “durability” is not a property you can delegate to another service. Plenty of corruption is caused in-transit, not just at rest.

If your system handles the data in any way, you must compute and validate checksums.

If you do not have end to end checksums for the data, you do not get to claim your service adopts S3’s Durability guarantees.

S3 has that many 9s because your data is checksumed by the SDK. Every service that touches that data in any way recomputes and validates that (or a bracketed) checksum. Soup de nuts. All the way to when the data gets read out again.

And there is a lot more to Durability than data corruption. Protections against accidental deletions, mutations, or other data loss events come into play too. How good is your durability SLO when you accidentally overwrite one customer’s data with another’s?

Check out some of the talks S3 has on what Durability actually means, then maybe you investigate how durable your service is.

https://youtu.be/P1gGFYS9LRk

ps: I haven’t looked at the code yet, but plan to. Maybe I’m being presumptuous and your service is fully secured. I’ll let you know if I find anything!

pps: I work for amazon but all my opinions are my own and do not necessarily reflect my employer’s. I don’t speak for Amazon in any way :D


As you allude to in your response, that's usually referred to as durability, not reliability. The home page could probably use an update there to reflect that terminology.


It sounds like not very practical metrics, since losing one byte often makes whole dataset useless (encryption, checksums failures).


It's an average- presumably they don't smear files across disks byte by byte, since that would be insane. But with drives randomly breaking, at some point every copy of at least one file will go at once. With, say, a terabyte of files over a thousand years, you'd expect to lose a total number of files equal to 100Kb. So probably not even one, with some small chance of losing half a drive.


I think probability to lose any data in 100tb should be good metric.


As in there's no durability guarantee for the data? I can expect data loss at a rhythm of 1b per GB per year?


It's unavoidable that too many disk failures in quick succession lead to data-loss. For example if you store two copies, your durability rests on being able to detect a disk failure and create another copy, before the sole remaining version dies as well.


"What do you mean you mean it can't recover from a 100% disk failure rate?

At least it's all in RAID 0, so the data's safe."


Do you know if strong read-after-write consistency is supported (as in s3)? Is an atomic put-if-absent method supported in JuiceFS (as in Azure blob storage)? If so, this could be a really cool platform for formats like Delta.io :)


It seems not, instead it provides 'close-to-open' consistency, as documented here: https://juicefs.com/docs/community/cache_management/#data-co...


That is the same as S3 then, once an upload is complete it is seen by all other clients.


JuiceFS supports create-if-not-existed by using the Java SDK (HDFS compatible), so I guess it should work well with Delta.io.


Regarding the topic of "cloud storage" - could someone tell me if Juice or maybe MinIO would be a good solution to: 1. Storing multimedia data (image/video) uploaded by an user - here I would guess it can either hit it directly or via the backend for auth 2. Should be accessible by an URL exposed outside of the docker-compose so it doesn't need to go through the backend REST API 3. Some form of authentication based on the JWT token in the Header - or maybe as this is a MVP simply generating a long enough random string will be enough

Or should I simply use nginx + filesystem and not overcomplicate?

I hear everywhere S3 but as it's a pet project don't want to go the AWS route, instead maybe a Hetzner VPS with docker-compose to run the whole setup with an external Postgres instance.


webdav, oauth2_proxy, nginx. That’s all you need. You can create your own issuer or also use dex I think.

Fancy authnz is easy to do with openresty instead of vanilla nginx.

Alternatively just use own cloud/nextcloud


I tried putting Postgres on JuiceFS and let's just say.. it didn't perform very well


Is that a use case they are really targeting though? Their splash page mentions big data with model generation and genomic sequencing as examples. I can really only speak to genetic sequencing. The IO pattern for these workflows is almost all streaming reads/writes. Random access takes too long when you are reading/writing 100-500GB files.

Postgres doesn’t like running on NFS either to be fair.


> Postgres doesn’t like running on NFS either to be fair.

https://news.ycombinator.com/item?id=19119991

Is that still the case after this? Or is it tribal knowledge?


Yes, JuiceFS is not a good choice for PG, unless if you don't care the performance.

One interesting use case is the backup of MySQL [1].

[1] https://juicefs.com/docs/cloud/backup_mysql_in_juicefs/


postgres + rocks fdw would be a more interesting test case (or any LSM DB)


Is it 'Posix Compatible' or 'Posix' aka 'Posix compliant'?

It's incredibly hard to make a distributed posix compatible filesystem since you run into CAP. I believe (but am not certain) you are caching locally in violation of Posix or you are signing up for arbitrarily long stalls and a ton of latency on every read/write. (I'm not certain because I'm not sure what Posix specifies wrt stale reads and other cache consistency requirements between sync's)

It would be interesting to hear what the tradeoffs are here, but assuming they are explicit and can be designed around this seems very useful.


It is not posix anything. It provides a compatibility layer that makes open, close, read, and write work but other than that does not provide the type of features that would allow you to deliver mail on it with qmail or whatever. It is incredibly misleading to advertise it that way.

As you say there is no free lunch with distributed filesystems. Application programmers have to program their way around the fact that something like posix atomic writes with multiple writers is never going to work, and that the only way to get reasonable efficiency out of the thing is to defer work until the file is closed.


What specifically is missing?


Agreed, it's very hard, that's why GFS and HDFS had give up some parts of POSIX compatibility.

Per CAP, it's addressed by different meta engines (CP system, Redis, MySQL, TiKV) and also different object stores (AP system). When the meta engine is not available, the operation to JuiceFS will be blocked for a while and finally it returns EIO. When object store returns 404 (object not found), which means it's not consistent with the meta engine, it will be retried for a while, may return EIO if it's not recovered.

The file format is carefully designed to workaround the consistency issue from object store and local cache. Any part of data is written into object store and local cache with unique ID, so you will not go stale data once the metadata is correct [1].

Within a mount point, JuiceFS provides read-after-write consistency. Across clusters, JuiceFS provides open-after-close consistency, which should be enough for most of the applications, also provide good balance between consistency and performance.

[1] https://juicefs.com/docs/community/architecture/#how-juicefs...


I was actually building something similar to Juice using S3 as an object store and optionally using redis(fast) or s3(slow) for metadata storage. Basically a log structured filesystem using rolling hash chunk encoding and delegations. I kinda stopped when I found juice (and to some extent seaweed) as they were much further along. If you need shared storage and don't have crazy performance requirements it makes a lot of sense to separate out metadata and just throw blobs into object storage.


I tried to take a look into the documentation but it seems to all be in Chinese?

Edit: For some reason my phone defaulted to Chinese but on my laptop it's fine. User error I guess!


Does JuiceFS scale horizontally? I can’t see anything about how the servers federate/balance load or if they can at all.

[EDIT] looks like there's an issue -- https://github.com/juicedata/juicefs/issues/345

But this still doesn't really answer it -- if I run JuiceFS S3 Gateway in 2 places, is there any way to redirect reads?


Usually the meta engine or object storage can scale horizontally by itself, JuiceFS is middleware to talk to these two services.

To serve S3 request, you can setup multiple S3 gateway and put a load director in front of them.


> Usually the meta engine or object storage can scale horizontally by itself, JuiceFS is middleware to talk to these two services.

Thanks for confirming this -- I spent a bunch of time reading and was wondering why I couldn't find anything... this answers my question.

I think I misunderstood JuiceFS -- it's more like a better rclone than it is a distributed file system. It's a swiss army knife for mounting remote filesystems.

Assuming you're using a large object service (S3, GCP, Backblaze, etc) then the scale issue is expected to be solved. If you're using filesystem or local minio for example, then you have to solve the problem yourself.

> To serve S3 request, you can setup multiple S3 gateway and put a load director in front of them.

This is exactly the question I had -- it occurred to me that if I make 2 s3 gateways, even if they share the metadata store they might be looking at resources that only one can serve.

So in this situation:

    [metadata (redis)]-->[ram]
      |
    [s3-0]-->[local disk]
      |
    [s3-1]-->[local disk]
In that situation, then if a request came in to s3-0 for data that was stored @ s3-1, the request would fail, correct? Because s3-0 has no way of redirecting the read request to s3-1.

This could work if you had an intelligent router sitting either in front the s3s (so you could send reads/writes to a certain one of them that is known to have the data), but by default, your writes would fail, I'm assuming.

Oh I have one more question -- can you give options to the sshfs module? It seems like you can just append `?SSHOptionHere=x` to `--bucket` but I'm not sure (ex. `--bucket user@box?SshOption=1`)


Can it do full encryption from client (transfer + at-rest) with Fuse?

Currently I use ext4 image + LUKS + NBD over SSH tunel, it works but is extremely slow.


Yes, the data can be encrypted [1] by the client before sending to S3, but the metadata is not encrypted.

[1] https://juicefs.com/docs/community/security/encrypt


Is this basically a non-POSIX* FUSE for S3 and/or Redis?

* POSIX implies a whole lot of guarantees, like atomic file renames/moves, that definitely don't seem to be included here.


Atomic file/directory renames/moves is the fundamental feature of JuiceFS, which makes it truely a file system rather than a proxy to S3, please check the docs for all the compatibility details [1].

https://github.com/juicedata/juicefs#posix-compatibility


I run a mastodon instance, and can imagine hosting the storage of multiple instances on a juicefs S3 gateway... needs dedupe because I consider that


Mastodon needs S3-compatible storage. I am not sure if there's an advantage using JuiceFS to expose an S3 backend over the S3 API.


Does it support POSIX ACLs as well?


The JuiceFS Cloud supports ACL, but open source one does not support it yet.


Anyway to mount with a forced UID/GID of all files?

Useful in container scenarios.


This is an experimental feature to do this, still working on it.


> but open source one does not support it yet.

Is this on the roadmap?


I can't seem to find docs in a language other than Chinese on the site


The docs are in English and Chinese, there's a language selector in the top right.

Perhaps if your computer/browser's language isn't set to English it defaults to Chinese?


Is the company Chinese or Taiwanese?


https://juicefs.com/en/about-us

  About Juicedata Inc.
  
  Founded in April 2017, Juicedata is a globally oriented innovated distributed file system company. The team consists of senior architects, genius engineers, and consulting experts who have worked in the field of distributed systems for many years. The team members located across Hangzhou, Shanghai, Xiamen, and other cities, used to serve Facebook, Databricks, Tencent, Alibaba, Zhihu, Xiaohongshu, Douban, and other well-known high-tech enterprises around the world.
  
  Juicedata was jointly invested by China Growth Capital and Foothill Ventures.


Juicedata Inc is a US company, was registered in Delaware. The founding team are Chinese.

ps, I'm the founder of Juicedata.


I could not find it on iOS Safari, might be an UI bug then.


The button to switch language is at the bottom of right-top menu, we will fix that.


The button doesn't appear on the mobile theme it seems (went to https://juicefs.com/docs/zh/community/introduction/ after clicking the "Community edition docs" link at the home page).

Maybe consider also changing the link on the English homepage to the English documentation?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: