We moved over to garage after running minio in production with about ~2PB after about 2 years of headache. Minio does not deal with small files very well, rightfully so, since they don't keep a separate index of the files other than straight on disk. While ssd's can mask this issue to some extent, spinning rust, not so much. And speaking of replication, this just works... Minio's approach even with synchronous mode turned on, tends to fall behind, and again small files will pretty much break it all together.
We saw about 20-30x performance gain overall after moving to garage for our specific use case.
quick question for advice - we have been evaluating minio for a in-house deployed storage for ML data. this is financial data which we have to comply on a crap ton of regulations.
so we wanted lots of compliance features - like access logs, access approvals, short lived (time bound) accesses, etc etc.
how would you compare garage vs minio on that front ?
As a competing theory, since both Minio and Garage are open source, if it were my stack I'd patch them to log with the granularity one wished since in my mental model the system of record will always have more information than a simple HTTP proxy in front of them
Plus, in the spirit of open source, it's very likely that if one person has this need then others have this need, too, and thus the whole ecosystem grows versus everyone having one more point of failure in the HTTP traversal
Hmm... maybe??? If you have a central audit log, what is the probability that whatever gets implemented in all the open (and closed) source projects will be compatible?
Why not? The application logs who, when and what happened to disk. This is application specific audit events and such patches should be welcome upstream.
Log scraper takes care of long time storage, search and indexing. Because you want your audit logs stored in a central location eventually. This is not bound to the application and upstream shouldn’t be concerned with how one does this.
That is assuming the application is aware of “who” is doing it. I can commit to GitHub any name/email address I want, but only GitHub proxy servers know who actually sent the commit.
Thats a very specific property of git, stemming from its distributed nature. Allowing one to push the history of a repo fetched from elsewhere.
The receiver of the push is still considered an application server in this case. Whether or not GitHub solves this with a proxy or by reimplementing the git protocol and solve it in process is an internal detail on their end. GitHub is still “the application”. Other git forges do this type of auth in the same process without any proxies, Gitlab or Gerrit for example, open source and self hosted, making this easy to confirm.
In fact, for such a hypothetical proxy to be able to solve this scenario, the proxy must have an implementation of git itself. How else would it know how to extract the commiter email and cross check that it matches the logged in users email?
An application almost always has the best view of what a resource is, the permissions set on it and it almost always has awareness of “who” is acting upon said resource.
> Thats a very specific property of git, stemming from its distributed nature.
Not at all. For example, authentication by a proxy server is old-as-the-internet. There's a name for it, I think, "proxy authentication"?[1] I've def had to write support for it many times in the past. It was the way to do SSO for self-hosted apps before modern SSO.
> In fact, for such a hypothetical proxy to be able to solve this scenario, the proxy must have an implementation of git itself.
Ummm, have you ever done a `git clone` before? Do you note the two most common types of urls: https/ssh. Both of these are standard implementations. Logging the user that is authenticating is literally how they do rate limiting and audit logging. The actual git server doesn't need to know anything about the current user or whether or not they are authenticated at all.
Enough of shifting the goal posts. This was about applications doing their own audit logging, I still don’t understand what’s wrong with that. Not made up claims that applications or a git server doesn’t know who is acting upon it. Yes, a proxy may know “who” and can perform additional auth and logging at that level, but often has a much less granular view of “what”. In the case of git over http, I doubt nginx out of the box has any idea of what a branch or a commiter email is, at best you will only see a request to the repo name and git-upload-pack url.
We saw about 20-30x performance gain overall after moving to garage for our specific use case.