We moved over to garage after running minio in production with about ~2PB after ...

sandGorgon · 2024-07-20T09:53:19 1721469199

quick question for advice - we have been evaluating minio for a in-house deployed storage for ML data. this is financial data which we have to comply on a crap ton of regulations.

so we wanted lots of compliance features - like access logs, access approvals, short lived (time bound) accesses, etc etc.

how would you compare garage vs minio on that front ?

withinboredom · 2024-07-20T12:25:22 1721478322

You will probably put a proxy in front of it, so do your audit logging there (nginx ingress mirror mode works pretty good for that)

mdaniel · 2024-07-20T16:46:20 1721493980

As a competing theory, since both Minio and Garage are open source, if it were my stack I'd patch them to log with the granularity one wished since in my mental model the system of record will always have more information than a simple HTTP proxy in front of them

Plus, in the spirit of open source, it's very likely that if one person has this need then others have this need, too, and thus the whole ecosystem grows versus everyone having one more point of failure in the HTTP traversal

withinboredom · 2024-07-20T16:52:44 1721494364

Hmm... maybe??? If you have a central audit log, what is the probability that whatever gets implemented in all the open (and closed) source projects will be compatible?

Too · 2024-07-21T16:25:36 1721579136

Log scrapers are decoupled from applications. Just log to disk and let the agent of your logging stack pick it up and send to the central location.

withinboredom · 2024-07-22T07:24:53 1721633093

That isn't an audit log.

Too · 2024-07-24T06:07:10 1721801230

Why not? The application logs who, when and what happened to disk. This is application specific audit events and such patches should be welcome upstream.

Log scraper takes care of long time storage, search and indexing. Because you want your audit logs stored in a central location eventually. This is not bound to the application and upstream shouldn’t be concerned with how one does this.

withinboredom · 2024-07-24T20:49:17 1721854157

That is assuming the application is aware of “who” is doing it. I can commit to GitHub any name/email address I want, but only GitHub proxy servers know who actually sent the commit.

Too · 2024-07-25T05:01:36 1721883696

Thats a very specific property of git, stemming from its distributed nature. Allowing one to push the history of a repo fetched from elsewhere.

The receiver of the push is still considered an application server in this case. Whether or not GitHub solves this with a proxy or by reimplementing the git protocol and solve it in process is an internal detail on their end. GitHub is still “the application”. Other git forges do this type of auth in the same process without any proxies, Gitlab or Gerrit for example, open source and self hosted, making this easy to confirm.

In fact, for such a hypothetical proxy to be able to solve this scenario, the proxy must have an implementation of git itself. How else would it know how to extract the commiter email and cross check that it matches the logged in users email?

An application almost always has the best view of what a resource is, the permissions set on it and it almost always has awareness of “who” is acting upon said resource.

withinboredom · 2024-07-25T08:51:15 1721897475

> Thats a very specific property of git, stemming from its distributed nature.

Not at all. For example, authentication by a proxy server is old-as-the-internet. There's a name for it, I think, "proxy authentication"?[1] I've def had to write support for it many times in the past. It was the way to do SSO for self-hosted apps before modern SSO.

> In fact, for such a hypothetical proxy to be able to solve this scenario, the proxy must have an implementation of git itself.

Ummm, have you ever done a `git clone` before? Do you note the two most common types of urls: https/ssh. Both of these are standard implementations. Logging the user that is authenticating is literally how they do rate limiting and audit logging. The actual git server doesn't need to know anything about the current user or whether or not they are authenticated at all.

1: https://www.oreilly.com/library/view/http-the-definitive/156...

Too · 2024-07-25T15:31:30 1721921490

Enough of shifting the goal posts. This was about applications doing their own audit logging, I still don’t understand what’s wrong with that. Not made up claims that applications or a git server doesn’t know who is acting upon it. Yes, a proxy may know “who” and can perform additional auth and logging at that level, but often has a much less granular view of “what”. In the case of git over http, I doubt nginx out of the box has any idea of what a branch or a commiter email is, at best you will only see a request to the repo name and git-upload-pack url.

Final food for the trolls. Sorry.

zimbatm · 2024-07-20T19:58:17 1721505497

That's very cool; I didn't expect Garage to scale that well while being so young.

Are there other details you are willing/allowed to share, like the number of objects in the store and the number of servers you are balancing them on?