well played. I think that just got added to my standard vocabulary. Caching has caused more errors and bugs that I've had to deal with than I can recall. My favorite was an off by one error where we returned nicely cached info -- just for the previous user who came through our system! :facepalm: That was a bad one.
That's because essentially, "state" and "caching" are the same thing on some level.
And the problem with state is that you have to make sure all your state transitions don't cause bugs. What we know as a "cache" is essentially creating new state representing existing state, with all new transitions...
I like to look at caching as a form of denormalization - introducing redundancy to improve performance. And whenever we have redundancy, we have to make sure all our copies are synchronized, which can be tricky, especially in a concurrent environment.
On the other hand, the whole point of normalization in databases is to avoid redundancy and have "single source of truth".
I find the concepts of normalization and denormalization applicable and helpful outside databases as well, though a different terminology is often used.
In the efficient implementation of a pure functional language (say Haskell without MVars), what really is the difference between state and cache?
I know this is overly philosophical, and in practical scenarios we readily (although not always unambiguously) differentiate between "cache" and "state", but the point about transitions and that being a major source of bugs still stands.
> In the efficient implementation of a pure functional language (say Haskell without MVars), what really is the difference between state and cache?
If you want to unify state and cache, you might want to go down a different route:
Think of log based filesystems (or a log based data base).
Instead of defining your operations in terms of state, you define them as pure functions of the log.
So your log is full of operations. Writing just means appending a symbolic operation like `write(key, value)` to your log.
And you define the result of `read(key)`: scan backwards through the log until you hit the last instance of `write(x, value)` and the return that `value`.
Now state means: compact your log by replacing a swath of `write` log entries with one big `snapshot` operation that encompasses many key/value pairs.
Alternatively, you can also define state to mean caching your `read` operations.
In this approach, it's no coincidence that the log is a data structure that has a linear shape: the evolution of state over time is also linear.
(With some cleverness you can replace the linear structure with eg a DAG; and then also think about how you merge divergent states.)
I can never remember what they are, though. To avoid this problem, I think I wrote them down on a post-it, but I had too many post-its on my desk so I got rid of them all, and now I can't remember.
I think it could probably mean "Happy to help" if said in response to a thanks of some sort. Saying it before someone has said thanks is a bit presumptuous. :-)
It's amazing how often exploits come down to optimizations. The general form being "the domain logic is X, and is secure, but we faked around it in this one case to make it faster, and it turns out we made a bad assumption while doing so". Meltdown fits this description too.
I am really fascinated by the responses to this comment. So many people exclaiming how many issues are caused by caches. In ten years as a fulltime programmer the only cache issues I've seen are cache misses. It probably has to do with one's field. I'm a game developer mainly dealing with graphics programming.
The key problem (as I understand it) is that updating a cache properly requires knowing the exact graph of relations that an entry in the cache has to other entries. So that when that entry changes, you can propagate that change throughout the cache to other concerned entries which need to be recomputed. But knowing that exact graph is too complex a task to be trivial, it seems in this case. Basically it sounds like the non-visual version of rerendering UI when a state changes, which is hard enough even with visual feedback.
A lot of threading issues are also cache related. Forget to properly mark access to shared variables and suddenly every thread /CPU core ends up with its own locally cached version of it.
Yes, but this is such a well understood danger that I've never really been bitten by it in practice.
Along the same lines a lot of GPU programming tutorials warn of inconsistencies between threads and it has never been a problem since I just assume I cannot rely on consistency or order of execution, seeing each thread as separate and independent.
To solve this problem we would need to first understand the human mind, how it stores data, how it does computation, and how it interacts with names. So we would need the same set of information that we would need for creating AGI. A solution is probably only a couple of months/decades away.
Those aren’t formal definitions. “Formal” means, at the very least, that the specification is done in a formal language, and usually that conformance to the specification can be checked mechanically, that is, by a computer.
I don't find I ever make off-by-one errors with simple collection iteration; at some point "i < len" becomes tattooed on your brain stem. The off-by-one errors I tend to make are related to implementation details of certain data structures or algs. Really, I would describe them more as "thinking at the margins can be challenging." Correctly handling doubly linked lists, that sort of thing.
Oh, and slicing. I will never get Python slicing right the first time. The fact that the range is [begin, end) is just never the way I expect it to work.
Per your downvotes - I used to hate jokes on Hacker News and downvote them when I saw them, but I've become more ambivalent. They're a way of amicably sharing culture and experiences with other engineers that transcend any differences in age, gender, race, background, etc.
It's barely even a joke to me anymore -- it's just too real for me to laugh.
(Cache invalidation is essentially the same problem as managing mutable state -- "Out of the Tar Pit" frames mutable state as either essential or incidental, the latter being rederivable in principle from essential state. Incidental mutable state is no more and no less than a cache, and usually one with an informal and undocumented invalidation policy.)
(And naming things has a very real technical counterpart in addressing, which comes up obviously in networking, but you can also see its shadows in quite a lot of concerns around architecture and modularity.)
Right, which is (AFAIK) not usually recommended except for side projects or ones where there is already an existing relationship under a personal email address.
Couldn’t it just as well be attributed to improper file path normalization? If we had only lower case ASCII file systems it would not have caused a problem.
Have you seen the mayhem that some of mine cause when you clone them and then type ./configure && make, like you have been socially engineered into doing?
It doesn't even have to be there... The main reasons to clone a repo are because you're about to compile and run the code there, or you already have and need to fix something.
I don't personally audit all the code I run, but I hope someone is doing it. That being said, source code being public is much better than the alternative of just downloading binaries from who knows where.
I don't trust anything absolutely, and I don't see a way past it.
In spite of my tongue-in-cheek statement, I get it.
It's huge in the context of non-programming uses of Git. If some people are just sharing some text documents with Git, then it's a big deal.
This is likely on the rise.
E.g. if you look at a site like Github, there is a lot of non-code content in it. Some people stash that content, and other people believe that content to just be harmless files that will never perpetrate an exploit just from being cloned.
Well, I clone repos to inspect code all the time, and when I run code, it’s usually not with the same permissions as the corresponding `git clone`. Maybe I should be better about sandboxing Git…
1. Download container description (Dockerfile)
2. Upon image build it "compiles things" (e.g. processes/assembles javascript)
3. Build fails, because it pulls architecture incompatible library (or does not pull architecture mandated library)
4. Fix build scripts, rebuild container image
5. Verify container
6. Pull repo
7. Reproduce changes, commit
8. Push
Nothing apart clone-edit-push happens on the repo. The code can be executed on a remote, hardened, isolated system. With proliferation of containers I guess this scenario will become more and more common among ops people.
For a while I tried to only run untrusted builds in Docker containers, like doing `docker run -v $PWD:/src node npm install`, but IDEs are not really configured to deal with this. Even my Vim has ALE and would just run node_modules/.bin/tsserver on my machine, which could be anything. Why aren't our tools concerned with this at all?
I get that you are not completely serious, but before the cmake/meson/... people jump on this:
If ./configure is checked in as part of the official repository of a moderately well known project, I doubt any committer would be stupid enough to insert a backdoor into ./configure or the Makefiles.
What can happen if an apostate project is not on GitHub: Some (usually several) faithful persons decide to correct the situation and put multiple unofficial mirrors on GitHub, and other faithful people clone from a random one of these.
Doesn't Git-for-Windows default configure symbolic link support off, though? Or does this exploit work even in that case as long as the underlying file system supports symlinks?
Git-for-Windows may turn symlink support on by default under some specific circumstances. As the repo's wiki [1] says:
Short version: there is no exact equivalent for POSIX symlinks on Windows, and the closest thing is unavailable for non-admins by default unless Developer Mode is enabled and a relatively recent Windows 10 version is used. Therefore, symlink emulation support is only turned on by default when that scenario is detected.
Before Windows XP, any application could open a file with a case-sensitive flag to request the operating system to not do any case folding. Starting with XP, the same feature exists but requires a registry key set (and a reboot) to instruct the kernel to allow case-sensitive operations.
Starting with Windows 10, the aforementioned key still works, but there's also a per-directory case-sensitive flag that forces all DOS and Windows programs to have case-sensitive operations unconditionally. This is made to great effect in both WSL1 and Cygwin.
macOS has defaulted to be case insensitive largely due to historical and perhaps usability reasons. You can opt to make it case sensitive (and I do, which broke Steam for several years but that also freed my time).
When necessary you can make an auto-expanding volume that's case-sensitive and leave your host FS alone. I have not found that I really want to have differently-cased but otherwise identical filenames in the real world at any point though.
Ages ago, I heard from co-workers at a company that I had left that there was an issue because of some file-naming in a PHP application that they were trying to run locally on a Mac. There was foo.php which was the interface and Foo.php which had a class definition in it.
A bug that affects my teams once every few years is a developer will create a file named "A.txt" check it into Git, realize it should be "a.txt" and rename it, and then basically everything will shit the bed and you waste a day figuring out why nothing is working.
A related bug is a developer will make a webpage called /a/ but link to /A/ and then the link will be broken in production. At this point, I have seen this same bug enough times to be able to fix it reasonably quickly, but it definitely wastes time for the team.
Yeah, my current company encourages development in a case-sensitive volume and I assume this is why. But this is more an issue of your development environment working differently than prod than a problem with the notion of case insensitivity per se.
Ashamed to admit (as an OSX user) that I didn't even realize the FS was case-insensitive (having migrated from years of Linux usage to a non-Linux desktop). It does a good job of hiding this from the user (filenames are still listed with cases, and bash autocompletion completes to the correct case as well)
MacOS by default uses a "case-preserving case-insensitive" filesystem, so you can create files with mixed case, but you can't create two files with the same name and different case. It's one of MacOS's more-egregious crimes against Unix. Fortunately it doesn't manifest that often, but it rears its head often enough to be a problem.
It may be a crime, but is the result of a set of compromises in the design of the OSX filesystem, which had to work with a BSD variant while also being compatible with pre-OSX days. I think it’s one thing they actually did an elegant job with.
> It's one of MacOS's more-egregious crimes against Unix.
Nah. Using a file system means putting up with its semantics. HFS+ was case-insensitive; they were deploying an upgrade to millions of existing filesystems.
If you mount, say, an NFS volume, MacOS does the expected thing.
Case-senitivity is not a "nasty holdover", it is a good design decision that continues to be proven correct (case in point, this bugfix for case-insensitive filesystems).
Why would you introduce complexity into the filesystem to try to normalize file names when you can simply, not? I mean, have you _seen_ the mess that is Unicode normalization? Hundreds of different glyphs or whatever that are all considered equivalent, but are actually composed of different bytes. The filesystem should try to make sense of all that, and consider them equivalent paths?
Even if you say "well, just capitalization, not Unicode normalization," there's the whole German letter ẞ => ss (or is it ß?) and similar friends like the Turkish dotted I that have popped up as articles on HN. Absolutely glad Linux filesystems by and large do not attempt to take that on, and treat paths as a bucket of bytes instead.
All for what benefit - so you can type File.txt in the terminal and have the OS find file.txt? That is much more appropriate for the Application layer to resolve, rather than the filesystem.
Bugs like this come from over-engineering. Filesystems should be simple, and follow the principle of least surprise.
It bugs me to see foo.c and Foo.c as separate files in a directory listing. I like the fact that MacOS doesn't allow this situation to ever happen. Not taking on that problem means it's left to the user to figure out what's going on when similar glyphs occur.
> Fortunately it doesn't manifest that often, but it rears its head often enough to be a problem.
IIRC, one place where it does rear its head in when a file is renamed in a git commit to a value that downcases to the same value as the prior name. For example `Foo.txt`->`foo.txt`.
I have `core.ignorecase = true` in my `.gitconfig` for this very reason.
The extraordinarily frustrating case is where you're working on a repository that has multiple files that differ only by case. Git will check out one of them, then overwrite it with the other.
Debhelper used to be one of those until I convinced them to change it: they had a Debian/ directory for the Perl module Debian::Debhelper as well as a debian/ directory for the packaging metadata. https://bugs.debian.org/873043
(I suspect I'm a little unusual in wanting to have checkouts of Linux and Debhelper on my Mac homedir.)
If linux doesn't normalize unicode at all, can you have two different files that look like they are named `josé`, depending on if the é is decomposed or not?
Yes, for linux filenames are just bytes. Apart from / and NUL characters it doesnt care what you give it, nor does it mangle them anyway, its the only sane thing to do.
The only sane thing to do if you don’t care about how humans (as opposed to nerds) think.
In the end, the file system doesn’t exist in isolation, it is there to support users, and most of them won’t care how many bytes “é” takes to store.
Unix, by not even defining the way to interpret the bytes of file names (one can’t even assume that names consisting of only bytes that correspond to ASCII letters and digits should be interpreted as ASCII) makes it impossible to show file names to users. That’s insane.
> The only sane thing to do if you don’t care about how humans (as opposed to nerds) think.
At the FS layer, I think that's better. Makes things simpler for programs. For non-techie humans, unicode can be normalized at upper levels, like the GUI file manager or toolkit library that does save dialogs, etc.
That's if humans being confused because of lack of normalization of unicode is a real practical issue and not just something that can happen but never does.
FYI, on MacOS, it is a property of the partition, so you can reformat and have a case-sensitive filesystem. Applications may subtly break if they weren't tested on such a filesystem, but I had used one for several years without too many issues.
Since the introduction of APFS I've taken to creating a new APFS volume formatted as case-sensitive, and put my git repositories there.
This has mostly been useful for working on shared repositories where, say, a Linux user (or other user on a case-sensitive filesystem) pushes two branches, say `feature/foo` and `feature/Foo` which works fine for them, but on a case-insensitive filesystem, git gets very upset.
Once spent way longer than I would have liked trying to debug an iOS app issue that couldn’t reproduce and debug in the emulator because iOS devices have a case-sensitive FS, macOS devices typically don’t, and the emulator was subject to the macOS file system’s conventions.
Its a notable problem with git + Windows that has gotten better over time but still leads to a lot of WTF moments. For many this event is the first time they hear that window's filesystem is case insensitive.
Sometimes it feels like corporate IT creates more security problems than it solves: windows as development machines, solar winds, Fucking McAffee malware on everything.
I don't think that smugly not running as root saves normal users; while malware running as your user can't trash your laptop, they can get your Google cookie and read and send emails as you, spend your money, view your private photos, etc.
etckeeper and friends (I have a git checkout in /etc/nixos on nixos machines), portage sync on funtoo, pulling ports tree or even system source on a BSD, grabbing setup scripts during install of Arch before a non-root user exists
I did a `brew install git` and then deleted /Library/Developer/CommandLineTools/usr/bin/git. You can't delete /usr/bin/git even with sudo (system integrity policy).
After installing git via brew and removing the one in CommandLineTools, /usr/bin/git is showing the latest version.
me@local % git --version
git version 2.30.2
I don't know if this is recommended or if it will have negative consequences that i don't know about, but it seemed like the way I could accomplish it. Given that /usr/bin/git is working with the homebrew installed git, I'm hopeful that everything will be good.
I don't use Mac, so I can't speak yo how the changes you made will affect your system. For future reference however, you should know that binaries are searched on your $PATH in order. Instead of deleting anything, you could have edited your $PATH variable so that the directory that brew stores binaries in is searched before other locations.
> This vulnerability affects platforms with case-insensitive filesystems with support for symbolic links, when certain clean/smudge filters are configured globally (e.g. Git LFS).
Can we get the title changed to "on macOS and Windows?"
I was worried for a second, but this is meaningless.
Why is it meaningless? Lots of people use Git on MacOS and Windows. I'd even be willing to bet that there are more people using Git on MacOS and Windows than Linux.
It does if nobody uses it. You can't exploit Apache 2.4.2 proxy bugs if nobody runs Apache 2.4.2 in proxy mode.
Of course, you should still update because you're a config change away from being vulnerable, but GP's point of it not being a big deal if (and only if, don't know if that's correct) nobody uses it stands.
I assume the primary user base of git-lfs is folks doing things like video game development (so that they can check in image/audio assets to a repo without massively bloating it), which probably has a much higher fraction of Mac/Windows users than folks writing server-side apps or whatever.
To be honest I was a heavy Windows user until Windows 7 yet only recently learned that it has symlink support. It's not something you (used to?) really come across in the ecosystem.
That said, I did snicker at the comment :) I had no idea it was that old.
I had exactly that done to me. By a huge 200k+ employees corporate monster... You can taste the humiliation of having to justify every sudo through a ticket system. They censored the internet for employees too, in an of course absurdly broken way. Made me learn to read ASN.1 printouts & detect tampering with TLS certs. Add to that an iconic "Office Space"-ey workplace atmosphere, absolutely toxic... made me _request a headset_ from the company (employees are not allowed to bring their own); 2 weeks of ticketing again, and they deliver: an rj45-plug phone headset, with three obscure boxes on the wire (I can only presume, for surveillance). I've been testing my limits for 3 months with them, and left without saying a word. A lesson is a lesson.
There are many options for case-insensitivity on Linux. The common one would be FAT, which can't handle symbolic links, so that is moot. There is also ext4 and ZFS that can have case-insensitive modes enabled (they aren't by default), which do support symbolic links. ntfs-3g also has an option to mount as case-insensitive (though said option can actually subtly break access to an NTFS volume, since NTFS itself is always case-sensitive and it's just the OS's VFS layer that pretends otherwise).
The exploit can also be done with (case-sensitive) Unicode file names. All it requires is that git thinks two paths are distinct, while the file system thinks they're equivalent
Yes, but no sensible people use case-insensitivity on Linux, and the amount of other people that do in a relevant context can probably be measured with four digits.
I have case-insensitivity enabled for DOSBox and Wine file systems.
I've actually thought about converting my whole $HOME to that way, but I do have a few files that would conflict if I did that. I honestly don't think it's that bad of an idea.
Do you store code in $HOME. If so, I wouldn't recommend it. I have a case-sensitive partition on my macOS machine because I was bitten one too many times by code that worked fine on my development machine (case-insensitive file system) only to fail in production (case-sensitive file system).
Case sensitivity is one of the things that really bothers me on Linux, it causes me to make mistakes for no reason. If I ever really switched to Linux full-time, I’d probably want to change that.
Because case insensitivity causes ambiguity and complexity for no meaningful benefit, and more often than not causes problems like in the post. This isn't a "Linux" thing for me; every UNIX and POSIX system that has been well-designed with the exception of Snow Leopard has had case sensitivity.
The benefit seems pretty clear to me. Users do not generally consider uppercase and lowercase versions of the letter completely distinct. The use cases for identical file names with different cases would seem quite limited.
Don't you think that 65 ≠ 97 is sufficient of a reason?..
I mean, 'A' ≠ 'a', in ASCII, Unicode and even EBCDIC. In computers, those are two distinct characters. This fact won't change no matter how you rationalize your expectations.
Thus, pretending that "y.txt" is the same as "Y.txt" is an elaborate lie. Even acknowledging that it's a "white", well-intentioned lie (designed to preserve the mistaken expectation that "y.txt" is the same as "Y.txt") — I don't like when computers lie to me; do you?
As every lie, this one has weird consequences. One of them is the today's RCE in OP. Another one was CVE-2014-9390. Myriads others.
Linux rejects the whole notion of filename case-insensitivity, and demonstrates how computers actually work. It becomes easier on developers and more secure on users.
Lastly, don't feel that I'm attacking you; I'm opposing an idea. So, here's a tip: you can set up case-insensitive filename completion in bash, so that TAB will correct your casing mistakes for you. It's a simply one-line change involving putting `set completion-ignore-case on` into an inputrc.
Fair enough. But notice: systems tend to expect that humans interacting with them observe basic rules. "The capital/lowercase variants of western alphabet letters are represented each as distinct character" is one such generic, basic rule with computer systems. Especially if we zoom out of FS's into a broader context (http, json, programming languages, etc) — you can't deny it; it's a fact.
We do have the options to ignore the fact and say "What bytes? I don't care. Guess what I mean, and lie to me as well as you can so I can stay happy in my ignorance" — but, see, coordinating good support for that isn't easy. Minor wrinkles in it continue causing burns, sometimes RCEs. Maybe "doing in Rome as Romans do" isn't such a bad advice after all?
I didn't say it's unacceptable, neither meant that. In many contexts, it'd be tough without case-insensitive regex matching, for example. Reinforcing my point, //i gains issues once applied to the entirety of Unicode.
It's almost comical: people continue insisting on "letters not code points" knowing very well how computers are bad with guesswork and under-defined notions. Issues stemming from that keep coming up. What if, instead, the norm accepted that 'A' ≠ 'a' and stopped creating problems which computers are known to deal poorly with?
I've had git repositories on vfat formatted usb drives before. This isn't something I do with much frequency but it's not that exotic of a use case.
There's also the possibility of a git repo on a SMB share. That's not a use case I have, but it's not too difficult to imagine in a corporate environment.
Yep, I sensed a similar relaxation when reading this. But whatever, don't be silly, title it long enough as it is.
Anyway, what I'm actually thinking when something like this is disclosed is how many more similar things must be known to a team of malicious professionals at Unit 8200 or whatever. I don't think I would reasonably suspect "git clone" being capable of something like that. How many more things I don't suspect to be dangerous actually are? It feels almost pointless to worry about it.
You'd be hard-pressed to get this on Windows/NTFS because of the way symbolic links work and if they're implemented as a link or a junction. This is really a macOS problem.
> Can we get the title changed to "on macOS and Windows?"
That isn't correct; it's a bug that manifests on case-insensitive[1] filesystems.
My colleagues who run a linux VM (our product targets Linux only) tend to git-clone onto a mounted NTFS partition so they can access the source from both host and guest. This bug will affect them even though they are running on Linux.
My other colleagues who run an actual Linux box tend to use a fast removable drive to git-clone (so they can work on it from home), and said drives tend to be FAT, which will also be susceptible to this bug.
If, on the third hand, you're running Windows and using ext4 as a filesystem (removable drive, mounted partition, whatever), then this bug should not affect you.
TLDR; the OS doesn't matter, the filesystem does.
[1] They aren't, not really; NTFS is case-sensitive! It preserves the case when writing filenames and ignores it when reading filenames.
This should be fixed especially for those who want to inspect the code in a repository before running it. But anyone should keep in mind that malicious repositories can do a lot of bad things after cloning, even without this bug.
https://github.com/gitster/git/commit/684dd4c2b414bcf648505e...
(Surprise, the root cause is a cache)