The fact that this trick exists reveals a deeper problem with the specification of how tools like sort or grep should behave: If they are supposed to act locale-specific, then this "trick" simply leads to wrong results and should be avoided. If, however, these tools are supposed to act locale-inspecific / according to the "C" locale, then the need for this trick should be tracked as a bug for them and the locale-sensitivity should be removed in code.
How are these tools specified to act? What were the requirements for them?
An interesting clue is how these tools are used at all. Less computer savvy users don't touch the command line anyway, so these are used by power users, developers, system operators and the like. HOW are they used? These tools cannot search through what is considered a text file by regular people, i.e. a Word document, but only what a developer would consider a "plain text" file with pure ASCII, possibly containing in-band markup. In particular, logs, configuration files, and code.
There is a strong tendency to treat these with the "C" locale, making the locale-sensititivy of the tools unnecessary and even harmful. However, another aspect comes into play: These files are sometimes NOT using the ASCII encoding but UTF-8, e.g. Java source files. The next obvious problem with these tools is that they treat character encoding as part of the locale; it should be derived from the file type, with the next problem obviously being that Unix doesn't have a solid concept of file types and cannot distinguish ASCII plain-text files from UTF8 plain-text files, resorting to a crude workaround using an unrelated setting (locale) to make the user solve the problem.
OTOH, maybe I'm just "afraid of the command line" /rant
> If they are supposed to act locale-specific, then this "trick" simply leads to wrong results and should be avoided.
These tools are definitely supposed to act locale-specific (e.g. '.' regular expression should match one unicode codepoint even if it is multibyte), but in many cases one uses it on ASCII-only data (e.g. logs from system daemons that runs with LC=C anyways), so using this 'trick' is fine.
Mainly, i use this trick not for speedup, but because grep (with UTF-8 locale) has issues with invalid UTF-8 sequences in data, while grep with C locale accepts them.
The problem (or one of the problems) really is that the same utilities that are used for scripting are also used interactively.
In a script you want predictability, on the command line you want convenience.
In general, you'd want these tools to behave as with the C.UTF-8 locale. Support unicode but forego the other locale shenanigans such as alternative characters for the decimal point.
It never makes sense to treat files/documents differently depending on a system-wide or application-wide "locale" setting. The same document may be worked on by multiple people in different parts of the world.
Those tools are supposed to act locale specific. So yes, one should only force locale to C when they know that locale does not matter.
Locales and encodings (at least ones that could be applied using LC vars) generally all behave the same as long as (1) the characters in question are in ASCII (0-127) range and (2) searches are cases insensitive or occasional false match is acceptable. In my experience, this applies to the vast majority of the grep/find invocation I have seen.
In other words: don't put LC_ALL=C in the script which searches your music or document collections. Most other cases are fine with it.
Same applies to things like Java source code: as long as you are searching for FactoryConstructorIndirectorSingleton, you can treat UTF-8 as ASCII, because ASCII is a subset of it. { is code 123, no matter if its latin-1, "C locale", UTF-8 or iso-8859-3.
> Those tools are supposed to act locale specific.
Another response mentioned decimal points. In a locale that uses a decimal comma, how does (for example) grep whether an ASCII 46 symbol is a decimal point or full-stop?
Luckily, this does not matter for most tools -- for "grep", ASCII 46 is nether decimal point nor full-stop, it is "match any" character. "sort" is commonly used with non-number data (filenames, ISO dates) or with integer values (file sizes), and most uses of "find" can avoid floating point entirely (use -mmin instead of-mtime). And since both "expr" and bash's "$((...))" don't work on non-integer data, there is a great incentive to avoid floating point in shell scripts.
And yes, sometimes you have no choice but to care about locales, maybe you are pulling data from badly designed API or parsing files generated by someone who didn't know about "stat -c" and did "ls -l" instead. But I'd argue that in this case you should be explicit about setting locale env, and you shouldn't rely on system settings. It's not like an API will suddenly start returning different data if you move your fetching script to a machine with a different locale.
> It's a problem when e.g. searching for caf◌́e doesn't find café.
That doesn't even display on my browser[1]; tried it in Goland[2], doesn't display there either, so that's the rare case 0.0001% that I wouldn't really worry about, because if the code has undisplayable unicode sequences, there's bigger problems than searching.
OP is trying to explain that accents in Unicode can be written in two ways: either "COMBINING ACUTE ACCENT" + "LATIN SMALL LETTER E" (two codepoints) or "LATIN SMALL LETTER E WITH ACUTE" (one codepoint). They both render the same on all browsers. But they don't compare the same unless you use locale-aware code.
To demonstrate this OP explicitly used "DOTTED CIRCLE" (◌) then added the "COMBINING ACUTE ACCENT" to that. Normally there would be no dotted circle.
If you're using a Mac you have an easy way to reproduce the distinction: save a file called café.txt and look at what filename it has in a directory listing. (It will look the same but have a different byte sequence).
It would be very useful to say have a locale called U or similiar that would have standard things like: encoding UTF-8, ISO 0601 date time, A4 paper size, en-US language, etc.
Does anyone have a an easy recipe to define my own?
I have started using globally exported TZ=:/etc/localtime after previous discussion, but since stopped, since at some desktop applications (slack for example) were not reading timezone correctly.
No, the behavior is not correct, because changing /etc/localtime is an asynchronous operation. You're changing a file in the file system and expecting to observe a change in a running process. It's, by design, racy — and that's OK.
Taking this into account, the correct behavior is to keep a timestamp of when the last stat() call was, and only stat() again if it was longer than X ago. Even a few seonds would clamp down the stat syscall rate to ambient noise.
(NB: changing /etc/localtime, and then starting a new process is not asynchronous. But that's not the issue here. Two processes are doing things here — changing /etc/localtime and invoking localtime() without synchronization. You shouldn't, can't and mustn't rely on ordering of such unsynchronized events.)
> No, the behavior is not correct, because changing /etc/localtime is an asynchronous operation
Well, technically you can do both from the same process (i.e. localtime(), unlink() + symlink() to change /etc/localtime, another localtime()), in that case it would be synchronous.
But in general i agree, if the timeout was small enough (say < 100 ms) i guess nobody would object.
As noted in the previous thread (https://news.ycombinator.com/item?id=13701320), there is no good reason for why libc behaves differently when accessing the default file vs. one explicitly specified by TZ. Whether to stat for changes should be orthogonal to which timezone file is used (or to whether the default is used or not). There should rather be a separate way to configure the stat behavior, assuming it makes sense at all. (I don’t know how many programs can actually properly deal with intermittent changes to the timezone definitions.)
One reply comment of the above link mentions user vs. admin, but that’s not a good argument, because users specifying TZ may still want a long-running background process to observe updates of the timezone file, and also TZ may have been set system-wide rather than by the user.
Sure there's a good reason: if TZ is set, glibc can assume that the timezone is going to be fixed for the duration of the process's life. If it is not set, it will consult /etc/localtime, and will then react to changes in the system timezone.
(Now, when you use TZ=:/etc/localtime, I suppose you could expect that glibc would see that it's a symlink [maybe], and then decide to re-check it each time, but I think the current behavior of not doing that is consistent and sane.)
> glibc can assume that the timezone is going to be fixed for the duration of the process's life.
From what I'm seeing.. it doesn't seem to actually assume that. It does check to see if the TZ name changed between calls and will reload a new tzfile if it has. This same mechanism is what causes the file to be loaded whenever getenv("TZ") returns NULL.
Yes, and if you use setenv(3) to change what the value of TZ is after that point, then the new value of TZ will be observed and the new file will be loaded at that point.
There should rather not be magic stat'ing and loading of random files in a library function that pretends to be stateless.
If it needs to load state from a file to do its thing, it should be split into initialization, the actual operation(s), and cleanup. (You know, like creating, using, and destroying an object, but this is C so this will be explicit calls.)
I wouldn't consider the configuration file that says what the local timezone is a "random file" in the context of a function whose sole purpose it is to convert a timestamp to the local timezone. In fact, loading the file makes the function stateless; because if it didn't, it'd have to cache the timezone somewhere, which can get out of sync with the actual configuration.
Yes, caching would be even worse with no control over the cache's lifetime.
You load the file into memory when I tell you to, you stringify timestamps when I tell you to, using the memory pointer I give you, and you release the memory when I tell you to.
Seems like you can safely define it for a single run of any script or anything where you don't need to factor in a timezone change. Unless I've mistaken everything I've read.
To personalize PHP execution I used to putenv("TZ=America/Los_Angeles") for example at the top of a script based on the user's desired timezone. It wound up making all the other time based calls localized which was great.
date_default_timezone_set() is how I do it now, I wonder if it's as efficient (I should strace it)
Why it should be? The behavior is correct, /etc/localtime could change since last read so it is necessary to check whether old value can be used. The real problem is calling localtime() so often.
It is not incorrect in general, but it is kind of hack that may be incorrect in some situations (as it does not reflect change in system timezone), so it is not suitable for default behavior.
Default behavior - use current system timezone
Explicit TZ - use specific timezone defined by TZ variable, defined either directly (e.g. TZ="NZST-12:00:00NZDT-13:00:00,M10.1.0,M3.3.0"), or indirectly from file (e.g. TZ=":/usr/share/zoneinfo/Europe/Brussels"). As this defines specific timezone, it is not supposed to change.
Syscalls can be heavier than expected. One example is when an application is run inside gVisor. Another example is when a lot of eBPF code is attached. A third example is when a program is run under strace.
Disclaimer: I'm working on ClickHouse[1], and it is used by thousands of companies in unimaginable environments. It has to work in every possible condition... That's why we set the TZ variable at startup and also embed the timezones into the binary. And we don't use the glibc functions for timezone operations because they are astonishingly slow.
If you find yourself in the position of paying over say, $250,000/month for cloud computing, things like this can have monetary impact that your clocks ultimately don't care about.
However, tzset(3) (or “man timezone”) says that if the filespec does not start with a slash, the file specification is relative to the system timezone directory, so e.g. “TZ=:Asia/Almaty” should give the desired effect.
https://unicode-org.atlassian.net/browse/ICU-13694
https://github.com/unicode-org/icu/pull/2213
This affects all packages that have icu as a dependency, one of them being Node.js.
https://github.com/nodejs/node/issues/37271
I discovered this the hard way when some code malfunctioned shortly after daylight savings time kicked in.