Hacker News new | past | comments | ask | show | jobs | submit login
Setting the TZ environment variable avoids thousands of system calls (2017) (packagecloud.io)
198 points by zdw on Jan 11, 2023 | hide | past | favorite | 63 comments



To all programmers here, the TZ=:<zonefile> syntax is currently unsupported in the icu library (International Components for Unicode):

https://unicode-org.atlassian.net/browse/ICU-13694

https://github.com/unicode-org/icu/pull/2213

This affects all packages that have icu as a dependency, one of them being Node.js.

https://github.com/nodejs/node/issues/37271

I discovered this the hard way when some code malfunctioned shortly after daylight savings time kicked in.


Why does the icu string library care about time zones?


Locale also bites people all the time with simple shell commands like sort or grep. A giant speedup can be had by setting LC_ALL=C.

https://www.inmotionhosting.com/support/website/speed-up-gre...


The fact that this trick exists reveals a deeper problem with the specification of how tools like sort or grep should behave: If they are supposed to act locale-specific, then this "trick" simply leads to wrong results and should be avoided. If, however, these tools are supposed to act locale-inspecific / according to the "C" locale, then the need for this trick should be tracked as a bug for them and the locale-sensitivity should be removed in code.

How are these tools specified to act? What were the requirements for them?

An interesting clue is how these tools are used at all. Less computer savvy users don't touch the command line anyway, so these are used by power users, developers, system operators and the like. HOW are they used? These tools cannot search through what is considered a text file by regular people, i.e. a Word document, but only what a developer would consider a "plain text" file with pure ASCII, possibly containing in-band markup. In particular, logs, configuration files, and code.

There is a strong tendency to treat these with the "C" locale, making the locale-sensititivy of the tools unnecessary and even harmful. However, another aspect comes into play: These files are sometimes NOT using the ASCII encoding but UTF-8, e.g. Java source files. The next obvious problem with these tools is that they treat character encoding as part of the locale; it should be derived from the file type, with the next problem obviously being that Unix doesn't have a solid concept of file types and cannot distinguish ASCII plain-text files from UTF8 plain-text files, resorting to a crude workaround using an unrelated setting (locale) to make the user solve the problem.

OTOH, maybe I'm just "afraid of the command line" /rant


> If they are supposed to act locale-specific, then this "trick" simply leads to wrong results and should be avoided.

These tools are definitely supposed to act locale-specific (e.g. '.' regular expression should match one unicode codepoint even if it is multibyte), but in many cases one uses it on ASCII-only data (e.g. logs from system daemons that runs with LC=C anyways), so using this 'trick' is fine.

Mainly, i use this trick not for speedup, but because grep (with UTF-8 locale) has issues with invalid UTF-8 sequences in data, while grep with C locale accepts them.


The problem (or one of the problems) really is that the same utilities that are used for scripting are also used interactively.

In a script you want predictability, on the command line you want convenience.

In general, you'd want these tools to behave as with the C.UTF-8 locale. Support unicode but forego the other locale shenanigans such as alternative characters for the decimal point.

It never makes sense to treat files/documents differently depending on a system-wide or application-wide "locale" setting. The same document may be worked on by multiple people in different parts of the world.


Those tools are supposed to act locale specific. So yes, one should only force locale to C when they know that locale does not matter.

Locales and encodings (at least ones that could be applied using LC vars) generally all behave the same as long as (1) the characters in question are in ASCII (0-127) range and (2) searches are cases insensitive or occasional false match is acceptable. In my experience, this applies to the vast majority of the grep/find invocation I have seen.

In other words: don't put LC_ALL=C in the script which searches your music or document collections. Most other cases are fine with it.

Same applies to things like Java source code: as long as you are searching for FactoryConstructorIndirectorSingleton, you can treat UTF-8 as ASCII, because ASCII is a subset of it. { is code 123, no matter if its latin-1, "C locale", UTF-8 or iso-8859-3.


> Those tools are supposed to act locale specific.

Another response mentioned decimal points. In a locale that uses a decimal comma, how does (for example) grep whether an ASCII 46 symbol is a decimal point or full-stop?


Luckily, this does not matter for most tools -- for "grep", ASCII 46 is nether decimal point nor full-stop, it is "match any" character. "sort" is commonly used with non-number data (filenames, ISO dates) or with integer values (file sizes), and most uses of "find" can avoid floating point entirely (use -mmin instead of-mtime). And since both "expr" and bash's "$((...))" don't work on non-integer data, there is a great incentive to avoid floating point in shell scripts.

And yes, sometimes you have no choice but to care about locales, maybe you are pulling data from badly designed API or parsing files generated by someone who didn't know about "stat -c" and did "ls -l" instead. But I'd argue that in this case you should be explicit about setting locale env, and you shouldn't rely on system settings. It's not like an API will suddenly start returning different data if you move your fetching script to a machine with a different locale.


Presumably you can set LC_ALL=C.UTF-8 in most environments to deal with that problem.


C.UTF-8 doesn't save you anything locale-wise? I believe you're still on the slow path for character matching.


> These files are sometimes NOT using the ASCII encoding but UTF-8, e.g. Java source files.

Why is this a problem? In 99.9999% of Java code, the UTF-8 characters aren't going to trip up an ASCII search.


It's a problem when e.g. searching for caf◌́e doesn't find café.


> It's a problem when e.g. searching for caf◌́e doesn't find café.

That doesn't even display on my browser[1]; tried it in Goland[2], doesn't display there either, so that's the rare case 0.0001% that I wouldn't really worry about, because if the code has undisplayable unicode sequences, there's bigger problems than searching.

[1] Chrome, on Mac

[2] Also on Mac


OP is trying to explain that accents in Unicode can be written in two ways: either "COMBINING ACUTE ACCENT" + "LATIN SMALL LETTER E" (two codepoints) or "LATIN SMALL LETTER E WITH ACUTE" (one codepoint). They both render the same on all browsers. But they don't compare the same unless you use locale-aware code.

To demonstrate this OP explicitly used "DOTTED CIRCLE" (◌) then added the "COMBINING ACUTE ACCENT" to that. Normally there would be no dotted circle.

I wrote an article on this a while ago: https://richardjharris.github.io/unicode-in-five-minutes/


it displays fine here on Firefox on Linux: https://i.imgur.com/MoRqYL8.png


If you're using a Mac you have an easy way to reproduce the distinction: save a file called café.txt and look at what filename it has in a directory listing. (It will look the same but have a different byte sequence).


Neither on FF or Safari on Mac.


ASCII is a subset of UTF-8. It’s time for all tools and paths to just assume UTF-8 encoding.

Most tasks assuming ASCII work fine on UTF-8 too (eg sorting). We do need to get rid of those byte order marks though, it breaks eg concatenation


Sort orders are different per locale. For example German and Swedish both have ä and ö but sort them differently relative to the other letters.


The posix locale and us-english both have `B` and `a` but sort them differently. (Posix B < a, US-english a < B) :)


> Most tasks assuming ASCII work fine on UTF-8 too (eg sorting).

cliché < clichñ < clich́e

NFC vs. NFD.


b'cliche\xcc\x81' (cliché) < b'clichz' (clichz) < b'clich\xc3\xa9' (cliché) < b'clich\xc3\xb1' (clichñ)

Combining marks like accents are placed after the base character.


You're right. Sorry, I get confused by Qt's multi-decade old bug that they're not only refusing to fix but keep porting to new major versions.

I bet they have unit tests asserting that the bug is still present.


Not to mention that you have to set the locale to C for character ranges to not be undefined behavior.

https://sourceware.org/bugzilla/show_bug.cgi?id=23393


also got bitten by this: http://xahlee.info/comp/unix_uniq_unicode_bug.html not sure if it's fixed now.



It would be very useful to say have a locale called U or similiar that would have standard things like: encoding UTF-8, ISO 0601 date time, A4 paper size, en-US language, etc.

Does anyone have a an easy recipe to define my own?


We have C.UTF-8 ("Computer English") which is not quite there (date/time and paper size do not match) but at least is common and omnipresent.


Discussed at the time:

How setting the TZ environment variable avoids thousands of system calls - https://news.ycombinator.com/item?id=13697555 - Feb 2017 (143 comments)


I have started using globally exported TZ=:/etc/localtime after previous discussion, but since stopped, since at some desktop applications (slack for example) were not reading timezone correctly.


[pulled up from some levels of reply down]

No, the behavior is not correct, because changing /etc/localtime is an asynchronous operation. You're changing a file in the file system and expecting to observe a change in a running process. It's, by design, racy — and that's OK.

Taking this into account, the correct behavior is to keep a timestamp of when the last stat() call was, and only stat() again if it was longer than X ago. Even a few seonds would clamp down the stat syscall rate to ambient noise.

(NB: changing /etc/localtime, and then starting a new process is not asynchronous. But that's not the issue here. Two processes are doing things here — changing /etc/localtime and invoking localtime() without synchronization. You shouldn't, can't and mustn't rely on ordering of such unsynchronized events.)


> No, the behavior is not correct, because changing /etc/localtime is an asynchronous operation

Well, technically you can do both from the same process (i.e. localtime(), unlink() + symlink() to change /etc/localtime, another localtime()), in that case it would be synchronous.

But in general i agree, if the timeout was small enough (say < 100 ms) i guess nobody would object.


As noted in the previous thread (https://news.ycombinator.com/item?id=13701320), there is no good reason for why libc behaves differently when accessing the default file vs. one explicitly specified by TZ. Whether to stat for changes should be orthogonal to which timezone file is used (or to whether the default is used or not). There should rather be a separate way to configure the stat behavior, assuming it makes sense at all. (I don’t know how many programs can actually properly deal with intermittent changes to the timezone definitions.)

One reply comment of the above link mentions user vs. admin, but that’s not a good argument, because users specifying TZ may still want a long-running background process to observe updates of the timezone file, and also TZ may have been set system-wide rather than by the user.


Sure there's a good reason: if TZ is set, glibc can assume that the timezone is going to be fixed for the duration of the process's life. If it is not set, it will consult /etc/localtime, and will then react to changes in the system timezone.

(Now, when you use TZ=:/etc/localtime, I suppose you could expect that glibc would see that it's a symlink [maybe], and then decide to re-check it each time, but I think the current behavior of not doing that is consistent and sane.)


> glibc can assume that the timezone is going to be fixed for the duration of the process's life.

From what I'm seeing.. it doesn't seem to actually assume that. It does check to see if the TZ name changed between calls and will reload a new tzfile if it has. This same mechanism is what causes the file to be loaded whenever getenv("TZ") returns NULL.


When I set TZ=:/etc/localtime it only stats and reads the file once. Debian 11 with glibc 2.31.


Yes, and if you use setenv(3) to change what the value of TZ is after that point, then the new value of TZ will be observed and the new file will be loaded at that point.

See 'tzset_internal' in time/tzset.c in glibc.


There should rather not be magic stat'ing and loading of random files in a library function that pretends to be stateless.

If it needs to load state from a file to do its thing, it should be split into initialization, the actual operation(s), and cleanup. (You know, like creating, using, and destroying an object, but this is C so this will be explicit calls.)


I wouldn't consider the configuration file that says what the local timezone is a "random file" in the context of a function whose sole purpose it is to convert a timestamp to the local timezone. In fact, loading the file makes the function stateless; because if it didn't, it'd have to cache the timezone somewhere, which can get out of sync with the actual configuration.


Yes, caching would be even worse with no control over the cache's lifetime.

You load the file into memory when I tell you to, you stringify timestamps when I tell you to, using the memory pointer I give you, and you release the memory when I tell you to.

That's what a sane API looks like.


I didn’t know that. It can be helpful to AcceptEnv TZ and SendEnv TZ over ssh. At least, ls provides relatable times.


Good call! I put this in place on my mail server and set it on my ansible base layer. Nice username too :-)


Seems like you can safely define it for a single run of any script or anything where you don't need to factor in a timezone change. Unless I've mistaken everything I've read.

To personalize PHP execution I used to putenv("TZ=America/Los_Angeles") for example at the top of a script based on the user's desired timezone. It wound up making all the other time based calls localized which was great.

date_default_timezone_set() is how I do it now, I wonder if it's as efficient (I should strace it)


> Ubuntu Precise (12.04) and Ubuntu Xenial (16.04)

Has glibc been updated to alleviate this since?


Not as of glibc 2.31


Why it should be? The behavior is correct, /etc/localtime could change since last read so it is necessary to check whether old value can be used. The real problem is calling localtime() so often.


[moved my comment to top-level - https://news.ycombinator.com/item?id=34353072]

TL;DR: no, not correct, put a rate limit on it.


If that behavior is correct, then the behavior with TZ=:/etc/localtime must be incorrect?


It is not incorrect in general, but it is kind of hack that may be incorrect in some situations (as it does not reflect change in system timezone), so it is not suitable for default behavior.

Default behavior - use current system timezone

Explicit TZ - use specific timezone defined by TZ variable, defined either directly (e.g. TZ="NZST-12:00:00NZDT-13:00:00,M10.1.0,M3.3.0"), or indirectly from file (e.g. TZ=":/usr/share/zoneinfo/Europe/Brussels"). As this defines specific timezone, it is not supposed to change.


So basically, always set TZ=:/etc/localtime.


> Without setting TZ during normal operations yields approximately: 14,925 calls to stat over a 30 second period (or roughly 497 stats per second).

> With TZ set during the same time period results in 8 calls to stat over a 30 second period.

This is interesting but what would be even more interesting is what that means to wall time. My gut feeling is probably not that much.


Syscalls can be heavier than expected. One example is when an application is run inside gVisor. Another example is when a lot of eBPF code is attached. A third example is when a program is run under strace.

Disclaimer: I'm working on ClickHouse[1], and it is used by thousands of companies in unimaginable environments. It has to work in every possible condition... That's why we set the TZ variable at startup and also embed the timezones into the binary. And we don't use the glibc functions for timezone operations because they are astonishingly slow.

https://github.com/ClickHouse/ClickHouse/


If you find yourself in the position of paying over say, $250,000/month for cloud computing, things like this can have monetary impact that your clocks ultimately don't care about.


500 syscalls/sec is a huge amount of useless overhead for a high performance system.


This seriously impacted performance of Zoom on my linux laptop. It's still pretty bad, but significantly better with TZ set.


Wasted effort affects battery life.


More work is more work.


It's not clear if TZ=Asia/Almaty or TZ=UTC helps to avoid those thousands.

Also what about alpine musl?


However, tzset(3) (or “man timezone”) says that if the filespec does not start with a slash, the file specification is relative to the system timezone directory, so e.g. “TZ=:Asia/Almaty” should give the desired effect.


You can also set TZ="" and UTC will be assumed.


This seems to be the same thing, question asked in 2010...

https://stackoverflow.com/questions/4554271/how-to-avoid-exc...


Please someone tell me this isn't the case in the default installation of Debian.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: