Hacker News new | past | comments | ask | show | jobs | submit login
OpenBSD cron(8) now supports random ranges with steps (undeadly.org)
149 points by gslin on May 8, 2023 | hide | past | favorite | 85 comments



Given how taxing the "thundering herd" effect can be on mirrors, websites (RSS readers!), you'd think this sort of thing should've been in cron since at least mid-90s.

Once again, OpenBSD with the simple, obvious solution, that everyone else kinda overlooked. I hope every other cron out there copies and ships this as soon as possible.


>OpenBSD with the simple, obvious solution, that everyone else kinda overlooked

Systemd Timers have had this for a while.


They said simple and obvious, which much of systemd is not.

Personally I find the OpenBSD solution far more elegant and UNIXy than the systemd one, but to each their own.


Why does the complexity of the rest of systemd matter when comparing the implementation of this feature?

What would you say is wrong with RandomizedDelaySec?


UI is too complicated to type some word concatenated keyword than being visually intuitive.


On the other hand I wouldn't be able to understand what the hell is this cron string. I actually have no idea about cron format despite the fact that I used it multiple times. I have to read man every time I use it. Also different software implements it differently.

  [Timer]
  OnCalendar=daily
  RandomizedDelaySec=12h
might take few more seconds to type, but it's definitely readable without any additional documentation.


The option says Sec but the value says h? What does that mean?


> The arguments to the directives are time spans configured in seconds. Example: "OnBootSec=50" means 50s after boot-up. The argument may also include time units. Example: "OnBootSec=5h 30min" means 5 hours and 30 minutes after boot-up.

https://www.freedesktop.org/software/systemd/man/systemd.tim...

Ie seconds if no units specified.


Sec is a standard suffix for time values, anything ending with Sec accepts a value in seconds. 12h is shorthand for 'the number of seconds in 12 hours'


It's standard to say 12h and expect this to read as 43200?

Sounds like a lousy standard if the correct way to use it is to say "I want to delay by 12 hour seconds". What is even "12 hour seconds"?


Makes sense once you know it.

"Sec" suffix indicates time and implements a default of seconds, where a suffix to the value indicates a change in unit.

It would be like Asking "Memory Allocation(MB)" but accepting "12G" for 12 GB.


In our code at work we have constants like HOUR=3600 and RESTART_TIME_SECS = 6 * HOUR. It makes sense to me. If it doesn't for you, feel free to use something else I guess.


Standard where? I have never seen this before and I’ve been reading *nix configuration files for 20+ years


Standard within systemd (i.e. consistent).


What's more simple than an ini file?


Reinventing the same features many times over in many different places, of course.


    RandomizedDelaySec=


But, see, a number of people hate systemd, so that Doesn't Count.

Even if their cron does...


Even when discussing OpenBSD somehow we end up debating systemd...


Yeah, but with systemd timers it's just another ad hoc hack.


I'm not hot for systemd but this https://www.freedesktop.org/software/systemd/man/systemd.tim... looks robust, far from ad-hoc hack.


How does it look robust? Because someone wrote a man page and there's lots of boilerplate in it?

Proper design reduces complexity. The above adds a lot of complexity, and thus it's a hack. Not the appearance.


That's not a hack though, a hack is gluing things together to fix a certain specific bug that is not easily solved because the bug is related to the design instead of a mere mistake.

Otherwise with your standard of hack anything beyond hello world and baby's first input are hacks because everything else requires boilerplate.


It's not a hack because of the boilerplate - it's a hack because it's functionality implemented in a wrong place. I think the boilerplate made you believe it's not a hack by making it look professional(ish), ie someone spend time putting a lot of lipstick on that pig, man page and all.


How is it functionality implemented in the wrong place?


freebsd cron(8) has -j for the daemon to add a random sleep of up to 60 seconds on each task. This was added in FreeBSD 5.3, committed 19 years ago. (FreeBSD 5.3 released November 6, 2004)

https://github.com/freebsd/freebsd-src/commit/f5896baf9c429c...


A random sleep of up to 60 seconds doesn't really solve the problem the OpenBSD changes do, especially when your jobs take longer than 60s.

  For example, instead of "0-59/10" in the minutes field, "0~59/10" can be used to run a command every 10 minutes where the first command starts at a random offset in the range [0,9]. The high and low numbers are optional, "~/10" can be used instead.


A random sleep of up to 10 seconds turned a 150gbit/sec spike into a 5gbit/sec spike on the Akamai bill from a newspaper app I once worked on...


That's surprising. You'd think spreading the workload start over 10 seconds would lower the size of spikes (integrated over a second) by at most a factor of 10.

But the above point is still true: many jobs take a few minutes to run. 60s of dispersion in start time is better than nothing, but you really want more.

(In this case, things are still quantized to a minute boundary, so you'd really want both).


> That's surprising. You'd think spreading the workload start over 10 seconds would lower the size of spikes (integrated over a second) by at most a factor of 10.

If the delay is at the reading side, away from Akami, through a cache, perhaps 10 concurrent requests for X would result in ten lots of data transfer as it isn't in cache yet, but 10 with a short delay is enough to prime the local cache on the first request before the rest start.

There are a number of reasons a sudden glut of activity could balloon bandwidth or CPU/memory costs more than you might expect.

Without a chunk more detail about the system in question, this is just random speculation of course.


Good points.

Thinking about this-- this is Akamai, who has historically charged for midgress. Liveness of cache could be very important.


I'm not disputing that it doesn't prevent a subset of the same class of problem. It's just a wholly incomplete solution to the OpenBSD implementation to the degree that it's disingenuous to say netbsd already implemented it.


FreeBSD, not NetBSD.

They're on different timescales. The OpenBSD start times are still quantized to the minute, I believe.

Both solutions would complement each other.


ah you're right, my bsd


That threw me for a loop when I realized the last time I used FreeBSD was in the 4.x days - on a desktop, no less. That was actually something of a glory period, at least for the hardware I had at the time... Soundblaster OSS drivers that actually did hardware mixing, the proprietary Nvidia driver that actually gave working 3D acceleration on the card I had at the time (Geforce 2 GTS maybe?) - this was at least a year or two before that driver was released for Linux. I think it even had working Java.

It was such a breath of fresh air compared to Linux at the time because it was a coherent, engineered, documented system. When you didn't always have reliable internet (and at least for me, even when I did have it was something like 128K DSL), it was a huge deal to have well written man pages, where as on the linux half the time time the man page woud just tell you to scream into the void, err, run gnu info.

This was still in the period when the GPL scared off corps.


I first ran into randomization delays in cronie (the ~ is how it's implemented also and there's a RANDOM_DELAY variable for use too), after redhat had switched to it at some point years ago. Personally, never really used it, but it's nice it's there.


Intriguing! I had no idea! Again, I wish the concept was more popular back when cron and the Internet were younger, and having it built-in and readily available goes a very long way. If someone doesn't introduce you to the concept, you have little chance of knowing better until you find yourself at the receiving end of a spike, yelling at the clouds.


I believe the philosophy was to leave it up to the service to behave sensibly, including things like have a circuit breaker, use some kind of backoff/retry, and generally be robust in the face of resource contention.

It kind of feels like this is putting the policy of "don't all go at once" into the cron mechanism, which is just starting jobs at desired times.


The lazy solution is to put something like:

sleep $((RANDOM%=60))

...or longer, at the top of the script your cron is running.


It's a neat idea to be sure but I fail to see how this will have any material impact.

Firstly, OpenBSD is a niche OS, meaning the absolute magnitude of OpenBSD cron jobs out "in the wild" is relatively low.

Second, my understanding is that this is a client-side feature. I.e. if I run a service, this feature only benefits me if a significant portion of my users opt into it.

Third, I have an unsubstantiated suspicion that cron usage relative to systemd usage is also on the decline.


Another nicety OpenBSD's cron gained not long ago was the "-s" flag to ensure only a single instance of command will be run concurrently.

https://man.openbsd.org/man5/crontab.5#s


Today I learned the -n flag. Is this also an OpenBSD exclusive?

I have to say, these flags are a really nice enhancement to cron.


FreeBSD and NetBSD also implement "-n". There doesn't seem to be a cross-platform port of OpenBSD cron like there are of doas and OpenBSD ksh. (Anybody want to try making one?) Cross-platform fcron has "erroronlymail".

https://man.freebsd.org/cgi/man.cgi?query=crontab&apropos=0&...

https://man.netbsd.org/NetBSD-9.3-STABLE/crontab.5

http://fcron.free.fr/doc/en/fcrontab.5.html#FCRONTAB.5.ERROR...


An external program but there's run-one program in Ubuntu repo.


That's gross. If the operation shouldn't run concurrently it should use an exclusive flock or similar. It's not just cron that can cause concurrent execution, and if that matters you generally want to robustly prevent it - not just if cron is the executor.


It’s not gross. It’s a simple solution to a simple and common cron problem

Just because it doesn’t consider external executions, it doesn’t mean it’s a bad solution. Horses for courses etc


> It’s not gross. It’s a simple solution to a simple and common cron problem

No, it's gross. By providing that facility in the wrong place it discourages implementing it in the right place to people who come at the problem from the cron perspective.

Wrap the command in a flock-running script. That script goes in the crontab entry. When you're inevitably debugging your cron-scheduled command - paydirt! The command serializes itself still while you're manually testing instead of shitting itself.


Isn’t that the same? Just because you check a file lock in your script doesn’t mean that other invocations of the program without the script will check the lock.


Lock files in scripts are actually pretty unreliable, learned this the really hard way and the lesson cost $10's of thousands of dollars.


Surprised to see this, mind sharing the experience?

How did the lock fail? Was there a more reliable fix?


No, it's not the same. The crontab entry reflects what you'd run to reproduce what the crontab does.


Ok, but your original criticism stated that solving this in cron is bad because the program may be run outside of cron:

* It's not just cron that can cause concurrent execution, and if that matters you generally want to robustly prevent it - not just if cron is the executor.*

So you did not like that exclusion only worked when triggered from cron. But in your case it also only works when triggered from your script. So cron just made your script an integrated feature and you’re essentially criticizing your own solution.


If your job is a real script: sure, handle locking in it.

If it's a single command or pipeline... adding the layer of indirection to have a script that runs flock is more opaque. Might as well put it in cron and trust cron to only run it once.


You don't have to create a script, running a command under flock is still a one-liner in crontab.


But it's unnecessary when you can do the same thing in 2 characters.

I like flock(1) and have known how to use it for 15 years. But there's sharp edges.

- It's not standardized. In particular, this means the OpenBSD base system doesn't even include it. It's not like the underlying flock(2) is very well behaved or consistent.

- You need to ask for nonblocking behavior.

- If your command or script can ever result in a daemon launching, it may be holding the lock even though the part of your action that is supposed to be protected by the lock (the script/immediate subprocess) has ceased. so e.g. 'flock -n /tmp/relaunch-apache /etc/init.d/apache2 restart' could be a really bad idea. -u can fix this... in some cases.


Ah, yes, lets add exclusive flock support to all binaries in the existence, just in case.

You are confusing running a process with running a task from scheduler.


> Ah, yes, lets add exclusive flock support to all binaries in the existence, just in case.

You are aware that flock(1) is a thing, right? And that a defacto core tenet of UNIX is composability of disparate programs that do one thing well?


a defacto core tenet of UNIX is composability of disparate programs that do one thing well?

cron goes completely against that principle - after all, you can schedule jobs with the 'at' command, and to make a repeating task, you just make it exec 'at' again each time it is called. cron is for the lazy, no real UNIX hacker would dream of using such an extravagant single-use program. /s


> You are aware that flock(1) is a thing, right?

No. Maybe I heard and even knew what it is, but I never needed it.

You know why?

My task scheduler supports running only one instance of the task so I don't need to reinvent the wheel every time I need something to run on schedule.

Just 'somebinary args args args', 'run only one instance' and I'm done.

> flock - manage locks from shell scripts

Well, yep. Reinventing the wheel each other time.

Thanks, I have some more important things (like browsing HN) than writing shitty shell scripts.


Why is using the composable OS feature reinventing the wheel?


>Well, yep. Reinventing the wheel each other time.

Erm, no - flock(1) is a UNIX tool, a component if you will. It's the exact opposite of reinventing the wheel.

Not understanding the operating system you're using is fine, but it's good to at least know what you don't know.


> It's the exact opposite of reinventing the wheel.

If I need to write a wrapper script each time I need to run a task on a timer - then it is the proverbial reinventing the wheel. It doesn't matter if you call it a tool, utility or a component. Especially if this was solved decades ago.

> Not understanding the operating system you're using is fine

Ah, another mighty UNIX wizard here.


>If I need to write a wrapper script

And there's your problem - you don't realize "writing a wrapper script" is actually simpler than messing with config files.


The 'problem' here is you, serving the machine. I prefer the machine to serve me.


Using utilities included with your OS is not the machine serving you?


Using obsolete utilities for the sake of 'doing the right UNIX way' instead of just ticking a checkbox/adding one line in the task configuration?

Come on, I would repeat it again - it was solved for decades. Why do you need to do the things like it's 1976? Why do you insist everyone else should do that way too and abandon the fruits of the digital age?


> Why do you need to do the things like it's 1976?

It was solved decades ago: in 1976. I don't understand why you feel like it wasn't.


Why would you use something other than cron for many types of tasks? Not everyone needs a distributed system to run a script periodically, they want to actually get things done. :)


[flagged]


> efore you dig that hole of incompetence any deeper.

"Hur-dur! Look at me! Me mighty UNIX wizard! Bow to me!"

> running the scheduled commands via sh, it's inherently involving the shell. Welcome to *NIX.

Thanks, while you are running shell commands I'm running systems. I'm not interested in writing shell scripts for things what were solved decades ago.


Better look into a mirror ...

Using cron (or systemd) to ensure only 1 instance is running is a valid approach.


I think it's fine for the case where running the operation more than once at a time isn't incorrect, but potentially wastes resources.


Even though a process can be run concurrently, it doesn’t mean you necessarily want to. Besides that, your cron itself may be something like “do-something | grep foo > /root/bar”. You obviously don’t want to run that concurrently. You could create a script, but that’s more cumbersome.


You are of course correct, but at the same time, you might (or probably not) be surprised at the number of "system administrators" that are thrown into the job without really having the capability to expand too far on their knowledge. Having the option in cron may help those administrators that specifically search Google for cron usage, and never come across flock.

EDIT: In addition, the Task Scheduler in Windows has this type of option, so it may help those sys admins coming from that environment, leveraging their existing knowledge


Nice! To me this is a small but meaningful differentiator for OpenBSD. It is similar to the "random" feature in my favorite cron, fcron: http://fcron.free.fr/doc/en/fcrontab.5.html#FCRONTAB.5.RANDO....

In crons that do not support something like this, you can introduce a random delay with a Perl oneliner. For example,

  # Start within the first 10 minutes of a matching hour.
  0 */8 * * * perl -e 'sleep rand 10*60' && ~/.config/jobs/fetch-tcl-logs


And if you have jot, but not Perl, like on a bare FreeBSD install:

  0 */8 * * * sleep "$(jot -r 1 0 599)" && ~/.config/jobs/fetch-tcl-logs


Might as well just use bash

    sleep $((RANDOM % 600)) && fetch-tcl-logs


How would this work with a step that isn't divisible into the range for the field? Given minutes 0~59/25 with a random offset of 0 there will be an event at 0, 25, and 50 minutes past the hour. On the next hour does it start at 0, 15, or a new random offset? i.e. constant offset, constant step, or regenerated offset.


I read it as the 25 is just a simple step. So the first hour in your example would run at 0, 25, and 50. The next hour would run at 15 and 40, etc.


This is a nice feature at the minute-resolution level. I think something at the second-resolution level would be helpful too. For example, I have a cronjob on my Raspberry Pi at home that runs every minute and does a simple check-in with Heii On-Call so I get alerted if there's a FIOS outage or the pi breaks. I ended up writing a little bash script like this:

    #!/bin/bash
    set -e

    HEIIONCALL_API_KEY="redacted_api_key_goes_here"
    HEIIONCALL_TRIGGER_ID="redacted_trigger_id_goes_here"

    AUTHORIZATION_HEADER="Authorization: Bearer ${HEIIONCALL_API_KEY}"
    CHECKIN_URL="https://api.heiioncall.com./triggers/${HEIIONCALL_TRIGGER_ID}/checkin"

    if [ "$1" != "--now" ]; then
      RANDOM_SLEEP=$[ ( $RANDOM % 55 )  + 1 ]
      echo "Sleeping ${RANDOM_SLEEP} seconds before checkin..."
      sleep ${RANDOM_SLEEP}s
    fi

    echo "Checking in..."
    exec curl \
      -X POST \
      --retry 5 --retry-connrefused --retry-max-time 15 --retry-delay 1 \
      -H "${AUTHORIZATION_HEADER}" \
      "${CHECKIN_URL}"
This script ~/bin/heiioncall-checkin.sh gets called by crond every minute at exactly :00 seconds, so my expected maximum timeout between check-ins is approximately 120 seconds. And I can skip the sleep with "--now" flag for testing. But I'd much rather have this random offset behavior be something optionally built-in to cron, I suppose.


Why aren't you using systemd timer unit with RandomizedDelaySec? [1]

[1]: https://www.freedesktop.org/software/systemd/man/systemd.tim...


This is useful. Many services that run periodically can end up with odd patterns emerging.

E.g. Tor network capacity varies over the month due to bandwidth management tending to happen at the begining/end of months


Some people call it splay. It helps to not have all running instances hug to death telemetry and configuration management servers.


Yes, I first learned this and the name "splay" from CFengine, back in the day.

I put together a small busybox-like collection of sysadmin tools, and one of the subcommands is "splay" to sleep for a random amount of time. It's one of those things that is useful surprisingly often, even outside cron.

https://github.com/skx/sysbox


From this thread is emerging that pretty much most operating systems had this for a while in some forms… except maybe openbsd (and a few others?).

Btw congrats to the openbsd people for catching up.


Doesn't make me very excited, since I strongly feel standard cron implementations should've been deprecated long time ago anyway. I mean, consider dkron, for example. Forget k8s and web-UI and all that nonsense, its YAML configs are simply way more clear, readable and powerful, than the usual crontab syntax. Why cannot I have the same with plain simple non-distributed cron?!

Also, just as a sidenote I'm not willing to seriously discuss: I seriously doubt I'd personally ever use random ranges in production. I understand what problem it's supposed to solve, but generally I just really don't want anything random in my systems. If it conflicts with some other cronjobs or whatever, I'd like it to break down deterministically — preferably, all the time, so it's easier to spot, track down and fix it. If it causes any load spikes, I'd like these spikes to be regular, so that I can see that and manually tweak run times so that it'll be more even. If any problems arise, I'd prefer them to arise after somebody changed something, and not just magically one Saturday evening a couple of months later.

The only situation I can think of right away when this is acceptable, is if I have lot of nodes with the same cron config, so it's my attempt to spread out workers of the same type that I know would start at the same time otherwise. But then, why the fuck do I have such a degenerate architecture in the first place?! Maybe I should think about replacing that by something a little more sustainable, like, uh, a centralized scheduler? No, I mean, it's definitely a solution — a quick and easy one, at that — but even then it seems like a solutions to a problem that shouldn't have existed in the first place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: