>> http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html > Here...

skrtskrt · on Aug 15, 2022

Errors, retries, consistency, observability. For actual software running in production, shelling out is way too flaky and unobservable

bbarnett · on Aug 15, 2022

It's not flaky at all, it is merely that most people don't code bash/etc to catch errors, retry on failure, etc, etc.

I will 100% agree that it has disadvantages, but it's unfair to level the above at shell scripts, for most of your complaint, is about poorly coded shell scripts.

An example? sysvinit is a few C programs, and all of it wrapped in bash or sh. It's far more reliable than systemd ever has been, with far better error checking.

Part of this is simplicity. 100 lines of code is better than 10k lines. "Whole scope" on one page can't be underestimated for debugging and comprehension, which also makes error checking easier too.

skrtskrt · on Aug 15, 2022

Can I, with off the shelf OSS tooling, easily trace that code that’s “just wget and xargs”, emit metrics and traces to collectors, differentiate between all the possible network and http failure errors, retry individual requests with backoff and jitter, allow individual threads to fail and retry those without borking the root program, and write the results to a datastore in idempotent way, and allow a junior developer to contribute to it with little ramp-up?

It’s not about “can bash do it” it’s about “is there a huge ecosystem of tools, which we are probably already using in our organization, that thoroughly cover all these issues”.

bbarnett · on Aug 15, 2022

The Unix way, is that wget does the backoff. And wget is very, very good at retry, backoff, jitter handling, etc. Frankly, you'll not find anything better.

If wget fails, you don't retry... at least not until next run.

And wget (or curl, or others) do reply with return codes which indicate what kind of error happened. You can also parse stderr.

Of course you could programmatically handle backoff in bash too, but.. why? Wget is very good at that. Very good.

===

In terms of 'junior dev', a junior dev can't contribute to much without ramp up first. I think you mean here, 'ramp up on bash' and that's fair... but, the same can be said for any language you use. I've seen python code with no error checking, and a gross misunderstanding of what to code for, just as with bash.

Yet like I said, I 100% agree there are issues in some cases. What you're saying is not entirely wrong. However, what you're looking for, I think, is not required much of the time, as wget + bash is "good enough" more often than you'd think.

So I think our disagreement here is, how often your route is required.

skrtskrt · on Aug 15, 2022

that’s fair. If you’re a grey haired old school Unix wiz who’s one of a handful of devs on the team I’d say by all means. But at a certain point technology choice is an organizational problem as well.

And while it sounds Unixy to let wget do its thing, a fully baked program like that is much less “do one thing and do it well” than the http utilities in general purpose programming languages.

dgfitz · on Aug 16, 2022

> If you’re a grey haired old school Unix wiz who’s one of a handful of devs on the team

Is grey hair a new dependency on using wget? Upstream hasn’t updated their dependencies apparently.

lazide · on Aug 15, 2022

It depends on what you mean by ‘production’ of course.

Plenty of stuff runs in production as shell scripts at Google - if you’re a dev, you just don’t notice it often.

Most of the rest of the glue is Python.

Spooky23 · on Aug 16, 2022

Why do you think these shell tools exist in the first place?

These tools were processing and generating phone bills in the 1980s with computers with less computing power than your watch.

DonHopkins · on Aug 18, 2022

Not only processing and generating, but also running up phone bills, like uucp!

derefr · on Aug 15, 2022

That can be solved at the design level: write your get step as an idempotent “only do it if it isn’t already done” creation operation for a given output file — like a make target, but no need to actually use Make (just a `test -f || …`.)

Then run your little pipeline in a loop until it stops making progress (`find | wc` doesn’t increase.) Either it finished, or everything that’s left as input represents one or more classes of errors. Debug them, and then start it looping again :)

dekhn · on Aug 15, 2022

Not redoing steps that appear to be already done has its own challenges- for example, a transfer that broke halfway through might leave a destination file, but not represent a completion (typically dealt with by writing to a temp file and renaming).

The issue here is that your code has no real-time adaptability. Many backends will scale with load up to a point then start returning "make fewer requests". Normally, you implement some internal logic such as randomized exponential backoff retries (amazingly, this is a remarkably effective way to automatically find the saturation point of the cluster), although I have also seen some large clients that coordinate their fetches centrally using tokens.

derefr · on Aug 15, 2022

Having that logic in the same place as the work of actually driving the fetch/crawl, though, is a violation of Unix “small components, each doing one thing” thinking.

You know how you can rate-limit your requests? A forward proxy daemon that rate-limits upstream connections by holding them open but not serving them until the timeout has elapsed. (I.e. Nginx with five lines of config.) As long as your fetcher has a concurrency limit, stalling some of those connections will lead to decreased attempted throughput.

(This isn’t just for scripting, either; it’s also a near-optimal way to implement global per-domain upstream-API rate-limiting in a production system that has multiple shared-nothing backends. It’s Istio/Envoy “in the small.”)

dekhn · on Aug 15, 2022

Setting up the nginx server is one more server (and isn't particularly a small component doing one thing) to manage.

Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server. That means responding to errors in a way that doesn't lead to thundering herds. The nice thing about this is that, like modern TCP, it auto-tunes to the capacity of the system, while also handling outages well.

derefr · on Aug 16, 2022

Not really; I’m talking about running non-daemonized Nginx as part of the same pipeline. You could even fit the config into the pipeline, with sed+tee+etc, to make the whole thing stateless. Short-lived daemons are the network-packet equivalent to shell pipelines. :)

> Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server.

I disagree. The goal should be to make the server behave in such a way that a client using entirely-default semantics for the protocol it’s speaking, is nudged and/or coerced and/or tricked into doing the right thing. (E.g. like I said, not returning a 429 right away, but instead, making the client block when the server must block.) This localizes the responsibility for “knowing how the semantics of default {HTTP, gRPC, MQPP, RTP, …} map into the pragmatics of your particular finicky upstream” into one reusable black-box abstraction layer.

dekhn · on Aug 16, 2022

That's an interesting perspective, certainly not one that would have immediately come to mind. Does this pattern have a name?

lstodd · on Aug 16, 2022

"Common sense"

cortesoft · on Aug 16, 2022

What happens if a page is partially downloaded?

hn_version_0023 · on Aug 15, 2022

I’d add GNU parallel to the tools used; I’ve written this exact crawler that way, saving screenshots using ghostscript, IIRC

bigDinosaur · on Aug 16, 2022

Isn't that also only an incredibly simplified crawler? I can't see how that works with the modern web. Try crawling many websites and they'll present difficulties such that when you go to view what you've downloaded you realise it's useless.

Sohcahtoa82 · on Aug 15, 2022

Fair, but a simply Python script could probably handle it. Still don't need a message queue.