Hacker News new | past | comments | ask | show | jobs | submit login

That can be solved at the design level: write your get step as an idempotent “only do it if it isn’t already done” creation operation for a given output file — like a make target, but no need to actually use Make (just a `test -f || …`.)

Then run your little pipeline in a loop until it stops making progress (`find | wc` doesn’t increase.) Either it finished, or everything that’s left as input represents one or more classes of errors. Debug them, and then start it looping again :)




Not redoing steps that appear to be already done has its own challenges- for example, a transfer that broke halfway through might leave a destination file, but not represent a completion (typically dealt with by writing to a temp file and renaming).

The issue here is that your code has no real-time adaptability. Many backends will scale with load up to a point then start returning "make fewer requests". Normally, you implement some internal logic such as randomized exponential backoff retries (amazingly, this is a remarkably effective way to automatically find the saturation point of the cluster), although I have also seen some large clients that coordinate their fetches centrally using tokens.


Having that logic in the same place as the work of actually driving the fetch/crawl, though, is a violation of Unix “small components, each doing one thing” thinking.

You know how you can rate-limit your requests? A forward proxy daemon that rate-limits upstream connections by holding them open but not serving them until the timeout has elapsed. (I.e. Nginx with five lines of config.) As long as your fetcher has a concurrency limit, stalling some of those connections will lead to decreased attempted throughput.

(This isn’t just for scripting, either; it’s also a near-optimal way to implement global per-domain upstream-API rate-limiting in a production system that has multiple shared-nothing backends. It’s Istio/Envoy “in the small.”)


Setting up the nginx server is one more server (and isn't particularly a small component doing one thing) to manage.

Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server. That means responding to errors in a way that doesn't lead to thundering herds. The nice thing about this is that, like modern TCP, it auto-tunes to the capacity of the system, while also handling outages well.


Not really; I’m talking about running non-daemonized Nginx as part of the same pipeline. You could even fit the config into the pipeline, with sed+tee+etc, to make the whole thing stateless. Short-lived daemons are the network-packet equivalent to shell pipelines. :)

> Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server.

I disagree. The goal should be to make the server behave in such a way that a client using entirely-default semantics for the protocol it’s speaking, is nudged and/or coerced and/or tricked into doing the right thing. (E.g. like I said, not returning a 429 right away, but instead, making the client block when the server must block.) This localizes the responsibility for “knowing how the semantics of default {HTTP, gRPC, MQPP, RTP, …} map into the pragmatics of your particular finicky upstream” into one reusable black-box abstraction layer.


That's an interesting perspective, certainly not one that would have immediately come to mind. Does this pattern have a name?


"Common sense"


What happens if a page is partially downloaded?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: