That does sound like quite a good approach, there are a few issues though:
1. Some vague throttling is useful, but I can only do this on an account level with Lambda rather than per function. Trying to call the API in bursts of 10000 requests may be problematic, which would limit our use of lambda for other tasks (for which we may be happy running that many in parallel). The default limit of 100 would probably serve well enough though.
2. I've now got to add splitting and recombining code on each end, with concerns around failed jobs being silently missed from the final output file. Although that extra work may leave me with a better approach to handling some failed jobs out of a large file. Hmm.
Part of the issue is that without the execution limit, this is an amazingly simple script. I've got a 10-20 line python script which does the actual processing of a file (read, hit api, store with a pool of N concurrent requests). Lambda is impressive because it adds only a small level of complexity to a problem and gives you a lot in return, just because my use-case is so simple that small amount of complexity adds up to relatively quite a lot.
Currently the setup doesn't hit the API, it just creates a dedicated instance to process a batch of data locally, but I'm hoping to simplify things to send everything through the API and just scale & load balance separately. Having code to automatically turn on & be responsible for turning off machines makes me a bit nervous :) I've already missed that I'd deleted the shutdown command in a script and left a box running for a day while developing.
Thanks for the suggestion though, I'll try and work through it in more detail today, see if I can see a clean way of dealing with the recombination. I think that's the side that I'm less clear on at the moment.
1. Some vague throttling is useful, but I can only do this on an account level with Lambda rather than per function. Trying to call the API in bursts of 10000 requests may be problematic, which would limit our use of lambda for other tasks (for which we may be happy running that many in parallel). The default limit of 100 would probably serve well enough though.
2. I've now got to add splitting and recombining code on each end, with concerns around failed jobs being silently missed from the final output file. Although that extra work may leave me with a better approach to handling some failed jobs out of a large file. Hmm.
Part of the issue is that without the execution limit, this is an amazingly simple script. I've got a 10-20 line python script which does the actual processing of a file (read, hit api, store with a pool of N concurrent requests). Lambda is impressive because it adds only a small level of complexity to a problem and gives you a lot in return, just because my use-case is so simple that small amount of complexity adds up to relatively quite a lot.
Currently the setup doesn't hit the API, it just creates a dedicated instance to process a batch of data locally, but I'm hoping to simplify things to send everything through the API and just scale & load balance separately. Having code to automatically turn on & be responsible for turning off machines makes me a bit nervous :) I've already missed that I'd deleted the shutdown command in a script and left a box running for a day while developing.
Thanks for the suggestion though, I'll try and work through it in more detail today, see if I can see a clean way of dealing with the recombination. I think that's the side that I'm less clear on at the moment.