To be fair I don't think you need CV in this specific case where the problem spa...

To be fair I don't think you need CV in this specific case where the problem space is very limited.

1. There's no lighting, so the enemies have specific, fixed pixel colours that don't appear in any of the backgrounds. Scan and target these.

2. Enemies appear in a specific zone in the canvas. Makes scan faster, combines with below.

If there's expected ambiguity one can a. detect a few interesting background properties by looking at pixels where enemies never appear (e.g corners), and/or b. use a couple of other pixels relative to the candidate match (maybe neighbours, maybe not, could just as well be 20px down, 10 left) to discriminate.

Side story: one day my team was tasked with doing textual document content recognition for some biz. Everyone was like "oh it's going to be $$$ to pull out CV+OCR and have the OCR learn the specific font".

Turns out the document in question was:

    - an extremely standardised gov format
    - produced only by gov administration
    - of a known fixed, overall size with clear identifiable boundaries
    - printing known, standardised list of fields at fixed position
    - with a known, standard font specifically made for quick automatic recognition
    - containing only /[A-Za-z0-9]/ chars (plus a few I can't recall, but essentially dash, plus, slash...)
    - on a known, standardised background
    - the only variable is the quality of the scan and the size parameters

So I put a file upload form, piped the image through some reasonable imagemagick filter sequence to turn it into a no-background monochrome, look for corners/borders, resize+rotate, scan through the image til I hit a black pixel, then look at pixel-lit/unlit patterns (think 7 segment display in reverse).

Cobbled the thing in a couple afternoons, with a quick, simple UI to have the user crop/rotate the doc (putting it mostly upright). It was stupidly fast to run and success rate was very high. Interestingly enough the failure mode was very good as it could reliably tell "ok I can't make any sense out of this" vs OCR which claimed success but outputted gibberish.

You can get surprisingly far with very little when you have known knowns.