It was pretty hard to be honest, but it was helped by the fact we already had our own virtual machine infrastructure that we use for the crowd side. Without that, it would be a much steeper hill to climb.
Re the problem itself at a basic level, it is balancing visual matching (what does and doesn't matter in an image, what is the match in the VM - is that acceptable, and how/why?), and OCR (what is there, what matters, etc), and timing and interaction issues and complexity as well, and for us has to work on basically any platform we can run in KVM.