I was wondering, have anybody found really good tools, potentially cross-platform for GUI automation but which leverages image detection from Computer vision models, say convolutional neural networks?
Not really a tool, but Python is widely used for deep learning so, you can combine Pytorch, Tensorflow, [insert your DL framework here] with Pyautogui[1] to achieve exactly what you're asking. If you feel Pyautogui is too much "manual", I built a kind of frontend for it [2].
> wondering how I can hookup a custom backend then for the boundary box detection, which appears is not supported.
You can take a screenshot with:
pyautogui.screenshot()
With your neural network you can have the coordinates of what you want, and act with pyautogui afterwards. In many cases, a neural network can even be overkill, take a look at this https://vimeo.com/352072921
The script takes a screenshot of the webpage, recognize the current highlighted word with pytesseract and type it in with pyautogui, simple.
[1]: https://github.com/asweigart/pyautogui
[2]: https://github.com/rmpr/atbswp