> ML methods would require labelled data to be trained.
Not necessarily.
One could train a model on synthesized data using e.g. blender, some programmable 3D people models with hand controls and some generic background images to paste them on.
Anyway, for a Multimedia Information Retrieval course I chose to do my term project on training a neural network with synthetic data. In particular I modded Minecraft such that when I press a button it saves two screenshots: one regular and one where the game renders a depth map instead. I used this to generate ~1000 samples with perfectly accurate depth maps. Because of the mods, texture pack and world I used the data was somewhat realistic: https://i.stack.imgur.com/Zai51.jpghttps://i.stack.imgur.com/eamMR.png
This data was then used to train a neural network to predict the depth map of unseen images. It was relatively successful, but requires more data and more research, I only had so much time for a term paper.
Labelling itself already solves the issue (i.e. a bunch of photos with people pointing in certain directions).
So basically you'd solve the problem you want to solve in order to automate a task that you already performed.