It‘s Reinforcement Learning. In each state, the agent can choose from a set of actions. This leads to a state transition, and the agent gets a reward which it can use to adjust it‘s policy. A policy is just a conditional probability distribution over actions, given a state p(a | s).