Perhaps these findings might be indicating that we need more NN layers/attention blocks for performing reasoning. This project circumvented the lack of more trained layers by looping the input through currently trained layers more than once.
Also we may have to look for better loss functions than ones that help us predict the next token to train the models if the objective is reasoning.
Also we may have to look for better loss functions than ones that help us predict the next token to train the models if the objective is reasoning.