I think it could potentially make the model smarter, but it's up to how you collect the data to train the reward models. Currently, companies & papers that use RLHF focus on "safety" rankings, for example. But you could potentially collect labels "smartness" or "correctness" instead and train the the reward model one these. (And then use that reward model to finetune the LLM you want to improve.)