Everything seems easy when you cannot stop and think and have to output the next token after doing the fixed amount of processing. It's amazing what GPT-4 can do given this limitation.
Right? I don't find enough people amazed by that. Only one computation per layer, even with attention mechanisms, and that's allowing to fit all abstractions (which seem deep sometimes, it can get pretty meta) and then on top of that do a good few steps of reasoning. It's so counter-intuitive to me.