Same, I've been pretty impressed as well and typically give Claude a shot. Sometimes I even pass their results back and forth in an LLM collab so they generate more diverse perspectives. However, this paper from 4 days ago shows that Claude can fall apart quickly in out of distribution tasks. If you ask opposite day questions, GPT-4 is weirdly strong at it (figure 2).
https://arxiv.org/pdf/2307.02477