I wouldn't consider doing 31/41 easy problems, 21/80 of medium problems, and 3/45 of hard problems a failure. GPT-4 wasn't built to solve these problems but can still do 3 of 45 hard problems. Hell I don't know if you sat down 45 random programmers if 3 of them could solve those 3 problems GPT-4 was able to do, and nobody could solve them in the time it took GPT-4.