I can't think of a paper where Google didn't present sparse or entirely lacking metrics vs. its peers. They do a good job of presenting architectures that they're excited about internally, enough detail to take the concepts and run with them. They also do a good job of showing why the new architecture is generally viable. They just miss out on detailed benchmark comparisons is all. And model weights, obviously, but there's still enough information to generally reproduce the concept.
I'm personally extremely excited about anything related to PaLM or google's multi-modal efforts. They're almost always worth the read.
Most of the GPT-4 benchmarks from their report were things like AP tests or leer code scores. Which aren’t benchmarks that can be compared by a different set of researchers as you don’t know the constituent parts of the test to run
GPT-4 report has MMLU score, which is believed to one of the most important metric for question answering task. GPT-4 MMLU score is slightly higher than PaLM 2(86 vs 81). Google didn't compare it in with PaLM 2 in this paper.