I'm all in favour of doing reproducible work, especially against open-source ML (I'm the maintainer of a library for LLM inference), but what is the alternative here?
GPT-4 is still the flagship model "of humanity"; it is still the best model that is publicly available. The intent of the paper is to determine whether the best general model can compete with specialized models - how do you do that without using the best general model?
This is not answering that question. It is answering whether a specific large model with unknown training set can compete with specific small models on known smaller training sets.
GPT-4 is still the flagship model "of humanity"; it is still the best model that is publicly available. The intent of the paper is to determine whether the best general model can compete with specialized models - how do you do that without using the best general model?