Very few DL papers are 100% fully reproducible, in that you can get identical results and generate identical checkpoints, even when source is attached and benchmark datasets are used.
I would say the vast majority published are not even close to be able to be reproduced.
Full size LLMs are not able to be reproduced because the datasets are copyright and not distributable and you need millions of dollars worth of compute to fully train them.
If a company is publishing enough detail about SoTA architectures for researchers to ballpark reproduce, that is a win in my book.
I'm not actually expecting them to be 100% reproducible. I'm more than familiar with the work required to do that.
What I'm expecting is not holding back the secret sauce, as seen in some (or many) publications.
My gripe is not limited to DL/ML/AI scope either. When I try to compare my results with the papers I cite, I generally can't find the formulae or the detailed method to reproduce the numerical method the paper claims, and this leaves us at the dark.
All I can do is saying "Paper23 cites these results, and we surpass them at this, we are even at that, and they are better at the other thing", which I'm not comfortable doing. Not because they are not telling the truth, but I want to test my method against other methods on an even ground.
> My gripe is not limited to DL/ML/AI scope either. When I try to compare my results with the papers I cite, I generally can't find the formulae or the detailed method to reproduce the numerical method the paper claims, and this leaves us at the dark.
I agree. So many don't even both including hyperparams, even when they publish the code. The Github Issues for their code are littered with questions asking about hyperparams.
> All I can do is saying "Paper23 cites these results, and we surpass them at this, we are even at that, and they are better at the other thing", which I'm not comfortable doing.
If you are achieving better results, on the same dataset, and you are not cheating in anyway, and others can reproduce your results, then I don't know what is wrong with saying you got a better result.
Issues somewhat arise when you are using better hardware with more parameters or larger batch sizes than the original authors could have attempted. I think this accounts for the results in many papers.