OpenAI Baselines

Smerity · on May 25, 2017

To extend on what was written in the article, reproducibility is difficult in science generally but can be insanely difficult for machine learning. The key insight I've found over the years in the field, especially when applied to deep learning, is that gradient descent is highly effective at hiding my darn bugs.

I ran into this recently by accident when writing a simple RL example. With two weight matrices to learn, the first weight matrix was given correct gradients, the last weight matrix was only supplied with partial information. Surprise, still works, and I only discovered the bug _after_ submitting to OpenAI's Gym with quite reasonable results. I've seen similar issues in the past such as accidentally leaving a part of the network frozen (i.e. it was randomly initialized and never changed) yet the model still happily went along with it.

This is good and bad. Bad in that it makes errors difficult to catch. Good in that, if you had a reason for freezing part of the network (maybe transfer learning etc) your model will learn to happily use it, even if that "information" is more akin to noise.

Regarding reproducibility, most papers I've gone to reproduce take far longer than expected and usually involve deducing / requesting additional information from the lead authors. Even minor issues, such as how the loss is calculated (loss / (batch * timestep) vs loss / batch) can confuse substantially and given they seem "insignificant" and that there are space constraints in papers, they are rarely written down.

Worst I have seen recently was a state of the art published result where the paper was accepted to a conference yet they didn't include a single hyperparameter for their best performing model - and no code. There is near zero ability to reproduce that given the authors spent a small nuclear reactor worth of compute performing grid search to get the optimal hyperparameters.

tldr There are reproducibility issues all the way up the stack, from gradient descent working against you to minor omissions in the papers to full fledged omissions that are still accepted by the community.

karpathy · on May 25, 2017

To add to this, there are many not-so-great side effects due to all of the complexity even if the code is all correct and perfectly reproduced.

For instance, I suspect a large fraction of claimed "better than baseline" results in papers are actually a result of well-tuned proposed model and not-very-well-tuned baseline. Papers usually say something along the lines of "we tune these hyperaparameters with crossvalidation", but you can be very thorough or not very thorough in this process and achieve very different results. Or if the author is (intentionally or unintentionally) devious, they could adjust the hyperparameter search ranges in a way that doesn't cover the optimal region, or is way too broad.

All of these problems combined are the reason that I pay much less attention to numbers in the tables of papers. Papers are, unfortunately, more of a source of cool ideas and inspiration for stuff to put into your toolbox and maybe try yourself in your next project.

Smerity · on May 25, 2017

+1. The "well-tuned proposed model and not-very-well-tuned baseline" is something I feel nearly every researcher is guilty of, including myself :) It's especially pronounced however when people compare to a baseline from paper X (usually by copying and pasting the number) which may be a year or more old.

A year or two may not sound like much but for deep learning but you might as well be comparing Heron's engine[1] to a modern combustible engine. The most cited attention paper, Bahdanau et al. 2014, was only two and a half years ago - and their results weren't all that strong initially anyway. There are so many minor auxiliary improvements / newly found domain expertise that can add up to a substantially improved result (initializations, regularization techniques, hyper parameter choices that work well for that given task / dataset, ...) even for almost the exact same model.

As an example of this, I implemented a Keras SNLI model[2] as a teaching aid for friends only to find my simplest baseline outperformed a number of fairly recent papers. This is not necessarily an issue at all with those papers - many of the ideas and techniques they propose have gone on to be used in other tasks and models - but it does indicate how these additional best practices / more time on the baseline can really accumulate over only months. Keras is a brilliant example of that as it's "best practices included" - when you use an LSTM it'll default to strong weight initializations etc.

I'm vaguely hoping that in the future there can be workshops at conferences titled "Pushing the baselines" where we take some time to see what new techniques can apply to old baselines and/or tune baselines with the explicit goal of getting them to regain their edge. There's at least one at ICML on reproducibility[3] which I'm looking forward to :)

[1]: https://en.wikipedia.org/wiki/Aeolipile

[2]: https://github.com/Smerity/keras_snli

[3]: https://sites.google.com/view/icml-reproducibility-workshop/...

cf · on May 25, 2017

I wish there were more ways to unit test the gradient calculations in my code. While I try to use AD as much as possible, it isn't always feasible. Of course it isn't just gradient since if you are implementing something like Adagrad there are additional corners bugs can lurk.

Smerity · on May 25, 2017

I presume you're already familiar with computing the numerical and analytical Jacobian[1][2] and just wishing for a better way? :) They're memory intensive as all hell and pretty finicky but at least it's something. I'll admit that when floating point calculations are involved it can all go to hell anyway.

Recently I had to implement gradient calculations by hand recently (writing custom CUDA code) and had a pretty terrible time. Mixing the complications of CUDA code with my iffy manual differentiations and floating point silliness can drive you a little bonkers. I ended up implementing a slow automatic differentiated version and compared resulting outputs and gradients to help work through my bugs.

Here's hoping that Tensorflow's XLA and other JIT style CUDA compilers/optimizers will make much of this obsolete in the near future.

For those not familiar, the overhead for calling a CUDA kernel can be insanely high, especially when you're just doing an elementwise operation such as an add. Given your neural network likely has many many of these, wrapping many of these into one small piece of custom CUDA can result in substantial speed increases. Unfortunately there's not really any automatic way of doing that yet. We're stuck in the days of either writing manual assembly or being fine with suboptimal compiled C.

[1]: https://www.tensorflow.org/versions/r0.11/api_docs/python/te...

[2]: https://github.com/pytorch/pytorch/blob/master/torch/autogra...

agibsonccc · on May 26, 2017

We spent a ton of time thinking about this. We have an "op executioner" in our tensor library that handles special cases like this. We call it "grid execution" where we look for opportunities for grouping ops automatically. We will be combining that with our new computation graph to automatically look for optimization opportunities like that.

Right now we hand write all of our own gradients as well.

The overhead can come from a ton of different places. This is why we wrote workspaces: http://deeplearning4j.org/workspaces

Allocation reduction and op grouping are only a few things you can do.

cf · on May 25, 2017

I did not know about gradcheck. Thanks for the pointer! I have some handwritten code that does some of this for me. But essentially, yes! I want better tooling to catch my mistakes.

eanzenberg · on May 26, 2017

I'm curious.. why are you calculating gradients and implementing these models from scratch? There are a decent amount of libraries that do these things for you.

eanzenberg · on May 26, 2017

If the solution works with the bugs you mentioned all that means is you started with a model with more complexity than it needed.

anythingbot · on May 26, 2017

> reproducibility is difficult in science generally but can be insanely difficult for machine learning

it is a computer algorithm, so by definition it is trivial to reproduce results, you just run the program again.

> I ran into this recently by accident when writing a simple RL example. With two weight matrices to learn, the first weight matrix was given correct gradients, the last weight matrix was only supplied with partial information. Surprise, still works, and I only discovered the bug _after_ submitting to OpenAI's Gym with quite reasonable results.

so you want to say that you coded a bug, but you don't have a method of testing whether you have a bug. So you didn't code a bug. And if you didn't code a bug, you can't reproduce a bug.

So yes, reproducing a bug is difficult when you have no means of determining whether or not you have a bug...

...maybe you should look into choosing a means of determining whether or not you have a bug.

transpostmeta · on May 26, 2017

> it is a computer algorithm, so by definition it is trivial to reproduce results, you just run the program again.

Not really. Deep learning is still quite a lot of dark voodoo where random initialization and data shuffling can matter significantly. People also adapt hyperparameters manually during training, stop early with no clear metric, and don't share their code for preprocessing the data or even the exact architecture of the network.

It's certainly better than in other fields, but it's not trivial.

anythingbot · on May 26, 2017

> dark voodoo

If I have a function f and an element of the domain x, then the value y = f(x) does not change. This is not dark voodoo.

gwern · on May 25, 2017

Advertising: if you're interested in RL, subscribe to https://www.reddit.com/r/reinforcementlearning/ !

SamBam · on May 25, 2017

This is good advice, and just as important is for authors to release the code they used.

For my Master's thesis in AI (12 years ago, so before most of this open stuff) I compared an existing Genetic Algorithm, described in a published paper, against my improvement. My improvement was significantly better.

However, I relied on the prose description of the original algorithm. The original paper (cited many, many times) didn't even have pseudocode, let alone source code.

For my paper, I included pseudocode of both the original algorithm and my improved algorithm. But we still didn't have established practices for how to make source code available to readers, in such a way that they'd be archived long term.

Is there an established way now?

zardo · on May 25, 2017

http://www.gitxiv.com

Seems to be gaining popularity.

Houshalter · on May 25, 2017

There's various sites that promise long term archival of websites for academic citations. E.g. https://perma.cc/ I imagine you could archive a github repo like that.

psuter · on May 25, 2017

zenodo.org lets you archive code and data, generates DOIs, has a GitHub integration, and is maintained by a reputable team. I recommend it wholeheartedly.

Houshalter · on May 25, 2017

It's horrifying to think about how many published papers have incorrect results due to bugs. There was a famous incident last year where someone published a paper with amazing results and got a ton of attention. No one could reproduce it and eventually it was withdrawn.

Smerity · on May 25, 2017

Agreed. Admittedly the paper that you're speaking about was on the extreme level of "bugged". The results were beyond stellar, attracting a great deal of interest, and researchers who are usually very reserved quickly pointed out methodological flaws and showed through previous / current experimentation how broken it was. A friend noted that their process was so flawed it was potentially _worse_ than if you'd just trained on the test set.

Your broader point is spot on however. My general hope is that people are wary when their results are strong^, ensuring that you don't get a good result via "cheating", so the majority of bugs are likely to harm performance. If a result is also not reproducible (i.e. "cheating" bug) it won't be used and built on - but if a result is bugged but reproducible (i.e. bug where performance was lower than it should be) then we can still move the field forward even in spite of these issues.

^ When I achieved state of the art for a task - especially given it was a huge jump in accuracy for a relatively small model compared to the previous state of the art - I spent many days sitting there double checking I hadn't accidentally cheated ;)

Houshalter · on May 25, 2017

Yeah the SARM paper was an extreme example, but that's why it got caught so quickly. How many papers have less extreme but still serious flaws, and don't get caught?

Smerity · on May 25, 2017

As an extreme example it actually brought me a bit of hope. No source code was released but the "peer review" via arXiv, Twitter, and other various channels ended up bringing the story to a close.

I'd like to imagine it's how effective peer review could be if given sufficient motivation ;)

As you note, though, most papers don't get anywhere in the same magnitude of focus, and others which do may still be entirely unreproducible anyway :(

aub3bhat · on May 25, 2017

Can anyone at OpenAI explain the sole focus on RL while ignoring Vision, NLP ?

karpathy · on May 25, 2017

Our focus at OpenAI is on AGI research. Many of us believe that Vision/NLP research falls into the category of AI applications, and does not inform insights into how to achieve generally intelligent agents. Instead, the approach is to work with full-stack agents and the core challenge is to get them to develop a cognitive toolkit for general problem solving, not anything that has to do with the specifics of perception.

This is a historically backed insight. If you're interested in a good critique of the decompose-by-function-then-combine-later approach, I recommend "Intelligence without Representation" from Rodney Brooks http://www.scs.ryerson.ca/aferworn/courses/CPS607/CLASSES/su...

mindcrime · on May 26, 2017

Many of us believe that Vision/NLP research falls into the category of AI applications, and does not inform insights into how to achieve generally intelligent agents.

Hmm... I can agree that vision and NLP could be seen as "applications", from one point of view. But I can see another position where each simply represents a different aspect of underlying cognition. Language in particular, seems to be closely tied up in how we (humans) think. And without proposing a strong version of the Sapir-Whorf hypotheses, I can't help but believe that a lot of human cognition is carried out in our primary language. Now to be fair, this belief comes from not much more than obsessively trying to introspect on my own thinking and "observe" my own mental processes.

In any case, it leads me to suspect that building generally intelligent AI's will be tightly bound up with understanding how language works in the brain and the extent to which there is a "mentalese" and how - if at all - a language like English (or Mandarin or Tamil, whatever) maps with "mentalese". Vision also seems crucial to the way humans learn, given our status as embodied agents that learn from the environment using site, smell, sound, kinesthetic awareness, proprioception, etc.

Quite likely I'm wrong, but I have a hunch that building a truly intelligent agent may well require creating an agent that can see, hear, touch, smell, balance, etc. At least to the extent that humans serve as our model of how to construct intelligence.

On the other hand, as the old saying goes "we didn't build flying machines by creating mechanical birds with flapping wings". :-)

karpathy · on May 26, 2017

I'm not saying NLP/Vision is not important, but we approach these modalities in a very specific way.

When we "work on NLP" it looks something like https://blog.openai.com/learning-to-communicate/ or http://www.foldl.me/2016/situated-language-learning/, as an emergent phenomenon in the service of a greater objective - not an end but a means to an end. It does not look like doing sentiment analysis or machine translation.

When we "work on Computer Vision" it looks like including a ConvNet into the agent architecture or a robot, it doesn't look like trying to achieve higher object detection scores.

mindcrime · on May 26, 2017

an emergent phenomenon in the service of a greater objective - not an end but a means to an end.

Ah, sounds like we are on the same page then.

_qc3o · on May 26, 2017

I have to disagree with you on language and cognition. If you went around asking famous mathematicians how they think the last thing they would tell you would be "words".

Creative and inventive thought is very picturesque and non-linear: https://www.amazon.com/Psychology-Invention-Mathematical-Fie...

mindcrime · on May 26, 2017

Also, thanks for the book recommendation. I ordered a copy. Looking forward to digging into it.

mindcrime · on May 26, 2017

FWIW, I don't intend to suggest that all cognition is done in terms of "word language", just an important bit of it.

aub3bhat · on May 26, 2017

Thanks for your answer. That clarifies my doubts regarding why OpenAI has not entered into more applied areas which can benefit from having a Non-Profit create tools / databases.

karpathy · on May 26, 2017

Agree, I'm pretty sure the world could really benefit from an awesome "OpenAI-2" that is more of an application-driven AI for high-societal-impact organization. We're more focused on mitigating the existential risk associated with AGI.

sanjeetsuhag · on May 25, 2017

This site is so pretty.

_ibu9 · on May 25, 2017

Releasing both code and models needs to be STANDARD for ML research.

joshmarlow · on May 25, 2017

s/STANDARD/ALL/

SamBam · on May 25, 2017

Did you substitute the wrong word?

joshmarlow · on May 26, 2017

Yes, I definitely did.