PyMC3 didn’t run well on GPUs last I tried. That may have changed but I find PyT...

mlthoughts2018 · on Nov 6, 2018

Just in case other readers stumble by, neither of these perceptions of pymc is accurate.

GPU operability is well-supported, and much like Keras, pymc provides well-designed abstractions over top of TensorFlow, making the downsides of raw TensorFlow mostly irrelevant.

I like PyTorch a lot too, but any time I see someone say PyTorch is easier than TensorFlow, it usually just means that person only tried PyTorch, learned some special knowledge about it, and now they don’t want to admit using a different framework might be the better choice, even if it requires giving up some of what’s nice about PyTorch.

marmaduke · on Nov 6, 2018

That’s a fairly aggressive response.

Both TF and Theano require static graph while PyTorch lets you use Python’s regular control flows (if, for, while, etc). This makes building modular model components much easier, since you can reason about execution mostly as if it’s normal numerical Python code.

I have tried running PyMC3 models on GPUs (when they were on Theano; not sure if they have transitioned since) and it is slower than CPUs, not for small models but the big, SIMD-wide ones. When I ported the same thing to Pyro/PyTorch, it was clearly making good use of the GPU, not bottlenecked by useless CPU-GPU transfers

Maybe that’s changed now, so as they say the only useful benchmark is your own code.

mlthoughts2018 · on Nov 7, 2018

> “I have tried running PyMC3 models on GPUs (when they were on Theano; not sure if they have transitioned since) and it is slower than CPUs, not for small models but the big, SIMD-wide ones.“

Can you post a link to your code with some synthetic data of the sizes you’re talking about to demonstrate this? I hear it as a criticism a lot, but have never found it to be true (full disclosure: I work on a large-scale production system that uses pymc for huge Bayesian logistic regression and huge hierarchical models, both in GPU mode out of necessity).

> “Both TF and Theano require static graph while PyTorch lets you use Python’s regular control flows (if, for, while, etc). This makes building modular model components much easier, since you can reason about execution mostly as if it’s normal numerical Python code.”

I can’t tell if you’ve looked into pymc or not based on this (or Keras either for that matter), since in pymc, GPU mode is just a Theano setting, you don’t actually write any Theano code, manipulate any graphs or sessions directly, or anything else. You just call pm.sample with the appropriate mode settings at it is executed on the GPU.

Much like with Keras, where you can also easily use Python native control flow, context managers and so on, pymc doesn’t require low-level usage of underlying computation graph abstractions.

Again, I really like PyTorch too, but people just seem to have only ever tried PyTorch, liked one or two things about it, forgive the parts that are bad about it (like needing to explicitly write a wrapper for the backwards calculation for custom layers, which you don’t need to do in Keras for example), and generalize to criticize other tools.

marmaduke · on Nov 7, 2018

I’ve contributed to pymc actually (https://docs.pymc.io/api/distributions/timeseries.html#pymc3...) and used it in research projects. So when I say I find Pyro/PyTorch easier to use, it’s not wishful thinking.

I don’t have pymc code anymore since we have moved to Stan, and now starting porting code to Pyro.

> forgive the parts that are bad about it (like needing to explicitly write a wrapper for the backwards calculation for custom layers

Why do that when AD does it for you?

p1esk · on Nov 7, 2018

like needing to explicitly write a wrapper for the backwards calculation for custom layers, which you don’t need to do in Keras for example

Not sure I understand - you will need to write a backwards pass regardless if you use Keras, PyTorch, or anything else. With Keras, you would need to modify the underlying backend code (e.g. with tf.RegisterGradient or tf.custom_gradient). With Pytorch you write the backward() function, which is about the same amount of effort.

mlthoughts2018 · on Nov 7, 2018

You missed the point entirely. When you compose operations in Keras, it automatically generates the backpropagation implementation, you do not need RegisterGradient, custom_gradient or anything else if you are making new operations or layers as the composition of existing operations (whether that is logical indexing, concatenation, math functions, whatever).

In PyTorch, you still do have to define the backward function and worry about bookkeeping the gradient, clearing gradient values at the appropriate time, and explicitly calling to calculate these things in verbose optimizer invocation code.

I encourage you to check out how this works in Keras, because it is simply just factually different than what you are saying, in ways that are specifically designed to remove certain types of boilerplate or overhead or bookkeeping that are required by PyTorch.

p1esk · on Nov 7, 2018

No, you're wrong about Pytorch. If your custom op is a combination of existing ops, you don't need to define a custom backward pass. This is true for any DL framework with autodiff. For more details, look at this answer [1].

Regarding more verbose Pytorch code for the update step, compare:

In Tensorflow:

loss = tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=output_logits)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

sess.run(optimizer)

In PyTorch:

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

loss = nn.CrossEntropyLoss()(output, label)

optimizer.zero_grad()

loss.backward()

optimizer.step()

In my opinion, PyTorch makes the parameter update process a lot easier to understand, control, and modify (if needed). For example what if you want to modify gradients right before the weight update? In PyTorch I'd do it right here in my code after the loss.backward() statement, while in TF I'd have to modify the optimizer code. Which option would you prefer?

[1] https://stackoverflow.com/questions/44428784/when-is-a-pytor...

marmaduke · on Nov 7, 2018

> PyTorch, you still do have to define the backward function and worry about bookkeeping the gradient, clearing gradient values at the appropriate time, and explicitly calling to calculate these things in verbose optimizer invocation code

I’ve definitely never had to do that. Where do you get this from?