More

mjw · on Feb 25, 2020

> A minor key has sharps and flats to create tension.

The natural minor or Aeolian mode doesn't use any notes outside the diatonic scale (probably what you meant by "sharps and flats"). It's very possible to write sad music in the natural minor. REM's "Losing my religion" for example.

> To sound sad, you MUST shift to a minor key

With sufficient skill you can write sad music in any key or scale, there's no hard and fast rules here. Tonality is only one of the elements you can use to shape the emotion that's conveyed, and it's all quite culturally relative too.

mjw · on Nov 16, 2018

This is very neat. That said the reason these methods haven't received much attention so far is that relatively few people actually need to compute Jacobeans or Hessians directly.

Often only Hessian-vector products or Jacobean-vector products are required, and these can be computed via more standard autodiff techniques, usually a lot more efficiently than if you were to compute the Hessian or Jacobean directly.

Also for models with lots of parameters, the Jacobean and Hessian are usually impractically large to realise in memory (N^2 in the number of parameters).

Nevertheless the symbolic tensor calculus approach is very appealing to me. For one thing it could make it a lot easier to see in a more readable symbolic notation what the gradient computations look like in standard backprop, and could perhaps make it easier to implement powerful symbolic optimizations.

SoerenL · on Nov 16, 2018

True, for large-scale problems Hessian-vector products are often the way to go (or completely ignoring second order information). However, computing first an expression for the Hessian symbolically and then taking the product with a vector is still more efficient than using autodiff for Hessian-vector products. It is not in the paper though. But the gain is rather small (just a factor of two or so). But true, only for problems involving up to a few thousand parameters computing Hessians or Jacobians is useful.

mjw · on June 29, 2018

When I started out in ML I was really keen to learn about the most 'mathsy' approaches out there.

I think with hindsight, it's great to have a broad spectrum of methods available to you, but if you focus too much on methods at the hard-math end of the spectrum just for the sake of an intellectual challenge, you can end up fixated on an exotic solution looking for a problem while the rest of the field moves on, rather than doing useful engineering people care about.

Maybe you find a niche where something exotic really helps, maybe you don't -- maybe for research this is a risk worth taking. But just something to keep in mind.

IMO: breadth is good. Mathematical maturity helps. If one sticks around one finds uses for interesting maths eventually, but not worth trying to force it.

Another avenue for people who want to use some hardcore math: try and use it to find some good theory around why things which work well, work well. Not an easy task either by any means.

mjw · on May 25, 2016

My main quibble from this paper is:

> For deeper networks, Corollary 2.4 states that there exist “bad” saddle points in the sense that the Hessian at the point has no negative eigenvalue.

To me these sound just as bad as local minima. Also I don't think it's standard to call something a saddle point unless the Hessian has negative as well as positive eigenvalues. Otherwise there's no "saddle", more something like a valley or plateau.

They claim that these can be escaped with some peturbation:

> From the proof of Theorem 2.3, we see that some perturbation is sufficient to escape such bad saddle points.

I haven't read through the (long!) proof in detail but it doesn't seem obvious to me why these would be any easier to escape via peturbation than a local minimum would be, and I think this could use some extra explanation as it seems like an important point for the result to be useful. Did anyone figure this bit out?

banskiachtar · on May 25, 2016

A saddle is a critical point that's not a local extremum--the Hessian could just be zero, for example, like x^4-y^4 at (0,0).

mjw · on May 25, 2016

Ah yep, true. I'd forgotten you can still get the saddle effect from higher-order derivatives, the Hessian eigenvalues aren't enough to characterise it.

I was thinking of examples like (x-y)^2 at zero, although I guess that's still a local minimum, just not a unique local minimum in any neighbourhood.

mjw · on May 17, 2016

Their answer is pretty much 'because it's based on the log-odds', which to me is still only very mild motivation.

There are other non-linearities which people use to map onto (0, 1), for example probit regression uses the Normal CDF. In fact you can use the CDF of any distribution supported on the whole real line, and the sigmoid is an example of this -- it's the CDF of a standard logistic distribution [1].

There's a nice interpretation for this using an extra latent variable: for probit regression, you take your linear predictor, add a standard normal noise term, and the response is determined by the sign of the result. For logistic regression, same thing except make it a standard logistic instead.

This then extends nicely to ordinal regression too.

[0] https://en.wikipedia.org/wiki/Probit_model [1] https://en.wikipedia.org/wiki/Logistic_distribution

gbrown · on May 17, 2016

There are other nice properties. For example, because the logit link is canonical for the binomial GLM, inference about unknown parameters using it is based on sufficient statistics.

It's certainly not the only option though, and not always the best fit.

mjw · on May 17, 2016

Ah yep, I forgot it's the canonical link. That's more of a small computational convenience though, right, at least when fitting a straightforward GLM -- it should be very cheap to fit regardless.

I suppose the logistic having heavier tails than the normal is probably the main consideration in motivating one or the other as the better model for a given situation.

Logistic being is heavier-tailed, is potentially more robust to outliers. Which in terms of binary data, means that it might be a better choice in cases where an unexpected outcome is possible even in the most clear-cut cases. Probit regression with its heavier normal tails, might be a better fit in cases where the response is expected to be pretty much deterministic in clear-cut cases, and where quite severe inferences can be drawn from unexpected outcomes in those cases. Sound fair?

jules · on May 17, 2016

Is there a natural justification for the logistic distribution though?

mjw · on May 17, 2016

See the other replies above, but: the logistic has heavier tails than the normal, so might do better in cases where we need robustness, where unexpected outcomes remain possible even in cases where the linear predictor is relatively big, and we want to avoid drawing extreme inferences from them.

Probit might lead to more efficient inferences in cases where the mechanism is known to become deterministic relatively quickly as the linear predictor gets big.

You could go further in either direction too (more or less robust) by using other link functions.

mjw · on May 15, 2016

Pretty much any kind of mathematical modelling that involves uncertainty, really.

Making inferences and predictions from data, in the presence of uncertainty.

Analysis of the properties of procedures for doing the above.

If you want examples that avoid the feel of just "curve fitting" (assume you mean something like "inferring parameters given noisy observations of them") -- maybe look at models involving latent variables. Bayesian statistics has quite a few interesting examples.

MichailP · on May 15, 2016

Thanks! I had a course at uni named Probability and Statistics, but since it was first (and only) course in EE curriculum it was oriented toward probability, and Statistics was an afterthought (I only remember simple linear and multilinear regression). That is probably the main reason I only see curve fitting everywhere :)

mjw · on April 8, 2016

If anything, to me a lot of deep learning literature seems to lack the statistical insight and theory that's available to other subfields in machine learning (whether the Bayesian/graphical models camp, the statistical learning theory camp...)

If this book is trying to do more to bring statistical or probabilistic insights to bear on deep learning than I think that's a very good thing. It might make it less accessible to those coming from a pure computer science background, but potentially more so to those who like to think about machine learning from a probabilistic modelling perspective.

If they're using stats jargon in a gratuitous way that doesn't actually cast any light on the material then that's another thing, but from a quick skim I didn't see anything particularly bad on this front. Do you have any examples of the kind of jargon you're talking about?

To others reading, I just wanted to emphasise that statistics is really important in machine learning! Deep learning lets you get away with less of it than you might need elsewhere, but that doesn't mean one can treat it as an unnecessary inconvenience. It's a language you need to learn, especially if you want to try and get to the bottom of how and why aspects of deep learning work the way they do. As opposed to just an empirical "using GPU clusters to throw lots of clever shit at the wall and see what sticks" engineering field. Bengio seems very interested in these kinds of questions and I'm glad he's leading research in that direction, even if clear answers and intuition aren't always easy to come by at this point.

mjw · on March 20, 2016

It's more an empirically-verified thing than a mathematical fact, there's nothing magic about 16 bits AFAIK. Empirically 16 bits seems to work well enough for some tasks, taking it down to 8 bits is usually taking it too far, and performance-wise there's not a lot of point playing with values in between e.g. 12 bits.

(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)

Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence [0]. It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.

Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.

[0] https://en.wikipedia.org/wiki/Stochastic_approximation#Robbi...

mjw · on Jan 25, 2016

Yep. To elaborate: really big batch sizes can speed up training data throughput, but usually mean that less is learned from each example seen, so time-to-convergence might not necessarily improve (might even increase, if you take things too far).

Training data throughput isn't the right metric to compare -- look at time to convergence, or e.g. time to some target accuracy level on held-out data.

mjw · on Jan 25, 2016

Warp-CTC implements one specific model (or at least, one specific loss function), it's not really a general framework in the same way as the other libraries mentioned.