Lessons from Optics, the Other Deep Learning

twtw · on Feb 12, 2018

Another field that might be interesting to compare is analog design. There is a similar stack of theories: lumped element -> transmission line -> maxwell's equations. And yet analog IC design depends heavily on inherited mental models from mentors and modifications of well known topologies. Outsiders think it is black magic. The physics is all understood (nearly) perfectly, and yet knowing the details of QM that explains MOSFET operation helps not at all (or very little) when designing actual useful circuits. The real world considerations of parasitics, coupling, etc. dominate, and extensive formal analysis is not terribly useful. The general methodology is to make changes to the design based on intuition, simple predictive models that give you a direction, and previous experience, and then simulate to see how you did.

A ton of high-quality engineering is done based on intuition, mental models, and patterns learned over years of experience. My hunch is that deep learning will be the same.

EDIT: Just reread, and I want to clarify. I'm not saying that analog design is at the same stage of development as deep learning, or that it is anywhere near as ad hoc. Deep learning probably has a long way to go, but it could potentially end up in a similar state where years of experience is critical and intuition rules.

wrycoder · on Feb 12, 2018

A couple of times, I've heard Gerry Sussman at MIT (SICP author) give a talk on how bipolar transistor circuits are actually designed. You don't use SPICE or mesh analysis except in unusual circumstances (e.g. non-linear circuits) or to fine tune a completed design.

As an example, a bias design goes something like this: "Let's see, I'll pin the base at five volts with a resistor divider. The emitter will be 0.6V below that. Then the emitter current will be (5.0 - 0.6) divided by the emitter resistor. The collector current will be essentially the same, so I can pick the collector load resistor to give me an appropriate quiescent point and make sure the output impedance is less than a tenth of the input impedance of the following stage (so I can ignore the latter)."

analog31 · on Feb 13, 2018

The classic Art of Electronics by Horowitz and Hill definitely taught this approach. It's considered to be a hard textbook for undergraduates, but is also a joy to read and study for pleasure.

vvanders · on Feb 13, 2018

Can't upvote enough, that book is a treasure. Expensive but totally worth it.

amelius · on Feb 13, 2018

Sorry, but I can't agree. It's basically a cookbook. It doesn't present the material in an analytical style.

vvanders · on Feb 13, 2018

Is that a bad thing? For me the value wasn't in the cookbook aspect but in the explanation of the practical considerations when it comes to engineering a design.

Knowing a parameter in detail isn't nearly as important as knowing if that parameter matters in the scope of the final design.

The book is also pretty honest about that and goes to pretty good lengths to guide readers to deeper material if it's an area they want to understand in greater depth.

ehead · on Feb 13, 2018

Seconded. This book should be given to all EE undergraduates, it’s a shame how few people know about it.

vvanders · on Feb 13, 2018

Yup, I've lent/given away my various copies over the years many times.

A true testament that even in there modern days the principals taught are still solid.

taeric · on Feb 12, 2018

https://www.infoq.com/presentations/We-Really-Dont-Know-How-... is the version of the talk I've seen. Highly recommended to everyone.

I have yet to get his classical mechanics book. Gonna have to pull the trigger on that soon.

wrycoder · on Feb 13, 2018

I haven't had a chance to ask him why he didn't cover special relativity. Seemed like a great fit for a computation-based course.

Edit: Thanks for that - it's more detailed than the talks I attended (LibrePlanet).

awful · on Feb 13, 2018

We were taught to do it that way, without having access to spice and computers in general, using the data sheets and maybe a curve tracer if you were lucky.

joe_the_user · on Feb 13, 2018

A ton of high-quality engineering is done based on intuition, mental models, and patterns learned over years of experience. My hunch is that deep learning will be the same.

It seems deep learning is in pretty much at that state now (with possible exception of quality). The problem is that this puts inherent limits on what can be done with it. The power of digital computing is the power of modular expansion of objects. Analog circuits and computers don't have that. And current trained deep learning models don't have combinability and modularity either.

A lot of specific engineering subfields involve this "intuition, mental models, and patterns learned over years of experience" but keeping that model of deep learning indefinitely would have to involve a vast proliferation of such subfields each with their limits as the differences in applying deep learning techniques to different subfields become evident.

rmetzler · on Feb 12, 2018

Yeah, deep learning needs a design pattern catalog like the Gang of Four book [1] and all which came after.

And while we're on it, is there something similar to analog or digital designs?

[1] https://en.wikipedia.org/wiki/Design_Patterns

kbob · on Feb 12, 2018

Jung's IC Op Amp Cookbook comes to mind, though that might be more basic than you want.

mhneu · on Feb 13, 2018

Question about analog design: how well can projects be transferred amongst engineers? Is the knowledge base relatively similar across individuals, or is it hard for others to modify designs?

darkmighty · on Feb 13, 2018

Electronics engineering (all engineering as far as possible really) does have a good amount of modularity. So understanding a circuit is about understanding each module, or simply being told their general function. You have a general geometry e.g. a balun or an antenna should follow, but the details very much follow from complex electromagnetic interactions, and I assume are designed through a intuition, simulation and testing loop. So if another engineer needs to slightly modify your design, he will usually just try adding or modifying individual modules to get the desired function -- that could include modifying the geometry of a circuit board analog filter, re-shaping an antenna for different directivity characteristics, etc.

TYPE_FASTER · on Feb 12, 2018

I worked for a couple years on control systems software without any kind of experience or formal training. We would get source code, with some initial tuned parameters, then take our robot out into the field, and re-tune our control loops based on reality.

Here was my takeaway: an engineer has to understand the domain and algorithms involved at a deep level, or they will not be productive. Or, you will need to have both an engineer and somebody with the domain knowledge and experience.

It doesn't really matter what your problem domain is. If you're an engineer, and it's your job to make changes to a system, whether code or config, you need to understand it at a deep level. And your manager needs to understand this requirement.

Otherwise, you will be guessing at changes, so your productivity will be horrible or non-existent.

zwieback · on Feb 12, 2018

Very true but the post makes a more subtle point: you don't have to have a model that explains everything as long as your model is predictive for the problem you're trying to solve. You can build a good telescope without understanding quantum mechanics. I guess that's the difference between science and engineering, broadly speaking.

amelius · on Feb 13, 2018

> I guess that's the difference between science and engineering, broadly speaking.

Science also doesn't need to have a model that explains everything. Because if it did, we wouldn't have science.

derefr · on Feb 13, 2018

A strange thing about deep learning, is that sometimes you're creating a model that generalizes beyond a particular domain. You might be tuning parameters to improve computer-vision and your tuned solution happens to also outperform the original at translation or partial-knowledge navigation. And your goal may very well be to create something that behaves well under all of those domains, while only having any real domain knowledge in a few of them.

jjoonathan · on Feb 13, 2018

That's evidence nobody knows what they're doing, not evidence that people couldn't benefit from knowing what they're doing.

IamNotAtWork · on Feb 12, 2018

I thought based on your opening sentence that you were going to say having no understanding of the problem worked out okay. So are you saying you were unproductive at your old job?

TYPE_FASTER · on Feb 12, 2018

In the beginning, I was definitely not productive at tuning or making changes to a PID loop. Over time, I learned via looking at the code, reading online, and getting training from very experienced automation software engineers. I'm by no means an expert now, but we were able to turn changes around in minutes and be confident in the outcome, instead of hours/days with some doubt as to what would actually happen in the physical world.

metakermit · on Feb 12, 2018

I like the parallel between optical lens systems and deep learning. I'm also kind of disappointed by the "arcane lore" status hyper-parameters have in different ML domains. I think it would be healthier for the community to make it a habit to explicitly document why a certain topology and layer sizes were selected. It's like providing documentation with your open source project – yes, it would be possible for knowledgable people to use it without it, but much more difficult and beginner unfriendly.

manux · on Feb 12, 2018

I wonder how documentable the space of hyperparameters really is (which is I think what the OP is poking at) with the current way we conceive of them, and also with how experiments currently happen.

Often, people either reuse other people's architectures, or simply try 2 or 3 and stick with the best one, only changing the learning rate and such.

I also wonder if there's a computation issue (training is long, we can only try so many things), or if it really is that we are working in the wrong hyperparameter space. Maybe there is another space we could be working in, where the HPs that we currently use (learning rate, L2 regularization, number of layers, etc.) are a projection from that other HP space where "things make more sense".

azag0 · on Feb 12, 2018

In this regard, it is similar to how natural sciences are done. The hyperparameter space of possible experiments is immense, they are expensive, so one has to go with intuition and luck. Reporting this is difficult.

[edit:] In this analogy, deep learning currently misses any sort of a general theory (in the sense of theories explaining experiments).

joe_the_user · on Feb 12, 2018

In this regard, it is similar to how natural sciences are done. The hyperparameter space of possible experiments is immense, they are expensive, so one has to go with intuition and luck. Reporting this is difficult.

I'd agree it's done in a sort-of scientific way. But I don't think you can say it's done the way natural science is done. A complex field, like oceanography or climate science, may be limited in the kind of experiments it can do and may require luck and intuition to produce a good experiment. But such science is always aiming to reproduce an underlying reality and the experiment aim to verify or not a given theory.

The process of hyperparameter optimization doesn't involve any broader theory of reality. It is essentially throwing enough heuristics at a problem and tune enough that they more or less "accidentally" work.

You use experiment to show this heuristic approximation "works" but this sort of approach can't be based on a larger theory of the domain.

And it's logical that there can't be a set theory of how any approximation to any domain works. You can have a bunch of ad-hoc descriptions of approximation each of which works with a number of common domains but it seems logical these will remain forever not-a-theory.

Cacti · on Feb 12, 2018

Exploring even a tiny, tiny, tiny part of the hyperparam space takes thousands of GPUs. And that is for a single dataset and model---change anything and you have to redo the entire thing.

I mean, maybe some day, but right now, we're poking at like 0.00000000001% of the space, and that is state-of-the-art progress.

saguro · on Feb 12, 2018

A DNN might be more effective at exploring the hypyerparameter space than people are with their intuition and luck. Rumor is Google has achieved this.

yorwba · on Feb 12, 2018

Google simply has the computational resources to cover thousands of different hyperparameter combinations. If you don't have that, you won't ever be able to do systematic exploration, so you might as well rely on intuition and luck.

posterboy · on Feb 12, 2018

This is not accurate. Chess alone is so complex, brood force would still take an eternity, and they certainly don't have a huge incentive to waste any money just to show off (because that would reflect negatively on them).

But how does it work? It's enough to outpace other implementations, alright. But the model even works on a consumer machine, if I remember correctly.

I have only read a few abstract descriptions and I have no idea about deep learning specifically. So the following is more musing than summary:

They use the Monte Carlo method to generate a sparse search space. The data structure is likely highly optimized to begin with. And it's no just a single network (if you will, any abstract syntax tree is a network, but that's not the point), but a whole architecture of networks --modules from different lines of research pieced together, each probably with different settings. I would be surprised if that works completely unsupervised; after all it took months from beating go to chess. They can run it without training the weights, but likely because the parameters and layouts are optimized already, and to the point of the OP, because some optimization is automatic. I guess what I'm trying to say is, if they extracted features from their own thought process (ie. domain knowledge) and mirrored that in code, than we are back at expert systems.

PS: Instead of letting processors run small networks, take advantage of the huge neural network experts have in their head and guide the artificial neural network into the right direction. Mostly, information processing follows insight from other fields, and doesn't deliver explanations. The explanations have to be there already. It would be particularly interesting to hear how the chess play of the developers involved has evolved since and how much they actually do understand the model.

yorwba · on Feb 13, 2018

I'm curious why you believe to be able to tell that my comment is not accurate when you yourself admit that you have no idea about deep learning?

Note that I'm not saying that Google is doing something stupid or leaving potential gains on the table. What I'm saying is that their methods make sense when you are able to perform enough experiments to actually make data-driven decisions. There is just no way to emulate that when you don't even have the budget to try more than one value for some hyperparameters.

And since you mentioned chess: The paper https://arxiv.org/pdf/1712.01815.pdf doesn't go into detail about hyperparameter tuning, but does say that they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.

posterboy · on Feb 13, 2018

Your comment reminded me of my self, so maybe I read a bit to much into it. Even given googles resources, I wouldn't be able to "solve" chess any time soon. And it's just a fair guess that this applies to most people, maybe slightly fewer percent here, though, so I took the opportunity to provoke informed answers correcting my assumptions. I did then search papers, so your link is appreciated, but it's all lost on me.

> they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.

I guess the trick is to cull the search tree by making the right moves forcing the opponents hand?

yorwba · on Feb 13, 2018

I think you are confused about the thing being optimized.

Hyperparameters are things like the number of layers in a model, which activation functions to use, the learning rate, the strength of momentum and so on. They control the structure of the model and the training process.

This is in contrast to "ordinary" parameters which describe e.g. how strongly neuron #23 in layer #2 is activated in response to the activation of neuron #57 in layer #1. The important difference between those parameters and hyperparameters is that the influence of the latter on the final model quality is hard to determine, since you need to run the complete training process before you know it.

To specifically address your chess example, there are actually three different optimization problems involved. The first is the choice of move to make in a given chess game to win in the end. That's what the neural network is supposed to solve.

But then you have a second problem, which is to choose the right parameters for the neural network to be good at its task. To find these parameters, most neural network models are trained with some variation of gradient descent.

And then you have the third problem of choosing the correct hyperparameters for gradient descent to work well. Some choices will just make the training process take a little longer, and others will cause it to fail completely, e.g. by getting "stuck" with bad parameters. The best ways we know to choose hyperparameters are still a combination of rules of thumb and systematic exploration of possibilities.

Houshalter · on Feb 13, 2018

Google did do this. It took ungodly amounts of computing power and only did slightly better than random search. They didn't even compare to old fashioned hill climbing or Bayesian optimization.

rocqua · on Feb 12, 2018

I've considered gradient descent for optimizing parameters on toy problems at university a few times. Never actually did it though, it's a lot of hassle for the advantage of less interaction at the cost of no longer building some intuition.

Nydhal · on Feb 13, 2018

A step in the right direction would be to encourage sharing negative results. It's important to know what to avoid too.

Cacti · on Feb 12, 2018

It is not often the case that someone actually knows why a hyperparam or architecture choice works. We pretend, sometimes, but frankly, it's mostly made up junk to cover the fact that most ML research involves a huge amount of intuitive guesswork and trial-and-error.

And the loss surfaces vary. Even just changing the dataset or even the input size alters the loss surface and can easily break a model.

It's not called Gradient Descent by Grad Student for nothing.

romaniv · on Feb 12, 2018

>I think it would be healthier for the community to make it a habit to explicitly document why a certain topology and layer sizes were selected.

Also, which other topologies were tried and failed to produce good results. It's amazing that this information is missing from most modern ML papers.

mabbo · on Feb 12, 2018

Scary possibility: what if there is no good formal theory to explain how it works? What if intelligence, both animal and machine, is purely random trial and error and "this thing seems to work"?

I don't believe that's true necessarily, but it will sure hamper the authors hopes.

Maken · on Feb 12, 2018

>What if intelligence, both animal and machine, is purely random trial and error and "this thing seems to work"?

Then you apply statistics. Which are the foundations of machine learning.

azag0 · on Feb 12, 2018

In the same way that there is no formal theory for the exact shapes of proteins? I think it’s possible. But as with proteins, there are probably some general aspects of the problem that can be explained in simpler terms.

wirrbel · on Feb 12, 2018

This might be indeed true for deep learning in its current shape. However, in the long run I think we will see more models that (1) are more composable and (2) better approaches to engineering such models, however, these, engineered models will look a bit different than what your research scientist throws over the fence today.

At the moment we are in a phase were, to stick to the optics metaphor, we stack up lenses until we see the object on the screen. This means we end up with models that are sprawling, instead of having models that were engineered.

Another trend that seems to start in deep learning is that layers become more constrained. I expect, that in 20 years, we will see much more constrained models and much generative models.

m_ke · on Feb 12, 2018

1. deep learning models are extremely composable

2. hopefully with time we'll have better approaches to engineer all things that are engineered

No, at the moment we go for the biggest and shiniest lens that we can get our hands on and hope that it's capable enough to tackle our problem. If it is we can waste time designing a smaller, more constrained, lens to ship to consumers.

taeric · on Feb 12, 2018

I'm curious if you two are going to be talking past each other with the first point. Any chance I could get you both to explore what you mean by composable?

m_ke · on Feb 12, 2018

These days any method that uses gradient descent to optimize a computational graph gets branded as deep learning. It's a very general paradigm that allows for almost any composition of functions as long as it's differentiable. If that's not composable then I don't know what is.

There's a reason why Lecun wanted to rebrand deep learning as differentiable programming. https://www.facebook.com/yann.lecun/posts/10155003011462143

I'm not sure what wirrbel meant.

taeric · on Feb 12, 2018

My guess for what they meant was that you can't compose the trained models.

For example, a classifier that tells you cat or not can't be used with one that says running or not to get running cat.

The benefit being that you could put together more "off the shelf" models into products. Instead, you have to train up pretty much everything from the ground. And we compare against others doing the same.

@wirrbel, that accurate?

m_ke · on Feb 12, 2018

You can do that, you just need a good embedding space.

taeric · on Feb 12, 2018

You have good examples? All ways I can think of for doing this are worse than just training the individual models.

taeric · on Feb 12, 2018

What makes you think this? So far, I don't think any field actually fits this description. (As such, my personal priors for this are quite low.)

Are there fields where this is an apt description?

CamperBob2 · on Feb 13, 2018

Scary possibility: what if there is no good formal theory to explain how it works?

It's scary, but to my thinking, inevitable. It was all well and good for the early atomic scientists to say that "Math is unreasonably effective at explaining Nature," [1] but our level of understanding of both mathematics and natural law is still superficial in several important areas. The universe doesn't owe us a formal theory of anything, much less everything.

It seems likely that we will soon start building -- and relying upon -- software that no human actually understands. The math community is already having to confront this dilemma to some extent, when an outlying figure like Mochizuki releases a massive work that takes months for anyone else to understand, much less prove or refute.

At some point we will have to give up, and let the machines maintain our models for us.

[1] https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.htm...

canes123456 · on Feb 12, 2018

> what if there is no good formal theory to explain how it works?

We don't need a theory that is perfect. Each theory was partially wrong but still lets you make useful predictions about the world. We need useful models that let you reason about the world. All models are wrong, some are useful.

> What if intelligence, both animal and machine, is purely random trial and error and "this thing seems to work"?

Evolution could be just considered random trail and error. However until we reach the singularity, we need people to speed up the evolution process by adapting and remixing pieces that worked before. We need models for what each level does so have ideas of what to try for a new application.

jfoutz · on Feb 12, 2018

I think the gp is referring to something a little more general. Let's pretend for a moment that our minds can be modeled by a formal system. Every thought has a chain of axioms grounding it. So, the scary part is, Godel showed us there are true things that can't be represented by a formal system.

Maybe the useful models exist, but we can't comprehend them, because they're true outside of the set of rules we happened to get built into our minds?

generally though, i'm on board with you. all models are wrong, some models are useful.

rocqua · on Feb 12, 2018

> Godel showed us there are true things that can't be represented by a formal system.

This is stretching things a bit. Specifically, it defines truth as 'does not lead to a contradiction in (some formal system that extends) Peano arithmetic'. Then, as there are statements that are 'true' in this sense in such a system A but not probable by that system, there are 'unproveable truths'.

But is that satisfactory as a definition of truth? It used to be because we had hope for a complete and consistent formal system, which feel very truthy. When Godel proved that cannot exist, perhaps the conclusion is that formal systems aren't the 'base' for truth.

canes123456 · on Feb 12, 2018

I am saying there is no connection between the two. You don't need a single complete formal system for a useful model of what is occurring. Godel was saying there can't be one true complete set of axioms for mathematics. This does not mean all mathematical models, theories are suddenly impossible.

eli_gottlieb · on Feb 12, 2018

>Let's pretend for a moment that our minds can be modeled by a formal system.

Since our minds are more like perception-action variational inference systems, that's a hell of a lot of pretending ;-).

marzell · on Feb 12, 2018

Extrapolating this concept in various and biased directions... you could say that intelligence, behavior, and evolution (defined as physical iterative behavior across successive reproductive generations) are products of random changes/mutations, reinforced by gradient descent in the form of Survival of the Fittest.

romaniv · on Feb 12, 2018

Then maybe we should put more resources to study other approaches and higher-level concepts, like statistical learning research done by Vladimir Vapnik.

nashashmi · on Feb 13, 2018

Look at how many times this article was submitted and how long ago it was submitted and never gained traction.

https://news.ycombinator.com/from?site=argmin.net

Houshalter · on Feb 13, 2018

HNs algorithm could be improved if new storied were briefly shown on the front page so a few people could see them and have a chance of voting. Like how new comments start a the top of the thread and fall if they don't get votes.

dekhn · on Feb 12, 2018

If this is interesting to you, it will also be interesting to you that lenses perform the physical equivalent of an analog fourier transform, and physicists exploited this to compute wave spectra well before digital computers existed.

6502nerdface · on Feb 13, 2018

Similarly, the human cochlea performs a physical fourier transform of sound waves, and acoustic engineers used similar principles to create paper spectrograms back in the 1950s (check out the Kay Electric Co. Sona-Graph).

zwieback · on Feb 12, 2018

Thanks for introducing me to this blog! This is the money quote for me:

There’s a mass influx of newcomers to our field and we’re equipping them with little more than folklore and pre-trained deep nets, then asking them to innovate.

tinymollusk · on Feb 12, 2018

As one of the recent newcomers, should I feel defensive when I read something like this? I understand there are people with much more knowledge. Isn't this true of everyone, in every field?

The message I've gotten is "try things out". Innovation isn't necessarily improving specific techniques, but applying them to new fields. To apply techniques to things that are more mundane like data processing in non-AI-focused companies, you're gonna need bodies who know how to apply these newer programming techniques to solve problems.

Not every electrician has to understand electrical engineering.

currymj · on Feb 12, 2018

i don't think so. the author here also helped co-write the "deep learning is alchemy" talk that was somewhat controversial at NIPS.

i think this is especially important if you purely want to do applications. we have a bag of tricks (dropout, batchnorm, different optimizers and learning rate schedules). we have no real theory for why any of this should work; often a proposed explanation will later turn out not to make sense.

so the choice of how to train things comes down to "folklore", the community's collective experience. and there's no guarantee that folklore will generalize to your new architecture or dataset, and no way to know whether it even should.

the presentation seems to have struck a nerve and there's papers and talks floating around now examining the performance of common architectures in very simple settings. it's probably worth paying attention to these at least in the background, as it will hopefully crystallize into a body of knowledge that will be useful for someone trying to decide on architectures and optimization techniques.

nlowell · on Feb 12, 2018

I am new as well. I think the folklore aspect is not because the established people in the field are bad teachers, but because even they don't have much rationale besides "this is what seems to work". It's a new field, that's fine. Innovations are still as simple as "Oh we used a cyclical learning rate instead of constant" and boom.

sgt101 · on Feb 12, 2018

The flaw (geddit?) in this is that deep networks are processing data from different domains, with different characteristics (which change in chaotic ways). In ML the choice was always "the simpler the better" - we used statistics and information theory to apply Occam's razor. Deep networks don't work that way and they do work well in some real domains, nature does not always prefer simple domain theories. If the laws of the universe suddenly changed designing optics would suddenly be difficult as well.

meri_dian · on Feb 13, 2018

A comment below makes an important point that I think is worth repeating:

"The power of digital computing is the power of modular expansion of objects. Analog circuits and computers don't have that. And current trained deep learning models don't have combinability and modularity either."

A point I'd like to make: the brain exhibits properties of both digital and analog computers. It also exhibits repeating units in the neocortex which do vary but are uniform enough that neuroscientists are comfortable classifying them as discrete units within the brain.

I believe we must look to how the brain implements effective modularity in the context of analog computation in order to replicate the success of digital computers with deep nets.

gyom · on Feb 13, 2018

One of the problems with coming up with a good theory is that, at the end of day, we're building a system that's particularly suited for a certain kind of patterns. If you're building a facial-recognition convnet, there is something about the dataset of faces that is going to influence what works and what doesn't.

When you're building digital circuits, they're expected not to care about what the bits mean, which patterns are more likely. It works for all possible inputs, with equal quality.

There are things in common with how you would process faces and how you would recognize other visual objects, and that's why there are design patterns such as "convolutional layers come before fully-connected layers".

In a way, the "no free lunch" theorem says that you are always paying a price when you specialize to a certain kind of patterns. It comes at the detriment to other patterns. So, any kind of stack of theories on ML/DL is going to be incomplete unless you say something about the nature of your data/patterns.

(That doesn't mean that we can't anything useful about DL, but it just puts a certain damper on those efforts.)

agitator · on Feb 13, 2018

It could also be that the younger engineers are engineers by training, while the more senior members of the team are Phd's in the field.

What I'm trying to say is that Phd's come from an academic research background, while engineers come from a product focused background. The deep learning field is still dealing with a lot of unknowns, counter-intuitive responses to modifications, and pure experimentation. The engineers might just not realize the need for continued experimentation, and, for them, it may just feel like an undesirable waste of time to fiddle with parameters (as in, taking away time from developing the actual product).

It's an alternate point of view, but something that I experienced.

stochastic_monk · on Feb 12, 2018

This reminds me of ACDC, a Deep-fried Convnets-like[0] approach by some NVIDIA employees. [1] See section 1.1, where they state they could perform the operation in analog.

[0]: https://arxiv.org/abs/1412.7149

[1]: https://arxiv.org/pdf/1511.05946

eb0la · on Feb 12, 2018

I really miss having the building block rationale of all the perception/classification/segmentation networks out there.

The only thing I've found really useful until now, is to put 2-fully connected layers if the classifier does now handle well classification... just because you needed a hidden perceptron layer for the XOR case.

I hope to find more examples like that. If you know them, please share!!

amelius · on Feb 13, 2018

Deep learning is the new alchemy. So perhaps we can learn from alchemists?

PS: The parallel between DL and optics is (if viewed historically) a bit misleading, because for building lenses we first had a theory.

lexy0202 · on Feb 13, 2018

What the author refers to as "randomization strategies" is in fact regularisation: a set of techniques to prevent over fitting.

TeMPOraL · on Feb 12, 2018

Tangential to the topic, but one of the references in this article is this amusing paper:

http://nyus.joshuawise.com/batchnorm.pdf

... which references an even better one:

http://pages.cs.wisc.edu/~kovar/hall.html

We've been having a solid laughfest in the office for the past 10 minutes or so.

fpgaminer · on Feb 12, 2018

> http://pages.cs.wisc.edu/~kovar/hall.html

This reminds me of my bioinformatics class. The final project was to reproduce the results of a famous paper in the field.

All of us spent _weeks_ trying to do it. Nobody succeeded. The more we dug into the paper, the more holes appeared. There were variables missing in the paper, assumptions not covered, datasets not properly specified, etc. It made reproduction nearly impossible; like winning the lottery. Imagine trying to recreate the results of a deep learning paper without the paper specifying _any_ information about the layers used, their sizes, or any hyperparameters.

The professor was equally mystified.

Years later I learned this kind of pseudo-science is rife in the field of bioinformatics. I felt both a sense of relief in knowing we weren't crazy, and disappointment. I actually really liked that class; the field of bioinformatics fascinated me. But realizing what a cesspool it was, left me disappointed.

I'm glad machine learning as a field has taken proactive steps to avoid these exact kinds of issues. It's now common practice in ML to publish code and models alongside your papers, and most ML libraries allow deterministic training. This makes reproduction of results easy. It's a breath of fresh air. That doesn't obviate all problems. Methodologies and conclusions are still up for debate in any given paper. But at least the experiments themselves are reproducible. And if you question the methodology or some aspect of the experiment, you can go in and augment the experiment yourself.

twic · on Feb 13, 2018

It's not just bioinformatics. Wet-lab biology is often similar.

A friend of mine spent a good chunk of his PhD trying to reproduce an experiment involving growing primary cells in serum-free medium (the idea was to use that experiment as a starting point, and explore more aspects of it). The protocol was:

1. Grow some regular immortal cells in serum-based media in a dish, so they coat the dish with extracellular matrix

2. Use trypsin to detach the cells from the dish and remove them

3. Wash the dish carefully to remove all traces of serum, but leaving the extracellular matrix

4. Plate the primary cells onto the dish and grow them in serum-free medium

He tried for months and couldn't get the cells to grow. Then he got sloppy, didn't wash the dishes as carefully as he should, and bingo, the primary cells grew fine, as described in the original paper.

After some subtle digging, the inescapable conclusion was that the original authors had not washed their plates all that carefully either, and the serum-free medium was not exactly that. The whole premise of the experiment was flawed.

deepnotderp · on Feb 12, 2018

> I'm glad machine learning as a field has taken proactive steps to avoid these exact kinds of issues.

Yeah about that, I've got some bad news...

mulmen · on Feb 12, 2018

I would say this kind of lesson is the most valuable you could learn. Having people with this kind of experience enhances any field they participate in.

Dieselgate started with a team of students attempting to reproduce VW's claimed emission numbers.

sevenfive · on Feb 12, 2018

Curious, can you link the paper?

fpgaminer · on Feb 12, 2018

The paper was Neuronal Transcriptome of Aplysia: Neuronal Compartments and Circuitry by Moroz

This was a decade ago. Looking at the paper again I believe we only tried to reproduce a small portion of it; the phylogeny tree from the paper and its supplemental material.

vog · on Feb 12, 2018

As funny as this may sound, there's some deep value in this: We deeply need more incentives in the academic world to report non-findings like that. We have a strong publication bias towards positive results, which has already become a huge problem. Moreover, we should have a much stronger focus on repeatability of experiments.

jimbokun · on Feb 13, 2018

And maybe this is exactly the trick, make it low key and have a sense of humor about it.

Sounds like a site dedicate to "My Ass" results would be extremely popular with grad students and real world researchers. Being able to know "it's not just me" and maybe even avoid some of the stumbling blocks others have run into, or to not just blindly use some approach that happened to work for one experiment, but seems to fail for many others.

vog · on Feb 13, 2018

> And maybe this is exactly the trick, make it low key and have a sense of humor about it.

I agree in general, but I'd also love to see those published as actual beautiful papers, not just ugly formatted websites. (Okay, the website in case isn't that bad. At least it's clearly structured and readable.)

These would be mostly short papers, for sure. But there could be a separate section in the journals for them - just like the "outtakes" section at the end of a movie.

nurettin · on Feb 13, 2018

This particular non-finding is mainly about a possible error in tensorflow contrib package, which the author self-admittedly wasn't arsed enough to research into. Probably because of time concerns?

vog · on Feb 13, 2018

Maybe.

But if this encourages other people to write a follow-up paper that fixes the issue, it would still serve an important purpose.

tlb · on Feb 12, 2018

I've gotten results like the "batchnorm, my ass" paper a few times. Now I know what to cite!

Medicine has a good tradition of adverse clinical writeups. "Patient presented with X symptoms, I administered Y treatment as recommended by [Z], and the patient got worse." One such writeup isn't conclusive evidence against Y, but suggests an issue to look into.

codepie · on Feb 12, 2018

The former paper appeared in SIBVOIK 2017 (http://www.sigbovik.org/), where you can find other such papers. My personal favorites are "On The Turing Completeness of PowerPoint" (https://www.youtube.com/watch?v=uNjxe8ShM-8) and "Automated Distributed Execution of LLVM code using SQL JIT Compilation"

> Following the popularity of MapReduce, a whole ecosystem of Apache Incubator Projects has emerged that all solve the same problem. Famous examples include Apache Hadoop, Apache Spark, Apache Pikachu, Apache Pig, German Spark and Apache Hive [1]

eonwe · on Feb 12, 2018

I can feel for Mr. Kovar (the writer of the latter link).

Looking at his resume, he did wisen up and did his master's thesis in computer science. I trust he's happier now than as a undergrad student.

wrycoder · on Feb 12, 2018

Ge is easily "soldered" using indium and an ultrasonic soldering iron. If you don't have those, good luck.

The standard technique is to set up a "Kelvin probe", with four contacts on the Ge sample. Pass a current from a constant current source (an IC or FET these days) between the outer two contacts and measure the voltage across the inner ones.

It doesn't sound like his lab assistant set up something at which he could succeed, and that's a shame. He couldn't even repeat the room temperature reading.

prewett · on Feb 13, 2018

The referenced paper on "The Great Epsilon Shortage" is short but hilarious: http://www.massey.ac.nz/~ctuffley/papers/epsilon.pdf

marzell · on Feb 12, 2018

Dropout is a good track! ...I'm not quite sure what the inference was on your first link (it's reference #8 on the Joshua Wise PDF) but found it amusing to come across.

foota · on Feb 12, 2018

Oh dear God why did I read those while eating ramen

gaze · on Feb 12, 2018

The difference is the character of the non-linearity.