Where is Noether's principle in machine learning?

scarmig · 2024-03-01T16:29:12 1709310552

A related paper I just found and am digesting: https://arxiv.org/abs/2012.04728

Softmax gives rise to translation symmetry, batch normalization to scale symmetry, homogeneous activations to rescale symmetry. Each of those induce their own learning invariants through training.

cgadski · 2024-03-01T17:10:44 1709313044

That's also a neat result! I'd just like to highlight that the conservation laws proved in that paper are functions of the parameters that hold over the course of gradient descent, whereas my post is talking about functions of the activations that are conserved from one layer to the next within an optimized network.

By the way, maybe I'm being too much of a math snob, but I'd argue Kunin's result is only superficially similar to Noether's theorem. (In the paper they call it a "striking similarity"!) Geometrically, what they're saying is that, if a loss function is invariant under a non-zero vector field, then the trajectory of gradient descent will be tangent to the codimension-1 distribution of vectors perpendicular to the vector field. If that distribution is integrable (in the sense of the Frobenius theorem), then any of its integrals is conserved under gradient descent. That's a very different geometric picture from Noether's theorem. For example, Noether's theorem gives a direct mapping from invariances to conserved quantities, whereas they need a special integrability condition to hold. But yes, it is a nice result, certainly worth keeping in mind when thinking about your gradient flows. :)

By the way, you might be interested in [1], which also studies gradient descent from the point of view of mechanics and seems to really use Noether-like results.

[1] Tanaka, Hidenori, and Daniel Kunin. “Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks.” In Advances in Neural Information Processing Systems, 34:25646–60. Curran Associates, Inc., 2021. https://papers.nips.cc/paper/2021/hash/d76d8deea9c19cc9aaf22....

samatman · 2024-03-02T14:48:32 1709390912

I wouldn't call drawing a distinction between an isomorphism and an analogy to be maths snobbery. I would call it mathematics. :)

jonathanyc · 2024-03-02T05:01:39 1709355699

Not GP, but thanks for your detailed comment and the paper reference.

empath-nirvana · 2024-03-01T14:06:40 1709302000

This is one of those links where just seeing the title sets you off, thinking about the implications.

I'm going to have to spend more time digesting the article, but one thing that jumps out at me, and maybe it's answered in the article and I don't understand it, is the role of time. Generally in physics, you're talking about a quantity being conserved over time, and I'm not sure what plays the role of time when you're talking about conserved quantities in machine learning -- is it conserved over training iterations or over inference layers, or what?

edit: now that i've read it again, I just saw that they described in the second paragraph.

I'm now wondering if in something like Sora that can do a kind of physical modeling, if there's some conserved quantity in the neural network that is _directly analagous_ to conserved quantities in physics -- if there is, for example, something that represents momentum, that operates exactly as momentum as it progresses through the layers.

nostrademons · 2024-03-01T15:13:01 1709305981

In physics, the conserved quantity isn't always time. Invariance over time translation is specifically conservation of energy. Invariance over spatial translation is conservation of momentum, invariance over spatial rotation is conservation of conservation of angular momentum, invariance of electromagnetic field is conservation of current, and invariance of wave function phase is conservation of charge.

I think the analogue in machine learning is conservation over changes in the training data. After all, the point of machine learning is to find general models that describe the training data given, and minimize the loss function. Assuming that a useful model can be trained, the whole point is that it generalizes to new, unseen instances with minimal losses, i.e. the model remains invariant under shifts in the instances seen.

The more interesting part to me is what this says about philosophy of physics. Noether's Theorem can be restated as "The laws of physics are invariant under X transformation", where X is the gauge symmetry associated with the conservation law. But maybe this is simply a consequence of how we do physics. After all, the point of science is to produce generalized laws from empirical observations. It's trivially easy to find a real-world situation where conservation of energy does not hold (any system with friction, which is basically all of them), but the math gets very messy if you try to actually model the real data, so we rely on approximations that are close enough most of the time. And if many people take empirical measurements at many different points in space, and time, and orientations, you get generalized laws that hold regardless of where/when/who takes the measurement.

Machine learning could be viewed as doing science on empirically measurable social quantities. It won't always be accurate, as individual machine-learning fails show. But it's accurate enough that it can provide useful models for civilization-scale quantities.

empath-nirvana · 2024-03-01T16:56:41 1709312201

> In physics, the conserved quantity isn't always time. Invariance over time translation is specifically conservation of energy.

That's not what i meant.

When you talk about "conservation of angular momentum", the symmetry is invariance over rotation, but the angular momentum is conserved _over time_.

jedbrown · 2024-03-01T16:26:23 1709310383

> It's trivially easy to find a real-world situation where conservation of energy does not hold (any system with friction, which is basically all of them)

Conservation of energy absolutely still holds, but entropy is not conserved so the process is irreversible. If your model doesn't include heat, then discrete energy won't be conserved in a process that produces heat, but that's your modeling choice, not a statement about physics. It is common to model such processes using a dissipation potential.

nostrademons · 2024-03-01T16:51:34 1709311894

Right, but I'm saying that it's all modeling choices, all the way down. Extend the model to include thermal energy and most of the time it holds again - but then it falls down if you also have static electricity that generates a visible spark (say, a wool sweater on a slide) or magnetic drag (say, regenerative braking on a car). Then you can include models for those too, but you're introducing new concepts with each, and the math gets much hairier. We call the unified model where we abstract away all the different forms of energy "conservation of energy", but there are a good many practical systems where making tangible predictions using conservation of energy gives wrong answers.

Basically this is a restatement of Box's Aphorism ("All models are wrong, but some are useful") or the ideas in Thomas Kuhn's "The Structure of Scientific Revolutions". The goal of science is to from concrete observations to abstract principles which ideally will accurately predict the value of future concrete observations. In many cases, you can do this. But not all. There is always messy data that doesn't fit into neat, simple, general laws. Usually the messy data is just ignored, because it can't be predicted and is assumed to average out or generally be irrelevant in the end. But sometimes the messy outliers bite you, or someone comes up with a new way to handle them elegantly, and then you get a paradigm shift.

And this has implications for understanding what machine learning is or why it's important. Few people would think that a model linking background color to likeliness to click on ads is a fundamental physical quality, but Google had one 15+ years ago, and it was pretty accurate, and made them a bunch of money. Or similarly, most people wouldn't think of a model of the English language as being a fundamental physical quality, but that's exactly what an LLM is, and they're pretty useful too.

jcgrillo · 2024-03-01T19:08:16 1709320096

It's been a long time since I have cracked a physics book, but your mention of interesting "fundamental physical quantities" triggered the recollection of there being a conservation of information result in quantum mechanics where you can come up with an action whose equations of motion are Schrödinger's equation and the conserved quantity is a probability current. So I wonder to what extent (if any) it might make sense to try to approach these things in terms of the really fundamental quantity of information itself?

jerf · 2024-03-01T23:34:52 1709336092

Approaching physics from a pure information flow is definitely a current research topic. I suspect we see less popsci treatment of it because almost nobody understands information at all, then trying to apply it to physics that also almost nobody understands is probably at least three or four bridges too far for a popsci treatment, but it's a current and active topic.

ogogmad · 2024-03-03T06:04:24 1709445864

This might be insultingly simplistic, but I always thought the phrase "conservation of information" just meant that the time-evolution operator in quantum mechanics was unitary. Unitary mappings are always bijective functions - so it makes intuitive sense to say that all information is preserved. However, it does not follow that this information is useful to actually quantify, like energy or momentum. There is certainly a kind of applied mathematics called "information theory", but I doubt there's any relevance to the term "conservation of information" as it's used in fundamental physics.

The links below lend credibility to my interpretation.

https://en.wikipedia.org/wiki/Time_evolution#In_quantum_mech...

https://en.wikipedia.org/wiki/Bijection

https://en.wikipedia.org/wiki/Black_hole_information_paradox

Aardwolf · 2024-03-01T16:23:13 1709310193

Is there any way to deduce which invariance gives which conservation? I mean for example: how can you tell that time invariance is the one paired with conservation of energy? Why is e.g. time invariance not paired with momentum, current, or anything else, but specifically energy?

I know that I can remember momentum is paired with translation simply because there's both the angular momentum and the non-angular momentum one and in space you have translation and rotation, so for time energy is the only one that's left over, but I'm not looking for a trick to remember it, I'm looking for the fundamental reason, as well as how to tell what will be paired with some invariance when looking at some other new invariance

chunky1994 · 2024-03-01T16:41:08 1709311268

The conserved quantity is derived from Noether's theorem itself. One thing that is a bit hairy is that Noether's theorem only applies to a continuous, smooth (physical -> there is some wiggle room here) space.

When deriving the conservation of energy from Noether's theorem you basically say that your Lagrangian (which is just a set of equations that describes a physical system) is invariant over time. When you do that you automatically get that energy is conserved. Each invariant produces a conserved quantity as explained in parent comment when you apple a specific transformation that is supposed to not change the system (i.e remain invariant).

Now in doing this you're also invoking the principle of least action (by using Lagrangians to describe the state of a physical system) but that is a separate topic.

calhoun137 · 2024-03-02T16:27:25 1709396845

The key point is that energy, momentum, and angular momentum are additive constants of the motion, and this additivity is a very important property that ultimately derives from the geometry of the space-time in which the motion takes place.

> Is there any way to deduce which invariance gives which conservation?

Yes. See Landau vol 1 chapter 2 [1].

> I'm looking for the fundamental reason, as well as how to tell what will be paired with some invariance when looking at some other new invariance

I'm not sure there is such a "fundamental reason", since energy, momentum, and angular momentum are by definition the names we give to the conserved quantities associated with time, translation, and rotation.

You are asking "how to tell what will be paired with some invariance" but this is not at all obvious in the case of conservation of charge, which is related to the fact that the results of measurements do not change when all the wavefunctions are shifted by a global phase factor (which in general can depend on position).

I am not aware of any way to guess or understand which invariance is tied to which conserved quantity other than just calculating it out, at least not in a way that is intuitive to me.

[1] https://ia803206.us.archive.org/4/items/landau-and-lifshitz-...

Aardwolf · 2024-03-02T16:37:14 1709397434

But momentum is also conserved over time, as far as I know 'conservation' of all of these things always means over time.

"In a closed system (one that does not exchange any matter with its surroundings and is not acted on by external forces) the total momentum remains constant."

That means it's conserved over time, right? So why is energy the one associated with time and not momentum?

rnhmjoj · 2024-03-03T17:58:39 1709488719

Conservation normally means things don't change over time just because in mechanics time is the go to external parameter to study the evolution of a system, but it's not the only one, nor the most convenient in some cases.

In Hamiltonian mechanics there is a 1:1 correspondence between any function of the phase space (coordinates and momenta) and one-parameter continous transformations (flows). If you give me a function f(q,p) I can construct some transformation φ_s(q,p) of the coordinates that conserves f, meaning d/ds f(φ_s(q, p)) = 0. (Keeping it very simple, the transformation consists in shifting the coordinates along the lines tangent to the gradient of f.)

If f(q,p) is the Hamiltonian H(q,p) itself, φ_s turns out to be the normal flow of time, meaning φ_s(q₀,p₀) = (q(s), p(s)), i.e. s is time and dH/dt = 0 says energy is conserved, but in general f(q,p) can be almost anything.

For example, take geometric optics (rays, refraction and such things): it's possible to write a Hamiltonian formulation of optics in which the equations of motion give the path taken by light rays (instead of particle trajectories). In this setting time is still a valid parameter but is most likely to be replaced by the optical path length or by the wave phase, because we are interested in steady conditions (say, laser turned on, beam has gone through some lenses and reached a screen). Conservation now means that quantities are constants along the ray, an example may be the frequency/color, which doesn't change even when changing between different media.

calhoun137 · 2024-03-02T16:53:47 1709398427

my understandinf is that conservation of momentum does not mean momentum is conserved as time passes. it means if you have a (closed) system in a certain configuration (not in an external field) and compute the total momentum, the result is independent of the configuration of the system.

rnhmjoj · 2024-03-03T18:01:23 1709488883

It certainly means that momentum is conserved as time passes. The variation of the total momentum of a system is equal to the impulse, which is zero if there are no external fields.

Cleonis · 2024-03-01T23:17:40 1709335060

In retrospect: the earliest recognition of a conserved quantity was Kepler's law of areas. Isaac Newton later showed that Kepler's law of areas is a specific instance of a property that obtains for any central force, not just the (inverse square) law of gravity.

About symmetry under change of orientation: for a given (spherically symmetric) source of gravitational interaction the amount of gravitational force is the same in any orientation.

For orbital motion the motion is in a plane, so for the case of orbital motion the relevant symmetry is cilindrical symmetry with respect to the plane of the orbit.

The very first derivation that is presented in Newton's Principia is a derivation that shows that for any central force we have: in equal intervals of time equal amounts of area are swept out.

(The swept out area is proportional to the angular momentum of the orbiting object. That is, the area law anticipated the principle of conservation of angular momentum)

A discussion of Newton's derivation, illustrated with diagrams, is available on my website: http://cleonis.nl/physics/phys256/angular_momentum.php

The thrust of the derivation is that if the force that the motion is subject to is a central force (cilindrical symmetry) then angular momentum is conserved.

So: In retrospect we see that Newton's demonstration of the area law is an instance of symmetry-and-conserved-quantity-relation being used. Symmetry of a force under change of orientation has as corresponding conserved quantity of the resulting (orbiting) motion: conservation of angular momentum.

About conservation laws:

The law of conservation of angular momentum and the law of conservation of momentum are about quantities that are associated with specific spatial characteristics, and the conserved quantity is conserved over time.

I'm actually not sure about the reason(s) for classification of conservation of energy. My own view: we have that kinetic energy is not associated with any form of keeping track of orientation; the velocity vector is squared, and that squaring operation discards directional information. More generally, Energy is not associated with any spatial characteristic. Arguably Energy conservation is categorized as associated with symmetry under time translation because of absence of association with any spatial characteristic.

chunky1994 · 2024-03-01T16:46:29 1709311589

I'm a bit skeptical to give up conservation of energy in a system with friction. Isn't it more accurate to say that if we were to calculate every specific interaction we'd still end up having conservation of energy. Now whether or not we're dealing with a closed system etc becomes important but if we were to able to truly model the entire physical system with friction, we'd still adhere to our conservation laws.

So they are not approximations, but are just terribly difficult calculations, no?

Maybe I'm misunderstanding your point, but this should be true regardless of our philosophy of physics correct?

nyrikki · 2024-03-01T18:39:35 1709318375

It is an analogy stating that dissipative systems do not have a Lagrangian, Noether's work applies to Lagrangian systems

Conservation laws in particular are measurable properties of an isolated physical system do not change as the system evolves over time.

It is important to remember that Physics is about finding useful models that make useful predictions about a system. So it is important to not confuse the map for the territory.

Gibbs free energy and Helmholtz free energy are not conserved.

As thermodynamics, entropy, and entropy are difficult topics due to didactic half-truths, here is a paper that shows that the nbody problem becomes invariant and may be undecidable due to what is a similar issue (in a contrived fashion)

http://philsci-archive.pitt.edu/13175/

While Noether's principle often allows you to see things that can often be simplified in an equation, often it allows you to not just simplify 'terribly difficult calculations' but to actually find computationally possible calculations.

sdenton4 · 2024-03-01T15:32:53 1709307173

A nice way to formulate (most) data augmentations is: a family of functions A = {a} such that our optimized neural network f obeys f(x) ~= f(a(x)).

So in this case, we're explicitly defining the set of desired invariances.

nurple · 2024-03-01T16:56:09 1709312169

I think the most profound insight I've come across while studying this particular topic is the insight that information theory ended up being the answer to conserving the 2nd law with respect to Maxwell's demon thought experiment. Not to put too fine a point, but essentially the knowledge organized in the mind of the demon, about the particles in its system, was calculated to offset the creation of the energy gradient.

I found the thinking of William Sidis to be particularly thought provoking perspective on Noether's benchmark work, in his paper The Animate and the Inanimate he posits--at a high level--that life is a "reversal of the second law of thermodynamics"; not that the 2nd law is a physical symmetry, but a mental one in an existence where energy reversibly flows between positive and negative states.

Indeed, when considering machine learning, I think it's quite interesting to consider how the organizing of information/knowledge done during training in some real way mirrors the energy-creating information interred in the mind of Maxwell's demon.

When taking into account the possible transitive benefits of knowledge organized via machine learning, and its attendant oracle through application, it's easy to see a world where this results in a net entropy loss, the creation of a previously non-existent energy gradient.

In my mind this has interesting implications for Fermi's paradox as it seems to imply the inevitibility of the organization of information. Taken further into my own personal dogma, I think it's inevitable that we create--what we would consider--a sentient being as I believe this is the cycle of our own origin in the larger evolutionary timeline.

danbmil99 · 2024-03-03T04:01:58 1709438518

I like the connection to Fermi. It seems to me eventually there has to be a concrete answer to the following question: Given the laws of physics, and the "initial conditions" (ie the state of the Universe at the moment of the Big Bang), what is the statistical likelihood of advanced (ie technological) civilizations occurring over time, and what is the likelihood that they go extinct (or revert to less technologically savvy conditions)? ISTM there are intrinsic numbers for this calculation, though it is probably impossible for us to derive them from first principles.

Jerrrry · 2024-03-01T18:28:38 1709317718

>at a high level--that life is a "reversal of the second law of thermodynamics";

Life temporarily displaces entropy, locally.

Life wins battles, chaos wins the war.

>Indeed, when considering machine learning, I think it's quite interesting to consider how the organizing of information/knowledge done during training in some real way mirrors the energy-creating information interred in the mind of Maxwell's demon.

This is our human bias favoring the common myth of ever-expanding complexity is an "inevitable" result of the passage of time; refer to Stephen Jay Gould's "Full House: The Spread of Excellence from Plato to Darwin"[0] for the only palatable refute modern evolutionists can offer.

>When taking into account the possible transitive benefits of knowledge organized via machine learning, and its attendant oracle through application, it's easy to see a world where this results in a net entropy loss, the creation of a previously non-existent energy gradient.

Because it is. Randomness combined with a sieve, like a generator and a discriminator, like the primordial protein soup and our own existence as a selector, like chaos and order themselves, MAY - but DOES NOT have to - lead to temporary, localized areas of complexity, that we call 'life'.

This "energy gradient" you speak of is literally gravity pulling baryonic matter foward thru space time. All work requires a temperature gradient - Hawking's musings on the second law of thermodynamics and your own intuition can reason why.

>In my mind this has interesting implications for Fermi's paradox as it seems to imply the inevitibility of the organization of information. Taken further into my own personal dogma, I think it's inevitable that we create--what we would consider--a sentient being as I believe this is the cycle of our own origin in the larger evolutionary timeline.

Over cosmological time spans, it is a near-mathematical certainty, that we are to either reach the universe's Omega point[1] on "our" own accord, perish to our own, by our own creation, or by our own son's, hands.

[0]: https://www.amazon.com/Full-House-Spread-Excellence-Darwin/d...

[1]: https://www.youtube.com/watch?v=eOxHRFN4rs0

shiandow · 2024-03-01T15:19:06 1709306346

A convolutional neural network ought to have translational symmetry, which should lead to a generalized version of momentum. If I understood the article correctly the conserved quantity would be <gx, dx>, where dx is the finite difference gradient of x.

This gives a vector with dimensions equal to however many directions you can translate a layer in and which is conserved over all (convolutional) layers.

cgadski · 2024-03-01T15:50:04 1709308204

Exactly right! In fact, because that symmetry does not include an action on the parameters of the layer, your conserved quantity <gx, dx> should hold whether or not the network is stationary for a loss. This means that it'll be stationary on every single data point. (In an image classification model, these values are just telling you whether or not the loss would be improved if the input image were translated.)

empath-nirvana · 2024-03-01T20:10:33 1709323833

Everything in the paper is talking about global symmetries, is there also the possibility of gauge symmetries?

Raro · 2024-03-01T14:16:43 1709302603

Yeah, I've been thinking about similar concepts in a different context. Fascinating.

Regarding the role of time, the idea of a purely conserved quantity is that it is conserved under the conditions of the system (that's why the article frequently references Newton's First Law), so they're generally held "for all time that these symmetries exist in the system".

Specifically on time: the invariant for systems that exhibit continuous time symmetries (i.e. you move a little bit forward or backward in time and the system looks exactly the same) is energy.

dustingetz · 2024-03-01T14:53:00 1709304780

Here's my ELI5 attempt of the time/energy relation:

imagine a spring at rest (not moving)

strike the spring, it's now oscillating

the system now contains energy like a battery

what is energy? it's stored work potential

the battery is storing the energy, which can then be taken out at some future time

the spring is transporting the energy through time

in fact how do we measure time? with clocks. What's a clock? It's an oscillator. The energized spring is the clock. When system energy is zero, what is time even? There's no baseline against which to measure change when nothing is changing

PaulHoule · 2024-03-01T15:11:05 1709305865

Symmetry exists abstractly, apart from time.

There are many machine learning problems which should have symmetries: a picture of a cow rotated 135 degrees is still a picture of a cow, the meaning of spoken words shouldn't change with the audio level, etc. If they were doing machine learning on tracks from the LHC the system ought to take account of relativistic momentum and energy.

Can a model learn a symmetry? Or should a symmetry just be built into the model from the beginning?

sdenton4 · 2024-03-01T15:37:21 1709307441

Equivariant machine learning is a thing that people have tried... Tends to be expensive and slow, though, and imposes invariances that our model (a universal function approximator, recall) should just learn anyway: If you don't have enough pictures of upside down cows, just train a normal model with augmentations.

Raro · 2024-03-01T14:23:23 1709303003

Ha, my previous comment was before your new edit mentioning Sora. There is a good reason why the accompanying research report to the Sora demo isn't titled "Awesome Generative Video," but references world models. The interesting feature is how many apparently (approximations to) physical properties emerge (object permanence, linear motion, partially elastic collisions, as well as many of the elements of grammar of film), and which do not (notably material properties of solid and fluids, creation of objects from nothing, etc.)

rnhmjoj · 2024-03-02T11:53:39 1709380419

Time is not special regarding symmetries and conserved quantities. In general you can consider any family of continuous transformations parametrised by some real variable s: be it translations by a distance x, rotations by an angle φ, etc. These are technically one-parameter subgroups of a Lie group.

Then, if your dynamical system is symmetrical under these transformations you can construct a quantity whose derivative wrt s is zero.

jungturk · 2024-03-01T22:01:53 1709330513

> now wondering...if there's some conserved quantity in the neural network that is _directly analagous_ to conserved quantities in physics

Isn't the model attempting to conserve information during training? And isn't information a physical quantity?

Communitivity · 2024-03-01T15:11:14 1709305874

"I'm now wondering if in something like Sora that can do a kind of physical modeling, if there's some conserved quantity in the neural network that is _directly analogous_ to conserved quantities in physics"

My first thought on reading that was that if there was it would be interesting to see if there was some way it tied into the concept of us living in a simulation, i.e. we're all living in a complex ML network simulation.

Scene_Cast2 · 2024-03-01T15:23:32 1709306612

People have mentioned the discrete - continuous tradeoff. One way to bridge that gap would be to use https://arxiv.org/abs/1806.07366 - they draw an equivalence between vanilla (FC layer) neural nets of constant width with differential equations, and then use a differential equation solver to "train" a "neural net" (from what I remember - it's been years since that paper...).

Another approach might be to take an information theoretic view with the infinite-width finite-entropy nets.

scarmig · 2024-03-01T15:49:05 1709308145

Another angle to look at would be the S4 models, which admit both a continuous time and recurrent discrete representation.

irchans · 2024-03-01T16:06:50 1709309210

I liked the article and I hope that I can understand it more with some study.

I think the following sentence in the article is wrong "Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.”

I think the correct statement is "Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be translation, rotation, and time shifting.”

I think translation leads to conservation of momentum, rotation leads to conservation of angular momentum, and time shifting leads to conservation of energy (potential+kinetic). It's been a few decades since I saw the proof, so I might be wrong.

nostrademons · 2024-03-01T16:37:20 1709311040

I think your last paragraph is correct, but the statement in the article is referring to the specific 2D 2-body example given, and its original phrasing is also correct. Translation, rotation, and time-shifting are transformations (matrices), not quantities. Horizontal, vertical, and angular (2D) momentum are scalars. The article is saying that if you take the action potential given in the example, there exist scalar quantities (which we call horizontal momentum, vertical momentum, and angular momentum) that remain constant regardless of any horizontal, vertical, or rotational transformation of the coordinate system used to measure the 2-body problem.

cgadski · 2024-03-01T17:25:00 1709313900

Hi, thanks!

In that sentence I was only talking about the translations and rotations of the plane as a group of invariances for the action of the two-body problem. This group is generated by one-parameter subgroups producing vertical translation, horizontal translation, and rotation about a particular point. Those are the "three degrees of freedom" I was counting.

You're right about the correspondence from symmetries to conservation laws in general.

kurthr · 2024-03-01T16:56:27 1709312187

The application of Noether's theorem in this case refers only to the energy integral shown (KE = ME - GPE for 2D Kinetic Mechanical and Gravitational Potential Energies) over time. It's really only for that particular 2 body 2 dimensional problem.

More generically in 3 dimensions a transformation with 3 translational 2 rotational and 1 time independence would provide conservation of 3 momenta 2 angular momenta and 1 energy.

chunky1994 · 2024-03-01T16:24:53 1709310293

Right, the rephrasing of the sentence is a tad more accurate. Your three entities are [invariant -> conserved quantity]: (translation -> momentum), (rotation -> angular momentum) and (time -> energy).

esafak · 2024-03-01T20:11:06 1709323866

I'll be walking tall the day I can leisurely read articles like this! I wish I had studied this stuff; now time is short.

waveBidder · 2024-03-01T18:01:02 1709316062

so I think this is a great connection that deserves more thought. as well as an absolutely gorgeous write-up.

The main problem I see with it is that most of the time you don't want the optimum for your objective function, as that frequently results in overfitting. this leads to things like early stopping being typical.

cgadski · 2024-03-02T04:37:53 1709354273

Thanks so much!

And yes, that's quite true. When parameter gradients don't quite vanish, then the equation

<g_x, d x / d eps> = <g_y, d y / d eps>

becomes

<g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d eps>

where g_theta is the gradient with respect to theta.

In defense of my hypothesis that interesting approximate conservation laws exist in practice, I'd argue that maybe parameter gradients at early stopping are small enough that the last term is pretty small compared to the first two.

On the other hand, stepping back, the condition that our network parameters are approximately stationary for a loss function feels pretty... shallow. My impression of deep learning is that an optimized model _cannot_ be understood as just "some solution to an optimization problem," but is more like a sample from a Boltzmann distribution which happens to concentrate a lot of its probability mass around _certain_ minimizers of an energy. So, if we can prove something that is true for neural networks simply because they're "near stationary points", we probably aren't saying anything very fundamental about deep learning.

riemannzeta · 2024-03-02T13:29:12 1709386152

Your work here is so beautiful, but perhaps one lesson is that growth and learning result where symmetries are broken. :-D

pmayrgundter · 2024-03-01T15:53:36 1709308416

I wonder if an energy and work metric could be derived for gradient descent. This might be useful for a more rigorous approach to hyperparameter development, and maybe for characterizing the data being learned. We say that some datasets are harder to learn, or measure difficulty by the overall compute needed to hit a quality benchmark. Something more essential would be a step forward.

Like in ANN backprop, the gradient descent algorithm can use a momentum to overcome getting stuck in local minima. This was heuristically physical when I learned it.. perhaps it's been developed since. Maybe only allowing a "real" energy to the momentum would then align it with an ability to do work calculation. Might also help with ensemble/monte carlo methods, to maintain an energy account across the ensemble.

danbmil99 · 2024-03-03T03:42:08 1709437328

I need to digest this but it is a seductive idea. My quick take: there may be a connection between back-propagation and reversibility, both computational and physical. For a system to be reversible implies conservation of information.

It also makes me think about the surprising success of highly quantized models (see for example recent paper on ternary networks, where the only valid numbers re 0, 1, and -1.)

Artificial Neural Networks were originally conceived as an approximation to an analog, continuous system, where floating-point numbers are stand-ins for reals. This is related to the ability to back-prop because real functions are generally differentiable. But if it turns out that we can closely approximate the same behavior with a small, discrete set of integers, it makes the whole edifice feel more like some sort of Cellular Automaton with reversible rules, rather than a set of functions over the reals.

Finally (sorry for the rabbit-holing) - how does this relate to our brains? Note that real neurons "fire" -- that is, they generate a discrete event when their internal configuration reaches a triggering state.

Lots to chew on...

pyinstallwoes · 2024-03-03T07:56:45 1709452605

Kinda like the reversibility of chained xor’s and it’s ability to preserve information through cyclical permutations?

danbmil99 · 2024-03-03T08:07:02 1709453222

Yes, that is roughly correct. You nay want to look up "reversible computation". It's a fundamental part of quantum computing, for one thing.

The key insight is that a (finite) discrete, reversible system will always eventually cycle back to its original state. This fact has very interesting follow-on implications for the concept of entropy and the Second Law. If it is guaranteed that a system will return to a prior state, how can it also be true that entropy (disorder) always increases?

pyinstallwoes · 2024-03-08T03:34:36 1709868876

In the context of Modular math, or even a clock, is rotation around a circle considered entropic?

Does a sine wave have entropy?

calhoun137 · 2024-03-02T15:31:25 1709393485

Very nice article! I recently had a long chat with chatgpt on this topic, although from a slightly different perspective.

A neural network is a type of machine that solves non linear optimization problems, and the principle of least action is also a non linear optimization problem that nature solves by some kind of natural law.

This is the one thing that chatgpt mentioned which surpised me the most and which I had not previously considered.

> Eigenvalues of the Hamiltonian in quantum mechanics correspond to energy states. In neural networks, the eigenvalues (principal components) of certain matrices, like the weight matrices in certain layers, can provide information about the dominant features or patterns. The notion of states or dominant features might be loosely analogous between the two domains.

I am skeptical that any conserved quantity besides energy would have a corresponding conserved quantity in ML, and the Reynolds operator will likely be relevant for understanding any correspondence like this.

iirc the Reynolds operator plays an important role in Noethers theorem, and it involves an averaging operation similar to what is described in the linked article.

emmynoether · 2024-03-01T17:46:52 1709315212

It has been shown that a finite difference implementation of wave propagation can be expressed as a deep neural network (e.g., [1]). These networks can have thousands of layers and yet I don't think they suffer from the exploding/vanishing gradient problem, which I imagine is because in the physical system they model there are conservation laws such as conservation of energy.

[1] https://arxiv.org/abs/1801.07232

r34 · 2024-03-01T14:53:12 1709304792

As a complete amateur I was wondering if it could be possible to use that property of light ("to always choose the most optimal route") to solve the traveling salesman problem (and the whole class of those problems as a consequence). Maybe not with an algorithmic approach, but rather some smart implementation of the machine itself.

shiandow · 2024-03-01T15:02:22 1709305342

If somehow you can ensure that light can only reach a point by travelling through all other points then yes.

It's basically the same way you could use light to solve a maze, just flood the exit with light and walk in the direction which is brightest. Works better for mirror mazes.

pvg · 2024-03-01T15:26:12 1709306772

Google up 'soap film steiner tree' for a fun, well-known variant of this.

jerf · 2024-03-01T23:39:52 1709336392

Then follow it up with https://www.scottaaronson.com/papers/npcomplete.pdf . While reality can "solve" these problems to some extent it turns out that people overestimate reality's ability to solve it optimally.

richk449 · 2024-03-01T16:27:50 1709310470

Sounds like some of the trendy analogy computing approaches, like this one for example:

https://www.microsoft.com/en-us/research/uploads/prod/2023/0...

samatman · 2024-03-02T15:25:48 1709393148

This is pretty likely, it's been done with DNA: https://pubmed.ncbi.nlm.nih.gov/15555757/

Physics contains a lot of 'machinery' for solving for low energy states.

nkozyra · 2024-03-01T14:56:32 1709304992

This sounds a bit like LIDAR implementations, I assume you mean something similar at a smaller scale, where physical obstacles provide a "path" representation of a problem space?

r34 · 2024-03-01T15:03:10 1709305390

Yup, something like that came to my mind first: create a physical representation (like a map) of the graph you want to solve and use physics to determine the shortest path. Once you have it you could easily compute the winning path's length etc.

klysm · 2024-03-02T02:40:22 1709347222

Completely irrelevant but I love the way the color theme on this blog feels like a chalk board

raptortech · 2024-03-01T18:27:09 1709317629

See also "Noether Networks: Meta-Learning Useful Conserved Quantities" https://arxiv.org/abs/2112.03321 from 2021.

Abstract: Progress in machine learning (ML) stems from a combination of data availability, computational resources, and an appropriate encoding of inductive biases. Useful biases often exploit symmetries in the prediction problem, such as convolutional networks relying on translation equivariance. Automatically discovering these useful symmetries holds the potential to greatly improve the performance of ML systems, but still remains a challenge. In this work, we focus on sequential prediction problems and take inspiration from Noether's theorem to reduce the problem of finding inductive biases to meta-learning useful conserved quantities. We propose Noether Networks: a new type of architecture where a meta-learned conservation loss is optimized inside the prediction function. We show, theoretically and experimentally, that Noether Networks improve prediction quality, providing a general framework for discovering inductive biases in sequential problems.

platz · 2024-03-01T14:19:45 1709302785

how do you direct what the network learns if it all comes from supervised learning training sets?

How do you insert rules that aren't learned into what weights are learned?

nrub · 2024-03-01T16:19:12 1709309952

There are promising methods developing for Physic's informed neural networks. Mathematical models can be integrated into the architecture of neural networks such that the parameters of the designed mathematical models can be learned. Examples include learning the frequency of a swinging pendulum from video, amongst more advanced ideas.

https://en.wikipedia.org/wiki/Physics-informed_neural_networ... https://www.youtube.com/watch?v=JoFW2uSd3Uo

brodolage · 2024-03-01T14:19:46 1709302786

How does he create those animations? I'd like to make them as well for myself.

smokel · 2024-03-01T14:43:22 1709304202

They seem to be built with some love by the author. Apparently they have written it in Haxe, judging from the comment in the page source.

cgadski · 2024-03-02T05:07:53 1709356073

Haha yeah, took some love. I have a scrappy little "framework" that I've been adjusting since I started making interactive posts last year. Writing my interactive widgets feels a bit like doing a game jam now: just copy a template and start compiling+reloading the page, seeing what I can get onto the screen. I've just been using the canvas2d API.

Besides figuring out a good way of dealing with reference frames, the only trick I'd pass on is to use CSS variables to change colors and sizes (line widths, arrow dimensions, etc.) interactively. It definitely helps to tighten the feedback loop on those decisions.

brodolage · 2024-03-01T17:56:20 1709315780

Oh that's way out of my league unfortunately. I wonder if there's a library or something that does something like this.

riemannzeta · 2024-03-02T13:18:46 1709385526

Not quite what you're looking for, but worth pointing out that Grant Sanderson of 3Blue1Brown has published the "framework" he uses for his math videos on GitHub.

https://github.com/3b1b/manim

aeonik · 2024-03-01T21:35:30 1709328930

I'd like to know too.

I've been using Emmy from the Clojurescript ecosystem, which works pretty good, but has a few quirks.

https://emmy-viewers.mentat.org/

iskander · 2024-03-01T16:22:50 1709310170

I love the simple but elegant formatting of this blog.

cgadski: what did you use to make it?

cgadski · 2024-03-02T04:46:08 1709354768

Thank you!

In the beginning, I used kognise's water.css [1], so most of the smart decisions (background/text color, margins, line spacing I think) probably come from there. Since then it's been some amount of little adjustments. The font is by Jean François Porchez, called Le Monde Livre Classic [2].

I draft in Obsidian [3] and build the site with a couple python scripts and KaTeX.

[1] https://watercss.kognise.dev/

[2] https://typofonderie.com/fr/fonts/le-monde-livre-classic

[3] https://obsidian.md/

iskander · 2024-03-01T20:26:21 1709324781

Only clue in the source: