Executing gradient descent on the earth

ajtulloch · on April 20, 2018

One big problem with the conclusion is that intuitions from low dimensional spaces often don’t carry over to high dimensional spaces. e.g. common example is how the volume of the unit hypersphere intersected with hypercube ratio goes to zero. One funny thing I saw once was something like “the real curse of dimensionality is how different optimization in high dimensional spaces is compared to low dimensional spaces”.

Terr_ · on April 20, 2018

> One big problem with the conclusion is that intuitions from low dimensional spaces often don’t carry over to high dimensional spaces. e.g. common example is how the volume of the unit hypersphere intersected with hypercube ratio goes to zero.

Is that un-intuitive, or am I looking at it the "wrong" way? I'm imagining the overlap of a 2D square/unit-circle versus the overlap of a 3D cube/unit-sphere, and from those two data points there's already a downward trend.

I mean, the 2D case is really just the single cross-sectional slice from the 3D case through the middle, the one that that has the greatest possible overlap. All other possible slices will less overlap.

Following that logic, a 3D square/cube overlap is likewise the the "middle cross-section of greatest overlap" for a 4D scenario, etc.

Strilanc · on April 20, 2018

Try this one.

I give you a megabox: a million dimensional hypercube spanning one meter along each axis. I also provide a bunch of megaspheres, million-dimensional hyperspheres scaled to be 0.99 meters in diameter. How many megaspheres can you fit into the megabox?

sudhirj · on April 20, 2018

The first answer in my head is 1, but I'm guessing this is more akin to 'how many circles can you fit in a sphere?'?

Strilanc · on April 20, 2018

The spheres and the cubes are of the same dimension, so it's not like trying to put flat things inside of a thing with volume. (Technically I should have said million dimensional balls instead of spheres. Sorry if that was confusing.)

sudhirj · on April 20, 2018

Then isn’t the answer 1? For 2 dimensions, one circle fits in a square, in three dimensions one sphere fits in a cube. Should be the same for any value of n?

Strilanc · on April 20, 2018

Consider that the distance between opposite corners of a square is √2 meters, but for a cube it's √3 meters. In general, for an n-dimensional hypercube of diameter 1m, the exactly-opposite corners are √n meters apart.

Meaning, in a million-dimensional meter-diameter hypercube, exactly-opposite corners will be a kilometer apart. There's a lot more diagonal wiggle room than you think.

therein · on April 20, 2018

That blew my mind but it still sounds like it would generalize, wouldn't it?

jstanley · on April 20, 2018

I'd love to hear the correct answer to this, with explanation.

OscarCunningham · on April 20, 2018

I can't tell you the exact answer, but it's at least 10^100000.

The coordinates of the centre of each megasphere have to lie in the range [0.49,0.51]. Let's just think about megaspheres which are pushed as far as they will go into some corner, so their coordinates are each either 0.49 or 0.51. For each of the one million directions, the distance between the centres of two megaspheres in that dimension will either be 0.02 or 0, depending if their coordinates are the same or different. So by Pythagoras' Theorem the total distance between their centres will be sqrt(n)0.02, where n is the number of dimensions in which their coordinates differ. In order for them to fit the distance must be at least 0.99, so we need sqrt(n)0.02 >= 0.99, which means we need n >= 2451. But we have 1000000 coordinates to choose from, so it's not hard at all to make sure that at least 2451 of them differ. We can start with the sphere whose centre has all coordinates equal to 0.49. Then the sphere with coordinates 0.51 in the first 2451 dimensions, and all others 0.49. Then the sphere with coordinate 0.51 in the next 2451 dimensions. And so on.

That squeezes in at least 1000000/2451 = 407 spheres, but even after that there's loads more room. How about the sphere that has coordinates 0.49 and 0.51 alternating? Or changing every 3? In fact if we just assign coordinates at random then each dimension has a 50/50 chance of being different, so we would expect them to be different in 500000 dimensions. The probability of them being different in only 2450 dimensions is given by (1000000 C 2450)/(2^1000000) = 10^-293570. The number of pairs of spheres is quadratic in the number of spheres. So if we just shove the spheres into corners at random we can expect to place about sqrt(1/10^-293570) = 10^146785 spheres before two of them collide.

And that still leaves loads of room away from the corners! The point at (0.5,...,0.5) is still a distance of sqrt(1000000)0.01 = 10 away from the centre of any sphere we've placed so far, so there's still plenty of space for more spheres.

EDIT: The coordinates should actually be 0.495 and 0.505, but the principle is the same.

Strilanc · on April 20, 2018

I don't know the exact answer, but the lower bound I computed had over a quarter million digits. See [1]

1: http://algassert.com/puzzle/2014/07/08/Boxing-Megaspheres.ht...

OscarCunningham · on April 20, 2018

I can improve the lower bound a bit.

Let V be the volume (content) of a hypersphere of radius 0.99 (twice the radius of the spheres in the question). The total volume in which the centres of the spheres can lie is 0.01^1000000. So if the densest packing uses n spheres, then we must have nV >= 0.01^1000000 or else there would be a point at least 0.99 away from the centre of any sphere, and we could add an extra one there.

Using Stirling's approximation to bound V above gives that n is greater than 10^388118.6...

Strilanc · on April 20, 2018

I don't follow the logic here. How are you getting the equation nV >= 0.01^1000000? Shouldn't there be a packing efficiency constant in there? It almost looks like you're trying to pour the spheres like a liquid.

OscarCunningham · on April 20, 2018

Imagine just putting the spheres in one at a time until you can't put in any more. Why can't you put in any more? It must be that every point of [0.495,0.505]^1000000 is within 0.99 of the centre of one of the spheres. Otherwise you could put in a new sphere with that point as its centre. This means that if we imagine spheres of radius 0.99 around each of our original spheres (which only have radius 0.495) those big spheres must cover all of [0.495,0.505]^1000000. Hence their combined volume must be as large as the volume of that box. Note that I defined V to be the volume of a sphere with radius 0.99, not diameter 0.99 as in the statement of the problem.

Strilanc · on April 20, 2018

Thanks, that's much clearer and I understand now. That's a good simple lower bound.

ubershmekel · on April 20, 2018

Downward trend doesn't imply trending to zero. Would your intuition suggest a trend to zero?

Even if it does - you can build intuitions to explain unintuitive results. Unituitive is subjective and a similar word would be "surprising". Because you can look at data to explain a surprising result - it doesn't make the initial result less surprising.

dTal · on April 20, 2018

Don't forget the 1D case, which has a 100% overlap.

pbhjpbhj · on April 20, 2018

Doesn't a sphere require all points to be separated from the centre point by the same distance r, where r>0? A point isn't a sphere.

gus_massa · on April 20, 2018

Are you taking about the orange peel, the orange juicy part or the whole fruit?

If you consider only the "orange peel", if the center is 0 and r=1, then the 1D version is just two points {-1} and {1}, not only one point.

If you consider only the "whole orange" then the 1D case is the segment [-1, 1], if the center is 0 and r=1

In both case it is equal to the 1D "square" equivalent.

detaro · on April 20, 2018

Really depends on who you are asking for a definition. E.g. there is the term "point sphere" for the special case of a sphere that is a point.

pbhjpbhj · on April 20, 2018

Is there any interesting result that comes from casting a point as being any shape (presumably the point-sphere is also a point-dodecahedron, etc.)?

chestervonwinch · on April 20, 2018

Define a topology on any set you'd like and a sphere is just the set of limit points of any open ball.

dTal · on April 21, 2018

The 1D case is a line segment, not a point. A point is the 0D case and indeed a degenerate case.

Zophike1 · on April 22, 2018

> One big problem with the conclusion is that intuitions from low dimensional spaces often don’t carry over to high dimensional spaces. e.g. common example is how the volume of the unit hypersphere intersected with hypercube ratio goes to zero. One funny thing I saw once was something like “the real curse of dimensionality is how different optimization in high dimensional spaces is compared to low dimensional spaces”.

How so usually in Pure Mathematics(Analytic area's) everything done in R^{1} R^{2} is usually generalized to R^{N} and much of the intuition carriers over. So how does it fail here in the context of Machine Learning ?

Note: I know nothing about ML :(

edraferi · on April 22, 2018

All the mechanics scale like you’d expect. You have bigger matrices containing more observations with more parameters per observation. Keep the dimensions aligned properly and everything works out just like it would with smaller vectors.

The discrepancies appear when you start dealing with probabilistic stuff. Suddenly everything’s conditioned by everything else, and your complexity skyrockets. Example: it’s easy to find people with about average height and weight, but the more physical characteristics you measure (increasing dimensionality) the more like you are to find something statistically unusual, eg very long fingers. Very interesting Air Force research on this [0]

WRT to optimization, higher dimensions can help avoid the local minima problem. When you have 1000s if possible directions to go, it’s unlikely that none of them will yield a better answer.

[0] https://www.thestar.com/news/insight/2016/01/16/when-us-air-...

OscarCunningham · on April 20, 2018

The actual global minimum isn't the ocean but rather the Dead Sea. I imagine this is rather hard to find, since you have to be fairly close to it before the gradient leads toward it rather than the ocean.

In fact we can be very precise. The Dead Sea has a catchment area of 41650 km^2, which is 0.0000816 of the Earth's surface. So we need on average 12200 random initializations in order to find the Dead Sea by gradient descent.

mirimir · on April 20, 2018

Well, the Dead Sea is certainly below mean ocean level. But the deepest oceanic trenches are far deeper (~11 km below mean ocean level) than the bottom of the Dead Sea (~0.74 km below mean ocean level).

sparky_z · on April 20, 2018

Except its pretty clear that for this problem, as posed, we're considering the surface of the ocean to be the "ground".

mirimir · on April 20, 2018

OK, sure. But the site talks about energetic boulders. And they wouldn't stop at sea level.

madisfun · on April 20, 2018

Finding ocean is easy: water and gravity have formed the surface to have few local minima. It would be much more interesting finding peaks: there are many local maxima.

And the most important maxima are often surrounded by lots of local maxima. They may also have some steep slopes so gradient methods can easily overshoot. Sometimes the surface function is not even continuous (overhanging walls).

iainmerrick · on April 20, 2018

water and gravity have formed the surface to have few local minima

From the article: "It turns out that the earth is filled with local minima"

(I suspect the deal is that you're right that water always flows downhill to the ocean; but that it often happens at scales too small to capture in the GIS data, or underground.)

chrisfosterelli · on April 20, 2018

Yeah you got it! Water moves in ways that are not captured by the resolution of the elevation dataset. Also, water moves into/through the ground.

_jgvg · on April 20, 2018

Well, in a large scale, performing a gradient descent on the earth should always end at an ocean or at an endorheic basin.

Sure, when you increase the "resolution" to meters or smaller, there will be an endless number of mini-endorheic basins. For example, the former Collect Pond in NYC:

https://en.wikipedia.org/wiki/Collect_Pond

mdturnerphys · on April 20, 2018

Any lake or pond is a local minimum. If it has an outlet it's only because the water level has risen to the point that it can escape.

sitkack · on April 20, 2018

Which if we want to continue this analogy, when GD finds a local minimum, it needs to start filling that basin so it can escape.

hodgesrm · on April 20, 2018

OK, I didn't run through the math on this but I have spent a lot of time walking around the Olympics and the computed path from the summit of Olympus looks extremely inefficient. Also there's a point east of Forks that looks as if the path was close to getting trapped. The example does not seem compelling as a proof of convergence, especially if you reduce the momentum component to a point where you can't just hop over 4000' ridges as this path does.

For now I'm sticking with the path down the Hoh River. ;)

edit:typo

make3 · on April 20, 2018

Maybe use a different dataset with the levels of the oceans, and see if you can find the lowest point on earth (Challenger Deep) from the peek of the Everest ? :P

And now, do Adagrad, RMSProp, RMSProp + Nesterov and Adam, and maybe Newton, BFGS, L-BFGS and conjugate gradients, and then coordinate ascent :)

It would be a pretty good educational tool to teach people the different gradient descent methods, though it's probably too simple of a problem for these methods to be at all useful.

wavefunction · on April 20, 2018

>The earth should actually be a very easy function to optimize. Since the earth is primarily covered by oceans, more than two thirds of possible valid inputs for this function return the optimal cost function value.

I guess if you're going to count the surface of the oceans as part of the descent you have to remember the surface of the ocean is not uniform in altitude neither due to tidal forces nor other factors.

edejong · on April 20, 2018

“We have success!”

Ehm, no. The lowest point should have been somewhere 413 meters below sea level (Dead sea depression).

Earth is actually a very difficult place to perform gradient descent on, due to its very large plateau (oceans).

(Edit: lowest point is 413 meters)

chrisfosterelli · on April 20, 2018

The lowest point would be in the ocean, if you want to consider elevations below 0. In this implementation everything under 0 is rounded to 0. Otherwise, you'll have negative loss.

edejong · on April 21, 2018

Obviously I considered the lowest point on land.

placebo · on April 20, 2018

While there are good reasons to use gradient descent on neural networks, finding the global minima of the surface of the earth would not be my first choice. Various metaheuristics would do much better.

subroutine · on April 20, 2018

"We have success! It’s interesting watching the behaviour of the optimizer, it seems to fall into a valley and “roll off” each of the sides on it’s way down the mountain"

Anyone sitting at a computer right now have a few minutes to make an animation of this? I'd love to see it in action.

camtarn · on April 20, 2018

The last image in the article is an animation with a tiny red dot following the path.