Hacker News new | past | comments | ask | show | jobs | submit login
Gaussian splatting is pretty cool (aras-p.info)
324 points by signa11 on Sept 7, 2023 | hide | past | favorite | 94 comments



> This is using gradient descent and “differentiable rendering” and all the other things that are way over my head.

then..

> And finally, they have resisted the temptation to do “neural” anything ;)

So, they are doing something similar to NeRF, in a sense, but using a different basis function, and a slightly different target. The whole "neural" or not "neural" is just about what you are optimizing, but it's not that different, conceptually -- they are optimizing a data-driven approximator of a 3d scene-related variable based on images.

The big difference of course, based on reading this, is that NeRF models the whole light transmission function ("radiance field") whereas this seems to model only the boundary conditions (what the light "hits"). So given the different modelling target, then yes, a different, and perhaps simpler, representation basis definitely seems warranted, and is shown here to give good results. But the comment, "they avoided neural anything" feels a bit smug, as if "avoiding the hype" is a laudable goal on its own merits. As if people are using neural networks for no reason but because they're cool, and not because they're actually an appropriate solution that are shown to work really well in practice. They surely are cool, don't get me wrong, but the hype is so often justified, because neural networks are really good function approximators. They are also not just one thing -- the choice and breadth of architectures that we happen to call "neural networks" is huge, so explicitly avoiding this huge potential solution space for some reason without good reason feels pretty specious; heck you could even consider having a Gaussian "output layer" to a neural network, which is called an RBF, being something like this "splat" approach. (This is in response to the blog post, not the paper -- I'm sure the paper has plenty of justification for modeling choices.) Having said that, of course it's interesting to also explore simpler, perhaps more efficient approaches, but I don't see "resisting using neural anything" as a good thing, just because they are popular. It ignores that they are popular for a reason.

(nb: This assessment is just based on the blog post and giving my impression from the tone of the introduction, I haven't read the paper yet, so might not be accurate with respect to the paper contents. I just found this off-hand comment in the post worth reflecting on.)


One big difference between NeRFs and this technique is that a NeRF is a jumbled mess of interconnected nodes, while this is a collection of values that lives in 3D space. It's much easier to reason about and could probably be integrated fairly easily in other 3D software like editors and renderers, and probably even animated. I agree that neural nets are cool and useful, but the whole black box function approximator that represents a whole scene at once thing doesn't feel scalable or elegant.


I don't think Nerf as it is now needs to be scalable or elegant. It is research into how far you can neural networks. It has merit in the research itself.


Why doesn’t it feel scalable or elegant.

It’s just a function call and you can always transform the NeRF into a new representation if you want to use traditional 3d software.


How do you edit a NeRF? Let's say I want to change material of an object in the scene, or rotate part of it, or compose several into a new scene, etc. Or animate it. It's not really a good representation of a 3D scene IMO, unless all you care about is novel view synthesis, which is what it was designed to do.

Transforming it to a mesh or SDF is cheating, then it's no longer a NeRF and of course the same cons doesn't apply anymore.


This is still an active area of research. Seems like marching cubes alone are not enough IIRC. Do you have any links I can read for NeRf to USD for example? Saw something pretty good from nvidea a while ago but still not great — still a lot of loss.


I agree that ostracizing researchers using neural networks is not a good path for anyone to go down. However, even though NNs are great at modeling problems…we usually have no idea how those learned models actually work. So NNs are great at solving problems in the short term, but aren’t terribly useful for developing long term solutions due to data requirements and regular retraining.

In this instance, I’m going to choose to take an optimistic stance and believe the authors were talking a friendly elbow jab rather than fueling a mini culture war. The researchers leveraging DNNs are doing great work, but perhaps could be wrangled in just a bit :)


It is incorrect that we don’t understand how NeRFs work. Unlike language models etc., they are not using deep neural networks; they are using shallow multi-layer perceptrons as function optimizers. The behavior of these is well-understood.


> The big difference of course, based on reading this, is that NeRF models the whole light transmission function ("radiance field") whereas this seems to model only the boundary conditions (what the light "hits").

The title of the paper is "3D Gaussian Splatting for Real-Time Radiance Field Rendering". Each rendered pixel weights the contribution of unbounded view-dependent Gaussians. So, no, that's not a difference.


Ah gotcha, thanks for pointing that out.


> I don't see "resisting using neural anything" as a good thing, just because they are popular

I read Aras' quip as more narrowly technical. OFC it's ambiguous and you'd have to ask the man himself.

The gist of NeRF is to obtain an NN representation of a 5D light field (3D position + 2D direction) from samples (photographs) of the real-world light field. Alarm bells ring already- 5 dimensions isn't that many! Considering NeRF has always used a low-rank spherical harmonic representation of the directional domain, it's even more like 3D-and-change. To reconstruct a function of such low dimensionality, why choose an NN?

Then at inference time, for each pixel, you have sample the NN repeatedly over the view ray. This part is exceedingly silly, as compact representations of light fields are a solved bread-and-butter problem in graphics.

Later on Plenoxels explicitly took the "Ne" out of "NeRF", giving far higher training and inference performance (also mentioned ITT). To be fair, and later still, Nvidia somewhat redeemed NNs here with Instant NeRF: https://nvlabs.github.io/instant-ngp/assets/mueller2022insta... ...where the twist was to interpolate fancy input emeddings, which are run through a tiny NN. That tininess is important, as the need to fetch NN weights from VRAM would kick NNs right off the Pareto frontier.

Zooming out, NNs have only seen wide adoption in graphics engineering for reconstruction from sparse data (inc. denoising). Makes sense, as that's a high-dimensional problem. Still, beware that the NN solutions rarely blow handmade algorithms out of the water. I also think using tiny NNs for compression- closely related to reconstruction- has a future too. Beyond that, if NNs were to set the graphics world ablaze, it would've happened by now.

Lots of graphics engineering is just approximating functions, so it's natural NNs have some place here. However, our functions tend to be more understandable, tractable, malleable. It's not an application domain where it's virtually impossible to write an algorithmic solution by hand (let alone one that performs well), like natural language understanding.


Thank you for saying this. I think once things enter the culture war, normal and sensible people will get mugged by stupid, unexamined ideas so long as they’re on the right side of the culture war. So many conversations boil down to AI=BAD or CRYPTO=BAD, which is a shame not because the opposition is wrong - that’s your opinion who am I to disagree - but because once people have their conclusion they just stop thinking and once you stop thinking you’ve got no way of noticing if your thoughts on a matter are totally stupid.


My point is a bit different though. It's not about culture wars or being anti or pro AI. It's about how silly it is to call one optimization function AI and another "not AI" and then make a whole curfuffel about bucking trends, when what you've done is actually something very similar. (Maybe very interesting in its own right and a good solution though, with its own desirable properties.. again, I'm not commenting on the paper itself, just this one off-hand comment in the blog.)

In my view calling a solution "neural" and "AI" because you are optimizing a stack of relus, and then turning around and solving the same (or almost the same) problem by optimizing a bunch of Gaussian model parameters on the same data, and then saying that is "not AI" and "at least it's not neural!" is just a very specious and unnecessary way to delineate things imho. They're just two solutions to the same problem, both data driven, both generating "models".. they have differences, just not differences that I think warranted such a comment in the blog, and moreover I thought it weird to attribute some kind of normative evaluation based on how you categorize the solution.


One thing I will say, is that

AI = means you don't understand how to quantify the problem

AI is a great hack to quickly and unknowingly solve classification problems and other similar things. But you trade massive datasets for explainability. And that's a pretty bitter pill to swallow.


Sounds like you're mixing up AI and deep learning. A* is a type of AI, as is scheduling, sudoku solvers, etc.


No, GP is correct, "AI" as a buzzword means very little at this point.


That's because AI as a buzzword means deep learning now, but there's plenty of other AI that doesn't use training data and requires good knowledge of the problem you want to solve.

If you look at one of the most widely used text books machine learning is one of seven chapters, and not all of that is deep or supervised learning.

https://aima.cs.berkeley.edu/


But as a long-standing (as far as computing is concerned at any rate) academic field of study, it has meaning, even if the precise boundaries aren't perfectly well defined.


That it works without you understanding it is a failure of your understanding not of the approach.


Oh please. Dumb word games isn't going to make the "is this a bird" AI approach explainable. https://xkcd.com/1425/

Even the "simple" AI's with only a few TBs of data for training still have low/no understanding what's going on to identify the features and how they relate. Sure, we can tell this neuron identifies this squiggle, but what does it mean? No fucking clue. It also shows us collectively that most of these "hard" problems we have no clue how to understand them. And although AI allows us to "do it", it's a bad hack for actually understanding, optimizing, and doing.

> That it works without you understanding it is a failure of your understanding not of the approach.

Cool. Then point me towards the tools to explain how weights in a moderate-ly sized NN are created with training data, and associated with each other and how they find the features in question.

Oh yeah. Those tools don't exist.


You seem to find the explanation unsatisfying and are getting mad at the tools. The program that defines how you recognize a bird is defined in the connectivity of the neurons in your eye and your brain. The circuit is the program.

It's not at all clear that there is or even should be a deeper explanation than "a function with sufficiently many parameters can approximate any other function with only a small change to those parameters, and thus allows tractable solving" - whether by gradient descent or some hokey approximation our neurons are probably doing.

We have even less of an idea how our own system of "understanding" works than we do of neural networks "understanding". If you're going to be mad at tools that don't provide satisfying explanations for how they work, then look no further than your own brain. Are you going to give up on thinking, just because when you look at it closely you realize you have no idea how on earth you're even doing it?


There’s no helping the incurious, but there are indeed reams of explorations unto weights and their explanations.

One of my personal favourites is Zeiler and Fergus https://arxiv.org/abs/1311.2901

Tom Zahavy’s “Greying the black box” on deep RL comes to mind too https://arxiv.org/abs/1602.02658

OpenAI has their Microscope tool for investigating individual neurons https://openai.com/research/microscope; more recently OpenAI shared research and tooling leveraging a powerful language model to explain a smaller model’s neurons. https://openai.com/research/language-models-can-explain-neur...

For a true toy box exploration playground.tensorflow.org is worth a play

These may or may not satisfy your curiosities — your perfect tool may not exist…yet. I would argue that the sampling of extant work above are indicative that it and more are possible.


Very cool work.

As a bit of tangent (but wondering whether someone can answer) - the article also makes mention of point based rendering and indeed the fact is has been a staple of particle systems for a long time. However, especially with recent games, I have noticed (purely subjectively) a very subtle shift to a new style of particle systems which are on the one hand fully point oriented (compared to (textured) fragments) but on the other behave more like a physics systems.

Examples:

- Hogwarts (heavily): https://www.gamespot.com/a/uploads/original/1816/18167535/40... - Forspoken (heavily): https://oyster.ignimgs.com/mediawiki/apis.ign.com/project-at... - Starfield (though more rarely): https://dotesports.com/wp-content/uploads/2023/08/temple-loc... - AC6 - FF16 (heavily)

It's more obvious when you see it 'in motion'. The common denominator seems to be particles as colored transparent points with physics. Especially on console systems it seems that developers are using this for very cheap (CPU-wise, all on GPU) effects.

Anyone in gamedev who has some insight in this?


I'm not in gamedev anymore (and didn't do graphics when I was), but my vague impression is that real-time fluid simulation has become more common over the past decade, and I think that's what you're describing as "physics" here and is what makes point-based VFX actually look cool.

Without a fluid sim, point-based particle systems just look like fireworks. It was a cool effect in, like, the early 90s, but is is passe today.

The next step up from that is having each particle move independently with its own physics (some momentum, maybe a little wandering around from "wind", etc.) but then rendering them using little texture billboards. That's what games did up until relatively recently and looks pretty good for explosions, smoke, etc.

But now machines are powerful enough and physics algorithms clever enough to actually do fluid simulation in real-time where the particles all interact with each other. I think that's what you're seeing now.


Very interesting, indeed, they seem to be driven by better fluid simulations... remarkable that they find their way into games. I was always under the impression that Navier Stokes was hard in 3d, but it does seem like there are performant solutions now that are easily offloaded to the GPU, e.g. https://github.com/chrismile/cfd3d (and NVIDIA also has some blog posts about it).

Edit: I also just found this: https://www.youtube.com/live/569oSOSoKDc?si=8V5buRMoI3IKqLQp... -- which is very close to what you describe and fully matches the kind of particle systems I was hinting at, thanks!


Do you have any example of particle systems without fluid sim? (e.g. videos on youtube from old games, or names of old games that used them?)


Engine rather than a game, this video about UE4's particle system shows a lot https://www.youtube.com/watch?v=OXK2Xbd7D9w


The first thing that came to my mind was Portal 2, the "you can't portal here" shower of sparks when you're spamming the gun in the elevators. :)


I'm not exactly up-to-date on current techniques (I've been out of AAA game making for a while now), but here are some general observations that might be useful.

Way back during the transition from Doom to Quake, which in some ways really marked the transition from 2D to 3D, Quake's particle systems also relied overwhelmingly on small flat colored particles and their motions, rather than larger textured sprites and their silhouettes. (Quake did use a few sprites, but it was few and far between).

And I think the reasoning was pretty straight forward even back then; in a 3d game world, there are a lot of conceptual and architectural benefits to only working with truly 3d primitives - and point sprites often can be treated like nearly infinitely small 3d objects.

Whereas putting 2d sprites into a 3d scene introduces a bunch of kludges. In particular, 2d sprites with any partial transparency need to be sorted back to front in a scene for certain major blend modes, which gets really troublesome as there are more and more of them. They don't play nice with zbuffers. And because they need to be sorted, they don't always play nice with the more natural order you might prefer to batch drawing to keep the GPU happy. And likewise, they have a habit of clipping into 3d surfaces in ways that reveal their 2d-ness. There's probably more things I'm forgetting.

These are all issues that have had lots of technical workarounds and compromises thrown at them over time, because 2d transparent textures have been so important for things like effects and grass and trees. Screen door transparency. Shaders to change how sprites are written into zbuffers. Alpha testing. Alpha to coverage. Various follow-on techniques to sand down the rough edges of these things and deal with aliasing in shaders. And so on.

And then there's the issue of VR (or so a cursory skim suggests). I haven't spent time doing VR development myself, but a quick refresher skim of articles and forum posts suggests that 2d image based rendering in 3d scenes tends to stick out a lot more, in a bad way, in VR than on a single screen. The fact that they're flat billboards is much more noticeable... which is roughly what I had guessed before I started writing this comment up.

All of those reasons taken together suggest why people would be happy to move on from effects based on 2d texture sprites in 3d scenes, to say nothing of the other benefits that come from using masses of point sprites specifically themselves (especially in terms of physics simulations and such).


I've certainly seen game engines trying to advertise much more-involved particle systems over the last few years, that is for sure. ;PPPP

Which in some ways I guess makes sense, as the shift towards AI has heavily pushed GPUs almost towards a "sizeable-reduced-instruction-set-GPU-in-GPU" approach with RT cores/tensor cores, etc.... ;P

Side note as well, in/from my experience at least, I think these systems may not be as hard to render as it is for the author. A few years ago, someone (at NVIDIA I think?) wrote kernels to move the SE(3) kernels to tensor cores, so I wouldn't be shocked if some of that could be ported to the spherical harmonics portion of the Gaussian splatting during both compression and runtime: https://developer.nvidia.com/blog/accelerating-se3-transform...

Also, side side note, gaussian splatting should be quite efficient...I think? Due to technically always having support in 3D space (and hopefully not too much of a problem with having good support in 3D space). This should mean that even 'sloppy', quick-conversion calculations should work pretty decently in the end.

I say all of this knowing very little about most optimizations like billboarding, how things like nanite work, etc, etc. I do like it tho! ;PPPP


I know in Unity, they added a VFX Graph several years ago, that allows for these kinds of fine-grain particle effects. You can create a beautiful vortex of glowing points relatively simply.

I'm assuming Unreal Engine has something similar, but I don't much experience with it.


I remember that kind of "particle effect" originally being shown off ~11 years ago by the Unreal Engine 4 demo called "Elemental": https://youtu.be/dD9CPqSKjTU

I don't think they're "from" that, but that's the first landmark I can point to.


I think this is just an artifact of people using much higher particle count these days. When you have more of them it's more noticeable that they are points.


I was super excited to see this when the Siggraph paper first dropped. For the past decade or so, I've been taking pictures of any room I've lived in from hundreds of angles, with the intention of being able to recreate these in 3d at some point in the future. Gaussian splatting is the first technique I've seen that feels like it can recreate it in a way that feels almost-real.

Once the tools around this mature a bit more, I'm super excited to revisit those old rooms and be hit with a wave of nostalgia.


I am very uneducated in this so please forgive my ignorance. These videos always look so insanely cool which I really like and from what I understand the scenes/radiance fields are always static and have lighting baked in. Is there a chance this might be turned into something that can be dynamically lit and support motion?


That is a good point and the answer in short is: No.

Radiance fields have no concept of light emission, reflection, absorption, etc. instead everything is mushed into one value: The light transported. In that sense radiance fields are just 3D photos.

You would have to perform reverse-rendering / photo grammetry and estimate where the light sources and the surfaces are, what materials they have and so on. Then you could use traditional path tracing methods on that again.

Another thing to think about might be videos (not animation): Continuously capture the radiance field over time and then try to compress away the similarities in between frames to gain temporal coherence.


Would you not be able to store and render normal maps color instead of just albedo? Seems like you should be able to render the scene normals and do a deferred lighting pass.

Is depth not properly preserved or something?


They're not just storing the albedo, they're optimizing spherical harmonics to represent the color in an anisotropic way, that's why tey're calling it a radiance field. Radiance fields capture both light intensity (including color) and direction. They explain in the paper that it's very difficult to estimate good normals from the sparse point cloud they're starting with (or rather, that's taken as a given and produced as an earlier step using colmap) and that the gaussians doesn't use normals. You could probably make a point cloud from the gaussians and then use one of the existing techniques to estimate their normals, as a first attempt.

Remember that it's a bit tricky to talk about depth when the gaussians have both a position (mean value) and a size (covariance). The bicycle spokes are made up of long thin splats, what value do you assign to one of those? That's why I think you would have to sample new points from them as a first step.

https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/


I wasn't saying that you'd estimate normals from the point cloud. You'd need to estimate the normals separately and store the world position and world normal along with the color. This should be possible as these values can be represented as a color texture, so you should be able to construct something that renders a normal map and depth map from any angle just like this renders the color currently.


For the motion part, there is already an extension to gaussian splatting for moving stuff: https://dynamic3dgaussians.github.io/ You can also just string a bunch of them together to create an animation: https://twitter.com/8Infinite8/status/1699460316529090568 For relighting, there are lots of NeRF variants that do this -- it should be possible to optimize material parameters for the splats.


I imagine it would be hard, since these are usually captured with a single camera moving around and taking lots of photos.

Here's an edge case I can imagine for dynamic lighting - Say you capture a scene indoors, and a table casts a dark shadow on the floor. But the NeRFs don't try to understand light sources and shadows yet, so it wouldn't know whether the floor is painted black, or a white surface shadowed by the table, or if there's actually a blue Stanford bunny hiding in the shadows.

The 3D scanning rigs that capture small objects like people's faces handle this by manipulating lighting and sampling the BDRF directly. If you can't manipulate the lighting, you can probably guess a BDRF, but there will be limits.

Re-animating might be easy. But capturing an animation, I think you'd need multiple cameras or you'd have to settle for guesswork, like a neural network that can hallucinate the hidden side of a person based on the fact that they're a person. If you point the camera at someone who is walking, you'll get a good view of them from one side, but when you wind the video back, the network won't know what their far side looks like at all.

A few years ago Intel had a project to capture an animated scene with multiple cameras. The pitch was something like, "Just film everything, and you can position the camera in post-processing." I think they wanted it for football games, but I never heard of it shipping. And again, multiple cameras. Matrix-style.


Sure. One way is to gather splats under various real world light conditions, then map those to the closest simulated light condition. (I.e. make the data animate over time of day.)

The data requirements might become massive, but there are ways to do the interpolation where it isn’t so bad. If a static scene is 2GB, you should be able to get to a rough time of day approximation in less than 16GB, which is renderable on modern GPUs.

Then it’s “just” a matter of spending several years optimizing it while waiting for H100s to become consumer grade devices.


It’s not actually that difficult. By differentiating on the spherical harmonics of each point/Gaussian we can approximate materials and their response to lighting.


Sure, if you want to reinvent N dot L. (In other words, yes, you can do that, but then the result will look just as fake as every other “photo realistic” scheme.)

The only hope is to measure actual photons hitting actual sensors, which is why Gaussian splatting looks so real to begin with.


> gather splats under various real world light conditions, then map those to the closest simulated light condition

is a better alternative from your perspective?


Oh yes. The key to making realistic-looking video is to sample from the real world. The more closely you do that, the more realistic it looks. The limit case is a phone camera recording a video.


It's always fun when science exceeds science fiction's expectations. In this case, I was immediately thinking of the Braindance concept[1] in Cyberpunk 2077, which - more or less - allows to move around in the visual memories of another person, limited by that person's perception during that scene.

When moving the camera to a perspective distinct from the original one, the vision disintegrates into kind of three-dimensional pixel blobs, somewhat similar to the notion of blobs here - just a lot less polished, surprisingly, compared to this paper.

[1]: https://steelseries.com/blog/how-to-braindance-cyberpunk-207...


Splatting for volume rendering is quite old - Westover, Lee Alan (July 1991). "SPLATTING: A Parallel, Feed-Forward Volume Rendering Algorithm"


Are you saying there's nothing new in this or that it's building on an already established approach? I think the latter is not something anyone is trying to hide - while the former doesn't seem correct.


What’s new in the paper is the technique to generate the gaussians from photos. The output that is used to render uses a technique that is a decade or three old, but not terribly practical until recently.

For me, a rendering guy, that’s great! The data used at render time is very simple and flexible. Simpler than triangles even when you get into non-trivial operations.


Is the decades-old type of splatting view-dependent still?


The view-dependent bit reflects the simplicity of the data. The data is similar to an oriented, stretchy box that you interpret as an oriented, stretchy, fuzzy ball (the Gaussian field). What you do from there is up to you.

Classically, researchers were interested in just using the splats at all. So, they just assigned a single color to each one.

This paper assigns a spherical harmonic-based color sphere instead. That gives it the view angle -> color function.

There is a second paper focused on moving splats. It just uses solid colored splats.

You could instead associate albedo/specular/roughness from real time materials and do real time lighting. But, you’d have to figure out how to generate/capture those values.


Yes, I used to think of it as throw Gaussian snowballs at the window and see what sticks.

P.S. Not sure where the insight came from, but I was living near Boston at the time, and had young children :)


Yeah, it seems like the authors of the new paper may have missed much relevant existing research?


They haven't. You can't get a paper into SIGGRAPH that misses relevant existing research.


I'm looking forward to the first native optimized WebGPU implementation of 3DGS rendering. I'm also curious how scene data could be compressed and decompressed efficiently.


I'm also looking forward to it. One of the big challenges is the sorting, for which I'm unaware of a good WebGPU implementation. I have some more notes on this question in a Zulip thread[1].

[1]: https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Gau...


Does the sorting need to be done in the renderer, or is that part of the training process?


It needs to be done in the renderer. I think it's doable though, the FidelityFX library looks like it can be ported, it'll just run a bit slow because of the lack of subgroups. This particular library isn't based on a fancy scan implementation, as the state-of-the-art CUDA implementations are. There's a bit more followup in the linked Zulip thread.


I’m working on this right now!


I recently came across this video showing how to use Gaussian splatting: Getting Started With 3D Gaussian Splats for Windows (Beginner Guide) - https://www.youtube.com/watch?v=UXtuigy_wYc


What's the state of the art for structure from motion? Given video of a space, how does one practically turn it into a 3d scene today?


"Structure from motion" refers to the process of reconstructing camera poses and a sparse point cloud from a set of images (or video). That's the input to the scene reconstruction process described in this paper.

As far as I know, the basic approach to SfM hasn't changed much in the last decade or so. It boils down to image feature extraction (using something like SIFT feature vectors), then a heuristic matching process, then bundle adjustment and outlier rejection.


My application is for taking videos of rooms and turning them into to-scale models in a 3d modeling software (to be quantized and re-textured manually). Basically a rough floor plan generator from walkthrough videos.

Is there something off the shelf that does this, or a library that can generate these point clouds so I can go about trying to fit simple geometry to them to reconstruct the scene?

Please forgive the noob questions; I've not worked with automated 3d stuff or photogrammetry before.


Take a look at Meshroom from https://alicevision.org/

It's a pretty decent tool, although not sure if videos work well.

I once tried to build a mesh from an aerial video by extracting certain frames and the result was ok-ish


No.

I recently tested exactly this use case, with iPhone apps 3D Scanner, Polycam and magicplan.

The results were, how do I put it nicely, not very useful.

For example, the floor wasn't even flat/straight! One would think this would be a basic constraint (or “inductive bias”) of the algorithm…

I haven’t tested Luma yet; also my example wasn't literally the most basic one (square empty room), instead there was some clutter on the floor, irregularly shaped room, a chair in the middle of the room, and open doors.


Apple device only, but RoomPlan will do geometry estimation directly from the sensors.

https://developer.apple.com/videos/play/wwdc2023/10192/


into to-scale models

What does this mean? Why would scale matter in a 3D model when you can scale it however you want in two seconds?


"Scale model" means things are correctly sized relative to each other, e.g. your square room doesn't end up rectangular.


Scale model generally means you make something smaller than the real thing.

Also they said 'to-scale', but none of this applies, since if you are doing some sort of photogrametry you are getting an accurate model of what you are photographing anyway, that's the entire point.


Conventional photogrammetry gives an 'up to scale' model, i.e. the scale is arbitrary and needs to set later to match reality. This is because a real cow far away and a toy cow close up look the same in pictures.

You can supplement images with lidar, stereo images, or IMU odometry to capture true scale information at the SfM stage to get (approximately) scale-correct models.

Alternatively you can learn what the real world tends to look like and estimate absolute depth from images alone, if you have enough data. This might be defeated by toy cows and doll houses, but work for some applications.


gives an 'up to scale' model,

This is not a term that makes sense. It makes a model, there is no sense in having it scaled in any way except for being correct. Saying 'up to scale' is like saying 'feet running' or 'mouth eating' or 'face talking'.

the scale is arbitrary

Only if you don't measure anything. If you have the camera correct and know how high something is or just know some depth or distances photogrametry makes an accurate model. 'Scale model' is a term from when scaling the model down made it easier to make. In the computer, it just means it isn't accurate.


It's actually the technical term that is used in this context.

It's used when you don't know the scale.

You can estimate all the geometry 'up to' but not including a scale term.

> and know how high something is or just know some depth or distances

There you go: you just added in some extra information to constrain the scale estimate as I described.


In the cow vs. toy cow example, you get a 3D model of a cow, but you don't know its absolute size. If you ALSO know that the cow is 2m long, or 3m away, you can infer the scaling factor for the geometry you estimated using pixels and camera intrinsics.


Video is just a bunch of still images (except potentially with degraded quality from resolution and/or motion blur) so it's much the same as asking "how does one practically capture a 3d scene" without the video part. Regular photogrammetry - but with interest and practicality gradually increasing for NeRF style approaches.

(Although I am suprised that there's not more attention being paid to the extra spatial relationships that can be inferred from merely knowing something is a video. Surely it could be used as an extra constraint when inferring camera poses - you know the camera can only move in certain ways from one frame to the next?)


> (Although I am suprised that there's not more attention being paid to the extra spatial relationships that can be inferred from merely knowing something is a video. Surely it could be used as an extra constraint when inferring camera poses - you know the camera can only move in certain ways from one frame to the next?)

Can I ask why you say this? Camera motion models including Kalman filters of various kinds and constraints on derivatives etc, are absolutely used in SLAM and photogrammetry afaik, and have been for a long time.


OK maybe it's my limited experience but last time i tried it, video was treated as "extract frames and then only use every 50th (or so)"

This was a couple of years ago probably. Meshroom and Reality Capture. Maybe I was mistaken or maybe they've improved?

Can you point me to software that does use video for photogrammetry in the way you describe?


The Gaussian Splatting paper itself uses COLMAP, which really isn't that new. I have thoughts about marching cubes across splatted gaussians to get a mesh out, but I don't think I've seen that done yet.


Check out the GitHub repo hierarchal localization which uses DL + traditional SfM


Literally gaussian splatting (and nerfs)


I rendered voxel iso-surfaces using blobs in the late 90's. I'd scan a full 3d voxel array for surface voxels, use the local density gradient to compute a normal, quantize the normal to one of 240. Then I'd make chains of surface voxels using a table of displacement vectors. Most voxels were 2 bytes - a displacement index (from previous voxel) and a normal vector (index). Lighting was computed for all 240 normal vectors and stuffed in a lookup. Software could paint all these little colored circles in a z-buffer very fast. The only real bummer back then was that it had to be non-perspective. The displacement vectors were coverted to screen space per-frame, so it was paint a blob, offset, lookup color, paint another. One displacement value was used to indicate the chain ended and I needed to store an absolute position next, but most voxels were just 2 bytes.


Wow, the referenced Ecstatica game is wild.


For a second I thought it's a really well done fake. The art style looked so obviously beyond 1994 software rendering. But then you start to notice that the backgrounds are all static 3D renderings and there are very few objects in motion after all, so it's certainly feasible on a 486.

Maybe you could say it's basically a 3D version of Alone in the Dark using ellipsoids instead of polygons as the primitive.

The quality of the animation is unusually good for the time, which is really what creates the coherent illusion. Wikipedia says the game had just one developer paired with a "film animation expert", which makes a lot of sense.

https://en.wikipedia.org/wiki/Ecstatica

I'm shocked I've never heard of this game until now.


Ecstatica II was an extraordinary game - funny, weird, violent, amazing sound, and super atmospheric.



Would be cool to see those splats in action from a second, free to move camera


What’s the difference between a free-to-move camera and one on a preset track if they’re rendering at 100fps?


To see the blobs forming, moving, changing


But the 3D blobs don’t move, do they? Only the splats do.


Does Plenoxels also count? I believe they also use a similar approach

https://alexyu.net/plenoxels/


Various comparisons with plenoxels are in the paper.


This paper mentions Plenoxels in the introduction and beats it.


Here's a video that may or may not use these techniques, I'd be curious for those who know to judge ... https://youtu.be/dXLCHvRsgRQ?si=wkniCf-7wNPt8DGj

There's coherence frame-to-frame of the artifacts used to render, but they may be 2D "watercolor" splotches, and some seem very manually seeded, e.g. the shapes used to fill in irises for eyes. In some scenes though the blobs seem to be 3D-represented.


Could someone please explain to me what the advantage of this is over traditional photogrammetry?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: