That book mentions alpha-beta filters as sort of a younger sibling to full-blown Kalman filters. I recently had need of something like this at work, and started doing a bunch of reading. Eventually I realized that alpha-beta filters (and the whole Kalman family) is very focused on predicting the near future, whereas what I really needed was just a way to smooth historical data.
So I started reading in that direction, came across "double exponential smoothing" which seemed perfect for my use-case, and as I went into it I realized... it's just the alpha-beta filter again, but now with different names for all the variables :(
I can't help feeling like this entire neighborhood of math rests on a few common fundamental theories, but because different disciplines arrived at the same systems via different approaches, they end up sounding a little different and the commonality is obscured. Something about power series, Euler's number, gradient descent, filters, feedback systems, general system theory... it feels to me like there's a relatively small kernel of intuitive understanding at the heart of all that stuff, which could end up making glorious sense of a lot of mathematics if I could only grasp it.
Incidentally this is why people miss the mark when they get mad about mathematicians using single letter variable names. Short names let you focus on the structure of equations and relationships, which lets you more easily pattern match and say "wait, this is structurally the same as X other thing I already know but with different names". It's not about saving paper or making it easier to write (it is not easier to write Greek letters with super/subscripts in LaTeX using an English keyboard than it would be to use words). It is about transmitting a certain type of information to the reader that is otherwise very difficult to transmit.
While it uses letters so it looks vaguely like writing, math notation is very pictorial in nature. Long words would obscure the pictures.
I disagree. Single letter variables are meaningless.
In order to get the big picture, you have to remember
what all those meaningless letters stand for. Using
meaningful variables would make this easier.
If you work with them long enough it becomes second nature to read them, and then it is easier to manipulate and compose them. The rest of the context is the background knowledge to understand the pithy core equations. Papers are for explaining concepts, equations are for symbolic manipulation. Meaningful variable names would be middle ground and not good at either, except to help someone not familiar with the subject to understand the equation, but a lot of the symbols are so abstract that they really need to be explained in more detail elsewhere or would be arbitrarily named.
If you're in an abstract/general mathematical function, then sure: single letters. If you're doing more business logic kind of stuff (iterating through a list of db/orm objects or processing a request body) then the names should be longer
Often the actual meaning of the symbols is subordinate to the point you're trying to convey. e.g. I can tell you that `integrate(boundary(Region), form) = integrate(Region, differentiate(form))`, which is great and all, but I might write `<∂M|w> = <M|dw>` because what I'm trying to tell you is that you should think of these things as a dual-pairing of vector spaces (via integration) and that ∂ and d are somehow adjoint. They're both Stokes' theorem, but the emphasis is different, and in either case the hard part is the mountain of work it takes to define what the words even mean (limits, and integrals, and derivatives, and vectors, and covectors, and manifolds, and tangent spaces, and vector fields, and covector fields, and partitions of unity, and symmetric and alternating forms, and exterior derivatives, etc. etc. all so you can finally write one equation, which really just says that all the swirlies inside a region cancel out so if you want to add them all up, you can just add up the outer swirly).
The thing about math is you need to be comfortable viewing the same concept through a bunch of different lenses, and various notations are meant to help you do that by emphasizing different aspects of "the picture" you're looking at.
Ok, I can accept that. At the same time, my impression is that mathematicians always use single-letter variables.
It's like either they're not clear who their audience is or they're afraid to get off the beaten path. If they're explaining a classic algorithm, they use the common, single-letter variables instead of replacing them with meaningful names.
IMO your comment seems not to be addressing the point made in its parent comment. To make the point again with different words:
- Using long descriptive variable names would give them meaning, and make the particular equation/expression easier to understand or apply.
- Using short single-letter variable names allows you to forget the meaning of the variables and see the underlying structure, thus making the expression easier to connect to other situations (with completely unrelated meanings) that happen to have the same underlying structure. (The letters being meaningless, or at least not carrying their meaning so strongly, is a feature, not a bug.)
(Another way of seeing the distinction is whether you consider the equation to be the final result, to be used and applied, or as a starting point, to be manipulated further.)
You're looking for the theory of linear (or nonlinear) dynamical systems. Unfortunately it's not one kernel of intuition backed by consistent notation, it's many with no consistency. A good course on controls and signals/systems will beat those intuitions into you and you learn the math/parlance without getting attached to any one notational convention.
The real intuition is "everything is a filter." Everything else is about analysis and synthesis of that idea.
Maybe check out Probabilistic Robotics by Dieter Fox, Sebastian Thrun, and Wolfram Burgard. It has a coherent Bayesian formulation with consistent notation on many Kalman-related topics. Also with the rise of AI/ML, classic control theory ideas are being merged with reinforcement learning.
Thanks for the recommendation! It would never have occurred to me to look at robotics, but I can understand why that's very relevant.
I read Feedback Control for Computer Systems not too long ago, which felt like yet another restatement of the same ideas; I guess that counts as "classic control theory".
If Q and R are constant (as is usually the case), the gain quickly converges, such that the Kalman filter is just an exponential filter with a prediction step. For many people this is a lot easier to understand, and even matches how it is typically used, where Q and R are manually tuned until it “looks good” and never changed again. Moreover, there is just one gain to manually tune instead of multiple quantities Q and R.
Hey, I had very similar thoughts many years ago! The trick is yes, many filters boil down to alpha/beta, and the kalman filter is (edit: can be) really a way to generate those constants given a (linear) model (set of equations describing the dynamics, ie the future states) and good knowledge of the noise (variance) in the measurements. So if the measurements always have the same noise it will just reduce the constants over time, and it is only really useful when the measurement accuracy can be determined well and also changes a lot.
Interesting. Are you characterizing Kalman filters mostly as systems of control/refinement on top of alpha-beta filters?
I do feel like the core of it is essentially exponential/logarithmic growth/decay, with the option to layer multiple higher-order growth/decay series on top of one another. Maybe that's the gist...
When you start dealing with linear systems and disturbances, you end up with basically matrix math and covariance in some form and way.
The thing about Kalman filter is that its a pretty well known and exists in many software packages (just like PID) so its fairly easy to implement. But because noise is often not gaussian, and systems are often not linear, its more of a "works well enough" for most applications.
Recently tasked with implementing a Kalman filter, I found it very very difficult to find good resources that explained it in language that made sense to a developer like me. So after spending a month learning it, I wrote a couple posts on it, perhaps someone might find it helpful?
As a developer I found the maths made sense only after implementing it, ironically. I guess we learn by building on top of what we already know? Is there a term for that?
Off topic, but why are you one of those people who do a full screen “subscribe to my mailing list” overlay? I never understand why you’d wanna do that.
Oh, many years ago I bought into the whole 'write a blog, collect subscribers, create a training product and sell it to them, retire on the beach' idea and gave it a go. I now think that's exceedingly unlikely, of course. Sorry you find it annoying, i'll have to get around to removing it sometime.
I've always thought math would be much easier to learn if they used descriptive variable names. Or, at least in an interactive medium like the web, add some tooltips at a bare minimum. Whenever I study math I spend 90% of the time looking up the symbols.
Also when this person says the subscript "denotes the order of the measurement" I'm trying to figure out what kind of order he's talking about. I guess that's the index? It's been a while since I did kalman filters:-p
People always seem to forget that mathematical notation is designed to make algebraic manipulations easier to follow. It's not really intended as something that makes sense on its own, it's mostly physicists who think something like E=mc^2 should have any meaning.
The more pure the mathematics the shorter the scope of most variables. Typically a variable is defined just before it's used, with a scope no longer than the proof or derivation it's in.
Also some of the choices in this article are just plain silly. Such as using P as both variable and index, and then use it for the covariance matrix when the precision matrix is the exact inverse.
To be honest a lot of things that are incredibly easy, incredibly intuitive — things of the "I could come up with this myself"-kind — can seem impenetrable when written down in mathematical notation.
If you want people to understand, you skip the mathematical notation and use pseudocode, or at least you start with an explaination and once people know what they are looking at you go into the math notation.
I understand that mathematical notation can be very practical for describing a certain class of problems concisely, but especially if you teach or explain things, consider that being concise is meaningless if people don't understand what you describe.
Sometimes my feeling is that this is on purpose. People not understanding what you are doing (but you come across soo smart), is a feature to many. Even better: The layperson cannot tell whether you are talking utter gibberish or using the most precise language on earth.
I, however, think you can recognize real smart people by the phenomenon that every room they are in seems to become smarter in total, because they lift people up in their wake of understanding things so deeply, they can just break them down into their very simple, managable parts and translate obscure domain-specific languages into something people can grasp.
I too can go into deep domain specific lanuages, be it in philosophy, electrical regulation, film theory, programming lingo, music theory, control theory etc. But what is the point of doing that without making the person opposite understand what you're waffling on about? Because either you will seem like a knowledgeable person that has a total lack of self-perception or like a brick that doesn't care what message reaches the person opposite as long as rhey can hear themselves talk.
That being said: Domain-specific language can be okay if it is used between people within that domain. But I have met many physicists or mathematicians who also think math notation sucks, so maybe there could be something better.
Unecessary use of mathematical notation / terminology definitely happens. It's one of the things that annoy me to no end. Especially if after unwrapping the unintelligible mess of mathematical gibberish you're left with some inane argument.
I still haven't quite forgiven the writers of UMAP for writing a theorem that uses pi^n/2 / Gamma(n/2+1) instead of the more reasonable "volume of the unit sphere". It makes it so confusing that they fail to spot the theorem doesn't work for low dimensional objects embedded in high dimensional spaces, which is their exact use-case. Luckily their informal conclusion does work, mostly.
In computer science, we straddle this divide where the usual way to describe things in scientific literature is using algebraic notation, but the usual way to actually write up a program is to use meaningful variable names. It's not K_i, it's current_gains or something like that. It leads to a funny kind of style where especially computer science-related fields start writing K_{current} in scientific manuscripts, and are then chided because that is not how you use subscripts. The reverse is also true, theory-adjacent code tends to use a lot more terse naming conventions.
I often work with a guy who's a pure computer scientist. He communicates in LaTeX and pseudocode only. I don't think he could find a compiler if he tried. What I've learned to do is keep his notation whenever I implement something. I start the function with a comment that says what paper I'm referencing when I implement it and then comment all of the one-letter variables. (In this case I would use K[i]).
Usually the stuff I'm implementing is complicated enough that reading the code without reading the paper won't have much meaning. And if you're reading the paper, it's good to have as close to a 1-1 representation to help give confidence that I implemented it right.
I've had non-zero times where I took his pseudocode, pasted it into my IDE and did a search and replace of begin/end to braces and "<-" to "=" and it "just worked" on the first try. I always found that amazing that he could write this stuff without ever trying it (outside of his head).
> Also when this person says the subscript "denotes the order of the measurement" I'm trying to figure out what kind of order he's talking about.
I hate that the most when reading papers. Authors trying to sound abstract and academic, but only accomplishing being frustratingly vague. AUTHORS YOU STILL HAVE TO INSERT THE SUBJECT INTO YOUR SENTENCES FOR THEM TO MAKE SENSE.
I'm so frustrated at this aspect in research papers more than anything else. You must disambiguate. Use absolute descriptors and do not use relative descriptors. Don't tell me to look right, because I'll look left. Use absolute descriptors! "then after spinning the prism the light cone blah blah blah" SPIN!? SPIN IN WHAT DIRECTION????? LEFT?RIGHT?! LATERAL? UP? DOWN???? How fast? How slow? You imagine all of these CRITICAL ASPECTS in your head when writing such ambiguous sentences, but the reader cannot read your mind.
Having written papers like this before, a large part of the problem is that almost all CS research is published in conferences, rather than journals, and conferences frequently have extremely strict length limits, on the order of three or five pages. If you have an even slightly complicated procedure it can be nightmarishly difficult to get even the core information into your paper, and you can forget about details or tangents.
I completely agree. And my insight similar to yours is that the greatest math book that no one has written is one where the meaning of notation, variables and a clear assortment of theorems across all topics are well curated.
> Also when this person says the subscript "denotes the order of the measurement" I'm trying to figure out what kind of order he's talking about. I guess that's the index? It's been a while since I did kalman filters:-p
The order referred to is the index-in-time that a value correspond to. Eg, x_3 would be the state at the third time step. I think their subscript “p” stands for prediction. x_p at time 3 is the state we expect at time 4. But then when time 4 comes around, we incorporate new measurements and calculate x_4 including that new information. Just to be explicit, this x_4 will be different from the x_p we calculated at time 3, as our prediction is always a bit incorrect
> I've always thought math would be much easier to learn if they used descriptive variable names.
I think the variable names are already picked to be descriptive. No one is picking them to be more obscure or harder to track. The problem is that those who are starting out still haven't picked up concepts or learned the standard notations for each problem domain, thus we are left with the pain of ramping up on a topic.
I generally don't have a problem with variable names, but creating syntax and terminology that conflicts with other mathematical use is a real problem.
I agree but this is a surface level issue impacting the very beginner phase only. Once you get familiar with the vocabulary, the hard stuff will be the actual material, understanding the thing itself, not its description. This differentiates fields that have an actual subject matter from fields that are essentially terminology and categorizations all the way through, a self-referential map with no territory.
In math/physics/engineering the terminology is unfamiliar at first but very precise and hence learnable. The vast majority of STEM textbooks (at least from good US university publishers) make their best effort in presenting the material understandably without any intentional obscurantism or additional fanciness. Academic joirnal/conference papers do sometimes intentionally confuse to game the publication metrics but intro materials are an earnest effort at educating. The subject matter has some inherent complexity, there's no need to prop it up artificially for prestige (that happens more in other fields that are insecure about their level of inherent complexity).
Oh definitely! Eventually you've spent so much time that the thetas and lambdas and Lorentzes and whatever become your close intimate friends. I've most recently experienced this with learning piano and how an ocean of white and black keys all developed these individual "identities." Like, "ah yes, <looks at A4> I won't forget you. We've had so many dramatic moments before, and remember that time I kept showing up at your neighbour's house instead?" ...Okay maybe I'm just weird.
I've always thought that code would be much easier to understand with shorter, less descriptive variable names. Whenever I look at new code most of the confusion involves searching through layers of abstraction for the part that actually does the thing as opposed to the layer upon layer of connections between abstractions which would be much less necessary if the entire behavior could be encoded in a single line. You can only have a small number of descriptive variables in an expression before it becomes entirely unreadable. That is opposed to single character with sub/superscripts where you can easily see what's happening with tens of variables in a single line of math.
Here's a formula for calculating the downstream Mach number in a certain kind of supersonic flow. I cannot imagine any way to write this in "descriptive variables" which makes the formula understandable at all, you just could not see the structure. (from https://en.wikipedia.org/wiki/Oblique_shock )
Kalman filters might be one of those weird cases in mathematics where the 'simple' version is simplified beyond all recognition.
I mean what you're really doing is take a measurement then simulate the possible future states and combine this information with the next measurement and repeat.
You can imagine e.g. taking multiple pictures of a tennis-ball, estimate its position and speed from the first picture, simulate where it's going to end up, and compare this with the next picture to see which estimate is closer to the truth. Or more old school, measure the inclination of the sun and compare the resulting line of possible locations on a map to the spot you thought you were.
Of course the exact calculations are beyond impractical. So you use sampling to simplify. However that still makes it difficult so you assume the distribution is somewhat close to a Gaussian distribution. And then you simplify even more by assuming the evolution of the system is just a linear transformation. And that's how you end up with the Kalman filter discussed here.
I'd be amazed if anyone could really understand what's going on just based on the linear algebra.
I don't know what it is about the Kalman filter but so many explanations including the OP have this format: "It's very simple! <complicated obtuse explanation listing the computation steps>"
Your comment is the first I've seen actually providing intuition about what is happening. It doesn't help perhaps that the name itself is misleading as heck to computer people like me: it's not a filter as in stream processing or SQL.
- simple to predict / understand what it would do: Easily explained with pictures and hardly no math
- simple to understand at a higher level / see why it works: Not easily explained without math and fraught with bayesian vs optimization vs EE-type approaches
- simple to understand well enough to use: Not easily done without other relevant math that is not covered in KF explanation, e.g., controls, matrix analysis, Jacobians, etc
- simple to understand why the equations are named what they are, and why they work: Not easily explained without math and historical context that takes a page or so to explain.
> it's not a filter as in stream processing
As you say "filter" means 'remove noise', but it also means 'process in order of arrival', so it's similar to your def of filter.
I recall reading a very intuitive explanation including animations of a point cloud to show how it works. I've had no luck finding the article again though.
It's simpler than that. The linear algebra is actually easier.
The kalman filter tries to guess the hidden input that produced the measurements. It does so forming the minimization problem:
'minimize over x, the function [ actual_measurement - expected_measurement(x) ]^2/s^2', here 's' is sigma of noise.
This follows from the state estimation problem:
'maximize over x, the likelihood of seeing the actual_measurement', because the only term that matters in the likelihood function is -([x-expected(x)]/s )^2. (look at the exponent in the Normal distribution, or any exponential distribution really).
'actual_measurement' is a constant, so if it happens that the function 'expected_measurement' is linear, this is trivially solved directly as a convex optimization, and if you take derivative, equate to zero, and solve, you'll get the kalman filter update step.
If it so happens that the function is non-linear, well we just make a single netwton-rhapson step by linearizing the equation, minimizing, and returning the solution to the "pretend linearization".
This is basic calc + linear algebra at an undergrad level, but nobody bothered to tell you that.
---
It's also completely wrong. It's a hack from the 60s to maximize the likelihood function using a recursive, single-step linearization like this. A misreading of the Cramer Rao Lower Bound has "proven" to generations of engineers that this is optimal. It's not, not really.[^1]
Nowadays we have 10,000x more compute, and any one of the following _will_ produce better performance:
* Forming and solving the non-linear equation using many newton-rhapson steps
* Keeping a long history of measurements, and solving using many newton-rhapson steps over this batch
* Using sum-of-guassian representation to accomodate multi-modal measurement functions, esp when including the prior bullets
All of these were well covered by state estimation research from 80s to now, but again, the textbooks seem to be written in stone in 1972.
[^1]: (the cramer rao lower bound is only defined when all measurement likelihood functions are linearized at the true state - which is only possible asymptotically in a batch which preserves all the measurements - and not possible before time infinity and not possible with recursive filter)
I don't, sadly. The only coverage of this that was first-principles accessible was the course taught by Prof Stergios Romelioutis where I went to grad school.
You could do fine by reading some old books by Bar-Shalom. Any practical textbook like his would include all the "other stuff" about the EKF that helps you understand how nonperforming it often is.
But the actual derivation of the EKF is probably only one or two pages in such a textbook, which is a damn shame nobody includes it.
The background required is simply:
* Know the form of exponential family of PDFs (like Normal Gaussian)
* Bayes rule
* Recognize that to maximize f~= exp(-a), you have to minimize 'a'
* Know how to take derivative of matrix equation ('a', above)
* Solve it
* Use 'matrix inversion lemma' to transform solution to what KF/EKF provides.
Probabilistic Robotics covers Kalman Filter from a first principles probabilistic viewpoint, and its extension the EKF. It's quite readable for someone with basic understanding of linear algebra, probability, and calculus. I believe it also has a refresher on these basics in the introduction.
It all depends on your background. If you come from a Bayesian background, looking at a Kalman filter as a posterior update mechanism is very intuitive. A distribution over state that gets updated as more information arrives...
Yeah one needs to distinguish between the optimal state sequence and the optimal state at a particular time. In other words, a joint posterior (state trajectory up to now given all measurements up to now) vs a marginal posterior (state right now given all measurements up to now). the KF only gives you a marginal posterior. they are not the same
Both the Bayesian perspective and the optimization perspectice are legitimate ways of understanding the Kalman filter. I like the Bayesian perspective better.
Forgive me, I'm thoroughly confused by that dichotomy. How are they different? Approaching from bayes rule or a "maximum likelihood" approach produces the same results.
The result is identical, the understanding is different. I would suggest that the Bayesian perspective leads to insights like the UKF [1] which IME is all round much better than the apparently better known EKF for approximating non linear systems.
[1] That is, it is generally easier to approximate a distribution than a non linear function.
But IME everyone in the entire world is a "visual learner" who learns best by examples. So I'm surprised that the tutorial midway through the page doesn't put any example numbers into the formulas (maybe I glanced over it?) and the pictures only start after a page of "what is a Kalman filter" text, and the pictures are just of more formulas.
Another comment pointed out variable naming conventions as an obstacle to learning and understanding mathematical topics. I am sympathetic to that perspective, but even more so to this one you post. I am astounded by how common this is. A weaker form of this exists in software libraries that don't include code examples.
One thing that clicked for me is that two uncertain distribution of measurements (a high variance in the distribution) makes a “more certain” measurement (a narrower distribution). Use this more certain measurement and combine it with the next measurement, then rinse and repeat and boom, you have a Kalman filter.
Close your eyes and walk around for a while. Imagine where you are. Now open your eyes. Is your actual location different from where you thought you were?
That last bit, using an observation to update a belief on a state variable, that's what the Kalman filter does.
Simply put, it is an "online" model that which means that it learns on the fly. Specifically, it is the optimal online learning algorithm for linear systems with Gaussian noise.
In a way it is like a primitive RNN, it has internal state, inputs and outputs.
I've used them in robotics and for tracking satellites going overhead via radar. Apparently they're also used by economists for guessing the state of the economy, along with other filters in the standard robotics toolkit.
Ensemble kalman filter (and similar techniques like variational assimilation) are also used heavily in the geosciences to assimilate measurements and model data in order to obtain a "posterior observation" which can be understood intuitively as an interpolation between model and observation weighted by their relative uncertainty (and covariance)
Purely idle curiosity – I've heard a lot about the Kalman filter over the years, it's a popular subject here, but what are the other filters in the standard robotics toolkit?
The Kalman filter has a family of generalizations in the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF.)
Also common in robotics applications is the Particle Filter, which uses a Monte Carlo approximation of the uncertainty in the state, rather than enforcing a (Gaussian) distribution, as in the traditional Kalman filter. This can be useful when the mechanics are highly nonlinear and/or your measurement uncertainties are, well, very non-Gaussian. Sebastian Thrun (a CMU robotics professor in the DARPA "Grand Challenge" days of self-driving cars) made an early Udacity course on Particle Filters.
You can also construct multiple hypothesis trackers from multiple Kalman Filters, but there is a little more machinery. For example, Interacting Multiple Models (IMM) trackers may use Kalman Filters or Particle Filters, and a lot of the foundational work by Bar-Shalom and others focuses on Kalman Filters.
Genuine question: why does kalman filter come up so frequently on HN? Is this something I'm missing? I'm a machine learning engineer, not a data scientist.
ML and Kalman filtering try to solve similar problems.
Some theories even suggest that Kalman filtering (or a similar algorithm) provides a basis for neurobiological learning. See predictive coding (e.g. https://arxiv.org/pdf/2102.10021.pdf)
It is considered a difficult topic and people want to show they understand it.
A similar thing happens that had the word quantic or relativistic. I'm a physicist and we hardly talk about it, but here in HN we find people bringing in up every other day
https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Pyt...
That book mentions alpha-beta filters as sort of a younger sibling to full-blown Kalman filters. I recently had need of something like this at work, and started doing a bunch of reading. Eventually I realized that alpha-beta filters (and the whole Kalman family) is very focused on predicting the near future, whereas what I really needed was just a way to smooth historical data.
So I started reading in that direction, came across "double exponential smoothing" which seemed perfect for my use-case, and as I went into it I realized... it's just the alpha-beta filter again, but now with different names for all the variables :(
I can't help feeling like this entire neighborhood of math rests on a few common fundamental theories, but because different disciplines arrived at the same systems via different approaches, they end up sounding a little different and the commonality is obscured. Something about power series, Euler's number, gradient descent, filters, feedback systems, general system theory... it feels to me like there's a relatively small kernel of intuitive understanding at the heart of all that stuff, which could end up making glorious sense of a lot of mathematics if I could only grasp it.
Somebody help me out, here!