Is there anything deep here? There's a parametrized representation of a function and it's being interpreted as a neural network, why is this surprising? It's like saying that Newton's second law is a neural network: F=ma can be written log(F) = log(m) + log(a). Aha, this is a neural network! The inputs are m and a, the first layer is sparsely connected with log activation functions, the second layer is fully connected with an exponential activation function:
F = exp(c1 * o1 + c2 * o2)
o1 = log(c3 * a + c4 * m)
o2 = log(c5 * a + c6 * m)
If you feed it enough data you'll find c1=c2=c3=c6=1 and c4=c5=0.
But saying that Newton's second law is a Neural Network, while correct, seems a bit deceptive in that it's not a deep idea at all.
Op here, IMO the "deepest" bit is [1] - the network learns the DFT in order to reconstruct the signal, and is not explicitly trained on the DFT values from the FFT.
Admittedly, I should have mentioned that any linear transform can be considered to be a single layer neural network (if you want to see the world through a neural network lens), and will add this to the post at some point.
In fact, I have a series of posts planned, which will reveal that well known algorithms/models are actually neural networks...
> I should have mentioned that any linear transform can be considered to be a single layer neural network
I think this should be turned around. A single layer neural network can be considered a linear mapping (but not necessarily an orthogonal transform or change of basis, like the DFT).
A more clear example of this are adaptive filters, which are trained in real time using gradient descent.
This is an important distinction because thinking of "X is a Neural Net" doesn't provide meaningful insight, whereas "Neural Nets with X properties are a case of linear dynamic systems, here's an example of how we can equate one linear transformation to a neural net" leads you to deeper conclusions on the analysis and synthesis of ANNs in the context of dynamics - which encompasses a much larger surface area than the DFT.
I've sometimes wondered how many signal processing kernels or filter architectures you could learn via SGD given the right data and search strategy. Maybe using something like Neural Architecture Search, so not just the weights / coefficients but the architecture too.
Could it discover an interpolation filter like upfirdn? An IIR Butterworth filter (it would have to be recurrent)? A frequency transfer function derived from two signals?
I imagined that the solutions it found wouldn't be "clean" but have other non-essential operators bloating them. Could it find new architectures that haven't been created from first principles?
I think a big question here is, "what is the goal?" A key distinction is whether the NN is intended to handle the signal directly (the NN is the filter) or if it's just a technique for finding filter coefficients (the NN designs the filter for the signal).
For some tasks, like system identification (used in echo cancellation), the NN is the filter - aformentioned adaptive filters are used for this case right now. It can also be used for black box modeling, which has numerous application in real time or otherwise.
For others like Butterworth (and other classic designs) there's not really a good reason to use a NN. Butterworth (and Chebychev I/II, elliptical, optimum-L, and others) are filter design formulae with a closed form (for a given filter order) that yield roots of transfer functions that have desirable properties - I'm not sure how a learning approach can beat a formulae that are derived from the properties of the filter they design.
There are iterative design algorithms that do not have a closed form, like Parks-McClellan. It is however quite good at what it does - it would be interesting to compare these methods against some design tricks to reduce filter order.
There are some applications of filter design that NNs can do that we don't have good solutions for with traditional algorithms, like system identification, another is in optimization (in terms of filter order) and iteratively designing stable IIR filters to fit arbitrary frequency curves (for FIR it's a solved problem, at significant extra cost compared to IIR).
As for topologies of the filter itself that is an interesting angle. Topologies affect quantization effects (if you wanted an FPGA to realize the filter with the fewest number of bits, topologies matter) and for time variant filters (like an adaptive IIR) there are drastic differences between the behavior of various topologies. I'm not sure how it would manifest with a NN designing the filter.
I meant learning the topology (I called it architecture) from scratch, or minimally from heuristics. What would I discover? Something that we recognize as an IIR filter? Probably not, but if it did, what would it discover that we didn't recognize over years of academic thought but could have from first principles? To me that's the intriguing angle.
I don't think I follow. These designs do come from first principles. In fact the reason we've had so much academic thought put into it is because 100 years ago we designed systems empirically and a few smart people went back to first principles to find the fundamental constraints on filter networks.
That is how we did create such topologies. But if we had not, could they have been discovered through neural architecture search [1]? NAS has been used to find architectures that outperform those that were hand designed. And in RL , Google used a method inspired by NAS [2] to learn methods built by hand (i.e. TD learning, DQN). I know it's not the same but if you use a bit of speculative imagination you can translate what has been done to signal processing applications.
Training straight from DCT coefficients to avoid spending time learning a similar representation in the bottom layers of the net.
I've personally toyed with something similar on GANs to gauge the computational benefits of not doing convolutions in the bottom layers of a net but learning directly in a FFT-like compressed space instead.
Isn't there something like Fouriernets and spdnets already, and stuff on Riemannian manifolds and my head is spinning? Followed Daniel Brook's work for some time, e.g. https://hal.archives-ouvertes.fr/hal-02290838
As far as I can tell, you're still using a fixed inverse DFT as the reconstruction layer, so it's not just rediscovering the DFT on its own. Instead of learning a linear transformation from input-output pairs, it's learning the inverse of a linear transformation when that transformation is given as an oracle. It's not terribly surprising that this works, although there are probably some interesting issues of numerical conditioning in the general case.
Actually, "learns" here means to fit reverse linear transformation a.k.a. inverse matrix. He defines reverse FT matrix and using gradient descent numerically converges to inverse matrix. Nothing more than inefficient way to solve system of linear equations.
A neural network can also be set to learn unconditionally return a fixed value with no learning feedback. I don't think lower bounds on capabilities are very informative. So could many arbitrarily complex arrangements that do massive amounts of work only to discard it and return a constant. An upper bound of what an approach is capable of is more useful. Say no matter how vast a look up table is it will never return a different value for the same input regardless of prior sequence.
Perhaps it is the compsci glasses talking, but this is just one very specific instance where someone figured out a way to map the DFT problem to a neural net. I agree that it is unfortunate that it is being presented as some kind of big discovery, and that the fundamental lesson is either still undiscovered to the author or just unclearly communicated, but there is still good intention in there (sharing something you've learned and eliciting feedback).
It's a stretch to call it a neural network. It was already well-known that the Fourier transform can be seen as a matrix multiply which minimizes some least squares problem.
This can be found somewhere in S.M. Kay's Fundamentals of Statistical Signal Processing: Estimation Theory (Vol 1), Detection Theory (Vol 2).
I think this is the sort of thing that is very obvious if you are already comfortable with neural networks and the FFT. But if you're only comfortable with neural networks, and the FFT feels like arcane magic, this exercise might be very instructive.
Well the point is that it does not make sense to DFT data before feeding it into a multilayered neural network (or calculating the force generated by mass and acceleration as you point out). Those formulas make no sense: the network can just learn them on the fly.
In fact you'll find that this does not just work for the Fourier transform, but for any FIR filter (and some other classes), and therefore neural networks can deal with signals and construct low-pass filters, high-pass filters, bandgap filters, ... as required for the task at hand without the network (or it's designer) having any idea at all what is happening.
I mean there's some basic assumptions these reasonings make (main one is that you need to feed many discretized values from a time window).
Of course, a problem remains: local optima. Just because a neural network can construct a filterbank or do a DFT, doesn't mean that it will actually do it when the situation warrants it. If there's a local optimum without filters ... well, you may get unlucky. It there's many local optima without filters ... sucks to be you.
I don't think I can agree with that. You do a Fourier transform when the data you're working with doesn't admit easily or at all certain operations in the time domain but it does in frequency domain. If you already know that to be the case, preprocessing with a FFT is a better idea than hoping a neural network with enough layers uses a few of those to much less efficiently perform a DFT. Always take advantage of pre-existing knowledge of structure in your data. With the FFT especially, depending on how your data is being ingested, you might be able to use specialized DSPs that implement the FFT directly in hardware. These are cheap and easy to find since they're used in frequency-division multiplexing.
Which is weird, because biological neural nets seem to have some evolutionary pressure to do hardware fourier transforms before neurons even get involved.
You can see this most clearly in the auditory system where the incoming signal is transformed into the frequency domain by the cochlea before the signal is received by the epithelial cells.
Neurons absolutely love working in the frequency domain, but they seem to prefer to not be the ones to do the binning in the first place.
I think I understand your sentiment, but consider our tactile sense of vibrations. We are able to distinguish different frequencies with the nerves in the skin, without cochlea doing the frequency domain transform. I think there are very different spectral limits and aliasing artifacts, but it's not like we can only sense a 5 Hz vibration either...
Again, I think that the sensor itself is detecting the frequency.
The actual signal isn't "direct".
IIRC from reading up on haptics - different tactile sensors are sensitive to different ranges of stimulus.
Different cell bodies sense different effects and frequency ranges and send an encoded signal.
We tend to forget that the underlying technology is very different and needs different encoding, signalling and computation approaches.
The main sentiment is that our nervous system is quite slow compared to our electronics and needs to use different design approaches and novel encodings, to get things done.
In much the same way our electronics can't engage with visible optic phenomena "directly". I mean visible light frequencies are too high to directly sample and transmit electronically the sensor used has to deal with that.
Right, I know it isn't direct. The ability to introduce aliasing and fool this perception system is the basis for haptic feedback systems. I was responding more narrowly to the topic of a frequency domain transform as you get from the cochlea. I didn't mean to suggest that there are not other "hardware" signal transforms or feature extraction happening prior to the nervous system signal processing.
As I understand it, the touch sensing structures are more like a zero, first, and second derivatives (deformation/tension, velocity, acceleration) and these coincidentally have different frequency sensitivity curves. My reading is that in some ways this might be more similar to how our visual system integrates different photon detection signals to interpret color. Rather than the clear frequency domain transform of the cochlea, there are broadly overlapping sensitivity curves with only 2-3 peaks.
Firing rates for neurons are in the order of 10hz - under the minimum acoustic frequencies of about 20hz.
Peak neuron firing rates basically only touch the lower bounds of frequencies that need to be processed.
Without the "hardware acceleration" of the cochlea that preprocesses the time domain signals to frequency domain first, basically our whole audio sensorium is out of processing range.
Normal nervous signal transmission speeds are in the order of the speed of sound, switching rates are in Hz range. Voltages are in the 10s of millivolt range.
Neuron spiking isn't rate encoded, it's temporal encoded. Ie. the distance from the next expected regular interval is the signal, not the spiking interval itself.
Additionally there's large variations in neuron spiking rates.
Yes it is encoded. Essentially the rate of firing for a neuron encodes an exponential value to be represented. This is called "spike trains".
You can see this clearly if you do an extreme slowdown of a human movement. Then, suddenly, what looks like a smooth movement, like raising an arm (and because of inertia it is smoothed of course), isn't really smooth. A pulse arrives in the muscle, and there is 20ms where the muscle is tensioned, and then it's back to neutral for 100ms. Then another spike arrives, another 20ms where a lot of tension is put on the muscle, the movement accelerates, and the muscle goes back to neutral. It's not a continuous movement at all.
But odds are good that it's not just the value that's encoded. Many experiments have shown that it matters a lot if the signals are in phase (ie. they encode the same or some multiple of a value, but that the signals started at the exact same time matters, maybe more than the value itself. Or in the encoding: while for the value only the distance between 2 spikes matters, if 2 spikes on 2 different neurons occur at the exact same time, this will be interpreted as very relevant, even those both spikes may have very different firing rates)
I think the issue here is that almost anything can be represented as a neural network. You could create a neural network that does a xor operation, for example. There is nothing new about this.
The Fourier transformation is a special case of a linear transformation. Linear neural networks can represent any linear transformation. Both can also be expressed as matrix multiplication.
It's not really all that surprising if you know some of the math behind neural networks, matrix algebra, linear transformations or the Fourier transformation.
I dont think its surprising but its definitely a cool little exercise. Somewhat aside of the content of the post, I could see some use cases where specifically implementing DFT layers in your architectures could lead to improved performance over using just the raw inputs or activations. Noisy datasets, or synthetic data based transfer learning come to mind as potential uses for the DFT as a step in your ML pipeline.
Why do people insist on comparing only things that are same? The same things are same, no point whatsoever comparing them: they're same! No, you should only compare things that are different, that way you learn more about their differences.
And what's better, an umbrella or a toaster, depends on the one's goals: if one'd like to fry some bread, toaster is more useful; if one'd like to cover from rain, umbrella is far superior, although toaster could used be too; if one'd like to drive a nail into the wall, toaster again is better; etc.
My statement was as approximate as the one I was replying to. If the original statement had been "you can use FFT to implement some operations involving matrix multiplication faster" I wouldn't have taken issue with it. But saying "FFT is faster than matrix multiplication" has little meaning.
The naive implementation of the DFT in terms of a matrix-vector multiplication has time complexity O(n^2) [1]. The point is that the FFT approach is much faster than that implementation.
Fourier transform, more exactly DFT, is a special case of matrix multiplication, when seen as a linear operator. The fact that you can do operations faster than the usual N*2 matrix / vector operation is due to the specific form of the underlying matrix (Van Der Mont matrix).
So they are not that different: the specific structure of the Fourier basis allows for more efficient matrix * vector operation.
They are different things, but you can use both to protect you from the rain, and an umbrella does that more efficiently than a toaster. The drawback is that the umbrella doesn't generalize to the broader field of bagel toasting
I've come to internalize the Fourier transform as a type of sparse matrix algorithm. I.e. something that implements a particular matrix-vector product without requiring explicit construction of said matrix.
Do you mean the FFT? I've been trying to wrap my head around the Fourier transform lately, and I can see the matrix connection to an FFT, but not the Fourier transform in general.
It might not be surprising when they implemented it by supervised learning, but when they also solved it by reconstructing the input it was like a rediscovery.
This is kind of silly. Neural networks are universal function approximators, meaning any function "is" (more accurately, "can be represented by") a neural network. In the DFT case, it is a linear function so we could get an exact representation. Though there is also nothing stopping you from just saying, f(x) is a neural network, because I can choose my activation function to be f.
If silly but interesting things weren't allowed, there would be no point to having a website like this. The article is a neat exploration of two things that many, many people don't realise are related because they simply don't know the maths involved in either of the two subjects.
Sorry, have you seen on the internet? There is no requirement of professionalism: write a catchy title, get people to read what you had to say.
HN isn't an academic publishing track, it's a reflection of what the world actually posts on the internet. You're free to not like the titles some people use to get eyeballs on their content, but this was clearly not clickbait warranting removal, and it's certainly not HN's job to police page titles on the wider web: we just upvote the cool content, and if enough people like it, it hits the front page.
So really, you're complaining that not enough people on HN mind catchy titles enough to downvote a post based on its cover, rather than the book it links to =)
I agree the result is not practical, but you asked what is known. There have been some refinements (the citations on that page give a reasonable flavor, fwiw).
That was a fun read. While it might be unsurprising to some, it's a testament to breadth of applications of modern machine learning frameworks.
I enjoyed this sentence in particular.
> This should look familiar, because it is a neural network layer with no activation function and no bias.
I thought it should look familiar because it's matrix multiplication. That it looks like a neural network layer first and foremost to some is maybe a sign of the times.
> it's a testament to breadth of applications of modern machine learning frameworks.
More like a testament to the the breadth of applications of linear algebra. It is absolutely remarkable what we're able to compute and analyze in the form of y = A x (hilbert spaces are a wild invention).
But it really isn't a testament to modern ML frameworks in any way. The fourier transform has been easy to compute/fit in this exact way (fourier = linear problem + solving linear problems by optimization) by modern-at-the-time frameworks for over two centuries.
I once had to port a siamese neural network from Tensorflow to Apple's CoreML to make it run on an iPhone. Siamese neural networks have a cross convolution step which wasn't something CoreML could handle. But CoreML could multiply two layer outputs element-wise.
I implemented it using a fourier transform (not a fast fourier transform), with separate re and im parts, since a fourier transform is just a matrix multiplication, and convolution is element-wise multiplication in the fourier domain. Unsurprisingly it was very slow.
And as a lot of people have mentioned in here, DFT is pretty much implicated in neural networks already because of the mathematics (especially in convolutional/correlational neural networks, which often make use of the convolution theorem (which is "just" fourier coefficient multiplication) to do the convolution)
Extending this post it seems more interesting to look more generally at the correspondence with wavelet-transforms.
>especially in convolutional/correlation neural networks, which often make use of the convolution theorem to do the convolution
Is this true? With the learned filters being so much smaller than the input imagery/signals, and with "striding" operations and different boundary conditions being wrapped into these algorithms, it doesn't seem like a natural fit.
I wish your comment is voted up. Where do we go next ? -- look addition is a neural network, look weighted average is a neural network, look linear regression, logistic regression, Poisson regression are neural networks ...
So? Machine learning is not some buzzword fad. Regression is part of ML for sure.
Some people seem to think ML means fancy deep learning GPT-3 transformer running on a TPU farm. Actually ML is a discipline that has existed for several decades and has various theoretical results too, including VC theory etc.
It is also not the same as statistics. They are adjacent but different fields.
Yeah, but no. It definitely is a buzzword that randomly gets attached to every method of data analysis these days. If you disagree, then please give me a definition that is both distinct from statistics but also means that Regression is an ML technique.
BTW, that doesn't mean there isn't also very substantial work done under the umbrella term.
Regression analysis was first introduced by Legendre in the early 19th century. I never would have dreamed to pretend it is Machine Learning 10 years ago when the term ML was still used somewhat more sparingly, and carried a bit more meaning...
But "I will use ML to analyse the data" gets more funding than "I will run a regression on the data".
I won't type up a chapter about this now but ML is about learning from data. It is concerned with training/validation/test data, the generalization error etc.
Statisticians care more about "modeling", actually estimating parameters that stand in for something real, to check assumptions and hypotheses. The cultures are very different. What makes total sense to one may baffle the other as lunacy (there's more convergence now though, by realizing the complementary nature of the two approaches). The terminology is different too: in stats, they call logistic regression regression, while in ML it would have been called classification (but the name is now stuck). ML also tends to be more Bayesian than frequentist, as opposed to classical stats.
I could write more but having taken ML courses before the big hype times, I can assure you ML doesn't just mean hyped up startup fluff vaporware.
Open up Kevin Murphy's ML book and compare the TOC with a Statistics textbook. There is overlap but it's certainly different in its goals and mentality.
It seems like some bitter bickering from the side of stats people that they didn't manage to hype up their field as much. Yeah they did have many of the same tools at their disposal but didn't use them in this way.
I am a physicist by training. I have neither an applied math/stats nor a ML/comp sci background, but I am increasingly using these methods. So I am far from an expert, but I also am not coming at this from some start-up perspective. ML-the-buzzword is now penetrating into all parts of STEM research and research funding. Some of the potential applications are exciting, too.
My conceptualization of the field is simply that ML is the design of universal function approximators to be used as part of statistical modeling/analysis. The key insight seems to be that a complex set of adapted network architectures, together with stochastic gradient descent are unreasonably effective. Further the effectiveness is a very non-linear function of size. As far as I can tell there is not much known about why this is the case. But as far as I can tell there really isn't anything done with these models that isn't statistical inference.
You're setting up a strawman of statistics. You're describing a small part of a large field. Stats has been more than what you're describing for longer than ML has been a phrase.
The only real useful definition these days is that stats is learning from data that happens in the stats department and ML is learning from data that happens in the CS department.
I'm fine with that. I would also hope for more collab cross-departments, but the incentives in academia are unfortunately not exactly set up the right way. People become protective of "their turf".
I think it's good if people are aware of the origins of the methods they use. Regression wasn't invented under the umbrella of ML, but is analyzed from a particular angle in ML more than in stats.
The generally accepted definition in the community is Tom Mitchell's:
> A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.
Statistical estimation methods are one way to achieve this, but not the only way, especially when an exact function can be learned, i.e. a typical layer 2 switch is a learning device. You don't program in the mapping from connected device MAC addresses to switch port, the switch itself learns this from receiving a message from a MAC on a port and then records the mapping. That is a very simple form of non-statistical machine learning.
I'm not really sure how you can start here but then say regression is not a form of machine learning. "Regression" is a pretty broad class of techniques that just means using labeled examples to estimate a function with a continuous response, typically contrasted with "classification," where the response is discrete. The method you use to do the function approximation may or may not be statistical. A genetic algorithm is not, for instance. I'm not sure least squares, which is what Legendre invented, should really be considered statistical, either. The original purpose was for approximating the solution to an overdetermined system of equations, before statistics was even formalized. It certainly became a preferred method in mathematical statistics, but mathematical statistics wasn't even formalized until later. It didn't start being called "regression" until Galton published his paper showing children of exceptionally tall or short people tended to regress to the mean, which was 80 years later. But you're performing regression analysis whether you use the normal equations, LU decomposition, QR decomposition, a genetic algorithm, gradient descent, stochastic gradient descent, stochastic gradient descent with dropout. Doesn't matter. As long as you're doing function approximation with a continuous response, it's still regression. Whether or not it can also be considered "machine" learning just depends on whether you're doing it by hand or via machine.
Though sure, typically people tend to imagine the more exotic and newer techniques that scale to very large data sets and reduce overfitting and deal with noise automatically and involve hyperparameters, i.e. not least squares.
I agree ML is a buzzword and using a sentence like that would sound silly but I can elaborate. First, though, regression does not need a separate ML definition to make it distinct from statistics... there's tons of overlap in the field of mathematics.
Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data.
This definition would include something as simple as linear regression.
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
The purpose of a neural network is exactly the same as a line of best fit. That is, you are just approximating an unknown function based on input/output data. The only difference is that a NN can better approximate a nonlinear function.
> ML is a discipline that has existed for several decades
If you're going to say regression is part of ML, you can't say ML has only existed for decades. If you define it with that expansive scope, it's existed for centuries.
ML uses regression in a particular way with particular aims and has developed prior approaches like OLS and Ridge regression, lasso, kriging etc. into a particular theoretical framework and analyses them from the angle of learning, i.e. heavy focus on generalization from training data to test data.
The truth is, these are tools that many communities have used in parallel and these communities are in constant flux regarding what promising directions they find and exploit and when they make a big splash in one place, others take notice and incorporate those ideas etc. There are no rigidly predefined disciplines and fields over long timescales.
When electrical engineers and signal processing communities worked on similar things they called it "pattern recognition".
Machine learning as a field started out with more ambition than mere regression and classification (and indeed covers a lot more ground), but it turns out that supervised learning has had the most practical success. But it's a research program and community that goes beyond that.
Similarly, there are parallels and similar equations between control theory and reinforcement learning. And indeed some controllers can be expressed as reinforcement learning agents. But the aims of the two communities are not the same.
Maybe people would be happier if "statistical learning" (which is also used) was used more instead of "machine learning"? But indeed as another comment points out, ML as a paradigm does not necessarily require learning of a statistical kind.
Labels grow and wane in popularity, it doesn't mean it's the same thing repackaged, rather that the aims and the focus changes depending on what we find most productive and fruitful.
For example many of these things were also called "soft computing" a few years ago, but that term is rarely seen nowadays.
This sounds like in the end you agree that a lot of ML is applied stats?
The problem in my mind is not that ML is using a lot of stats (obviously), it's that foundational mathematical concepts get labelled as ML techniques. This is why the title of the post is so annoying. This totally obscures the structure of the field. E.g. I wouldn't call linear algebra a quantum mechanics technique. I would say that QM uses (and spurred the development of) a lot of LinAlg.
Well, if ML people didn't use the word "regression" and named their use of it differently that would also upset stats people.
The point is, when you listen to an ML person introduce regression in a lecture it will look and feel different from when a stats person does it. ML-type regression is part of ML. Stats type regression doesn't cut it. They care about different aspects, flesh out stuff that's not very relevant for ML and ignore parts that are more important for ML.
Here is the special sauce: “We can consider the the discrete Fourier transform (DFT) to be an artificial neural network: it is a single layer network, with no bias, no activation function, and particular values for the weights. The number of output nodes is equal to the number of frequencies we evaluate.”
A single layer neural network is a sum of products, the basic Fourier equation is a sum of products.
In this view there are lots of single layer neural networks out there. For me, it’s the training algorithm (backprop) that sets apart the neural net.
The activation function is what keeps layers separated. When it isn't there, a pair of layers devolves to a matrix multiplication, which can be replaced by its resulting matrix, a single layer.
It seems that backpropagation might be a good abstraction for biological learing rules after all: "local predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules" from https://arxiv.org/abs/2006.04182
Isn't "backpropagation" just a synonym for the partial derivative of the scalar cost function with respect to the weights? And in the mathematical formulation of biological neural networks these derivatives can't be computed analytically. Your comment sounds like "backpropagation" is some kind of natural phenomenon.
> We can consider the the discrete Fourier transform (DFT) to be an artificial neural network: it is a single layer network, with no bias, no activation function, and particular values for the weights.
So this post essentially shows that the Fourier transform is a linear combination.
The choice of activation function isn't entirely clear to me, but I think it's definitely possible to make a network that operates entirely in the frequency domain. It would probably be pretty easy to start experimenting with such a thing with the nice complex number and FFT support in PyTorch 1.8. :)
Like you said, there's already a significant connection between convolutional networks and the Fourier domain (the convolution theorem).
Tangentially, I've recently worked on a project that focused on implementing convolution in the Fourier domain, and how that allows one to control useful properties of convolutions (like "smoothness" and orthogonality).
I don't see why processing JPEG coefficients directly as input features shouldn't have similar accuracy to a much larger NN using raw pixel data.
In particular I'm interested in the efficiency gains from avoiding convolutions and the possibility of running a compressed frequency-domain NN on CPUs.
I have trained GANs on raw JPEG coefficients with moderate success as a pet project. Read the raw DCT coefficients without decompressing and train in this space. JPEG-decompress the output of the net to reconstruct an image in the pixel space. There are few papers doing similar things for supervised learning tasks iirc
Yeah it kinda works when you feed JPEG coefficients into a typical time-domain CNN, but mathematically it seems that if you're using frequencies as inputs, your convolutions should become simple multiplications. Am I wrong?
Yep, you can successfully do it without convolutions. Here is a pointer if you want to dig deeper: https://eng.uber.com/neural-networks-jpeg/ (there's prior art to that, but this one is well written)
"...multiplying a vector by the matrix returned by dft is mathematically equivalent to (but much less efficient than) the calculation performed by scipy.fft.fft."
fft-as-a-matrix-multiply is much slower to compute than a standard fft, especially on large input.
You can do the linear parts of neural nets in the frequency domain, but AFAIK you can't do the nonlinearity, so you have to inverse transform back to the spatial domain for that. The nonlinearity is an absolutely essential part of pretty much every neural net layer, so there is no big win to be had unfortunately. For convolutional nets in particular there are other ways of going faster than a naive matrix multiply, e.g. winograd convolution.
But if your network contains a layer whose linear part performs an (approximate) DFT, you will get an efficiency gain by replacing it with an exact FFT.
You wouldn't want to use an FFT for most CNNs anyway because the kernels have very small support. Convolution with them is O(n) in the spatial domain as long as you recognize the sparsity.
I have always thought of it the other way around, neural networks are a kind of transform similar to the Fourier transform. But you know, instead of a sum of sinusoid components you get a sum of weighted <insert your activation function>.
This could be highly interesting. Especially when starting with the FFT butterfly structure for connections and the corresponding weights as network initialization and then adapting that to real data via training.
This is a great idea, I bet one could find a bit-faster fftw that would be close enough for many applications, but it would apply a statistical approach to each weight.
In speech recognition, we usually use log Mel features, or MFCC features, which do a Fourier transformation on the raw audio frames.
You can also train neural networks directly on the raw audio features. When you do so, you can inspect the weights of the first layer (convolutional layer), and you see that it pretty much learned the short Fourier transformation.
> [with regards to y = A x:] This should look familiar, because it is a neural network layer with no activation function and no bias.
This line makes me think this might be satire. Are there really people who see y = A x and think neural networks?
Otherwise, a better title would be "The Fourier transform is a linear operation. (neural networks can do linear operations too)" ... which makes it pretty boring.
Ummmm... sure. A layer of a neural network is a linear operator followed by an arbitrary (nonlinear in practice) mapping. Which means that neural networks can trivially implement linear operators (like the Fourier transform) by making the second mapping to be identity. QED
While it's all very posh to scoff while saying "well, obviously!", the author has explained it, demonstrated it, documented it. It's not about proving something new, it's about sharing something neat with the world. It's also well written and emotes a bit of a fun in the telling.
I'm reminded of xkcd's "Ten Thousand" principle[0]: (Paraphrased) "For each thing that 'Everybody knows', every day there are, on average, 10,000 people in the US learning it for the first time". Good writing makes the experience better for those learning it for the first time today.
But it explains the wrong thing. It's a fundamentally ridiculous framing. A basic neural network layer is a linear transformation + constant offset + activation function. Linear transformations are the elementary building block. And Fourier Transform is a quintessential example of a linear transform. But the article doesn't actually discuss that at all. If you don't understand this how will you ever deal with more complicated NN architectures?
The problem isn't an "everybody knows" kind of problem. People (myself included) are rolling our eyes because it's the most rube goldbergian way to say it.
It's analogous to saying that subtraction is a neural network. Addition and negation are core elements of modern state of the art neural networks, and you can express subtraction as a combination of those two modern neural network techniques. Therefore subtraction is a neural network.
The line in the article about the formula y = A x is illustrative:
> This [y = A x] should look familiar, because it is a neural network layer with no activation function and no bias.
Imagine that as:
> This [y = a + b] should look familiar, because it is a neural network layer with no activation function and no bias and no multiplication. Meanwhile, this [y = -x] should also look familiar, because it is a neural network layer with no activation function, no bias, no addition, and only negation.
You'd expect that kind of explanation of subtraction if someone had never learned about addition and negation outside of neural networks, but you'd roll your eyes and say it's just such a bizarre way to talk about addition, negation, subtraction, and neural networks.
I have zero doubt that given sufficient resources ANY function can be learned by a neural network.
Why not just initialize part of the layers in a network to do the FFT, or DFT when you're setting up a system to learn, that it doesn't have to waste megawatt hours of power? Consider it a form of discretized / explicit transfer learning.
Why not use the fact that you could just set the coefficients in a neural network to do DFTs or FFTs to utilize neural network chips as ersatz DSP chips, without learning?
Just like when GPUs turned out to be useful for neural networks, the TPUs will turn out to be useful for DSP and other things. These neural network chips showing up can do a lot of cool things unrelated to AI, if we use them properly.
I don't see anything here beyond "neural networks generalize linear operators." (But of course they're more interesting for nonlinear applications)
I think neutral nets are such a generalized thing that tons of stuff can be shown to map to them. I saw a talk once about evolution being equivalent to neural networks: consider a genome as a set of neuron weights, and your iterative refinement as new generations in evolution. Maybe there's something deep there, but maybe if you make your model arbitrarily complicated enough you can make it look like any system you want. Which is the whole idea with neural nets.
of course (it's a single fully-connected layer, aka a linear map), but more interestingly, the FFT algorithm can be implemented as a very sparse network of logarithmic depth.
Anything can be expressed by a neural network to some arbitrary degree of precision. That's the whole point. The hard part isn't expressing the thing but getting the machine to figure out how to express it correctly on it's own. There's also no way to prove your learning algorithm actually works outside of toy problems like this which is why ML is really only good for data mining rather than steering a car.
If the fourier transform is written as two real parts, the universal approximation theorem says it can be used to approximate any function, the same way NNs do
Not to be too glib, but as a non-ML person with decent numerics ability:
... did this post just spend a ton of boring math explaining that a neural network has the ability to detect whether an image is bumpy or smooth?
I mean, Fourier analysis has a very straightforward "intuitive" interpretation. That's the point of its value in applied domains, after all. Straightforward intuitive things are "neural" by their nature, no?
Of course. What about the Haar transform? Or digital filters? It's very much all about the same thing.
You compare the original signal with some reference signals, look at how much it correlates. Then you save those correlations only. With the correlations and the reference signals, you can roughly reproduce the original signal. (Extremely paraphrased.)
Winograd transform can be seen as a depthwise convolution with fixed weights.
And similarly to DTF, winograd transform can be used to speed up convolutions: a 3x3 convolution can be implemented with a winograd transform + a 1x1 convolution + inverse winograd transform. This basically reduce the number of operations performed by 2.
Now, if a neural network is able to learn the weights of a depthwise convolution to match the coefficients of a winograd transform, that means you should be able to do this transformation when describing the architecture of your network. That way, you don't rely on your deep learning framework to do the transformation for you.
And more importantly you might end up with something more generic: perhaps the weights in the depthwise convolutions will not converge to the weights of a winograd transform, but to weight better suited to what you are trying to learn.
This is the same ideas which lead to the development of CNN: replace the fixed weights of convolutions used in traditional computer vision algorithm with weights that are learned by the network.
I guess it is mostly an illustration that time-domain => frequency-domain transformation is trivial for ANNs. And e.g. it won't make any sense to pre-process your signal with FFT before training ANN since it will do it itself if needed.
That's a fair point. My assumption was that the neural network isn't going to be as efficient at the FFT but the article doesn't include that type of assessment.
NN is doing matrix multiplications anyway. That's what it does, basically.
The property of matrix multiplications is that they are composable, i.e. `X * (Y * z) = (X * Y) * z` that is, in the end you only need one matrix.
So what this means in practice is that you have FFT for free. NN is doing a matrix multiplication anyway. Discrete time Fourier Transform is a matrix multiplication. Thus it can simply fold DTFT together with whatever other transform it is doing - it doesn't cost anything.
oh then this is just approximating the values of the linear transform. I figured at least they'd use mlp to learn each layer of the FFT butterfly network, and the corresponding weights. What would be of interest would be if you could extend this for n length fft using an RNN network
F = exp(c1 * o1 + c2 * o2) o1 = log(c3 * a + c4 * m) o2 = log(c5 * a + c6 * m)
If you feed it enough data you'll find c1=c2=c3=c6=1 and c4=c5=0.
But saying that Newton's second law is a Neural Network, while correct, seems a bit deceptive in that it's not a deep idea at all.