I've only skimmed it, but the gist seems to be that on MNIST the FF algorithm is within 30% as effective as the classic backpropagation. I didn't quite follow in my quick reading how the network can generate its own negative data, but this seems like what the future research would be interested in. Section 8 seemed to appear out of the blue, and talks about alternative hardware models for machine learning.
I was at the NeurIPS talk yesterday where he introduced this paper, and it made a bit more sense in the context of the entire work.
His argument is basically that this method is more amenable to analog implementation, and that eventually he believes that we will throw away the idea of separating hardware and software that's defined digital computation to date. Currently, we expect hardware will run the same software the same each time, but to make things more efficient he wants to "grow" the hardware and software together, to the point that you cannot have a piece of software that runs on a different device. The program has instead been learned to run specifically on that hardware and that hardware alone.
He also makes a point that this won't displace digital computers, and instead will be another class of computation hardware.
TLDR: during training back propagation means we need to send data 2 ways: input data from input => output, and error data (wanted_result - output) from output => input, updating the weights of the network to work better.
This is the main source of performance bottlenecks and also an obvious difference between natural and artificial neural networks. The brain does not even seem to have error signals, and we also can't seem to find any signal going in the opposite direction of propagation. It requires that you have an error signal ... and that means it requires knowing the right answer to the problem the algorithm is trying to solve. Also quite important to some companies: sending data in 2 directions through the network places serious limitations on parallelization of neural network training. It is one of the big causes that Google/Facebook/MS(OpenAI)/... only seem to have a 1y or less headstart over the rest of the industry, despite billions of investment.
Forward - forward tries to do online learning by training the network to differentiate between real and fake-but-really-realistic signals with data going in the same direction every time.
What is being proposed appears to be Hebbian learning rule and the paper does not even mention that or the Hebb network, why? By the way, Hebbian learning rule was proposed in 1949 and is one of the pioneering work that demonstrated neuron-based models are worth investigation.
Is this really just Hebbian learning? That's often stated as "cells that fire together wire together", and I can see how that can line up with FF increasing the 'goodness' of actual data. But decreasing the 'goodness' of negative data seems like a qualitatively different mechanism.
What you are saying is half truth. If a cell excites (or inhibits) another cell repeatedly enough that its ability to excite (or inhibit) that neighboring cell improves over time.
"fire together" originally meant fire in same direction, but Hebb's rule was extended quite early to mean that if some cells inhibit some others too often, wiring in that direction will get stronger.
#using the Forward-Forward algorithm to train a neural network to classify positive and negative data
#the positive data is real data and the negative data is generated by the network itself
#the network is trained to have high goodness for positive data and low goodness for negative data
#the goodness is measured by the sum of the squared activities in a layer
#the network is trained to correctly classify input vectors as positive data or negative data
#the probability that an input vector is positive is given by applying the logistic function, σ to the goodness, minus some threshold, θ
#the negative data may be predicted by the neural net using top-down connections, or it may be supplied externally
import numpy as np
# Define the activation function and its derivative
def activation(x):
return np.maximum(0, x)
def activation_derivative(x):
return 1. * (x > 0)
# Define the goodness function (the sum of the squared activities in a layer)
def goodness(x):
return np.sum(x*2)
# Define the forward pass for the positive data
def forward_pass_positive(X, W1, W2):
# Forward pass
a1 = activation(np.dot(X, W1))
a2 = activation(np.dot(a1, W2))
return a1, a2
# Define the forward pass for the negative data
def forward_pass_negative(X, W1, W2):
# Forward pass
a1 = activation(np.dot(X, W1))
a2 = activation(np.dot(a1, W2))
return a1, a2
# Define the learning rate
learning_rate = 0.01
# Define the threshold for the goodness
theta = 0.1
# Define the number of epochs
epochs = 100
# Generate the positive data
X = np.array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]])
# Generate the negative data
Xn = np.array([[0, 0, 0],
[0, 1, 0],
[1, 0, 0],
[1, 1, 0]])
# Perform the positive and negative passes for each epoch
for j in range(epochs):
# Forward pass for the positive data
a1, a2 = forward_pass_positive(X, W1, W2)
# Forward pass for the negative data
a1n, a2n = forward_pass_negative(Xn, W1, W2)
# Calculate the goodness for the positive data
g1 = goodness(a1)
g2 = goodness(a2)
# Calculate the goodness for the negative data
g1n = goodness(a1n)
g2n = goodness(a2n)
# Calculate the probability that the input vector is positive data
p1 = 1/(1 + np.exp(-(g1 - theta)))
p2 = 1/(1 + np.exp(-(g2 - theta)))
# Calculate the probability that the input vector is negative data
p1n = 1/(1 + np.exp(-(g1n - theta)))
p2n = 1/(1 + np.exp(-(g2n - theta)))
# Calculate the error for the positive data
error2 = p2 - 1
error1 = p1 - 1
# Calculate the error for the negative data
error2n = p2n - 0
error1n = p1n - 0
# Calculate the delta for the positive data
delta2 = error2 * activation_derivative(a2)
delta1 = error1 * activation_derivative(a1)
# Calculate the delta for the negative data
delta2n = error2n * activation_derivative(a2n)
delta1n = error1n * activation_derivative(a1n)
# Calculate the change in the weights for the positive data
dW2 = learning_rate * a1.T.dot(delta2)
dW1 = learning_rate * X.T.dot(delta1)
# Calculate the change in the weights for the negative data
dW2n = learning_rate * a1n.T.dot(delta2n)
dW1n = learning_rate * Xn.T.dot(delta1n)
# Update the weights for the positive data
W2 += dW2
W1 += dW1
# Update the weights for the negative data
W2 += dW2n
W1 += dW1n
It doesn't seem to work, the update rule seems wrong, but I'll look tomorrow. (Still reading the paper). Thanks for that, debugging a piece of code is educational!