At the risk of being pedantic...if you're not familiar with data/math work in Python, the `np` word refers to "numpy", which is an extension to Python that includes array and matrix math...so the OP needs 12 lines, the first being:
No, definitely not pedantic. In fact, I have only just now realized that "numpy" means "NumPy," instead of being a nonsense word that rhymes with "lumpy." Sigh.
Just for comparison this is almost the same using a library [1]:
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1]])
y = np.array([[0,0,1,1]]).T
from keras.models import Sequential
from keras.layers.core import Dense
model = Sequential([Dense(3, 1, init='uniform', activation='sigmoid')])
model.compile(loss='mean_absolute_error', optimizer='sgd')
model.fit(X, y, nb_epoch=10000)
model.predict(X)
Thanks for this. It is really well explained. However, it did feel like it went downhill when you explained how the updating happens. At least it was less of a walkthrough than the previous sections.
Also, lines 22 and 23 in the three layer network are not intuitive for a beginner. Before, syn0 was (3,1) and now its (3,4), and the new syn1 is (4,1). Your previous explanation for this initialization was much better.
But again, awesome work, just being nitpicky. Really appreciate it!
Great work! I have been reading a practical guide to neural networks [1], and I am amazed that it is so comprehensible. I had assumed ML to be a lot more complicated. Although, I won't classify it as an easy craft but (maybe) it isn't so hard as it looks.
I wonder if it'd work better with a leaky ReLU. Somebody wanna try? It's a _very_ simple mechanism. [1] Basically, instead of a sigmoid, let f(z·w) be the activation due to inputs z and weight w, then a(z) = z·w if z·w > 0 else 0.01 * z·w. Sigmoid for activation have several problems, most notably they suffer from vanishing gradient problems with bigger networks.
Why do you say it is an excellent course? I started it and so far I found the professor to be uninspiring and the lectures lacking in preparation and detail.
Interesting fact about the sigmoid function: If you scale it so its slope at the origin is the same as the cumulative error function, it differs from that function by at most 0.017.
Well, I personally presumed that neural networks would require two or three orders of magnitude more of code.
When its revealed that the code is actually much more manageable, individuals (such as myself) take the dive into the subject, knowing that it is something we can at least get through.