Hacker News new | past | comments | ask | show | jobs | submit login

Nice. Recreating these methods in simple code for yourself is definitely the way to check you understand it. This demo looks nice, clean, and straightforward. (Although I'd rename or comment variables x and y, or give some sort of guidance on what way around the weight matrices are within the code itself.)

It's also worth checking out existing neural net code-bases to see what tricks they have. The fine details usually aren't in papers, and they're not all in the text-books either.

The first potential problem that jumped out at me in this code was the initialization:

    self.weights = [np.array([0])] + [np.random.randn(y, x)
            for y, x in zip(sizes[1:], sizes[:-1])]
If the number of units in a layer is H, the typical size of the input into the layer above will be √(H). For large H, the sigmoid will usually saturate, and the gradients will underflow to zero, making it impossible to learn anything. There are some tricks to avoid the numerical problems, but even if you avoid numerical-underflow, things probably aren't going to work well.

I'd multiply those initial weights by a small constant divided by the square-root of weights going into the same neuron. For multiple layers you might consider layer-by-layer pre-training. For other architectures, like recurrent nets, definitely find a reference on how to do the initialization.

PS I would definitely add a test routine to check that the gradients from back-propagation agree with a finite difference approximation. It's so easy to get gradient code wrong, and it's so easy to test.




'It's also worth checking out existing neural net code-bases to see what tricks they have. The fine details usually aren't in papers, and they're not all in the text-books either.'

Given that you are a person who is highly-qualified to answer, I am genuinely curious why do you think that is? Reimplementing algorithms from scratch is an efficient way to learn, understand the underlying concepts and attempt improvements in a research context.


A lot of machine-learning papers are eight pages. Speech conference papers (heavy users of neural nets) are often only four. Some details aren't part of the main message, so don't make it in. Often code is available, and initialization and other tweaks can be found in there (even if you aren't going to use their code).

That said, there are also whole papers, even collected volumes, on initialization and other practical details.

Textbooks aren't always up-to-date with the latest practical knowledge, as deep-learning practice is moving quickly. Or they simply don't want to clutter their high-level maths descriptions with code-level implementation details. Teaching stuff is all about tradeoffs. I'm sure several books do mention the scale of weights for simple feed-forward weights though, as it's not an implementation-level detail, and it's probably been well known since the 1980s.


I'll weigh in; papers aren't necessarily worded to convey new information in an ideal manner (especially to newbies). They are worded so that expert researchers are able to reproduce them, especially the parts that constitute whatever their contribution is to the field.

As for textbooks, I imagine that the field is moving too fast; half the stuff I use has only existed for the past year or two.


Hi ! Thanks for the valuable suggestion :-) Your points make much sense to me. I am caught up with some other work but I surely intend to make amendments latest by next 20 days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: