I seem to be in a similar situation as an experienced software engineer who has jumped into the deep end of ML. It seems most resources either abstract away too much detail or too little. For example, building a toy example that just calls gensim.word2vec doesn't help me transfer that knowledge to other use cases. Yet on the other extreme, most research papers are impenetrable walls of math that obscure the forest for the trees.
Thus far, I would also recommend Andrej Karpathy's Zero to Hero course (https://karpathy.ai/zero-to-hero.html). He assumes a high level of programming knowledge but demystifies the ML side.
--
P.S.
If anyone is, by chance, interested in helping chip away at the literacy crisis (e.g., 40% of US 4th graders can't read even at a basic level), I would love to find a collaborator for evaluating the practical application of results from the ML fields of cognitive modeling and machine teaching. These seemingly simple ML models offer powerful insight into the neural basis for learning but are explained in the most obtuse ways.
I'm enrolled in their latest course via University of Queensland; presently, they're teaching us by implementing one of the latest text-to-image papers in PyTorch. They cover the math as side lectures if you're interested in it and have the pre-requisite knowledge. But it's not necessary if what you're keen on is the programming of models.
So far, I am a week into learning ML :). I have spent ~30 hours watching various ML courses and am in the process of testing the hypothesis that teaching reading with a shallower orthography (e.g., differentiating between the short and long 'e' sounds by introducing an 'ē' grapheme) leads to improved recognition of sublexical patterns. The step I am working on is building an embedding layer to ensure that these new graphemes (i.e., 'ē', 'ā', etc.) are near their parent grapheme (i.e., 'e', 'a') in the embedding space. (Although the model seems straightforward, I could also be completely misguided in how I am tackling this problem :) ).
FYI, this orthographic approach (i.e., how words are spelled using an alphabet) is used in a few highly researched literacy programs, but AFAICT there isn't direct research on the approach itself. The motivation is to initially make English a consistent language (i.e., the letters you see have a one-to-one correspondence with a particular sound). This should greatly simplify the initial roadblock in learning to read English (as seen by studies of countries with "shallow" orthographic languages) and then learners would transfer this knowledge to the normal (inconsistent) English orthography.
My main goal is to use cognitive modeling to evaluate the efficacy of interventions and inform the personalized "minimum effective dose" for a particular learner. Academically, this is well-trodden territory [0-2] but these results haven't found there way into practice. This is critically important because we know that ~30% of children will learn to read regardless of method, ~50% require explicit, systematic instruction, ~15% require prolonged explicit and systematic instruction, and up to 6% have severe cognitive impairments that make acquiring reading skills extremely difficult [3]. Yet, how much is enough?
To make this more concrete, imagine you are learning a foreign language with Duolingo. How much effort per day is necessary to achieve that? Many people have long streaks and are no closer to fluency (I learned nearly nothing despite a 400 day streak). Similarly, many reading interventions are once-a-week and, predictably, don't meaningfully affect the learning outcomes for those students.
BTW, this ML portion is part of a much larger effort (e.g., our team is a Phase II finalist in the Learning Engineering Tools Competition). If anyone is interested in collaborating, please feel free to reach out to me.
[3] Education Advisory Board. (2019). Narrowing the Third-grade Reading Gap: Embracing the Science of Reading, District Leadership Forum: Research briefing
Don’t get me wrong I think your work is really cool and a worthy cause, but surely the literacy crisis is a socio-economic problem not a technological one.
> surely the literacy crisis is a socio-economic problem not a technological one.
Yes and no. It is, of course, not strictly a technological one, but the argument that it is a socio-economic one is, at best, an oversimplification. If you are interested in a more complete understanding, I highly recommend checking out APM's documentaries on this issue (https://features.apmreports.org/reading/).
From my research, the underlying causes of the literacy crisis are:
1. The mistaken belief that reading, like speaking, is biologically natural. This belief manifests as guidance to surround your child with books and read to them. Unfortunately, this isn't sufficient for the majority of children.
2. The majority of teachers lack the content knowledge to teach children to read. For example, imagine helping a child to sound out the word "father". What is the sound of the second letter? It isn't a short 'a' nor a long 'a'.
3. Many popular programs used in schools are completely debunked by science (e.g., cueing theory), but as a teacher it is difficult to identify that your approach is faulty. (If ~30% of children learn regardless of method, it is too easy to offer excuses for why the other children don't learn).
4. Helping a struggling child is "rich man's game". If you are high SES and your child is struggling, you will pay a tutor to rectify the problem. That isn't an option for the vast majority of families.
In other words, this is a highly complex puzzle and it is completely understandable why society is seemingly no closer to solving it :). Consequently, the majority of our effort is directed at understanding these root causes and identifying how to overcome them (FWIW, we have made significant progress here). The cognitive modeling portion is a small but plausibly important part of the larger landscape.
It depends how far down the rabbit hole you want to go :). I highly recommend checking out APM's documentaries on this issue (https://features.apmreports.org/reading/). These are in-depth and accessible.
If you want to go further, you can read Moat's Speech to Print, Seidenberg's Language at the Speed of Sight, and many others. If you want to go even deeper, then welcome to the firehose that is educational research :D.
Some papers that are runnable on a laptop CPU (so long as you stick to small image sizes/tasks):
1) Generative Adversarial Networks (https://arxiv.org/abs/1406.2661). Good practice to have a custom training loops, different optimisers and networks etc.
2) Neural Style Transfer (https://arxiv.org/abs/1508.06576). Nice to be able to manipulate pretrained networks and intercept intermediate layers.
4) Physics Informed Neural Networks (https://arxiv.org/abs/1711.10561). If you're interested scientific applications, this might be fun. It's good exercise in calculating higher order derivatives of neural networks and using these in loss functions.
5) Vanilla Policy Gradient (https://arxiv.org/abs/1604.06778) is the easiest reinforcement learning algorithm to implement and can be used as a black-box optimiser in a lot of settings.
6) Deep Q Learning (https://arxiv.org/abs/1312.5602) is also not too hard to implement and was the first time I had heard about DeepMind, as well as being a foundational deep reinforcement learning paper .
It's best to choose something you personally find interesting, for example, I'm interested in audio generation, so I'd pick some papers that describe a music/voice generation model or algorithm, but to you it might be something completely different.
> ideally, a list of papers that could take 2-5 hours each and a few hundred lines of code?
I think you are severely underestimating the time required, unless you are quite experienced, know exactly what to look for, or the paper is just a slight variation on previous work that you are already familiar with.
Even seasoned researchers can easily spend 30+ hours on trying to reproduce a paper, because papers almost never contain all the details that went into the experiments. You are left with a lot of fiddling and iteration. Of course, if you only care about roughly reproducing what the authors did, and don't care about getting the same results, the time can be much shorter. If the code is available that's even better, but looking at it is cheating since wrestling with issues yourself is a big part of the learning process.
A few people here mentioned Andrej's lectures, and I also think they are amazing, but they are not a replacement for getting stuck and solving problems yourself. You can easily watch these lectures and think "I get it!" because everything is so well explained, but you'll probably still be stuck when you run into your own problems trying to reproduce papers from scratch. There's no replacement for the experience you gain by struggling :)
It's like watching a math lecture and thinking you get it, but then getting stuck at the exercise problems. The real learning happens when you force yourself to struggle through the exercises.
I completely agree (that struggling is the most important part, and also about andrej's lectures - they almost spoon feed you imo), but I also didn't know papers took so long. Any particular papers that you remember implementing yourself that you would reccomend?
I believe which paper you implement matters less than you think. Most papers take some kind of well-known base model, make a few tweaks, and then run experiments. To implement it you need to follow the references and start with the base model, and 99% of all papers within some field will eventually lead you back to implementing the same base model, so it really doesn't matter where you start. The overlap is huge.
- “your classifier is secretly an energy-based model and yiu should treat it like one” paper
- self-supervision
- distance metric learning
Places where you can read implementations:
- lucidrains’ github
- timm computer vision models library
- fastai
- labml (annotated quite nicely)
Biggest foreseeable headaches:
- not suuper easy to do test-driven development
- data normalization (floating point error, not using eg batchnorm)
- sensitivity of model performance to (hyper)params (layer sizes, learning rates, optimizer, etc)
- impatience
- lack of data
I’d also recommend watching Mark Saroufim live code in PyTorch, on YouTube. My 2 cents, you can only get really fast as well as good at this with a lot of experience. A lot of rules-of-thumb have to come together just right for the whole system to work.
Early days, but trying to solve the lack of data problem with a project called "Oxen".
At it's core it's a data version control system. Built from the ground up to handle the size and scale of some of these ML datasets (unlike git-lfs or DVC which I have found to be relatively slow and hard to work with). Also building out a web hub similar to GitHub to collaborate on the data with a nice UI.
Would love any feedback on the project as it grows! Here's the github repo:
2-5 hours for a few hundred lines of tricky math code sounds like way too little. Not to mention, having to read and understand the paper first. Depending on the difficulty of the paper and your level of skill in the field, I'd say implementing a paper should take 20-200 hours.
Sorry, I mostly work on (hobbyist) 3d computer vision and not ML.
BTW time of implementation also greatly depends on what you've implemented already. Most papers are a small derivation of some preexisting idea, so I've you've already implemented that idea, there isn't that much work to do on top of it - just modify your existing code. But, if you're just starting with some area, getting up to that point will take time.
Not sure if it's beginner friendly but I found implementing NeRF from scratch a good exercise. Especially since it reveals many details that are not immediately obvious from the paper.
I've been working on implementing this too! It's been fun trying to debug problems by figuring out how to reduce it to the "bare minimum" to repro. Ex: does it work if I disable positional encoding? Does it work if I have only one sample per ray on a single image dataset? Etc
I would recommend diffusion: try starting with Lilian Weng's blog post and writing up the process for yourself. For all it's abilities, the code for DDPM is surprisingly simple.
I'd love for someone to do a good quality PyTorch enabled implementation of Sampled AlphaZero/MuZero [1]. RLLib has an AlphaZero, but it doesn't have the parallelized MCTS you really want to have and the "Sampled" part is another twist to it. It does implement a single player variant though, which I needed. This would be amazing for applying MCTS based RL to various hard combinatorial optimization problems. Case in point, AlphaTensor uses their internal implementation of Sampled AlphaZero.
An initial implementation might be doable in 5 hours for someone competent and familiar with RLLib's APIs, but could take much longer to really polish.
I have implemented YOLO v1 and train/tested it on synthetic images with geometric forms. Implementing the loss function thought me a lot on how backpropagation really works. I used keras/tf.
Just finished assignment 2 of cs224n[1], which has you derive gradients and implement word2vec. I thought it was a pretty good exercise. You could read the glove paper and try implementing that as well.
Knowing how to step through backpropagation in a neural network gets you pretty far in conceptual understanding of a lot of architectures. Imo there’s no substitute for writing out the gradients by hand to make sure you get what’s going on, if only in a toy example.
Starting out I would recommend implementing fundamental building blocks within whatever 'subculture' of ML you are interested in whether that be DL, kernel methods, probabilistic models, etc.
Let's say you are interested in deep learning methods (as that's something I could at least speak more confidently about). In that case build yourself an MLP layer, then an RNN layer, then a GNN layer, then a CNN layer, and an attention layer along with some full models with those layers on some case studies exhibiting different data modalities (images, graphs, signals). This should give you a feel for the assumptions driving the inductive biases in each layer and what motivates their existence (vs. an MLP). It also gives you the all the building blocks you can then extend to build every other DL layer+model out there. Another reason is that these fundamental building blocks have been implemented many times so you have a reference to look to when you get stuck.
After building the basic building blocks these should each take about 2-5 hours (reading paper + implementation). Probably quicker at the end with all this practice. Good luck and remember to have fun!
another question: Do you have a good resource for learning more about GNNs, I'm currently looking at the Stanford course, is that good enough?
any other courses/books/ just something else that you think would be more useful?
quick question: when you say write your own MLP, RNN etc, is that without using PyTorch, I'm assuming so?
i.e., i'm guessing you want me to write my own NN library that handles all of this stuff, including backprop etc
Hey, feel free to reach out if you’d like to join an NLP project to gain more experience that I’m working on. Will provide mentorship and potentially coauthorship on the publication.
Hey - I would love to help out with whatever needs doing. Happy to do grunt work. I've been deeply studying NLP for a few months now and a project is exactly what I need to help me move forward with it.
you could maybe write the whole thing in a few hours, but debugging what you wrote to recreate prior results will probably take much longer depending on choice of problem
sorry should've been more specific, but ML is general. from what i've seen it's not hard to reimplement the pseudo-code in any ML paper. it's just that it gets tricky when you actually try to utilize the code you've written, usually in trying to recreate performance results in the implementing paper. it's very common for authors to leave out / downplay the role of tricks or implementation details that greatly contributed to the performance of the model, in addition to just how finicky machine learning is in general
I seem to be in a similar situation as an experienced software engineer who has jumped into the deep end of ML. It seems most resources either abstract away too much detail or too little. For example, building a toy example that just calls gensim.word2vec doesn't help me transfer that knowledge to other use cases. Yet on the other extreme, most research papers are impenetrable walls of math that obscure the forest for the trees.
Thus far, I would also recommend Andrej Karpathy's Zero to Hero course (https://karpathy.ai/zero-to-hero.html). He assumes a high level of programming knowledge but demystifies the ML side.
--
P.S. If anyone is, by chance, interested in helping chip away at the literacy crisis (e.g., 40% of US 4th graders can't read even at a basic level), I would love to find a collaborator for evaluating the practical application of results from the ML fields of cognitive modeling and machine teaching. These seemingly simple ML models offer powerful insight into the neural basis for learning but are explained in the most obtuse ways.