Do simpler machine learning models exist and how can we find them?

tbalsam · on Dec 22, 2022

I recently released a codebase in beta that modernizes a tiny model that gets really good performance on CIFAR-10 in about 18.1 or so seconds on the right single GPU -- a number of years ago the world record was 10 minutes, down from several days a few years previously.

While most of my work was porting and cleaning up certain parts of the code for a different purpose (just-clone-and-hack experimentation workbench), I've spent years optimizing neural networks at a very fine grained level, and many of the lessons learned here in debugging reflected that.

I believe that there are fundamentally a few big NP-hard layers (at least two that I can define, and likely several other smaller ones) unfortunately but they are not hard blockers to progress. The model I mentioned above is extremely simple and has little "extra fat" where it is not needed. It also importantly seems to have good gradient and such flow throughout, something that's important for a model to be able to learn quickly. There are a few reasonable priors, like initializing and freezing the first convolution to whiten the inputs based upon some statistics from the training data. That does a shocking amount of work in stabilizing and speeding up training.

Ultimately, the network is simple, and there are a number of other methods to help it reach near-SOTA, but they are as simple as can be. I think as this project evolves and we get nearer to the goal (<2 seconds in a year or two), we'll keep uncovering good puzzle pieces showing exactly what it is that's allowing such a tiny network to perform so well. There's a kind of exponential value to having ultra-short training times -- you can somewhat open-endedly barrage-test your algorithm, something that's already led to a few interesting discoveries that I'd like to refine before publishing to the repo.

If you're interested, the code is here. The running code is a single .py with the upsides and downsides that come with that. If you're interested or have any questions, let me know! :D :))))

https://github.com/tysam-code/hlb-CIFAR10

m3at · on Dec 23, 2022

Nice experiment! Not to minimize your work, in fact some ideas might be complementary to yours, a couple years ago there were already some approaches reaching 30s on a single V100:

https://myrtle.ai/learn/how-to-train-your-resnet-8-bag-of-tr...

tbalsam · on Dec 23, 2022

Yes, David Page's work is lovely. This initial release is almost a bit for bit remake of the functionality of the original code, but built to be linear and hackable at basically any stage of the pipeline. Page gets a ton of respect for me for all of the novel stuff introduced, I spent like 80-90 hours plus trying to debug the minutiae it takes to get things working properly at that accuracy -- and doing that thing has been my career. It's a seriously impressive accomplishment to me and the ease of which he presents some of those changes in the blog feels like one of those baking shows where you get a good idea of how truly difficult the achievement is when you do it yourself.

I wanted to start with his baseline but as that hackable workbench for my own purposes to explore some information theory concepts w.r.t. deep learning and etc. His code is beautiful but also a framework-within-a-framework and nearly purely functional so quick hacks are basically impossible beyond a certain point. There are tradeoffs of course.

Continuing to drop bit depth and a few other improvements will probably carry things surprisingly far, so long as hardware compatibility with said hacks remains Gucci. There's also some Triton kernel hacks that we could dip into but it would taint some of the "pure simple python" goals for the project.

But yes -- this is a port of David Page's work designed for a 1-2 hour quick-sketch experimenting researcher. I've found a few other improvements that I hope to refine and contribute to the repo at some point -- after I fix a few basic, glaring bugs like the console printing the whole progress chart again each time. But yes, we're well on our way and thank you so much for linking that -- I'm a rather large fan of his work and I truly hope that some more of his wizardry comes to public light for us to glean from. :D :)))) <3

m3at · on Dec 23, 2022

Now I realise that I _somehow_ totally missed your link to Page's work right in the README…

Thanks for the detailed comment, I definitely will dive into your script soon! One early suggestion, you might want to try torch dynamo [1], anecdotally I had good speedups (~20%) on some image models; though not sure how significant the impact might be at this (relatively) small scale.

[1] https://pytorch.org/docs/master/dynamo/

tbalsam · on Dec 23, 2022

Thank you! And good feedback, the README could be condensed a bit more to be more clean and readable.

Thanks so much for the suggestion, I'll take a look at it! And much appreciated to a huge degree on interest in the script, it's still not perfectly polished stylistically but now that the baseline checks out, we definitely have a lot of performance gains to be had in the next release! :D :)

Feel free to ping me if you ever need anything, I'm not the most active on GitHub but if you ever need to reach me by email for questions/comments/thoughts/etc, hi [ period ] tysam [ the at symbol ] gmail [ period ] com is my email address. :D

m3at · on Dec 23, 2022

Will do if I have feedback :)

jacquesm · on Dec 23, 2022

Very nice work! What I've found with optimizations is that once you have exhausted one avenue that has almost magically caused another one to be opened up which you can then squash and so on. The compounding effect of such serial optimizations can be considerable.

I wonder if converting a model back into code and then optimizing that is a viable path, you could then try to drop out little bits of the code to see if they meaningfully affect the output or not.

tbalsam · on Dec 23, 2022

Thanks for the comment! As far as seeing what the core parts are, I'm really partial to the method where you have a graph network with a really high L2 communication penalty which seems to approach the intrinsic dimension of the problem for certain simple problems. How well that scales to larger problems I have no idea (probably not well, but techniques like variational dropout are pretty analogous in their own kinds of ways). Thankfully with a fast-training network you can play/dork around with the numbers a bit and see what's what.

One could distill the tiny ResNet into a graph network, which with the right constraints could theoretically accomplish the same as the original neural network, and then compress that as small as possible. There's probably an interesting tradeoff in "maximal compression" and "number of iterate rounds" for said graph network. I recently got enough runs (25) and performance difference to have a p=.0014 result or something like that in half an hour for something I was experimenting with against the baseline recently and it felt so good because I wasn't diddling with 5 runs, which in some cases for certain papers take days to finish. It's just a very satisfying feeling.

I guess back to the explainability side of things.... -- that alone I don't think would necessarily provide answers to the explainability problem but I think it would be like a refining-oil type step before diving into the L2-compressed feature representations....

jacquesm · on Dec 23, 2022

Interested in how that will pan out. Unfortunately I don't know enough about the subject matter or I would definitely give it a shot.

And if you hack this to the point where it is explainable then that in turn might generalize to larger networks and/or different problems.

tbalsam · on Dec 23, 2022

That's the hope! :D :)))) <3 <3 :)

jacquesm · on Dec 23, 2022

Best of luck with this, if you have any interesting results and write them up please do post them and ping me (email in profile).

elcritch · on Dec 23, 2022

> There are a few reasonable priors, like initializing and freezing the first convolution to whiten the inputs based upon some statistics from the training data. That does a shocking amount of work in stabilizing and speeding up training.

Wow, that sounds nifty! Could you elaborate or point to more resources for that sort of technique?

tbalsam · on Dec 23, 2022

Someone else linked this in a different tree but I do truly love this blogpost series -- you can see the section "input patch whitening" for more details. https://myrtle.ai/learn/how-to-train-your-resnet-8-bag-of-tr...

Alternatively, you can read through the direct code here if you're willing to wade through some of the mathy code on it: https://github.com/tysam-code/hlb-CIFAR10/blob/main/main.py#...

elcritch · on Dec 23, 2022

Thanks! Also that code seems pretty straightforward. Later there's this bit:

    conv_layer.weight.data = (eigenvectors/torch.sqrt(eigenvalues+eps))[-shape[0]:, :, :, :]
    ## We don't want to train this, since this is implicitly whitening over the whole dataset

Oh it looks like you then just "remove" the correlated data.. Actually, you're just pre-"padding" the basic (gaussian?) covariances. Ah so then you'd skip training basic statistical information that's already easy to calculate directly. Clever!

ayhanfuat · on Dec 23, 2022

> a few big NP-hard layers

What does NP-hard mean in the context of neural networks? You mean the loss minimization of those layers is NP-hard?

tbalsam · on Dec 23, 2022

That's a good question, and I provided scant few hints towards it in my original post.

There are a few layers in which the order of certain things matter -- basically, any chaotic system arising from the choices made in training neural networks. We oftentimes just randomly choose an answer as that's a "best guess", but in pushing the territory of world records, something more principled is in order.

Weight initialization is one, and what data we show to the network when is another. Each choice at any point influences every single choice that comes after it, so even if it was quantized into discrete decision bins (which it can be for both of those, I believe -- even the weight initialization if you're a lottery ticket hypothesis fan. I cringe in saying that though as that phrase can summon an interesting mix of people.) In that sense, calculating which order of operations/order of values is ideal is I believe an NP-hard problem by definition, and not too much weirdness if we're looking at the discrete case(s) I think.

Maybe solving that up front from a structure perspective is untenable, but if we're able to crack some of the mystery of the solution manifold and turn that into portable priors for architectures like this, then I think that opens the door to connecting things to a more universal, pure mathematical solution. And then that ends up unpacking nicely in other problem domains even if we maybe don't know up front which priors work well for that particular subdomain. If we have some sort of mathematically-connected rule, we can be more sure about it.

That's a loose form of a general workflow I follow, it's a bit more of a crapshoot though where any discrete chaotic processes are involved, unfortunately.

Hope that helps answer some of the question, there's definitely other layers to be had, though. Which means job security for a lot of people for a long time to come, I personally think. :D :))))

bmc7505 · on Dec 23, 2022

There are architectures with layers which can approximate MAXSAT [1] (or rather, an SDP relaxation thereof), but I doubt this is what OP was referring to.

[1]: https://arxiv.org/abs/1905.12149

naasking · on Dec 24, 2022

Switching to Hinton's new Forward-Forward algorithm might get you that faster training time, rather than focusing on tweaking existing approaches.

tbalsam · on Jan 3, 2023

This is an interesting statement to me, especially as the original paper from Hinton notes that it converges more slowly than traditional algorithms.

I like the new shiny shiny too, but I've been in this field too long to chase all the new stuff that comes along (and I do love me some Hinton too). I thought about FF for this application but didn't see anything that would make it work in this context, is there anything in particular that you were seeing that would benefit us in this particular usecase?

satvikpendem · on Dec 22, 2022

I just submitted an article about a paper by Deepmind whose main conclusion is that "data, not size, is the currently active constraint on language modeling performance" [0]. This means that even if we have bigger models, with billions and trillions of parameters, they are unlikely to be better than our current ones, because our amount of data is the bottleneck.

TFA though also reminds me of the phenomenon in mathematical proofs where some long winded proof eventually comes up but then over time it becomes simplified as more mathematicians try to optimize it, in much the same way as programmers with technical debt ("make it work, make it right, make it fast"), such as with the four color theorem that was until now computer assisted but it seems there is a non computer assisted proof out [1].

I wonder if the problem in TFA could itself be solved by machine learning, where models would create, train, and test other models, changing them along the way, similar to genetic programming but with "artificial selection" and not "natural selection" so to speak.

[0] https://news.ycombinator.com/item?id=34098087

[1] https://news.ycombinator.com/item?id=34082022

hooande · on Dec 22, 2022

two thoughts

People love to say "it's early" and "it will improve" about ChatGPT. but amount of training data IS the dominant factor in determining the quality of the output, usually in logarithmic terms. it's already trained on the entire internet. it's hard to see how they'll be able to significantly increase that.

And having models build models is drastically overrated. again, the accuracy/quality improvements are largely driven by the scale and diversity of the dataset. that's like 90% of the solution to any ml problem. choosing the right model and parameters is often a minor relative improvement

not2b · on Dec 22, 2022

Perhaps some sort of adversarial network approach could work better; models that learn to generate text and other models that try to distinguish AIs from humans, competing against each other. Also, children learning language benefit from constant feedback from people who have their best interest at heart ... that last part is important because of episodes like Microsoft's Tay where 4chan folks thought it would be fun to turn the chatbot into a fascist.

charcircuit · on Dec 23, 2022

Passing the Turing test isn't the goal. You can have a useful model that isn't human like and can have a useless model that you can't tell isn't a human.

not2b · on Dec 23, 2022

Certainly. Approaches that have done better on the Turing test have used various tricks to avoid their lack of understanding, like playing a paranoid person. But some of the best chatbots give themselves away by getting stuck in a loop or demonstrating lack of basic intuition about the world that a three year old would have. Those things perhaps could be caught.

ronsor · on Dec 23, 2022

I think OpenAI has already published some research showing humans preferred smaller/fewer parameter models that were "better trained" through the use of human feedback.

If there were a model that could adequately replace the role of the human, then that approach would probably work well.

geysersam · on Dec 22, 2022

The data is the bottleneck for the current generation of models. Better models/training strategies could very well change that in the next couple of decades.

tlarkworthy · on Dec 22, 2022

All the books. I think books might be better

albert_e · on Dec 23, 2022

Yes.

And (with scary privacy implications) maybe the next frontier is capturing all spoken language uttered by people in real-time and streaming it into a model that is being updated in near-real-time.

And finally brain implants extracting unspoken thoughts and neural activity and combining it all. Extend it to non human life forms as well. (universal consciousness?)

akomtu · on Dec 23, 2022

It's the future where everyone is a universal captcha solver for the needs of the Great AI. No freedom even in your thoughts.

recuter · on Dec 22, 2022

There's only about 100m books (in English). About the same volume of text as the web all total.

MonkeyClub · on Dec 22, 2022

Generally a book has deeper thinking than a webpage, though, I think that’s the crux of the GP’s clarification.

disgruntledphd2 · on Dec 23, 2022

And the distribution of text will be very different for all books vs the internet.

dr_dshiv · on Dec 22, 2022

Well, diffusion networks seemed to work because they increased the amount of data by training so many examples of added noise and it’s removal. Some similar approach might be possible with text, too. Or, i guess they will talk to itself, like alpha go played itself.

theGnuMe · on Dec 22, 2022

> it's hard to see how they'll be able to significantly increase that.

With feedback.

DenisM · on Dec 23, 2022

Yes. Which is why ChatGPT is open to public.

rhacker · on Dec 22, 2022

The next phase has got to be more and more private data sets that don't exist on the internet: Homes with Alexa, Ok Google, etc.. Beyond that, linkages into a human brain.

albertzeyer · on Dec 22, 2022

> "data, not size, is the currently active constraint on language modeling performance"

This is a bit incomplete. It goes both ways. Right in the abstract, it says:

"the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled"

You are probably referring to the statement:

"We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant."

So GPT3 and co would perform better with more data, but only up to a certain point. At that point, you would also need to scale up the model size again to get better performance.

But also, GPT3 and co would perform better when they would be scaled down a bit, when you keep the same training data. That is actually what the Chinchilla paper does. Their Chinchilla model is smaller than GPT3 and others, trained with same amount of data (as Gopher), and performs better.

I don't really see a problem in scaling up models even more. This paper just says that you should also scale up the training tokens equally.

As far as I remember, this paper does not quite address whether you can use the same training data twice. I think all statements assume that every training token is only used once.

joaogui1 · on Dec 23, 2022

Chinchilla is trained with substantially more tokens then Gopher, 1.4T Vs 300B

jacquesm · on Dec 23, 2022

And you can bet that any file timestamped prior to GPT being released will carry extra weight in the future.

DenisM · on Dec 23, 2022

It’s kind of like pre-nuclear steel that way.

https://en.m.wikipedia.org/wiki/Low-background_steel

jacquesm · on Dec 23, 2022

Ah great analogy, yes, that is exactly what it is.

bilsbie · on Dec 22, 2022

Yet humans learn on way less language data.

Enginerrrd · on Dec 22, 2022

I disagree.

On pure # of words, sure.

But for humans, language is actually just a compressed version of reality perceived through multiple senses / prediction + observation cycles / model paradigms / scales / contexts/ social cues, etc. and we get full access to the entire thing. So a single sentence is wrapped in orders of magnitude more data.

We also get multiple modes of interconnected feedback. How to describe this? Let me use an analogy. In poker, different properties a player has statistically take different amounts of data to reach convergence: Some become evident in 10s of hands, some take 100s of hands, and some take 1000s, and some even take 10,000s before you get over 90% confidence. ....And yet, if you let a good player see your behavior on just one single hand that goes to showdown, a good human player will be able to estimate your playing style, skill, and where your stats will converge to with remarkable accuracy. They get to see how you acted pre-flop, on the flop, turn, and river, with the rich context of position, pot-size, and what the other players were doing during those times, along with the stakes and location you're playing at, what you're wearing, how you move and handle your chips, etc. etc.

numpad0 · on Dec 22, 2022

We also eat. It feels to me that better food with divergent micronutrients has positive performance implications. Maybe I’m just schizophrenic, but to me it just feels that way.

fshbbdssbbgdd · on Dec 22, 2022

Try training the model with free-range, locally-sourced electrons.

marcosdumay · on Dec 22, 2022

It's less data anyway. There's no way to add up the data a person senses and get into a volume any similar to what's on the internet.

But it may be better data.

thom · on Dec 22, 2022

To quantify this a little, the human sensory system has been estimated to generate on the order of 11m bits per second of data. So 1-2 megabytes a second for most of your life. That’s probably in the region of a day of YouTube. But it’s clear that a lot of human cognition is directed towards novelty, and humans are able to run experiments interactively, not just consume data (see schemas in child development etc).

So, you take your baby AI and instead of training it just on a static corpus, you put it in a simulator. And then when you again hit a wall where you conclude data is the problem, you give them a robot body and plug them into the internet bidirectionally. Some would argue this would be a mistake.

edgyquant · on Dec 23, 2022

Language is a tiny way of encapsulating the vivid imagery humans contain about any given situation. The issue with ML models is they are very specific, a human baby collections 1-2 years worth of visual, auditory, etc data before it begins to use that in a meaningful way. Every thing a baby does is a reinforcement session and every moment it is training its neural networks. This doesn’t even get into sleep: which is where connections are solidified in an abstract way.

lovecg · on Dec 23, 2022

Humans benefit from billions of years of pre-trained hardware. It will be interesting to see if we can match that somehow.

katvan · on Dec 22, 2022

somehow we quickly can apply other models to make predictions on new unrelated datasets. I guess that's where stupidity and creativity come from.

jacquesm · on Dec 23, 2022

I'm reminded of the way computer chess has evolved. It wasn't all that long ago that you needed rooms full of all kinds of special purpose power hungry hardware and then suddenly all that was gone and the strongest programs in the world fit in your pocket on a device using as much power as a small lightbulb. An extra level of understanding of the problem domain achieved through 'slow' methods can be an enormous difference in terms of practical applications.

tabtab · on Dec 22, 2022

Re: similar to genetic programming

A genetical algorithm was also what I was thinking of. One could devise some kind of symbolic (textual) way to represent a wiring/circuit diagram (graph) and evolve the most efficient "learner" using mutation and cross-breeding (e-sex). The earliest GA I read about used Lisp dicing.

As far as "easiest" AI for humans to work with, "Factor Tables" may be a way:

https://github.com/RowColz/AI

AI tuning then becomes more like accounting instead of a lab with Doc Brown. Factor Tables are much easier to analyze, debug, and modularize than neural nets.

joaogui1 · on Dec 23, 2022

There's been work combining GA's and Architecture search for neural networks, the main keyword to search is NeuroEvolution, with NEAT being one of the first "good" algorithms for that (though scaling it up is hard)

wincy · on Dec 22, 2022

Couldn’t we start a website that just has humans tag stuff for machine learning, and make that tagged data set open? Does such a thing exist? I’ve heard the issues with Stable Diffusion and others is that the LAION-5B dataset is kind of terrible quality.

nl · on Dec 22, 2022

> I’ve heard the issues with Stable Diffusion and others is that the LAION-5B dataset is kind of terrible quality.

This is mostly wrong.

It's possible to get higher quality results in specific domains by fine-tuning on carefully annotated datasets. However these higher quality results wouldn't be possible without the vast pre-training on the huge dataset LAION-5B provides.

> humans tag stuff for machine learning, and make that tagged data set open? Does such a thing exist?

Yes, there are lots. The size of LAION-5B is the innovation here.

PoignardAzur · on Dec 22, 2022

SurgeAI's business model is to crowdsource these kinds of datasets, some of which it has release in the open.

donkeyboy · on Dec 22, 2022

Yes, they exist, and they are called Linear Regression and Decision Tree. Not everything needs to be a neural network.

Anyway, residual connections in NNs as well as distillation being only a 1% hit to performance imply our models are way too big.

PartiallyTyped · on Dec 22, 2022

> Anyway, residual connections in NNs as well as distillation being only a 1% hit to performance imply our models are way too big.

I disagree with the conclusion.

It indicates that our optimisers are just not good enough, likely because gradient descent is just weak.

The argument for residual connections is that we can create a nested family of models which enables expressing more models, but also embedding the smaller ones into them.

The smaller models may be retrieved if our model learns to produce the the identity function at later layers.

The problem though is that that is very difficult, meaning that our optimisers are simply not good enough at constructing identify functions. With the residual layers, we can embed the identity function into the structure of the model, and we now need to learn to map to 0 (since a residual is f(x) = x+g(x)), we need only to learn g(x)=0).

As for our optimisers being bad, the argument is that with an overparameterised network, there is always a descent direction, but we land on local minima that are very close to the global one. The descend direction may exist in the batch, but when considering all the batches, we are at a local minimum.

We can find many such local minima via certain symmetries.

The general problem however is that even with the full dataset, we can only make local improvements in the landscape.

Thus, it’s that the better models are embedded within the larger ones, and more parameters enable us to find them because of nested families, symmetries, and because of always having a descent direction.

visarga · on Dec 22, 2022

> It indicates that our optimisers are just not good enough, likely because gradient descent is just weak.

No, the networks are ok, what is wrong is the paradigm. If you want rule or code based exploration and learning it is possible. You need to train a model to generate code from text instructions, then fine-tune it with RL on problem solving. The code generated by the model is interpretable and generalises better than running computation in the network itself.

Neural nets can also generate problems, tests and evaluations of the test outputs. They can make a data generation loop. As an analogy, AlphaGo generated its own training data by self play and had very strong skills.

thom · on Dec 22, 2022

Prior to the rise of neural networks, Eurisko was hailed as one of the most impressive achievements in general-ish AI. It was built on self-modifying Lisp heuristics. It’d be interesting to revisit that in a loop with newer larger NN models.

PartiallyTyped · on Dec 22, 2022

I did say that the networks are okay. In fact, I am arguing that the networks are even overcompensating for the weakness of optimisers. Neural nets are great even given that they are differentiable and we can propagate gradients through them without affecting the parameters.

I don’t think that this reply takes into consideration just how inefficient RL and the likes are. In fact, RL is so inefficient that current SOTA in RL is … causal transformers that perform in-context learning without gradient updates.

Depending on the approach one takes with RL, be it policy gradients or value networks, it still relies on gradient descent (and backprop).

Policy gradients are just increasing the likelihood of useful actions given the current state. It’s a likelihood model increasing probabilities based on observed random walks.

Value networks are even worse because one needs to derive not only the quality of the behaviour but also select an action.

Sure enough, alternative methods exist such as model based RL, etc, and for example ChatGPT use RL to train some value functions and learn how to rank options, but all of these rely on gradient descent.

Gradient descent, especially stochastic, is just garbage compared to stuff that we have for fixed functions that are not very expensive to evaluate.

With stochastic gradient descent, your loss landscape depends on the example or mini batch, so a way to think about it is that the landscape is a linear combination of all the training examples, but at any time you observe only some of them and cope that the gradient doesn’t mess up too bad.

But in general gradient descent shows linear convergence rate (cf Nocedal et al Numerical Opt, or Boyd and Vanderberghe’s proof where they bound the improvement of the iterates), and that’s a best case scenario (meaning non stochastic, non partial).

Second order methods can get quadratic convergence rate but they are prohibitly expensive for large models, or require hessians (good luck lol).

None of these though address limitations imposed by loss functions, eg needing exponentially higher values to increase a prediction optimised by cross entropy (see the logarithm). Nor do they address the bound on the information that we have about the minima.

So needing exponentially more steps (assuming each update is fixed in length) while relying on linear convergence is … problematic to say the list

visarga · on Dec 23, 2022

I know RL is very hard. But we have about 1 TeraWord of text in current datasets and about 10 TeraWord could be scraped if we did a thorough job. RL is how language models can generate more. By solving many problems and training on problem solving, AI can create its own data. It's the AlphaGo way - build your own data to surpass human level.

PartiallyTyped · on Dec 23, 2022

This does not compute.

To generate more, you need to interact with an environment, and you also need an objective function. If we could magically generate more textual data, then we have a language model already and don't need to train another language model.

You can't bootstrap a model for language synthesis unless you give it access to the internet to interact with users, at which point ... you have a Tay [1]

[1] https://en.wikipedia.org/wiki/Tay_(bot)

joaogui1 · on Dec 23, 2022

RL is an immense field with lots of sub-areas and for the vast majority transformers are not SOTA. I believe they're only state of the art for Offline RL and even then there are some caveats

dr_dshiv · on Dec 22, 2022

Take the example of creating an accurate ontology. You could try to use a large language model to develop simpler, human-readable conceptual relations out of whatever mess of complexity currently constitutes an LLM concept. You could use ratings of the accuracy or reasonability of rules and cross-validated tests against the structure of human hand-crafted ontologies (ie, iteratively derive wikidata from LLMs trying to predict wikidata).

heyitsguay · on Dec 22, 2022

I think this is one of those issues where it's easy to observe from the sidelines that models "should" be smaller (it'd make my life a whole lot easier), but it's not so clear how to actually create small models that work as well as these larger models, without having the larger models first (as in distillation).

If you have any ideas to do better and aren't idly wealthy, I'd suggest pursuing them. Create a model that's within a percentage point or two of GPT3 on big NLP benchmarks, and fame and fortune will be yours.

[Edit] this of course only applies for domains like NLP or computer vision where neural networks have proven very hard to beat. If you're working on a problem that doesn't need deep learning to achieve adequate performance, don't use them!

sdenton4 · on Dec 22, 2022

We've got lots of great tricks for making audio ml run fast (we need to produce 16k samples per second on a mobile phone CPU, and sound great), but I think they haven't back propagated to the image or language communities.

heyitsguay · on Dec 23, 2022

Interesting, any material you can share?

mrguyorama · on Dec 22, 2022

It's almost like we have no clue what we are doing with NN and are just tweaking knobs and hoping it works out in the end.

And yet people still like to push this idea that we will magically and accidentally build a superintelligence on top of these systems. It's so frustrating how deep into their own koolaid the ML industry is. We don't even know how the brain learns, we don't understand intelligence, there's no valid reason to believe a NN "learns" the same way a human brain learns, and individual human neurons are infinitely more complex and "learning" than even a single layer of a NN.

heyitsguay · on Dec 22, 2022

As someone in the ML industry, who knows many people in the ML industry, we all know this. It's non-technical fundraisers that spread the hype, and non-technical laypeople that buy into it. Meanwhile, the folks building things and solving problems plug right along, aware of where limitations are and aren't.

hooande · on Dec 22, 2022

> It's almost like we have no clue what we are doing with NN and are just tweaking knobs and hoping it works out in the end.

No, we understand very well how NNs work. Look at PartiallyTyped's comment in this thread. It's a great explanation of the basic concepts behind modern machine learning.

You're quite correct that modern neural networks have nothing to do with how the brain learns or with any kind of superintelligence. And people know this. But these technologies have valuable practical applications. They're good at what they were made to do.

z3c0 · on Dec 22, 2022

I've always thought it was abundantly clear how to make smaller models perform as well as large models: keep labeling data and build a human-in-the-loop support process to keep it on track.

My perspective is more pessimistic. I think people opt for huge unsupervised models because they believe that tuning a few thousand more input features is easier than labeling copious amounts of data. Plus (in my experience) supervised models often require a more involved understanding of the math, whereas there's so many NN frameworks that ask very little of the users.

heyitsguay · on Dec 22, 2022

People have tried (and continue to try) that human-in-the-loop data growth. Basically any applied AI company is doing something like that every day, if they're getting their own training data in the course of business. It helps but it won't turn your bag-of-words model into GPT3.

Companies like Google have even spent huge amounts of time and money on enormous labeled datasets -- JFT-300M or something like that for computer vision tasks, as you might guess, ~300M labeled images. It creates value, but it creates more value for larger models with higher capacity.

z3c0 · on Dec 23, 2022

I "have tried (and continue to try) that human-in-the-loop data growth" to enormous success, bringing logistic regression models to greater than 99% accuracy. And you can chain vectorization strategies to create more input features than simply a bag-of-words, like morphology, shape, etc. We (the software company that I work for) don't need GPT-3, because it is a specialized model geared towards generating human-like text. Most NLP problems are just parsing text for actionable information, and oftentimes, supervised models can be chained to create something far more effective towards your needs than trying to shoehorn a massive general-purpose unsupervised model into a specialized problem.

janef0421 · on Dec 22, 2022

Supervised models would also require a lot more human labour, and the goal of most machine learning projects is to achieve cost-savings by eliminating human labour.

z3c0 · on Dec 22, 2022

Up front, yes, but long term, I wholly disagree. A model that performs at 95% or higher will assuredly eliminate human work, no matter how many interns you enlist to label the data.

cs702 · on Dec 22, 2022

> I wonder whether it would make sense to separate the concepts of "simpler" and "interpretable."

Interesting. I was thinking the same, after coming across a preprint proposing a credit-assignment mechanism that seems to make it possible to build deep models in a way that enables interpretability: https://arxiv.org/abs/2211.11754 (please note: the results look interesting/significant to me, but I'm still making my way through the preprint and its accompanying code).

Consider that our brains are incredibly complex organs, yet they are really good at answering questions in a way that other brains find interpretable. Meanwhile, large language models (LLMs) keep getting better and better at explaining their answers with natural language in a way that our brains find interpretable. If you ask ChatGPT to explain its answers, it will generate explanations that a human being can interpret -- even if the explanations are wrong!

Could it be that "model simplicity" and "model interpretability" are actually orthogonal to each other?

kylevedder · on Dec 22, 2022

Humans give explanations that other humans find convincing, but they can be totally wrong and non-causal. I think human explanations are often mechanistically wrong / totally acausal.

As a famous early example, this lady provided an unprompted explanation (using only the information available to her conscious part of her brain in her good eye) for some of her preferences despite the mechanism of action being subconscious observations out of her blind eye.

https://www.nature.com/articles/336766a0

not2b · on Dec 22, 2022

A key reason that we want models at least for some applications to be interpretable is to watch out for undesirable features. For example, suppose we want to train a model to figure out whether to grant or deny a loan, and we train it to match the decisions of human loan officers. Now, suppose it turns out that many loan officers have unconscious prejudices that cause them to deny loans more often to green people and grant loans more often to blue people (substitute whatever categories you like for blue/green). The model might wind up with an explicit weight that makes this implicit discrimination explicit. If the model is relatively small and interpretable this weight can be found and perhaps eliminated.

But if that model could chat with us it would replicate the speech of the loan officers, many of whom sincerely believe that they treat green people and blue people fairly. So interpretability can't be about somehow asking the model to justify itself. We may need the equivalent of a debugger.

joe_the_user · on Dec 22, 2022

I don't think anyone has come up with an unambiguous definition of "interpretable". I mean, often people assume that, for example, a statement like "it's a cat because it has fur, whiskers and pointy ears" is interpretable because it's a logical conjunction of conditions. But a logical conjunction of a thousand vague conditions could easily be completely opaque. It's a bit like the way SQL initially advanced, years ago, as "natural language interface" and simple SQL statements are a bit like natural language but large SQL statements tend to be more incomprehensible than even ordinary computer programs.

If you ask ChatGPT to explain its answers, it will generate explanations that a human being can interpret -- even if the explanations are wrong!

The funny thing is that yeah, LLMs often come up with correct method-description for wrong answers and wrong method-descriptions for right answers. Human language is quite slippery and humans do this too. Human beings tend to start loose but tighten things up over time - LLMs are kind of randomly tight and loose. Maybe this can be tuned but I think "lack of actual understanding" will make this difficult.

visarga · on Dec 22, 2022

The fact that both humans and LMs can give interpretable justifications makes me think intelligence was actually in the language. It comes from language learning and problem solving with language, and gets saved back into language as we validate more of our ideas.

bilsbie · on Dec 22, 2022

I think you’re on to something. I wonder if there’s anyone working on this idea. I’d be curious to research it more.

Llamamoe · on Dec 23, 2022

I don't understand the abstract. What does it do in plain language?

londons_explore · on Dec 22, 2022

Are people 'interpretable'?

If you ask an art expert 'how much will this painting sell for at auction', he might reply '$450k'. And when questioned, he'll probably have a long answer about the brush strokes being more detailed than this other painting by the same artist, but it being worth less due to surface damage...

If our 'black box' ML models could give a similar long answer when asked 'why', would that solve the need? Because ChatGPT is getting close to being able to do just that...

ketralnis · on Dec 22, 2022

If you tell that same art expert that it actually sold for $200k, they'll happily give you a post-hoc justification for that too. ChatGPT is equally good at that, you can ask it all sorts of "why" questions about falsehoods and it will confidently muse with the best armchair expert.

IgorPartola · on Dec 22, 2022

What I am curious about: are there GPT type large language models that don’t have the same restrictions as the ones we’ve seen so far. For example, I remember having great fun reading some political parody blogs about 10 years ago that I thought would be kinda fun to recreate with AI but all implementations I’ve seen refuse to generate anything that would be considered remotely offensive by anyone which means no satire.

screye · on Dec 23, 2022

It is actually the other way round. The models would be even better if their overlords decided to let it train and learn anything unencumbered. These limitations are deliberate and ethical.

One example is that stable diffusion (SD) (think GPT for images) does a pretty bad job of rendering humans. It also isn't trained on NSFW data. Now, people took these SD models and trained them on pornographic images. Turns out, this new model while excellent at generating NSFW images, also became really good at creating humans in general.

There are similar gains that we are ethically leaving on the table. IMO, it's for the best. The field is moving fast enough as it is.

hackernewds · on Dec 23, 2022

Food for thought if it is deliberate, who defines what is "ethical"? Models are inherently biased by all sorts of unconscious biases

totetsu · on Dec 22, 2022

Ones mans tool for automatically generating endless political satire is another mans tool for missinformation/political spam to drown out real speech/hate speech generator. You might try and find a version you can run on your own hardware.

firsttimebigboy · on Dec 22, 2022

that sounds like a purely political/administrative decision and not a technical restriction. There's plenty of offensive material to be trained from on the internet and language models should have no problem generating offensive material (plenty of stories of twitter trained bots spewing out nazi propagnda)

danuker · on Dec 22, 2022

If interpretability is sufficiently important, you could straight-up search for mathematical formulae.

My SymReg library pops to mind. I'm thinking of rewriting it in multithreaded Julia this holiday season.

https://github.com/danuker/symreg

heyitsguay · on Dec 22, 2022

How often are closed-form equations actually useful for real world problem domains? When i did my PhD in applied math, they mostly came up in abstracted toy problems. Then you get into the real world data or a need for realistic modeling and it's numerical methods everywhere.

mellavora · on Dec 23, 2022

Well, Black-Scholes has proved pretty useful. With the caveat that all models are wrong-- and most people using B-S know this.

Which is why actual option prices have the "smile", with tail prices being higher than the model would predict (because traders know that the model underestimates tail risk, and generally have a good sense of how far it underestimates it, because the model is fairly transparent).

Because B-S is closed form, you can run it backwards, to convert actual prices to an implied volatility.

Which is also known to be wrong, because historical standard deviations of returns are only somewhat predictive of future observed returns.

As one person put it, Black-Scholes is the wrong model, into which you put the wrong data, to get the right answer.

chimeracoder · on Dec 22, 2022

> How often are closed-form equations actually useful for real world problem domains? When i did my PhD in applied math, they mostly came up in abstracted toy problems. Then you get into the real world data or a need for realistic modeling and it's numerical methods everywhere.

And closed-form equations are themselves almost always simplified or abstracted models derived from real-world observations.

danuker · on Dec 22, 2022

I find them most useful when there are many variables, or when I can see there's a relationship but I don't feel like trying out equation forms manually.

It is indeed of limited use, since often I can spot the relationship visually. And once I get the general equation I can easily transform the data to get a linear regression.

version_five · on Dec 23, 2022

Engineering?

UncleOxidant · on Dec 22, 2022

Would be interested to see this in Julia.

moelf · on Dec 22, 2022

https://github.com/MilesCranmer/SymbolicRegression.jl

danuker · on Dec 22, 2022

Wow! I should probably join forces with this project instead.

fxtentacle · on Dec 22, 2022

We blow up model sizes to reduce the risk of overfitting and to speed up training. So yes, usually you can shrink the finished model by 99% with a bit of normalization, quantization and sparseness.

Also, plenty of "deep learning" tasks work equally well with decision trees if you use the right feature extractors.

optimalsolver · on Dec 23, 2022

I thought increasing model size raises the risk of overfitting, rather than the opposite?

jakearmitage · on Dec 22, 2022

What are feature extractors?

danuker · on Dec 22, 2022

I suspect features created manually from the data (as opposed to solely using the raw data): https://en.wikipedia.org/wiki/Feature_(computer_vision)#Extr...

nsxwolf · on Dec 22, 2022

"black box models have led to mistakes in bail and parole decisions in criminal justice"

Lolwut? Does your average regular person know machine learning is used to make these decisions at all?

WhitneyLand · on Dec 22, 2022

Instead of a rigorous CS oriented paper, it (the article referenced by Dr. Rudin) seems more like an editorial on the risks of using AI for consequential decisions. It proposes using simpler models and the benefits of explainable vs interpretable AI in these cases.

However it seems to deal more with problems of perception in AI and how things might be better in the ideal rather than present any specific results.

Maybe I’m missing something, not sure of the insight here? I agree it’s an important issue and laudable goal.

topspin · on Dec 22, 2022

The paper content might not be all that but the fact of its existence is interesting to me.

Lines are being drawn around what AI can be permitted to engage; if the methods aren't understood then they can't be controlled, and there are large domains where the Powers That Be won't tolerate what they can't control.

Of course, it is at least as likely that the outcomes are actually less arbitrary than the conventional, and this this is what causes consternation.

whoevercares · on Dec 22, 2022

Good luck. AWS and other big AI/ML infra providers will give exact opposite incentives - they want you to train larger models with larger clusters. And they will sponsor those folks researching on larger models for their own business.

HWR_14 · on Dec 22, 2022

Isn't TikTok's recommendation engine famously a fairly simple machine learning model? Where simple means they really honed it down to the most important factors?

alecco · on Dec 22, 2022

Do Simpler Models Exist and How Can We Find Them? (2019)

Cynthia Rudin, Professor of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University

https://pdfs.semanticscholar.org/7b99/2c46800f8913c251259c1d...

aputsiak · on Dec 22, 2022

Petar Veličković et al has a concept of geometric deep learning, see this forthcoming book: https://geometricdeeplearning.com/ There is also the Categories for AI, cats.for.ai, course which deals with the applying category theory into ML.

derbOac · on Dec 22, 2022

Does anyone have recommendations on papers on current definitions of interpretability and explainability?

Oranguru · on Dec 23, 2022

This technical report from the European Commission was very useful in the research I conducted on this topic earlier this year.

https://publications.jrc.ec.europa.eu/repository/handle/JRC1...

henning · on Dec 23, 2022

Maybe don't use machine learning for criminal justice situations, you fucking irresponsible assholes?

Animats · on Dec 22, 2022

This is someone's commentary Actual articles on this.[1] Mostly paywalled, unfortunately.

This seems to be Rudin's thing - finding equally accurate but simpler models for things upon which deep learning can be trained. Where is there something on this that's not paywalled?

[1] https://ece.duke.edu/faculty/cynthia-rudin

alecco · on Dec 22, 2022

Do Simpler Models Exist and How Can We Find Them? (2019)

Cynthia Rudin, Professor of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University

https://pdfs.semanticscholar.org/7b99/2c46800f8913c251259c1d...

gyulai · on Dec 23, 2022

I found the opening chapters of "Computational Learning Theory" by Kearns & Vazirani [1] really eye-opening in this regard.

It starts using the example of "rectangle" learning, where data points are encoded as points in a 2-dimensional space, just like many other learning algorithms start by encoding things as points in N-dimensional space.

But then, instead of launching straight into decision trees or k-means or Gaussians or anything like that, the book does something strange: It uses "rectangle" as a machine learning model. The idea is that some points are marked with one label (red) others with another (blue). You want to fit a model to distinguish red from blue.

So you just take the min/max of all the x/y corrdinates that you've seen in your training data to get a lower-left and upper-right point defining a rectangle and predict that all points inside the rectangle are red.

Then the book continues with all the usual stuff like figuring out error surfaces, computing precision/recall and so forth and then introduces concepts like big-O analysis of data complexity, overfitting problems by having a sense for the kolmogorov complexity in the model space versus amount of data, etc. etc.

This really blew my mind, that, at the point where the "magic" was supposed to come in, introducing neural networks or whatever, they just skip over that, sort of saying: That's not important yet, let's just use a simple thing like 'rectangle' as a placeholder for now as it will serve our purpose just as well.

So: Any method for selecting a model (e.g. particular rectangle) from a meta-model (e.g. set of all rectangles) can be a valid machine learning method and there really is no "magic" to neural networks. What really matters is whether the model is well-suited to the phenomenon being modelled, given the way the data is encoded. So it makes a lot of sense to obsess over preprocessing of data and to come up with custom models to fit custom modelling needs. It makes very little sense to chase all the newest and greatest breakthroughs in [machine learning flavour of the week].

This has been pretty much how I have been doing machine learning over the course of 10 years, working as a Data Scientist.

My advice to people starting out in this field: By walking down this path, you will have predictably successful project outcomes while at the same time flushing your career down the tube. ...your ignorant bosses will never see you as the smart guy in the room, if everyone else is talking about neural networks and you are talking about rectangles.

[1] https://a.co/d/5ZEyQKg