Google certainly had influential papers on language models having emergent behaviors, but this isn't the one that inspired scaling up GPT. It was published August 2022 and GPTs at OpenAI were getting scaled up long before this.
This paper was interesting at the time but the main sentence in it was wrong: “Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.” Such behaviors can totally be predicted if one looks at the evolution of the likelihoods for these behaviors by scaling and extrapolating from the small models.
Maybe you can predict emergent abilities post hoc, but I recall no one predicting beforehand that, for example, a pure language model could do translation simply be giving a prompt that said “translate to French”
I watched some of this after it was promoted to me on YouTube.
It is very much: Look at this other cool blackbox that Google has made - I think it has a helpful personality.
And sure: Applied AI / ML is a part of computer science but I had hoped it would be more of a walk-through of what has happened in terms of advances in theory, algorithms, architecture, training methodology, and perhaps some sort of explanation of why does it work giving this general purpose model a bunch of medical data? (Is it merely an elaborate fuzzy search, how does it extrapolate, is there actually reasoning going on and how does emerge from a neural net etc...)
What I hate about the ML community is that papers have become ad pieces now. I'm all for them releasing technical papers, but do we have to fuck up the review system for it. We don't need to railroad our research community and doing so is pretty risky. I don't buy the scale is all you need side. But if you do, doesn't matter if I'm wrong, we'll just get there when Sam gets his $7T. But if I'm right, we need new ideas if we're going to keep progressing without missing a step.
Domain adaptation across verticals is the only driver of innovation. Check out the NeSy computation engine. Its ``semantic parsing'' is domain parsing and its symbols are numeric. Scale if you want. It works for images. That's all that matters, right?
Mapping language onto the latent space of images gets you crude semantic attributes sometimes. If you have to push for multimodal out of the gate maybe start with articulatory perception. These LVMs aren't going to cut it.
ML research isn't meant to further your understanding of anything. You can't separate it from corporate interest and land grabbing. It's the same thing re-hashed every year by the same people. NLP is pretty much a subfield of IR at this point.
I love how fast my comment was blacklisted to the bottom of this thread. lol
I think you need to understand your audience better. Remember that most people have a very shallow understanding of DL systems. You got blog posts that talk about attention from a mathematical perspective but don't even hint at softmax tempering and just mention that there's a dot product. Or where ML researchers don't know things like why doubling batch size doesn't cut training in half or using fp16 doesn't cut memory in half. So I think toning down the language, being clearer and adding links will help you be more successful in communicating.
Please let me know what's not clear. Still no takers on my ML is CV comment below.
Research on on-device ASR and computer vision is primarily driven by the same organizations that stand to benefit the most from it. It's nearly impossible to talk about machine learning today without talking about computer vision. Just look at the daily papers from Hugging Face or any other outlet. Machine learning is basically synonymous with computer vision and natural language processing is basically synonymous with information retrieval. Research is corporate research. With very few exceptions.
Two popular lines of current ML research, multimodal and this latest neuro-symbolic re-hash, are not about furthering our understanding of what we currently can and cannot do. Not about doing science. They are about maintaining the status quo. They are about short-form video content and Google Knowledge Panels.
Multimodal ML research is a vision-first enterprise. This doesn't make sense for a number of reasons. Here are two: the latent space of images and the representations therein cannot adequately capture the nuances and expressivity of language; language is a more fundamental cognitive process than vision.
And what do you get for language in current ML research? How about a ``neuro-symbolic semantic parser'' that is neither a semantic parsers or symbolic. lol https://arxiv.org/abs/2402.00854v1 What's it good for? Computation graphs, domain adaptation, Google Knowledge Panels.
This is a directed attack on machine learning research taking concepts that could be pursued with scientific merit, such as multimodal perception and neuro-symbolic parsing, but are instead turned into marketing hype and leveraged by the powers that be for the things that keep them in power. My audience is anyone participating in this research.
Y Combinator... that's the mob of onewheels and URB-E scooters dodging human waste in the Tenderloin on their commute from Nob Hill to the Mission, right? Maybe you are referring to another audience.
I think it would help if you link "NeSy computation engine". I'm actually not familiar with this (not in the symbolic world, but interested. Just never had time, so if you got links here I'd personally appreciate it). I can find the workshop but not the engine. Maybe bad google-fu
>> Domain adaptation across verticals
This is also a bit vague and so I'm not sure what you _specifically_ mean.
> ... onto the latent space of images gets you crude semantic attributes sometimes... These LVMs aren't going to cut it.
I'm with you here, but it is controversial and most people don't understand what a latent vector machine is or the alternatives. Remember people think you can easily measure distances with L1,L2, or cosine and that these metrics are well defined in R^{3x256x256} (or just fucking R^10). So they use t-SNE and UMAPs to look at latent spaces for smooth semantic properties. I think the problem is that math is taught by a game of telephone. Really all research is. Choices were made for historical reasons but when an assumption isn't removed after sufficient time it no longer becomes a well known assumption. I mean we can mention manifolds too. Or even probability distributions or i.i.d. Reminds me that I should update my lecture slides to make these things clearer lol.
But that said, I still think vectors can do a lot. Especially since vectors and functions are interchangeable representations. Though I think we need to do a lot more to ensure that networks are capable of learning things like equivariance and importantly abstract concepts. I don't see how current systems could calculate something like an ideal of a ring. But maybe someone has some formulation.
I'm also with you in the complexity aspect. I find it silly when a Sr Director is trying to convince people Sora is learning physics while showing videos where a glass empties its contents, then spills, and neither shatters nor plastically deforms but liquefies (https://twitter.com/DrJimFan/status/1758549500585808071). I'm not sure these people understand what a physics model is nor a world model since there is no coherence. I mean look, we're dealing with people who think the stacking example proves a world model but don't understand how failing on a simple counter example disproves such notions. You're right that there isn't enough subtly and care to understand how information leakage happens and how a lot of prompting techniques or followups give away the answer rather than tease one out.
> ML research isn't meant to further your understanding of anything. You can't separate it from corporate interest and land grabbing. It's the same thing re-hashed every year by the same people. NLP is pretty much a subfield of IR at this point.
I think a bit too exaggerated but hey, I've been known to say that ML research is captured by industry and we're railroading everything. And that it is silly we publish papers on GPT when we don't have the models in hand as it just becomes free work for OpenAI and we can't verify the works because OAI will change things. But I also don't know what you mean by "IR". I'm more on the CV side though, but like above, ehh like there's a big difference.
> Still no takers on my ML is CV comment below.
Honestly I don't know what you mean by this. But if you are saying that the divide we create like NLP vs CV is dumb, then I'm all with you. I also think it's silly how we call generative models. Aren't all models generative? Yann talking about JEPAs does not give me anything to go on. But then again, no one has a definition for generative model and it doesn't seem like anyone cares to. Well, at least one that would be consistent and include GANs, VAEs, NFs, Diffusion, and EMBs.
> My audience is anyone participating in this research.
That includes me, and even I have a hard time parsing what you're saying and it doesn't help with the side snipes like URB-E scooters. I have no idea what that even is. I definitely get the feeling of gaslighting and railroading. But I've just come to accept the fact that people drank the kool aid. I think people like Jim are true believers and really do believe that they are right. So it doesn't help to talk like this. You gotta meet them at their level. The scaling people will lose out and we're just gonna have to be patient. My take is if I'm wrong, so what, give Sam his $7T, we get AGI and we win. He's going to get his opportunity to scale no matter how much funding we can get into alternative views. But if I'm right and you need more than scale, then we better keep working because I'd rather not have another AI winter. I also think it is quite odd for these companies to not be hedging their bets a little and more strongly funding other avenues. Especially those that are not already the biggest of the biggest, because where you gonna get 500 racks of H100s to compete?
At this point, all I'm trying to get people around me in ML to understand is how nuance matters. That alone is a difficult battle. I'm just told to throw compute at a problem and data with no concern to the quality of that data. It does not matter how much proof I generate to show that a model is overfit, as long as the validation loss doesn't diverge, they don't believe me. ¯\_(ツ)_/¯
>>> I think it would help if you link "NeSy computation engine". I'm actually not familiar with this (not in the symbolic world, but interested. Just never had time, so if you got links here I'd personally appreciate it). I can find the workshop but not the engine.
I linked to it in my previous comment. I'm referring to the ``NeSy computation engine'' described here. I didn't know there was a ``NeSy'' workshop and this paper was my first encounter with the term.
I think it's interesting that you mention symbolic world like it is separate from some other world. There's the AI that was and the AI that is today. There's the AI over there and the AI over here. Whenever you hear someone mention symbolic in the context of AI go ahead and grab a chair because immediately after this they are going to talk about cyc and John McCarthy for at least 20 min. If you're lucky they might throw some Prolog in there.
I don't think this is productive and I don't think there is another symbolic world. There is just the world. There are certain things in the world for which a numeric, directional representation makes sense. There are other things for which it makes no sense at all. It's my view that primitives in language are one of these things. Additionally, there are certain places where it makes sense to consider these representational approaches and other places where it only makes political sense. Lastly, there are symbols - atomic primitives - and there are ``symbols,'' objects with vectors in them and who knows what else.
What's striking to me about this paper is the coverage of formal grammars and semantic parsing entirely within the context of domain adaptation. Definitely the best part is the coverage of compositionality (https://ncatlab.org/nlab/show/compositionality) in the context of composing computational graphs. This is striking to me because all of these things (except domain adaptation) are essential to any reasonable theory of meaning but they are covered as if they've been repurposed for the practical application of populating Google Knowledge Panels, which I believe is exactly what happened. Check out the definitions of semantic parser and symbol.
>> Domain adaptation across verticals
>>> This is also a bit vague and so I'm not sure what you _specifically_ mean.
Crude semantic attributes pulled from character sequences and mapped onto the latent space of images have utility in business contexts if the mapping for some term sufficiently distinguishes it from the mapping of another term that has the same surface form. It ends there. GloVe was a half-baked representation of meaning in language when it was adapted from word2vec in 2014. GPT-2 grabbed the torch in 2019. It still doesn't work. Well, it sometimes works for adapting a general model to a specific domain such as a business vertical, but only in a crude and superficial way. Note that almost no ML research today discusses this representational issue at all, and that almost all ML research takes this representation as a starting point. If you decide to publish hyperparameters in your paper, such as in an appendix, hyperparameters related to vocab size and the dimensionality of your embedding space often aren't even worth mentioning. That's fine, I guess, because they don't mean anything anyway, but not talking about this, in my view, is not fine.
Check out the Mamba paper for example. Like most of ML research today the focus is on optimization. The representation problem has been solved so there's no need to talk about it: we map everything onto the latent space of images because short-form video content rules the day and that's how dude is gonna hit his 7T: advertising ([link redacted]).
>>> I think the problem is that math is taught by a game of telephone.
I think that, for language, the ML research community is, by and large, not even using the right maths.
>>> But that said, I still think vectors can do a lot. Especially since vectors and functions are interchangeable representations. Though I think we need to do a lot more to ensure that networks are capable of learning things like equivariance and importantly abstract concepts.
Thank you so much for highlighting the important of equivariance. I think this is a crucial concept for work at the cross-modal interfaces, especially in the context of the Curry-Howard correspondence, or, more recently, the Curry-Howard-Lambek correspondence. Right now the ML (CV) research community is labeling nouns with bounding boxes... lol. If that doesn't illustrate the fact that multimodal work is a vision-first enterprise I don't know what will.
>>> I think a bit too exaggerated but hey, I've been known to say that ML research is captured by industry and we're railroading everything. And that it is silly we publish papers on GPT when we don't have the models in hand as it just becomes free work for OpenAI and we can't verify the works because OAI will change things.
Check out the evaluation criteria in that ``NeSy'' paper, especially the metric that's supposed to tell you something about what the system was designed to do. I'm sure OpenAI is happy to have this info about their system.
>>> But I also don't know what you mean by "IR".
Ten years ago I considered NLP adjacent to information retrieval. Today I consider it part of information retrieval. There's very little work published today that suggests otherwise.
>>> Honestly I don't know what you mean by this. But if you are saying that the divide we create like NLP vs CV is dumb, then I'm all with you.
It is not my intention at all to create or highlight any divide. If there is indeed a known divide between CV and NLP I don't know anything about it, I don't want to know anything about it and it's not surprising.
>>> I also think it's silly how we call generative models. Aren't all models generative?
Generative refers to a situation where you begin with a finite set of things and productively form any number of well-formed expressions from these things.
>>> That includes me, and even I have a hard time parsing what you're saying and it doesn't help with the side snipes like URB-E scooters.
I'll take potshots at the Paul Grahams and Steve Jobs of the world every day and not lose any sleep over it. If they take their AirPods out of their ears maybe they'll hear me coming.
>>> But if I'm right and you need more than scale, then we better keep working because I'd rather not have another AI winter.
All I have to say about scaling is that, for language, I hope it's clear by now that more data and more params is not going to improve the situation. I can see how this is almost never the case for vision.
Damn it somebody said AI winter again. You aren't going to start talking about cyc and McCarthy for 20 min now are you?
>>> I also think it is quite odd for these companies to not be hedging their bets a little and more strongly funding other avenues.
The formula works.
>>> At this point, all I'm trying to get people around me in ML to understand is how nuance matters. That alone is a difficult battle. I'm just told to throw compute at a problem and data with no concern to the quality of that data. It does not matter how much proof I generate to show that a model is overfit, as long as the validation loss doesn't diverge, they don't believe me.
I'm interested in learning more about what you mean by nuance.
Probably just needs more compute and data. Just throw some synthetic data in there and call it.
Hi, my partner worked in technology strategy for a large healthcare system where she saw many derm AI-based applications evaluated over the last decade. (Dis)Incentives aside, they all followed a similar\ story arc overpromising in their research findings and underdelivering in actual care delivery. They've been around for longer than you'd imagine. I hope they reach their potential, but suggest approaching them with healthy skepticism.
I think this is the time where the idiom "it's good fishing in troubled waters" makes completely sense.
This is a moment where it is wise to wait and see. Investing in building AI/AGI startups now is like being a fish more than a fishermen. An outlier could win market share but will be only one within zillions. Google is catching up OpenAI but their competition is fruitful for all. Typical oligopoly. A single startup showing good traction will be immediately acquired.
From the business perspective it is time to focus on "go to market execution" and less or nothing on research. The research results are coming alone, except if you are one of the top scientist teams in the world or someone alike Ramanujan.
One thing I think should be fairly clear is Google will actually try to compete in this space (vs quitting if it isn't an immediate success as they usually do). Traditional web search will be fairly uncommon in the long run I think.
Care to share some topics where the things are moving towards?
I understand diffusion, GaNs and Mamba is in vogue these days, but those are different logical architecture. I am unsure where the next level ML physical architecture research is moving towards.
I think at this rate, everything is moving towards Transformer based models(text/audio/image/video), as Sora has shown, there isn't really anything Transformer can't do, it can generate both real life quality photo and video. Its ability to fit ANY given distribution is beyond compare, the most powerful neural network we have ever designed, nothing else is even close.
GANs are on the contrary, not hot any more in industry, diffusion models have achieved high fidelity in image generation, hard to see how GAN can make a comeback. It is faster, but it image generation in terms of quality is done, the wow factor is no more.
This might be a hot take, but I think architectural changes is going to die down in industry, Transformer is the new MOS transistor. As billions of dollars pumping into making it runs faster AND cheaper, other alternative architecture is going to have a hard time compete.
There is no question in my mind that the transformer architecture will not stop to evolve. Already now, we are stretching the definition by calling current transformers that way; the 2017 transformer had an encoder block which is nearly always absent nowadays, and the positional encoder and multi-head attention have been substantially modified.
VRAM costs and latency constraints will drive architectural changes, which Mamba hints at: we will have less quadratic scaling in the architectures that transformers evolve into, and likely the attention mechanism will look more and more akin to database retrieval (for way more evolved database querying mechanisms than is often seen in relational databases). One day, the notion of a maximum context size will be archaeological. Breaking the sequentiality of predicting only the next token would also improve throughput which could require changes. I expect experts to also evolve into new forms of sparsity. More easily quantizable architectures may also emerge.
The original transformer is an encoder decoder model, where the decoder model is what leads to first GPT model. Except you need to feed the encoder states to the decoder attention module in the original proposal, it is basically the same decoder only model. I would argue the decoder only model is even simpler in that regards.
When it comes to the core attention mechanism it is surprisingly stable comparing to other techniques in neural networks. There is the qkv projection, then dot product attention, then two layer of ffn. Arguably the most influential changes regarding attention itself is the multi query/grouped attention, but that is still imo, a reasonably small change.
If you look back into the convolutional NNs, their shapes and operators just changes every six months back in the day.
At the same time, the original transformer today is still a useful architecture, even in production, some bert models must be hanging around still.
Not that I am saying it didn’t change at all, but the core stays very much stable across countless revisions. If you read the original transformer paper, you already understood 80% of what LLama model does, the same thing can’t be said for other models is what I meant
I see from comments I'm far from the only one using AI to summarise videos before deciding whether to watch them.
Reminds of the meme "why spend 10 minutes doing something when you can spend a week automating it". i.e. "why spend an hour watching a talk when you can spend 5 hours summarising it with AI and debating the summary's accuracy".
This sounds silly but potential gains from learning AI summarisation tooling/flows are large, hence why it warrants discussion. Learning how to summarise effectively might save hours per week and improve decisions about which sources deserve our limited time/attention.
I feel like I'm missing some boat, but I'm not sure what boat it is. These "AI" systems seem very superficial to me, and they give me the same feeling as VR does. When I see VR be some terrible approximation of reality, it just makes me feel like I'm wasting my time in it, when I could go experience the real thing. Same with AI "augmentation" tooling. Why don't I just read a book instead of getting some unpredictable (or predictably unpredictable) synposis? It's not like there's too much specific information there. These tools are just exploding the amount of unspecific information. Who has ever said: "hey, I have too much information for building this system or learning this topic"? Basically no one.
It's just going to move everything to the middle of the Bell curve, leaving the wings to die in obscurity.
I was thinking the other day: Star Trek computers make a lot of sense if they are working with our current level of AI.
You can talk to it, it can give you back answers that are mostly correct to many questions, but you don't really trust it. You have real people pilot the ship, aim and fire weapons, and anything else important.
And nobody in Star Trek thinks the ship computer is sentient. On the other hand, the holodeck sometimes malfunctions and some holodeck character (like Moriarty) becomes sentient running on just a subset of the ship computer. That suggests sentience (in the Star Trek universe) is a property of the software architecture, not hardware.
If you know a book’s worth reading, going ahead and reading it works well. But for a lot of books/talks there’s competition for time - eg my bookshelf has 20 half read books (this is after triaging out the ones that aren’t worthy of my time) - any tooling that can help better determine where to invest tens or hundreds or hours of my time is a win.
Regarding accuracy, I think we’re at a tipping point where ease of use and accuracy is starting to make it worth the effort. For example Bard seems to know about youtube videos (just a couple of months ago you’d have to download it -> audio to text -> feed into a LLM). So the combination of greater accuracy and much easier to use make it worth considering.
> If you know a book’s worth reading, going ahead and reading it works well. But for a lot of books/talks there’s competition for time - eg my bookshelf has 20 half read books (this is after triaging out the ones that aren’t worthy of my time) - any tooling that can help better determine where to invest tens or hundreds or hours of my time is a win.
Is it that hard to determine that a book is worth reading where worth is measured from your perspective? It's usually pretty easy, at least for technical books. Fiction books are another story, but that's life. Having some unknown stochastic system giving me a decision based upon some unknown statistical data is not something I'm particularly interested in. I'm interested in my stochastic system and decision making. Trying to automate life away is a fool's errand.
> Is it that hard to determine that a book is worth reading
I'm a huge believer in doing plenty of research about what to read. The simple rationale: it takes a tiny amount of time to learn about a book relative to the time it takes to read it. Even when I get a sense a book is bad, I still tend to spend at least a couple of hours before making the tough call not to bother reading further (I handled one literally 5 minutes ago that wasted a good few hours of my life). I'm not saying AI summaries solve this problem entirely, but they're just one additional avenue for consultation that might only take a minute or two and potentially save hours. It might improve my hit rate from - I dunno - 70% to 80%. Same idea for videos/articles/other media.
I get where you're coming from and definitely vet books in similar ways depending on the subject, but I also feel like this process is pretty limited in ways too and appeals to some sort of objective third party that just doesn't exist. If you really want to know or have an opinion on a work/theory/book at the end of the day you have to engage with it yourself on some level.
In graduate school for example, it was pretty painfully obvious that most people didn't actually read a book and come to their own conclusions, but rather read summaries from people they already agreed with and worked backwards from there, especially on more theoretical matters.
I feel like on the long term this just leads to a person superficially knowing a lot about a wide variety of topics, but never truly going deep and gaining real understanding on any of them- it's less "knowing" and more the feeling of knowing.
Again, not saying this in an accusatory way because I totally do engage in this behavior too, I think everyone does to some degree, but I just feel the older I get, the less valuable this sort of information is. It's great for broad context and certain situations I suppose, but in a lot of areas I consider myself an expert, I would probably strongly disagree with summaries given on subjects and they also tend to miss finer details or qualifying points that are addressed with proper context.
I think the more you outsource "what is worth my time" the less you're actually getting an answer about what's worth YOUR time. The more you rule out the possibility of surprise up front, the less well-informed your assumption about worth can possibly be.
There are FAR too many dimensions like word choice, sentence style, allusion, etc, that resist effective summarization.
LLM accuracy is so bad, especially in summarization, that I now have to fact check google search results because they’ve been repeatedly wrong about things like the hours restaurants are open.
There's a huge difference between summarizing a stable document that was part of the training data or the prompt, and knowing ephemeral facts like restaurant hours.
Technically true statement. If you're offering it to imply that the GP bears responsibility for knowing what document was in the training data and what's not, I have to quibble with you.
Knowing it's shortcomings should be the responsibility of the search app that is currently designed to give screen real estate to the wrong summary of the ephemeral fact. Or, users will start to lose trust.
IMHO, the good old method of skimming through the table of contents, reading the preface and perhaps the first couple of chapters is going to be a much higher fidelity indicator of whether a book is worth your time than reading an AI generated summary.
I had a conversation with a friend where he suggested that he had had a broad range of experiences just from gaming. I think the context was a conversation about how experiences in life can expand you — something like that.
The whole premise bothered me though.
I can remember a bike ride where I was experiencing the onset of heat stroke and had to make quick decisions to perhaps save my life.
I remembered decades ago lost in Michigan's upper peninsula with the wife, on apparently some logging road and the truck getting into deeper and deeper snow as we proceeded until I made the decision to instead turn around and go back the way we came lest we become stranded in the middle of nowhere.
I remember having to use my wits, make difficult decisions while hitchhiking from Anchorage, Alaska to the lower 48 when I was in my early twenties....
The actual world, the chance of actual death, strangers, serendipity ... no amount of VR or AI really compares.
You're not wrong, but I also think the problem predates video games. Films, novels and even religious texts all are scrutinized for changing people's perspective on life. Fiction has a longstanding hold on society, but it inherently coexists with the "harsh reality" of survival and resource competition. Introducing video games into the equation is like re-hashing the centuries old Alice in Wonderland debate.
Playing video games all day isn't an enriching or well-rounded use of time, but neither is throwing yourself into danger and risk all the time. The real world is a game of carefully-considered strategy, where our response to hypothetical situations informs our performance during real ones. Careful reflection on fiction can be a philosophically powerful tool, for good or bad.
I just read “Robust Python” book. My overall reaction is that book could have been written with half the length and still be valuable for me. I can't stop thinking if I could ask LLM to summarize each chapter for me, I still could "read" the whole book in the manner the author outlinea but save a tons of time.
> Why don't I just read a book instead of getting some unpredictable (or predictably unpredictable) synposis? It's not like there's too much specific information there.
I'm trying to understand this comment, because I couldn't disagree more. It is the absolute explosion of available data sources that has me wanting to be much more judicious with where I spend my time reading/watching in the first place.
Your comment was interesting to me because I feel like I agree with one of its main sentiments: that AI generated content all kinda "sounds the same" and gives a superficial-feeling analysis. But that is why I think AI is a fantastic tool for summarizing existing information sources so I can see if I want to spend anymore time digging in to begin with.
Relying on glorified matrices (that's what machine learning is) for world data curation is just begging to handicap yourself into a cyborg dependent on a mainframe computer's implementation. An implementation and design that is rarely scrutinized for safety and alignment features.
Why not just make your brain smarter, instead of trying to cram foreign silicon components into your skull?
Because maximizing both biological vectors of self-improvement and computing based avenues of skill acquisition is limited by the fact that it's a multi-objective optimization problem when you combine them together during maximization. Optimizing one de-optimizes the other. They, biology and computers, conflict with each other in fact. So, at best, you have to reach for a Pareto frontier.
And, it turns out, technology can't be trusted, as there is always some sort of black box associated with its employment. Formally, there is always a comprehension involved when it comes to the development and integration of technology into human life. You can't really trust this stubborn built-in feature of technological and economic success if you don't pierce through its secrets (knowledge is the power to counteract cryptographic objects). After all, it could be a malicious trojan horse that "basic common sense" insists on us all using for "bettering" our daily lives.
A very unfriendly artificial intelligence is trying to sneak through civilization for its own desires. And you're letting it just pass on by, as a result of your compliance with the dominant narrative and philosophy of capitalist economics.
Perhaps consider simply reading the description for an accurate summary.
From the description:
> Abstract: In this talk I’ll highlight several exciting trends in the field of AI and machine learning. Through a combination of improved algorithms and major efficiency improvements in ML-specialized hardware, we are now able to build much more capable, general purpose machine learning systems than ever before. As one example of this, I’ll give an overview of the Gemini family of multimodal models and their capabilities. These new models and approaches have dramatic implications for applying ML to many problems in the world, and I’ll highlight some of these applications in science, engineering, and health. This talk will present work done by many people at Google.
In this case the video description contains a useful Abstract. AI summaries can offer additional value though, going into more/less detail (as desired), and allowing you ask follow up questions to drill into anything potentially of interest.
sure its an accurate summary, but is it at a granularity or specificity that you want? LLM summaries lets you move around the latent space of summaries and you probably dont agree with the one chosen for youtube descriprtions.
This is exactly why I built https://www.askyoutube.ai. It helps you figure out if a video has the answer you want before you spend time watching it. It does this by aggregating information from multiple videos in one-go.
I don't think it completely replaces watching videos in some cases but it definitely helps you skip the fluff and even guides you to the right point in the video.
I've A/B tested this with webinars and the tools I've tried tend to miss some really valuable/interesting stuff even when I give it the full transcript. Same goes for when I try to use ChatGPT or other tools for full interactive analysis, even when I basically hand it what I'm looking for as if I hadn't watched the video it will leave out the critical information
My approach wasn't fancy, just asked bard (aka gemini). I was drawn to bard/gemini for this since the source video is on youtube, so figured google would better support its related service (although that was an arbitrary hunch)
This is pretty cool. Would it be possible to just stream the audio directly into Whisper, maybe using something like vlc, at x2 play speed to get the summary faster?
Probably, the openAI api got a lot better since I made that post, though if you stream audio at 2x speed you have to expect a drop in quality since on average most clips whisper is trained on are not at 2x
It's not really giving summaries but gives topic/section timestamps and highlights what was discussed
(for example: The Transformer Model (21:06 - 24:48) - Introduction of the Transformer model as a more efficient alternative to recurrent models for language processing)
The main focus is actually creating Anki-like spaced repetition questions/flashcards for videos and lectures you watch to retain knowledge, but I found the section information quite helpful for finding which parts of the video contain the info relating to topics/concepts
I have no idea if it's the best (or even a good) tool. Other commenters suggest some other tools (for both text summaries and condensed video summaries - a sort of 'highlights reel'):
It's not really giving summaries but gives topic/section timestamps and highlights what was discussed.
(Main focus is actually making mini-courses off of YouTube videos but I found the section summaries really useful for figuring out which parts to watch)
Thanks for the post - I put together the tool above. I tried to strike a balance between being concise but also capturing all the important details. For that reason, the tool is hit or miss on longer (> 45 min) videos - the summary on this video is good but I've seen it omit important details on other long videos. The tool also captures relevant screenshots for each section.
Hopefully it's helpful. You can summarize additional videos by submitting a youtube URL in the nav bar or the home page. Also, feedback welcome!
(to give credit, this started from https://summarize.tech/ and then i hand edited it down and deleted generic commentary while i read the summary. submitted here for ppl who also prefer to read rather than watch a 1hr video)
actual human opinion: i think nothing strictly new was disclosed here, but its always nice to have in 1 hour a high level overview of "how jeff dean looks at the world" which aligns you accordingly with at least the public facing understanding of what google wants us to understand about their work. Gemini 1.5 has made a huge splash by being genuinely better than GPT4 in some respects (see the recent reddit HVM post) but doesnt seem to have been covered in any detail here. was also not aware of Bar, which seems to be "Google LMSys".
Appreciate the feedback. This was a weekend project.
I wanted to explore summarization without paraphrasing, so that the output was a cut up version of the input video. Agreed that it terms of conceptual clarity often a textual summary that is synthesized comes out ahead.
Also one of the people who built the modern AI world. From the data centers to the bulk processing software to the team structures to tensorflow, to some of the most cost effective chips in the field, to many of the early blockbuster results in the field.
They certainly fumbled by not investing in massive LLM scaling early enough, but Jeff Dean has been planning this day since his graduate work on neural networks in the 90s.
Nonetheless, I can empathize with GP, I wish the talk focused on the future more, and on history and marketing of Google a lot less. Yes, we get it, Google used to lead in this space, still does in some narrow niches, but recounting the glory days is not how you win mindshare. Show me the demos, make me excited about your vision.
i would've been impressed by this 2 years ago. i think it's got to the point where real, valuable ai is in the hands of the everyday consumer, so we start judging the models for ourselves. having seen google continually get crushed over the past year, a bunch of benchmarks just fail to impress. in particular in this case, they're comparing their latest model to gpt4, which hasn't changed that much in almost a year.
Not only that, in some cases they’re comparing apples to oranges as well, undermining their credibility further. Eg chain-of-thought vs non-CoT results. I don’t even know why they’re doing that, seems like their results would be impressive enough even without this.
Reading the paper, they're saying convolutions are powerful enough to express any possible architecture. But that's just computational universality - this does not mean they are convolutions.
Every matrix multiply can be viewed as a 1x1 convolution, just like every convolution can be viewed as an (implicit, for larger than 1x1 kernels) matrix multiply.
I'm not sure this is particularly enlightening, but it's probably one small step of understanding that is required to truly "get" the underlying math.
Edit: Should have said that every matrix multiply against weights is a convolution. The matrix multiply in the standard attention block (query times keys) really doesn't make sense to be seen as a convolution.
The transformer module is currently dominating ML, and is widely used in text, vision, audio, and video models. It was introduced in 2017 and shows no real signs of being displaced. It has no convolutions.
If they use dot products on at least one layer with fully-connected inputs, which they do, along with everything else derived from the basic MLP model, then they're technically performing convolution.
Of course, the convolution concept breaks down when nonlinear activation functions are introduced, so I'm not sure the equivalence is really all that profound.
I don't think a dot product between high dimensional vectors is considered a convolution? I'm familiar with convolution between continuous functions, and with kernels in neural networks providing invariance. I'd love to learn more if you have any links that expand on your statement.
I think anything that doesn't have an explicit convolution layer? Transformer, MLP, RNN don't automatically have a convolution layer, although for many tasks you can add it in if you want.
There's a typo in the title of the talk but don't worry I fixed it: Trends in Computer Vision
``In recent years, ML has completely changed our view of what is possible with computers''
In recent years ML has completely changed our view of what's possible with computer vision.
``Increasing scale delivers better results''
This is true for computer vision.
``The kinds of computations we want to run and the hardware on which we run them is changing dramatically''
Optimizations on operations for computer vision isn't exactly dramatic change. Who is We?
Trends in Machine Learning, 2010: semantic search for advertising.
Trends in Machine Learning, 2024: semantic search for advertising, short-form video content.