Had this idea about 5 years ago, seems like it might be viable now, would love to see a video analyzer that creates relationship graphs using sentiment analysis. Ex. who responds to whom, what their tone is, how often. Hopefully with modern methods, it wouldn't take too many examples to pick out the dimensionality of expression (cheating out vs in, loudness + arousal, whining on the spectrum of playful to hurt, confidence, clarity)
Could test against TV shows and see if it gets an understanding of the social dynamics.
Plus could uncover a lot of the editing technique, I forget what the term is, when they create context from unrelated scenes by cutting from one to the other.
Would also pick up the general plot formula pretty quickly by mapping out the relative intensity and direction (action, tense, playful, romantic) of scenes.
I remember reading about a startup that did this or something similar for TV shows + movies a while back in the New Yorker, the idea was that they could predict how well it would do from a pilot or even the script.
> Hopefully with modern methods, it wouldn't take too many examples to pick out the dimensionality of expression (cheating out vs in, loudness + arousal, whining on the spectrum of playful to hurt, confidence, clarity)
My first thoughts were also how this would effect the porn industry…
Huh, had no idea that existed. Looks a lot like this project.
But no, that's not it at all. The sentiment analysis buried at the bottom there is generic. It's advertised for aggregate customer mood bc that's the only thing it's sensitive enough to detect, thumbs up or down.
I believe that this type of video captioning will be able to fill the gaps in the common-sense knowledge of and understanding of the world and its physics for LLMs. It should also be useful for robots.
Very cool work but I'm a bit perplexed by their first example/diagram from the blog post (which is presumably cherry-picked?). The event "The dogs are waiting." overlapping with the event "The dogs are pulling the sled." seems like a poor joint labeling of the events. The two obviously cannot co-occur, and this feels like a pretty easy opportunity for the model to demonstrate its understanding of event disentanglement.
The remaining examples from the paper don't do much in the way of convincing me this is a one-off issue. The recognition of multiple events globally is good, but perhaps extra care should be taken at overlapping event boundaries (e.g. additional local constraints in the loss/regularization scheme that encourage event splitting or time boundaries "snapping-to-grid" if confidence of co-occurrence is low).
Anyone figured out how to run this against a video?
https://github.com/google-research/scenic/tree/main/scenic/p... has an example showing how to "train Vid2Seq on YouCook2" using "python -m scenic.projects.vid2seq.main", but I couldn't see the recipe for using it against a video to return a description.
It is perfect for Audio Description [1]. Certain broadcast TV programs have had this here in Australia to allow those with poor vision to be able to follow along - it allows them to appreciate content that has originally be produced as audio-visual. https://en.wikipedia.org/wiki/Audio_description?wprov=sfla1
This isn’t really a “google project” in the way I think about that term, but it’s a research project. Google’s research is constantly advancing and when things get far enough along they do tend to get used in production. Individual research papers are just a step along the way. This research seems useful for training video generation systems like transformers and especially multi modal systems. Imagine you have a robot that needs to understand the world around it. It needs to interpret text input (likely as voice) but it also needs to understand complex scenes around it. If you can get a system to accurately describe YouTube videos (a captive data set) then it should also be able to understand a live video feed on a robot. That’s an important part of a robot. But it is not in itself a product or notable project.
I suspect this or something like it has already “taken off”, but only internally at google. They seem to be a lot better at searching the contents of YouTube videos than simple transcripts or image detection would give.
Could test against TV shows and see if it gets an understanding of the social dynamics.
Plus could uncover a lot of the editing technique, I forget what the term is, when they create context from unrelated scenes by cutting from one to the other.
Would also pick up the general plot formula pretty quickly by mapping out the relative intensity and direction (action, tense, playful, romantic) of scenes.
I remember reading about a startup that did this or something similar for TV shows + movies a while back in the New Yorker, the idea was that they could predict how well it would do from a pilot or even the script.