Vid2Seq: A pretrained visual language model for describing multi-event videos

sdwr · on March 17, 2023

Had this idea about 5 years ago, seems like it might be viable now, would love to see a video analyzer that creates relationship graphs using sentiment analysis. Ex. who responds to whom, what their tone is, how often. Hopefully with modern methods, it wouldn't take too many examples to pick out the dimensionality of expression (cheating out vs in, loudness + arousal, whining on the spectrum of playful to hurt, confidence, clarity)

Could test against TV shows and see if it gets an understanding of the social dynamics.

Plus could uncover a lot of the editing technique, I forget what the term is, when they create context from unrelated scenes by cutting from one to the other.

Would also pick up the general plot formula pretty quickly by mapping out the relative intensity and direction (action, tense, playful, romantic) of scenes.

I remember reading about a startup that did this or something similar for TV shows + movies a while back in the New Yorker, the idea was that they could predict how well it would do from a pilot or even the script.

UncleEntity · on March 17, 2023

> Hopefully with modern methods, it wouldn't take too many examples to pick out the dimensionality of expression (cheating out vs in, loudness + arousal, whining on the spectrum of playful to hurt, confidence, clarity)

My first thoughts were also how this would effect the porn industry…

mdswanson · on March 17, 2023

Some of this: https://vi.microsoft.com/en-us

sdwr · on March 18, 2023

Huh, had no idea that existed. Looks a lot like this project.

But no, that's not it at all. The sentiment analysis buried at the bottom there is generic. It's advertised for aggregate customer mood bc that's the only thing it's sensitive enough to detect, thumbs up or down.

I mean something like https://m.youtube.com/watch?v=8dagzaFjHU4&t=2m, the expressions in the eyes, ex. concern - solicitous - complacent/satisfied - steely from 2:10 - 2:20.

That + the same for voice tone would be everything.

groestl · on March 17, 2023

> when they create context from unrelated scenes by cutting from one to the other.

Do you mean juxtaposition?

sdwr · on March 18, 2023

Was thinking of https://en.m.wikipedia.org/wiki/Kuleshov_effect.

That's along the same vein tho

zone411 · on March 17, 2023

I believe that this type of video captioning will be able to fill the gaps in the common-sense knowledge of and understanding of the world and its physics for LLMs. It should also be useful for robots.

samgriesemer · on March 17, 2023

Very cool work but I'm a bit perplexed by their first example/diagram from the blog post (which is presumably cherry-picked?). The event "The dogs are waiting." overlapping with the event "The dogs are pulling the sled." seems like a poor joint labeling of the events. The two obviously cannot co-occur, and this feels like a pretty easy opportunity for the model to demonstrate its understanding of event disentanglement.

The remaining examples from the paper don't do much in the way of convincing me this is a one-off issue. The recognition of multiple events globally is good, but perhaps extra care should be taken at overlapping event boundaries (e.g. additional local constraints in the loss/regularization scheme that encourage event splitting or time boundaries "snapping-to-grid" if confidence of co-occurrence is low).

simonw · on March 18, 2023

Anyone figured out how to run this against a video?

https://github.com/google-research/scenic/tree/main/scenic/p... has an example showing how to "train Vid2Seq on YouCook2" using "python -m scenic.projects.vid2seq.main", but I couldn't see the recipe for using it against a video to return a description.

netdur · on March 17, 2023

one of cool google projects that does not take off because its so hard to run and integrate with other cool things...

philjohn · on March 17, 2023

If we take the less cynical take, it'll end up in Youtube as audio description for people with visual impairments.

moritonal · on March 17, 2023

It'll become a standard option in CCTV and as an extension a tool for overly authoritive bosses.

"the employee is talking to X, the employee is working, the employee is going to the toilet"

martyvis · on March 18, 2023

It is perfect for Audio Description [1]. Certain broadcast TV programs have had this here in Australia to allow those with poor vision to be able to follow along - it allows them to appreciate content that has originally be produced as audio-visual. https://en.wikipedia.org/wiki/Audio_description?wprov=sfla1

TaylorAlexander · on March 17, 2023

This isn’t really a “google project” in the way I think about that term, but it’s a research project. Google’s research is constantly advancing and when things get far enough along they do tend to get used in production. Individual research papers are just a step along the way. This research seems useful for training video generation systems like transformers and especially multi modal systems. Imagine you have a robot that needs to understand the world around it. It needs to interpret text input (likely as voice) but it also needs to understand complex scenes around it. If you can get a system to accurately describe YouTube videos (a captive data set) then it should also be able to understand a live video feed on a robot. That’s an important part of a robot. But it is not in itself a product or notable project.

notatoad · on March 18, 2023

I suspect this or something like it has already “taken off”, but only internally at google. They seem to be a lot better at searching the contents of YouTube videos than simple transcripts or image detection would give.

og_kalu · on March 17, 2023

an augmented LM with time tokens to predict event boundaries and textual descriptions in the same output sequence