Segment Anything Model and the hard problems of computer vision

swyx · on April 13, 2023

Hey HN! I'm very proud to release the deepest interview deep dive into the SAM model I can find on the Internet (seriously - i looked on youtube and listennotes and all of them were pretty superficial). the Roboflow team has spent the past week hacking on and building with SAM and I ran into Joseph Nelson this weekend and realized he might be the perfect non-Meta-AI person to discuss what it means for developers building with SAM.

so.. enjoy! worked really hard on the prep and editing, any feedback and suggestions/recommendations welcome. still new to AI and new to the podcast game.

edit: Video demo is here in case people miss it https://youtu.be/SZQSF-A-WkA

CyberDildonics · on April 13, 2023

It's easy to have someone lay down 30 points for a simple banana shaped outline and compare segmentation to that, but how does this compare to other automatic techniques like spectral matting (which is now 16 years old) ?

http://people.csail.mit.edu/alevin/papers/spectral-matting-l...

teruakohatu · on April 14, 2023

Deep methods are a vast improvement over classical computer vision techniques. Classical techniques can be thought of as a function of the raw pixel data. Deep learning techniques understands the context and are more likely to get segment how a human might segment.

spectral matting, as I understand it, is used for subject/foreground and background separation.

CyberDildonics · on April 14, 2023

It's easy to say something is better, but good computer graphics and computer vision papers compare themselves to the state of the art.

This is ignoring all that and comparing itself to the most manual method possible. This also happens sometimes and is more of a marketing stunt to people who don't follow current research.

HanClinto · on April 14, 2023

Excellent episode, I really enjoyed this. Thank you!

swyx · on April 14, 2023

thanks for listening! feedback welcome, seriously dont be shy

re5i5tor · on April 13, 2023

Really great, thank you.

swyx · on April 13, 2023

any requests for under-covered topics? i felt like this one resonated because somehow the other podcasters/youtubers seemed to miss how big of a deal it was. hungry for more.

re5i5tor · on April 14, 2023

Yeah, the podcast + demo really got to the heart of what was game changing, explained very well.

No additional requests yet!

Eisenstein · on April 13, 2023

Why is there no volume control on your podcast page player?

yeldarb · on April 13, 2023

The feature they discussed re using Segment Anything to greatly speed up labeling is now live in Roboflow Annotate for anyone to try: https://blog.roboflow.com/label-data-segment-anything-model-...

Distilling these big, slow vision transformer models into something that can be used in realtime on the edge is going to be huge.

swyx · on April 13, 2023

> Distilling these big, slow vision transformer models into something that can be used in realtime on the edge is going to be huge.

something i didnt quite get to is does Roboflow do this for you or are you pointing to more work that you’d like to happen someday? (possibly done by Roboflow, possibly someone else) also are you worried about business model if people can distil to run on their own devices (so they dont need to pay you anymore)?

yeldarb · on April 13, 2023

This is the core of what we do! Previously our job was distilling human knowledge into a model; now that knowledge is starting to come from bigger models with humans managing the objectives vs doing the labor.

> also are you worried about business model if people can distil to run on their own devices (so they dont need to pay you anymore)?

This is probably a risk to the current business model over the long term, but we're constantly working on reinventing ourselves & finding new ways to provide value. If we don't adapt to the changing world we deserve to go out of business someday. I'd much rather help build the thing that makes us obsolete than sit idly by while someone else builds it.

I think of this risk similarly to the way that I'm marginally worried that, in the long run, AGI will obviate the need for my job. Probably true, but the opportunities it will present are far greater & it's better to focus on how to be valuable in the future than cling to how I provide value today.

throwaway20222 · on April 14, 2023

Out of my depth, but ca the SAM outputs be mapped for 6DOF models of the objects directly?or would you still need to use the resultant dataset to train 6DOF or key points for that matter?

swyx · on April 13, 2023

haha true true. thanks Brad appreciate the responses and would love to have you on the pod next time there's hot computer vision news!

ChatGTP · on April 14, 2023

Yup, may as well join the race to the end ? We’ll just have to workout the details when we get there?

dekhn · on April 13, 2023

The real hard problem in CV (and science in general) is bad papers that omit useful info.

Segment Anything requires an image embedding. They report in the paper that segmentation takes ~50ms, but conveniently leave out that computing an embedding of an image (640x480) in their model takes ~2+ seconds (on a 3080 Ti). Well, at least they released all the code and model and enough instructions to figure that part out.

alsodumb · on April 14, 2023

There are papers that omit useful info, this one certainly ain't it. They mention on their website homepage FAQ that image embedding takes 0.15 seconds on an A100.

SAM team released the entire codebase, model weights, and the entire dataset and details of their training recipe. I can't believe you are calling it a bad paper for not mentioning embedding generation time on the paper? Seriously? Like it's a few hundred million parameters model that results in a 256x64x64 embedding of course it's gonna take some time.

dekhn · on April 14, 2023

OK i can see everybody is bothered that I said the paper was "bad". A more productive way to say it: "Many great papers still omit information that implementors would like to see spelled out".

0.15 seconds (150ms) would be almost good enough for me (I aim for 25 FPS on my microscope, and achieve that with full high quality object detection), but it's 2 seconds on my 2080 (and 2 seconds on my 3080).

These sorts of things are important (even the 0.15seconds would have been useful to include in the paper) for implementors because we can read the paper, and reject it as a solution without having to download any code or run any experiments. It even took a while to get a good answer to why inference is so slow (I was using the python scripts not the notebook, which mentions that image embedding is super slow).

Note they report 55ms inference, I guess that's also an A100, my 2080 and 3080 both take over a second to do inference after the embedding. Looking at inference performance of A100 vs. 3080, it doesn't seem to make sense that the A100 is that much faster (I wonder if they are running many batches and then dividing by the batch size?)

As an ex-scientist, I've come across many well regarded papers that omitted stating something explicitly, and it's only once the first or second reimplementor finishes that we know something important was left out. I don't think the authors were being intentionally misleading, and I'm sure this product is nice, but so far, in my hands, it has not been that great and it would have been nice if they prominently stated the image embedding time, since it's absolutely necessary to do before prompt and mask decoding.

alsodumb · on April 14, 2023

"It even took a while to get a good answer to why inference is so slow" No it didn't, if you read their paper they mention multiple times what parts take most of the time.

There is only so much one can write in a paper. Meta team did a great job writing down every detail that matters to people trying to reproduce the results or build on their architecture. Not a single researcher I know complained that the authors missed out important details.

You are picking on tiny, trivial details that anyone in the area can figure out in a few minutes and making a big deal out of it. It is a research paper, not a product with detailed documentation/spec sheet.

p1esk · on April 13, 2023

They mention this multiple times in the paper. For example, in the “Limitations” section they write: “SAM can process prompts in real-time, but nevertheless SAM’s overall performance is not real-time when using a heavy image encoder.”

This paper is one of the highest quality papers released this year. I wish more papers were so clear and informative.

dekhn · on April 13, 2023

Yes, I read that section. They should have included the time required by the heavy image encoder- unless you know of a way to make SAM work with another encoder.

chriskanan · on April 14, 2023

I don't think the paper itself is misleading. I taught SAM earlier this week for my Frontiers in Deep Learning course and showed a figure from the paper with how long each component took, where they separate the components done on a GPU vs CPU/web browser.

dekhn · on April 14, 2023

Sorry, I can't find this in the paper.

dekisdummy · on April 14, 2023

[flagged]

dekhn · on April 14, 2023

I'll keep that in mind. I'm also touched that you went out of your way to create a green account to make fun of my user name. Quality.

m3kw9 · on April 13, 2023

Thing is when you segment everything in a scene, sometimes those things are actually just one object say a laptop, and it starts segmenting trackpad, individual keys, screen. And then you need another algorithm or human intervention to say this segment is pointless etc, a noise filter

alsodumb · on April 13, 2023

Which is not bad imo. Infact SAM actively proposed a fix for this. Bounding box algorithms are relatively easier to train and SAM can take in a rough bounding box of the laptop as input prompt and create a detailed segmentation of laptop. Their webpage has an example with a bunch of cats.

SAM is good at (i) deciding detailed mask around segments, (ii) taking a wide range of prompts as input to decide what exactly the user wants to be segmented, and it processes these prompts at very low compute requirements.

I think SAM is very well designed architecture and I'm not sure how better it can be. I mean coming back to your question, there should be some signal that user wants a segmentation of laptop. SAM takes exactly that as prompt input.

yazzku · on April 14, 2023

I don't know much about this field, but are the segments given as a list or with more structure, e.g., a tree? If the keys are given as children to the laptop, then I don't think breaking things down in detail would get in the way of a user.

steventey · on April 13, 2023

Incredible stuff – glad you got to collab with Joseph on this!!

swyx · on April 13, 2023

thanks steven... would love to chat sharegpt learnings and whatever else u have going on next time you are in town?

thangngoc89 · on April 14, 2023

I tried to run the automatic_mask_generator notebook [1] but the results is more noisy then the everything mode on the demo website. Are the parameters used in the demo's website published anywhere?

[1]: https://arxiv.org/abs/2304.06035

endisneigh · on April 13, 2023

What’s an interesting problem that’s solved with segment anything?

rampantraccoon · on April 13, 2023

The problem being solved is AI being able to distinguish unique objects within visual data. Before SAM, people would have to train a model on specific objects by labeling data and training a model to understand those objects specifically. This becomes problematic given the variety of objects in the world, settings they can be in, and their orientation in an image. SAM can identify objects it has never seen before, as in objects that might not be part of the training data.

Once you can determine which pixels belong to which object automatically, you can start to utilize that knowledge for other applications.

If you have SAM showing you all objects, you can use other models to identify what the object is, understand it's shape/size, understand depth/distance, etc. It's a foundational model to build off of for any application that wants to use visual data as an input.

DaiPlusPlus · on April 13, 2023

> SAM can identify objects it has never seen before

I'd love to see what SAM does when you send it a photo of rolling fog though, e.g. https://www.google.com/search?q=rolling+fog+scotland&tbm=isc... - what happens then? (and how can it meaningfully segment-out fog?)

yeldarb · on April 13, 2023

Not sure if this is what you mean, but I grabbed some of those images & dropped them in to see what it predicted: https://imgur.com/a/CXLmYXo

idopmstuff · on April 13, 2023

It groups the fog as a single object (except where it's separated by things like hills).

You can see what it does - it's available to test at https://segment-anything.com/.

endisneigh · on April 13, 2023

Yes, what I am interested in are the other applications.

swyx · on April 13, 2023

see video demo where joseph showed how it improves on sota? https://youtu.be/SZQSF-A-WkA

mritchie712 · on April 13, 2023

yep, value is pretty clear from his demo. Goes from dozens of clicks to identify an object within an image to a single click. SAM does almost exactly what you'd want as a human in every one of his examples.

jeron · on April 13, 2023

Curious if chatGPT can convert this podcast transcript into an article

LoganDark · on April 13, 2023

Then try it..? Maybe it also could have written a more thoughtful comment for you.

passion__desire · on April 13, 2023

Text just doesn't cut it for many now. I would love if AI could customize minimal animations that goes along with the comment. Something like this.

https://www.youtube.com/shorts/Qnvt5GHTywU

swyx · on April 13, 2023

fun fact i just sent some money to someone offering to do youtube shorts for me so we might try to do this. honeslty i dont think shorts is conducive to deep conversation or technical topics tho. pple watch with their brain off

DonHopkins · on April 13, 2023

In the style of Hunter S. Thompson.

https://www.youtube.com/watch?v=vUgs2O7Okqc

wmwmwm · on April 13, 2023

I’ve had some pretty remarkable results pasting lecture transcripts from youtube into gpt4 and getting well formatted/relevant markdown summaries from meandering and mis-transcribed content! Needs chunking up but surprisingly effective. It can even generate youtube urls with the right timestamps if you ask it nicely

wmwmwm · on April 13, 2023

There’s also a python api for getting the transcript from a given youtube video id so you can script the whole thing

variousNick · on April 13, 2023

It's less configurable than what you're describing, but I've found this useful in at least determining if a given video has the content I'm looking for: https://www.summarize.tech/

haliskerbas · on April 13, 2023

I wonder if Lex Friedman will cover this

brianjking · on April 13, 2023

Why would it matter if he did?

not-my-account · on April 13, 2023

Many would find it interesting!

swyx · on April 13, 2023

i mean maybe i'm the new lex fridman? i can ask my guests what is meaning of life if you want

nullsense · on April 14, 2023

Are we living in a simulation?