Hey HN! I'm very proud to release the deepest interview deep dive into the SAM model I can find on the Internet (seriously - i looked on youtube and listennotes and all of them were pretty superficial). the Roboflow team has spent the past week hacking on and building with SAM and I ran into Joseph Nelson this weekend and realized he might be the perfect non-Meta-AI person to discuss what it means for developers building with SAM.
so.. enjoy! worked really hard on the prep and editing, any feedback and suggestions/recommendations welcome. still new to AI and new to the podcast game.
It's easy to have someone lay down 30 points for a simple banana shaped outline and compare segmentation to that, but how does this compare to other automatic techniques like spectral matting (which is now 16 years old) ?
Deep methods are a vast improvement over classical computer vision techniques. Classical techniques can be thought of as a function of the raw pixel data. Deep learning techniques understands the context and are more likely to get segment how a human might segment.
spectral matting, as I understand it, is used for subject/foreground and background separation.
It's easy to say something is better, but good computer graphics and computer vision papers compare themselves to the state of the art.
This is ignoring all that and comparing itself to the most manual method possible. This also happens sometimes and is more of a marketing stunt to people who don't follow current research.
any requests for under-covered topics? i felt like this one resonated because somehow the other podcasters/youtubers seemed to miss how big of a deal it was. hungry for more.
> Distilling these big, slow vision transformer models into something that can be used in realtime on the edge is going to be huge.
something i didnt quite get to is does Roboflow do this for you or are you pointing to more work that you’d like to happen someday? (possibly done by Roboflow, possibly someone else) also are you worried about business model if people can distil to run on their own devices (so they dont need to pay you anymore)?
This is the core of what we do! Previously our job was distilling human knowledge into a model; now that knowledge is starting to come from bigger models with humans managing the objectives vs doing the labor.
> also are you worried about business model if people can distil to run on their own devices (so they dont need to pay you anymore)?
This is probably a risk to the current business model over the long term, but we're constantly working on reinventing ourselves & finding new ways to provide value. If we don't adapt to the changing world we deserve to go out of business someday. I'd much rather help build the thing that makes us obsolete than sit idly by while someone else builds it.
I think of this risk similarly to the way that I'm marginally worried that, in the long run, AGI will obviate the need for my job. Probably true, but the opportunities it will present are far greater & it's better to focus on how to be valuable in the future than cling to how I provide value today.
Out of my depth, but ca the SAM outputs be mapped for 6DOF models of the objects directly?or would you still need to use the resultant dataset to train 6DOF or key points for that matter?
The real hard problem in CV (and science in general) is bad papers that omit useful info.
Segment Anything requires an image embedding. They report in the paper that segmentation takes ~50ms, but conveniently leave out that computing an embedding of an image (640x480) in their model takes ~2+ seconds (on a 3080 Ti). Well, at least they released all the code and model and enough instructions to figure that part out.
There are papers that omit useful info, this one certainly ain't it. They mention on their website homepage FAQ that image embedding takes 0.15 seconds on an A100.
SAM team released the entire codebase, model weights, and the entire dataset and details of their training recipe. I can't believe you are calling it a bad paper for not mentioning embedding generation time on the paper? Seriously? Like it's a few hundred million parameters model that results in a 256x64x64 embedding of course it's gonna take some time.
OK i can see everybody is bothered that I said the paper was "bad". A more productive way to say it: "Many great papers still omit information that implementors would like to see spelled out".
0.15 seconds (150ms) would be almost good enough for me (I aim for 25 FPS on my microscope, and achieve that with full high quality object detection), but it's 2 seconds on my 2080 (and 2 seconds on my 3080).
These sorts of things are important (even the 0.15seconds would have been useful to include in the paper) for implementors because we can read the paper, and reject it as a solution without having to download any code or run any experiments. It even took a while to get a good answer to why inference is so slow (I was using the python scripts not the notebook, which mentions that image embedding is super slow).
Note they report 55ms inference, I guess that's also an A100, my 2080 and 3080 both take over a second to do inference after the embedding. Looking at inference performance of A100 vs. 3080, it doesn't seem to make sense that the A100 is that much faster (I wonder if they are running many batches and then dividing by the batch size?)
As an ex-scientist, I've come across many well regarded papers that omitted stating something explicitly, and it's only once the first or second reimplementor finishes that we know something important was left out. I don't think the authors were being intentionally misleading, and I'm sure this product is nice, but so far, in my hands, it has not been that great and it would have been nice if they prominently stated the image embedding time, since it's absolutely necessary to do before prompt and mask decoding.
"It even took a while to get a good answer to why inference is so slow" No it didn't, if you read their paper they mention multiple times what parts take most of the time.
There is only so much one can write in a paper. Meta team did a great job writing down every detail that matters to people trying to reproduce the results or build on their architecture. Not a single researcher I know complained that the authors missed out important details.
You are picking on tiny, trivial details that anyone in the area can figure out in a few minutes and making a big deal out of it. It is a research paper, not a product with detailed documentation/spec sheet.
They mention this multiple times in the paper. For example, in the “Limitations” section they write: “SAM can process prompts in real-time, but nevertheless SAM’s overall performance is not real-time when using a heavy image encoder.”
This paper is one of the highest quality papers released this year. I wish more papers were so clear and informative.
Yes, I read that section. They should have included the time required by the heavy image encoder- unless you know of a way to make SAM work with another encoder.
I don't think the paper itself is misleading. I taught SAM earlier this week for my Frontiers in Deep Learning course and showed a figure from the paper with how long each component took, where they separate the components done on a GPU vs CPU/web browser.
Thing is when you segment everything in a scene, sometimes those things are actually just one object say a laptop, and it starts segmenting trackpad, individual keys, screen. And then you need another algorithm or human intervention to say this segment is pointless etc, a noise filter
Which is not bad imo. Infact SAM actively proposed a fix for this. Bounding box algorithms are relatively easier to train and SAM can take in a rough bounding box of the laptop as input prompt and create a detailed segmentation of laptop. Their webpage has an example with a bunch of cats.
SAM is good at (i) deciding detailed mask around segments, (ii) taking a wide range of prompts as input to decide what exactly the user wants to be segmented, and it processes these prompts at very low compute requirements.
I think SAM is very well designed architecture and I'm not sure how better it can be. I mean coming back to your question, there should be some signal that user wants a segmentation of laptop. SAM takes exactly that as prompt input.
I don't know much about this field, but are the segments given as a list or with more structure, e.g., a tree? If the keys are given as children to the laptop, then I don't think breaking things down in detail would get in the way of a user.
I tried to run the automatic_mask_generator notebook [1] but the results is more noisy then the everything mode on the demo website. Are the parameters used in the demo's website published anywhere?
The problem being solved is AI being able to distinguish unique objects within visual data. Before SAM, people would have to train a model on specific objects by labeling data and training a model to understand those objects specifically. This becomes problematic given the variety of objects in the world, settings they can be in, and their orientation in an image. SAM can identify objects it has never seen before, as in objects that might not be part of the training data.
Once you can determine which pixels belong to which object automatically, you can start to utilize that knowledge for other applications.
If you have SAM showing you all objects, you can use other models to identify what the object is, understand it's shape/size, understand depth/distance, etc. It's a foundational model to build off of for any application that wants to use visual data as an input.
yep, value is pretty clear from his demo. Goes from dozens of clicks to identify an object within an image to a single click. SAM does almost exactly what you'd want as a human in every one of his examples.
fun fact i just sent some money to someone offering to do youtube shorts for me so we might try to do this. honeslty i dont think shorts is conducive to deep conversation or technical topics tho. pple watch with their brain off
I’ve had some pretty remarkable results pasting lecture transcripts from youtube into gpt4 and getting well formatted/relevant markdown summaries from meandering and mis-transcribed content! Needs chunking up but surprisingly effective. It can even generate youtube urls with the right timestamps if you ask it nicely
It's less configurable than what you're describing, but I've found this useful in at least determining if a given video has the content I'm looking for: https://www.summarize.tech/
so.. enjoy! worked really hard on the prep and editing, any feedback and suggestions/recommendations welcome. still new to AI and new to the podcast game.
edit: Video demo is here in case people miss it https://youtu.be/SZQSF-A-WkA