Hi from the Segment Anything team! Today we’re releasing Segment Anything Model 2! It's the first unified model for real-time promptable object segmentation in images and videos! We're releasing the code, models, dataset, research paper and a demo! We're excited to see what everyone builds! https://ai.meta.com/blog/segment-anything-2/
Code, model, data and under Apache 2.0. Impressive.
Curious how this was allowed to be more open source compared to Llama's interesting new take on "open source". Are other projects restricted in some form due to technical/legal issues and the desire is to be more like this project? Or was there an initiative to break the mold this time round?
This argument doesn't make sense to me unless you're talking about the training material. If that is not the case, then how does this argument relate to the license Meta attempts to force on downloaders of LLaMa weights?
Yeah, but it's a signal they aren't thinking of the project as a community project. They are centralizing rights towards themselves in an unequal way. Apache 2.0 without a CLA would be fine otherwise.
Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2!
I've been supporting non-computational (i.e. scientists) to use and finetune SAM for biological applications, so excited to see how SAM2 performs and how the video aspects work for large image stacks of 3D objects.
Considering the instant flood of noisy issues/PRs on the repo and the limited fix/update support on SAM, are there plans/buy-in for support of SAM2 on the medium-term beyond quick fixes? Either way, thank you to the team for your work on this and the continued public releases!
and extract segments of images where the object are in the image as I understand it?
A segment then is a collection of images that follow each other in time?
So if you have a video comprised of img1, img2, img3, img4
and object shows in img1 and img2 and img4
Can you catch that as a sequence img1, img2, img3, img4 and can you also catch just the object img1, img2, img4 but get some sort of information that there is a break between img2 and img4 - number of images break etc.?
On edit: Or am I totally off about the segment possibilities and what it means?
Or can you only catch img1 and img2 as a sequence?
Yes I did give it a glance, polite and clever HN member, it showed an object in a sequence of images extracted from video, and evidently followed the object from sequence.
Perhaps however my interpretation of what happens here is way off, which is why I asked in an obviously incorrect and stupid way that you have pointed out to me without clarifying exactly why it was incorrect and stupid.
So anyway there is the extraction of the object I referred to, but also seeming to follow the object through sequence of scenes?
So it seems to me that they identify the object and follow it for a contiguous sequence. Img1, img2, img3, img4, is my interpretation incorrect here?
But what I am wondering is - what happens if the object is not in img3 - like perhaps two people talking and shifting viewpoint from person talking to person listening. Person talking is in img1, img2, img4. Can you get that sequence or is it just img1, img2 the sequence.
It says "We extend SAM to video by considering images as a video with a single frame." which I don't know what that means, does it mean that they concatenated all the video frames into a single image and identified the object in them, in which case their example still shows contiguous images without the object ever disappearing so my question still pertains.
So anyway my conclusion is what said when addressing me was wrong, to quote: "what SAM does is immediately apparent when you view the home page" because I (the you addressed) viewed the homepage I wondered about some things? Obviously wrong things that you have identified as being wrong.
And thus my question is: If what SAM does is immediately apparent when you view the home page can you point out where my understanding has failed?
On edit: grammar fixes for last paragraph / question.
> A segment then is a collection of images that follow each other in time?
A segment is a visually distinctive... segment of image, segmentation is basically splitting an image into objects: https://segment-anything.com, as such it has nothing to do with time or video.
Now SAM 2 is about video, so they seem to add object tracking (that is attributing same object to the same segment throughout frames)
The videos in the main article demonstrate that it can track objects in and out of frame (the one with bacteria or the one with boy going around the tree). However they do acknowledge this part of the algorithm can produce incorrect result sometimes (example with the horses).
The answer to your question is img1, img2, img4, as there is no reason to believe that it can only track objects in contiguous sequence.
I wonder if it can be used with security cameras somehow. My cameras currently alert me when they detect motion. It would be neat if this would help cameras become a little smarter. They should alert me only if someone other than a family member is detected.
The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.
I think some cameras already try to do this, however, they are really bad at it.
Texas and Illinois. Both issued massive fines against Facebook for facial recognition, over a decade after FB first launched the feature. Segmentation is I guess usable to identify faces, so may seem too close to facial recognition to launch.
Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.
It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.