Hacker News new | past | comments | ask | show | jobs | submit login

Hi from the Segment Anything team! Today we’re releasing Segment Anything Model 2! It's the first unified model for real-time promptable object segmentation in images and videos! We're releasing the code, models, dataset, research paper and a demo! We're excited to see what everyone builds! https://ai.meta.com/blog/segment-anything-2/



Code, model, data and under Apache 2.0. Impressive.

Curious how this was allowed to be more open source compared to Llama's interesting new take on "open source". Are other projects restricted in some form due to technical/legal issues and the desire is to be more like this project? Or was there an initiative to break the mold this time round?


LLMs are trained on the entire internet so loads of copyrighted data, which Meta can’t distribute, and is afraid to even reference


This argument doesn't make sense to me unless you're talking about the training material. If that is not the case, then how does this argument relate to the license Meta attempts to force on downloaders of LLaMa weights?


they're literally talking about the training material.


data is creative commons


Yeah, but there's a CLA for some reason. I'm wary they will switch to a new license down the road.


So get it today. You can't retroactively change a license on someone.


Yeah, but it's a signal they aren't thinking of the project as a community project. They are centralizing rights towards themselves in an unequal way. Apache 2.0 without a CLA would be fine otherwise.


Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2!


Huge fan of the SAM work, one of the most underrated models.

My favorite use case is that it slays for memes. Try getting a good alpha mask of Fassbender Turtleneck any other way.

Keep doing stuff like this. <3


I've been supporting non-computational (i.e. scientists) to use and finetune SAM for biological applications, so excited to see how SAM2 performs and how the video aspects work for large image stacks of 3D objects.

Considering the instant flood of noisy issues/PRs on the repo and the limited fix/update support on SAM, are there plans/buy-in for support of SAM2 on the medium-term beyond quick fixes? Either way, thank you to the team for your work on this and the continued public releases!


stupid question from a noob: what exactly is object segmentation? what does your library actually do? Does it cut clips?


Given an image, it will outline where objects are in the image.


and extract segments of images where the object are in the image as I understand it?

A segment then is a collection of images that follow each other in time?

So if you have a video comprised of img1, img2, img3, img4 and object shows in img1 and img2 and img4

Can you catch that as a sequence img1, img2, img3, img4 and can you also catch just the object img1, img2, img4 but get some sort of information that there is a break between img2 and img4 - number of images break etc.?

On edit: Or am I totally off about the segment possibilities and what it means?

Or can you only catch img1 and img2 as a sequence?


I'm not in the field and what SAM does is immediately apparent when you view the home page. Did you not even give it a glance?


Yes I did give it a glance, polite and clever HN member, it showed an object in a sequence of images extracted from video, and evidently followed the object from sequence.

Perhaps however my interpretation of what happens here is way off, which is why I asked in an obviously incorrect and stupid way that you have pointed out to me without clarifying exactly why it was incorrect and stupid.

So anyway there is the extraction of the object I referred to, but also seeming to follow the object through sequence of scenes?

https://github.com/facebookresearch/segment-anything-2/raw/m...

So it seems to me that they identify the object and follow it for a contiguous sequence. Img1, img2, img3, img4, is my interpretation incorrect here?

But what I am wondering is - what happens if the object is not in img3 - like perhaps two people talking and shifting viewpoint from person talking to person listening. Person talking is in img1, img2, img4. Can you get that sequence or is it just img1, img2 the sequence.

It says "We extend SAM to video by considering images as a video with a single frame." which I don't know what that means, does it mean that they concatenated all the video frames into a single image and identified the object in them, in which case their example still shows contiguous images without the object ever disappearing so my question still pertains.

So anyway my conclusion is what said when addressing me was wrong, to quote: "what SAM does is immediately apparent when you view the home page" because I (the you addressed) viewed the homepage I wondered about some things? Obviously wrong things that you have identified as being wrong.

And thus my question is: If what SAM does is immediately apparent when you view the home page can you point out where my understanding has failed?

On edit: grammar fixes for last paragraph / question.


> A segment then is a collection of images that follow each other in time?

A segment is a visually distinctive... segment of image, segmentation is basically splitting an image into objects: https://segment-anything.com, as such it has nothing to do with time or video.

Now SAM 2 is about video, so they seem to add object tracking (that is attributing same object to the same segment throughout frames)

The videos in the main article demonstrate that it can track objects in and out of frame (the one with bacteria or the one with boy going around the tree). However they do acknowledge this part of the algorithm can produce incorrect result sometimes (example with the horses).

The answer to your question is img1, img2, img4, as there is no reason to believe that it can only track objects in contiguous sequence.


Thanks!


Classification per pixel


will the model ever be extended to being able to segment audio (eg. different people talking, different instruments in a soundtrack?)


Check out Facebook DeMucs, and more newer: Ultimate Vocal Remover project on GitHub


There are a ton of models that do Stemming like this. We use them all the time. Lookup MvSep on Replicate.com


That would be really cool to try out. I hope someone is doing that.


I wonder if it can be used with security cameras somehow. My cameras currently alert me when they detect motion. It would be neat if this would help cameras become a little smarter. They should alert me only if someone other than a family member is detected.

The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.

I think some cameras already try to do this, however, they are really bad at it.


Frigate use both motion detection and object detection. Object detection is usually done with one of the Yolo models.


Is there a reason Texans can't use the demo?


Texas and Illinois. Both issued massive fines against Facebook for facial recognition, over a decade after FB first launched the feature. Segmentation is I guess usable to identify faces, so may seem too close to facial recognition to launch.

Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.


[flagged]


Your suggestion is that Meta is just too ethical?


Awesome model - thank you! Are you guys planning to provide any guidance on fine-tuning?


Oh, nice!

The first one was excellent. Now part of my Gimp toolbox. Thanks for your work!


How did you add it to gimp?



Thank you for sharing it! Is there any plans to move the codebase to a more performant programming language?


Everything in machine learning uses Python.

It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.


It's all C, C++ and Fortran(?) under the hood so moving languages probably won't matter as much as you expect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: