Hi from the Segment Anything team! Today we’re releasing Segment Anything Model ...

robbomacrae · 2024-07-29T23:14:46 1722294886

Code, model, data and under Apache 2.0. Impressive.

Curious how this was allowed to be more open source compared to Llama's interesting new take on "open source". Are other projects restricted in some form due to technical/legal issues and the desire is to be more like this project? Or was there an initiative to break the mold this time round?

Nesco · 2024-07-30T01:08:29 1722301709

LLMs are trained on the entire internet so loads of copyrighted data, which Meta can’t distribute, and is afraid to even reference

Zuiii · 2024-07-31T04:25:06 1722399906

This argument doesn't make sense to me unless you're talking about the training material. If that is not the case, then how does this argument relate to the license Meta attempts to force on downloaders of LLaMa weights?

owenpalmer · 2024-07-31T08:58:33 1722416313

they're literally talking about the training material.

swyx · 2024-07-30T00:51:41 1722300701

data is creative commons

8organicbits · 2024-07-30T05:17:48 1722316668

Yeah, but there's a CLA for some reason. I'm wary they will switch to a new license down the road.

phkahler · 2024-07-30T07:42:35 1722325355

So get it today. You can't retroactively change a license on someone.

8organicbits · 2024-07-30T19:12:34 1722366754

Yeah, but it's a signal they aren't thinking of the project as a community project. They are centralizing rights towards themselves in an unequal way. Apache 2.0 without a CLA would be fine otherwise.

ed · 2024-07-30T00:29:34 1722299374

Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2!

benreesman · 2024-07-29T23:14:25 1722294865

Huge fan of the SAM work, one of the most underrated models.

My favorite use case is that it slays for memes. Try getting a good alpha mask of Fassbender Turtleneck any other way.

Keep doing stuff like this. <3

sea-shunned · 2024-07-30T10:11:53 1722334313

I've been supporting non-computational (i.e. scientists) to use and finetune SAM for biological applications, so excited to see how SAM2 performs and how the video aspects work for large image stacks of 3D objects.

Considering the instant flood of noisy issues/PRs on the repo and the limited fix/update support on SAM, are there plans/buy-in for support of SAM2 on the medium-term beyond quick fixes? Either way, thank you to the team for your work on this and the continued public releases!

vivzkestrel · 2024-07-30T03:25:11 1722309911

stupid question from a noob: what exactly is object segmentation? what does your library actually do? Does it cut clips?

j7ake · 2024-07-30T03:56:35 1722311795

Given an image, it will outline where objects are in the image.

bryanrasmussen · 2024-07-30T06:16:37 1722320197

and extract segments of images where the object are in the image as I understand it?

A segment then is a collection of images that follow each other in time?

So if you have a video comprised of img1, img2, img3, img4 and object shows in img1 and img2 and img4

Can you catch that as a sequence img1, img2, img3, img4 and can you also catch just the object img1, img2, img4 but get some sort of information that there is a break between img2 and img4 - number of images break etc.?

On edit: Or am I totally off about the segment possibilities and what it means?

Or can you only catch img1 and img2 as a sequence?

nsonha · 2024-07-30T09:00:30 1722330030

I'm not in the field and what SAM does is immediately apparent when you view the home page. Did you not even give it a glance?

bryanrasmussen · 2024-07-30T10:06:53 1722334013

Yes I did give it a glance, polite and clever HN member, it showed an object in a sequence of images extracted from video, and evidently followed the object from sequence.

Perhaps however my interpretation of what happens here is way off, which is why I asked in an obviously incorrect and stupid way that you have pointed out to me without clarifying exactly why it was incorrect and stupid.

So anyway there is the extraction of the object I referred to, but also seeming to follow the object through sequence of scenes?

https://github.com/facebookresearch/segment-anything-2/raw/m...

So it seems to me that they identify the object and follow it for a contiguous sequence. Img1, img2, img3, img4, is my interpretation incorrect here?

But what I am wondering is - what happens if the object is not in img3 - like perhaps two people talking and shifting viewpoint from person talking to person listening. Person talking is in img1, img2, img4. Can you get that sequence or is it just img1, img2 the sequence.

It says "We extend SAM to video by considering images as a video with a single frame." which I don't know what that means, does it mean that they concatenated all the video frames into a single image and identified the object in them, in which case their example still shows contiguous images without the object ever disappearing so my question still pertains.

So anyway my conclusion is what said when addressing me was wrong, to quote: "what SAM does is immediately apparent when you view the home page" because I (the you addressed) viewed the homepage I wondered about some things? Obviously wrong things that you have identified as being wrong.

And thus my question is: If what SAM does is immediately apparent when you view the home page can you point out where my understanding has failed?

On edit: grammar fixes for last paragraph / question.

nsonha · 2024-07-30T13:05:09 1722344709

> A segment then is a collection of images that follow each other in time?

A segment is a visually distinctive... segment of image, segmentation is basically splitting an image into objects: https://segment-anything.com, as such it has nothing to do with time or video.

Now SAM 2 is about video, so they seem to add object tracking (that is attributing same object to the same segment throughout frames)

The videos in the main article demonstrate that it can track objects in and out of frame (the one with bacteria or the one with boy going around the tree). However they do acknowledge this part of the algorithm can produce incorrect result sometimes (example with the horses).

The answer to your question is img1, img2, img4, as there is no reason to believe that it can only track objects in contiguous sequence.

bryanrasmussen · 2024-07-30T18:30:29 1722364229

Thanks!

stabbles · 2024-07-30T06:21:11 1722320471

Classification per pixel

acacac · 2024-07-29T23:28:41 1722295721

will the model ever be extended to being able to segment audio (eg. different people talking, different instruments in a soundtrack?)

sagz · 2024-07-30T03:26:14 1722309974

Check out Facebook DeMucs, and more newer: Ultimate Vocal Remover project on GitHub

mrdjtek · 2024-07-30T00:26:05 1722299165

There are a ton of models that do Stemming like this. We use them all the time. Lookup MvSep on Replicate.com

TheHumanist · 2024-07-30T00:22:25 1722298945

That would be really cool to try out. I hope someone is doing that.

cheema33 · 2024-07-31T09:26:16 1722417976

I wonder if it can be used with security cameras somehow. My cameras currently alert me when they detect motion. It would be neat if this would help cameras become a little smarter. They should alert me only if someone other than a family member is detected.

The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.

I think some cameras already try to do this, however, they are really bad at it.

sorenjan · 2024-07-31T13:47:38 1722433658

Frigate use both motion detection and object detection. Object detection is usually done with one of the Yolo models.

nyxtom · 2024-07-30T23:27:55 1722382075

Is there a reason Texans can't use the demo?

mike_hearn · 2024-07-31T08:46:47 1722415607

Texas and Illinois. Both issued massive fines against Facebook for facial recognition, over a decade after FB first launched the feature. Segmentation is I guess usable to identify faces, so may seem too close to facial recognition to launch.

Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.

DonHopkins · 2024-07-31T08:08:53 1722413333

[flagged]

socksy · 2024-07-31T08:50:27 1722415827

Your suggestion is that Meta is just too ethical?

ulrikhansen54 · 2024-07-30T01:46:58 1722304018

Awesome model - thank you! Are you guys planning to provide any guidance on fine-tuning?

Yoric · 2024-07-30T09:49:38 1722332978

Oh, nice!

The first one was excellent. Now part of my Gimp toolbox. Thanks for your work!

jacooper · 2024-07-30T10:06:04 1722333964

How did you add it to gimp?

homarp · 2024-07-30T10:37:11 1722335831

https://github.com/Shriinivas/gimpsegany

https://github.com/crb02005/gimp-segment-anything

madduci · 2024-07-30T05:58:22 1722319102

Thank you for sharing it! Is there any plans to move the codebase to a more performant programming language?

Legend2440 · 2024-07-30T06:13:19 1722319999

Everything in machine learning uses Python.

It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.

cinntaile · 2024-07-30T07:20:36 1722324036

It's all C, C++ and Fortran(?) under the hood so moving languages probably won't matter as much as you expect.