TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

How does the Segment-Anything Model’s (SAM’s) decoder work?

Wei Yi
TDS Archive
Published in
18 min readMar 24, 2024

--

Photo by dylan nolte on Unsplash

A deep dive into how the Segment-Anything model’s decoding procedure, with a focus on how its self-attention and cross-attention mechanism works.

This article only focuses on SAM’s decoder. For people interested in SAM’s encoder, please see my other article “How Does the Segment-Anything Model’s (SAM’s) encoder work?”.

The Segment-Anything (SAM) model is a 2D interactive segmentation model, or guided model. SAM requires user prompts to segment an image. These prompts tell the model where to segment. The output of the model is a set of segmentation masks at different levels and a confidence score associated with each mask.

A segmentation mask is a 2D binary array with the same size as the input image. In this 2D array, an entry at location (x, y) has a value 1 if the model thinks that the pixel at location (x, y) belongs to the segmented area. Otherwise, the entry is 0. Those confidence scores indicate model’s belief on the quality of each segmentation, higher score means higher quality.

The network architecture of SAM consists of an encoder and a decoder:

  • The encoder takes in the image and user prompt inputs to produce image embedding, image positional embedding and user prompt embeddings.
  • The decoder takes in the various embeddings to produce segmentation masks and confidence scores

This article focuses on how SAM’s decoder works. I will write another article about the encoder.

SAM’s inputs and outputs

Together with the input image to segment, SAM also requires user prompts, and it support the following kinds of prompts.

Different kinds of input user prompts

  • Mouse clicks. A mouse click can be a positive click that tells the model to include the clicked location in the produced segmentation mask. It can also be a negative click telling the model to avoid the clicked location. SAM accept multiple clicks either positive or negative to the model.
  • Bounding boxes. Bounding boxes are always positive signals, telling the model that the produced segmentation mask should be inside the boxed area. There are no…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Wei Yi
Wei Yi

Written by Wei Yi

I'm leading the Deep Learning team at AstraZeneca. Previously I worked at SecondMind, Microsoft Research, and also was CTO of a hedge fund EQB.

Responses (1)

Write a response