Member-only story
How does the Segment-Anything Model’s (SAM’s) decoder work?
A deep dive into how the Segment-Anything model’s decoding procedure, with a focus on how its self-attention and cross-attention mechanism works.
This article only focuses on SAM’s decoder. For people interested in SAM’s encoder, please see my other article “How Does the Segment-Anything Model’s (SAM’s) encoder work?”.
The Segment-Anything (SAM) model is a 2D interactive segmentation model, or guided model. SAM requires user prompts to segment an image. These prompts tell the model where to segment. The output of the model is a set of segmentation masks at different levels and a confidence score associated with each mask.
A segmentation mask is a 2D binary array with the same size as the input image. In this 2D array, an entry at location (x, y) has a value 1 if the model thinks that the pixel at location (x, y) belongs to the segmented area. Otherwise, the entry is 0. Those confidence scores indicate model’s belief on the quality of each segmentation, higher score means higher quality.
The network architecture of SAM consists of an encoder and a decoder:
- The encoder takes in the image and user prompt inputs to produce image embedding, image positional embedding and user prompt embeddings.
- The decoder takes in the various embeddings to produce segmentation masks and confidence scores
This article focuses on how SAM’s decoder works. I will write another article about the encoder.
SAM’s inputs and outputs
Together with the input image to segment, SAM also requires user prompts, and it support the following kinds of prompts.
Different kinds of input user prompts
- Mouse clicks. A mouse click can be a positive click that tells the model to include the clicked location in the produced segmentation mask. It can also be a negative click telling the model to avoid the clicked location. SAM accept multiple clicks either positive or negative to the model.
- Bounding boxes. Bounding boxes are always positive signals, telling the model that the produced segmentation mask should be inside the boxed area. There are no…