Turning Point: Segment Anything

“Our intelligence is what makes us human, and AI is an extension of that quality.” — Yann LeCun

Ayyüce Kızrak, Ph.D.
Heartbeat

--

A new milestone is recorded almost every week as we experience the renaissance of artificial intelligence (AI) research and development. OpenAI is leading the way in these significant developments, but this year in April, a revolutionary segmentation model in computer vision was shared by Meta AI. The Segment Anything Model (SAM), which claims to segment everything, was published with the SA-1B dataset of 1 billion masks on 11 million images.

One element that makes this study more important is that they have put forward an approach that adopts the ethical principles of Responsible AI.

SAM Demo of Photo by Andre Hunter on Unsplash

Natural Language Processing (NLP) studies have revolutionized in the last five years with large datasets and pre-trained, zero-shot, and few-shot generalizations. To see this capability effectively in applications, it is necessary to direct the language model with the correct prompt entries. Thus, the models produce more successful outputs. Performing this process well is now defined as a profession: prompt engineering. Trained language models and text prompts do well in providing zero-shot generalizations on new visual and data distributions. These prompts have been actively used for the past year to enable image generation tasks such as DALL-E and Midjourney.

While much progress has been made with computer vision and language encoders, it poses many challenges beyond the scope, most notably the need for appropriate training data.

The aim of this study published by Meta AI is; The objective is to create a new, pre-trained, and basic model in image segmentation with strong generalization capability, including the Prompt engineering approach in this model aimed to solve a series of flow segmentation problems on new data distributions. The study’s success was handled through 3 key elements and the following complex questions: Task, Model, and Data.

To build a foundation model for segmentation by introducing three interconnected components by Meta AI

1. Which task activates the zero-shot generalization?
2. What is the corresponding model architecture?
3. What data can power this task and model?

Task

The inspiration for this study is zero-hot and few-shot learning by using the “prompting” techniques used in NLP, especially for new datasets. When a prompt is given for the segmentation problem, it is defined as a task to generate a valid segmentation mask. For example, spatial information or textual input describing the object in the image can be given as a prompt. Thus, downstream segmentation tasks are solved with prompt engineering.

Model

SAM is an effective and efficient tool for segmenting images and identifying objects based on user prompts. The SAM tool can quickly segment and identify objects within an image based on user prompts. The model is designed to be flexible, allowing users to input various prompts and receive results in real-time. This is accomplished through a powerful image encoder and a lightweight mask decoder, which combine to generate segmentation masks for different prompts quickly. SAM is also designed to be ambiguity-aware, meaning it can handle situations with multiple possible interpretations of a prompt.

Data Engine

For SAM to be effective with new data, the model must be trained on diverse masks that go beyond the typical segmentation datasets. However, since masks are not naturally abundant, the SAM team developed a “data engine” that co-develops the model with dataset annotation. The data engine has three stages: assisted-manual, semi-automatic, and fully automatic.

  • SAM supports annotators in the first stage by assisting them in annotating masks in a classic interactive segmentation setup.
  • In the second stage, SAM automatically generates masks for a subset of objects, allowing annotators to concentrate on annotating the remaining objects, increasing mask diversity.
  • In the final stage, SAM prompts with a regular grid of foreground points, which leads to an average of around 100 high-quality masks per image.

This data engine strategy enables SAM to achieve strong generalization to new data distributions.

Example images with overlaid masks from SA-1B dataset

Dataset

The SA-1B dataset is a large collection of 1 billion masks generated through a fully automated process. This dataset includes 11 million licensed and privacy-preserving images, making it outstanding compared to other segmentation datasets. It contains 400 times more masks than comparable datasets, making it an essential tool for training the SAM model. Additionally, it’s a valuable resource for researchers who want to develop new foundational models.

Segment anything model workflow by Meta AI

Where does “Responsible AI” fit into this work?

The SAM holds great significance as it delves into the crucial matter of fairness concerns and potential biases that could arise while utilizing SA-1B and SAM. With SA-1B covering a wide range of countries with diverse economic backgrounds and SAM showcasing consistent performance across different groups of individuals, this study is anticipated to play a pivotal role in promoting fairness in real-world applications.

Conclusion

SAM continues to evolve and improve, always striving for greater accuracy and precision. As with any tool, there is always room for refinement and optimization, and SAM is no exception. It is designed to be versatile and adaptable, allowing for a wide range of applications, but there may be instances where more specialized tools are better suited. Through continued exploration and refinement, we are confident that SAM will continue to be a valuable asset in the world of semantic and panoptic segmentation.

Like Open AI, Meta AI will continue its confident and targeted rise to the generative AI space. First, LLaMA, a large language model with 65 billion parameters, published SAM! Meta AI will significantly contribute to its work, which includes real-time segmentation and rendering images with text and audio, especially for Project ARIA.

Feel free to follow me on GitHub and Twitter accounts for more content!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

AI Specialist @Digital Transformation Office, Presidency of the Republic of Türkiye | Academics @Bahçeşehir University | http://www.ayyucekizrak.com/