Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Heeseong Shin¹, Chaehyun Kim¹, Sunghwan Hong², Seokju Cho¹,
Anurag Arnab^†,3, Paul Hongsuck Seo^†,2, Seungryong Kim^†,1

¹KAIST, ²Korea University, ³Google Research
^†Co-Corresponding Author

NeurIPS 2024

In contrast to existing methods utilizing (a) pixel-level semantic labels or (b) image-level semantic labels, we leverage unlabeled masks as supervision, which can be freely generated from vision foundation models such as SAM and DINO.

Abstract

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation.

Motivation

Recent vision foundation models (VFMs), such as DINO and SAM can freely generate fine-grained masks. The generated masks can be a guide to CLIP for pixel-level semantic understanding. However, the resulting masks can be too small or incomplete to have semantic meaning. To address this over-segmentation issue, we propose an online clustering of the masks into semantically meaningful groups defined globally for given images. By guiding CLIP with the clustered masks, we can adapt the CLIP image encoder to open-vocabulary semantic segmentation without any semantic labels.

Architecture

PixelCLIP utilizes unlabeled images and masks for fine-tuning the image encoder of CLIP, enabling open-vocabulary semantic segmentation. The momentum image encoder and the mask decoder are only leveraged during training, and inference is only done with image and text encoders of CLIP.

Qualitative Results

Qualitative results of ADE20k with 150 categories. ADE Qual image

Qualitative results of Pascal-Context with 59 categories. PC59 Qual image

Quantitative Results

The best-performing results are presented in bold, while the second-best results are underlined.
*: Images were seen during training. †: Masks from SA-1B were used. Quan image

BibTeX

@misc{shin2024openvocabularysemanticsegmentationsemantic,
      title={Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels}, 
      author={Heeseong Shin and Chaehyun Kim and Sunghwan Hong and Seokju Cho and Anurag Arnab and Paul Hongsuck Seo and Seungryong Kim},
      year={2024},
      eprint={2409.19846},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.19846}, 
}