Semantic segmentation models are predominantly based on supervised or unsupervised learning methodologies, which require substantial effort in annotation or training. In this study, we present a novel framework that leverages multiple pre-trained foundational models for semantic segmentation tasks on previously unseen images, eliminating the need for additional training. Our framework utilizes image recognition models to transform an input image into textual information. This text information is then used to engage an advanced Large Language Model (LLM) to predict the presence of specific classes within the given image. The labels predicted by the LLM are subsequently processed through an open-set detection and segmentation model to generate our ultimate outcomes. To ensure that the class information is precisely aligned with the intended context, we incorporate both a pre-refinement and a post-refinement procedure utilizing the LLM. The segmentation model is further modified to accept both bounding boxes and point prompts, resulting in higher accuracy than original usage that only accepts bounding boxes as input. Our proposed framework accomplishes training-free zero-shot semantic segmentation, requiring only the input image and customizable target classes for different scenarios as inputs. Experiments indicate that the proposed framework demonstrates the capacity to execute semantic segmentation effectively across various datasets. Notably, our results surpass those of existing unsupervised models despite the absence of any training procedure.
Our framework consists of three sub-components: a) Image recognition models,
which include a Recognize Anything Plus Model (RAM++) [14] and a BLIP-2 [19] model. These
models process input images to generate a list of tags and a caption that reflects the textual
information present in the input image. b) An advanced GPT-4 [23] model is employed for
a pre-refinement process, which process the textual information into predicted classes for
the segmentation model. The system automatically generates prompts based on predefined
target classes specific to different datasets. c) a pre-trained open-set segmentation model
Grounded-SAM [28] that is able to detect and segment certain classes in the images based
on predicted classes. A post-refinement process is applied during the detection phase using
the same GPT-4 model.
@inproceedings{Huang2024SemSegLLM,
author={Yuantian Huang and Satoshi Iizuka and Kazuhiro Fukui},
title={Training-Free Zero-Shot Semantic Segmentation with LLM Refinement},
booktitle={35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year={2024},
url={https://bmvc2024.org/proceedings/601},
}