Today we’ll dive right into a paper that builds upon the nice success of CLIP in language-image pre-training and extends it to the duty of object detection: GLIP — Grounded Language-Image Pre-training. We’ll cowl the important thing ideas and findings of the paper and make them simple to grasp by offering additional context and including annotations to pictures and experiment outcomes. Let’s go!
Paper: Grounded Language-Image Pre-training
Code: https://github.com/microsoft/GLIP
First Printed: 7 Dec. 2021
Authors: Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao
Class: illustration studying, object detection, phrase-grounding, multi-modal deep studying, laptop vison, pure language processing, basis fashions
- Context & Background
- Claimed Contributions
- Methodology
- Experiments
- Additional Readings & Sources
GLIP (Grounded Language-Image Pre-training) is a multi-modal language-image mannequin. Much like CLIP (Contrastive Language-Image Pre-Coaching), it performs contrastive pre-training to study semantically wealthy representations and aligns them throughout its modalities. Whereas CLIP learns these illustration on a picture stage, which suggests one sentence describes the complete picture, GLIP goals to increase this strategy to object-level representations, that means one sentence would possibly correspond to a number of objects throughout the picture. The duty of figuring out correspondences between single tokens in a text-prompt and objects or areas in a picture known as phrase grounding. Therefore the phrase “Grounded” in GLIP.
Due to this fact, GLIP goals to:
- Unify phrase grounding and object detection for large-scale pre-training.
- Present a versatile framework for zero-shot object detection, the place versatile means it isn’t restricted to a set set of lessons.
- Construct one pre-trained mannequin that seamlessly transfers to varied duties and domains, in a zero-shot or few-shot method.
What are you able to do with such a mannequin? You could possibly use textual content prompts to seek out objects or areas of curiosity inside a given enter picture. And the most effective half: you aren’t restricted to pre-defined lessons.
You could possibly additional course of these detections (e.g. feeding these right into a monitoring system) or create a customized dataset with sure lessons of curiosity and use these to coach your personal supervised detection system. Not solely that you possibly can cowl uncommon or very particular lessons, however you possibly can additionally save numerous money and time for the creation of handbook labels. As we’ll see later, the authors of GLIP had the same thought to spice up the efficiency even additional by introducing a teacher-student framework.
GLIP has been adopted by many different tasks and domains in deep studying. For instance, GLIGEN (Grounded-Language-to-Image-Generation) makes use of GLIP as to situation the picture era of a latent diffusion mannequin to extend the controllability. Moreover, GLIP has been mixed with different basis fashions comparable to DINO (Self Distilation with no Labels) and SAM (Segment Anything) to GroundingDINO and Grounded-Segment-Anything respectively. GLIPv2 extends the preliminary GLIP mannequin with vision-language understanding to not solely enhance phrase grounding but additionally allow visible query answering duties.
- Massive scale pre-training for mixed phrase grounding and object detection
- Offering a unified view on object detection and phrase grounding
- Deep cross-modality fusion to study high-quality language-aware visible representations and to realize superior switch studying efficiency.
- Presenting that prompt-tuning is simpler in deep vision-language fusion (e.g. GLIP) as in shallow fused networks (e.g. CLIP)
Having a tough thought of what may be executed with GLIP, let’s have a better look into the small print of the paper.
Architectural Overview
On a excessive stage, GLIP’s structure is sort of much like CLIP’s in a way that it additionally consists of a textual content encoder, a picture encoder and a few form of contrastive studying on the similarity of textual content and picture options. The structure of GLIP is proven in Fig. 2.
GLIP provides a language-image conscious deep fusion module after the textual content and picture encoder. This module performs cross-modal consideration and extracts additional options. A cosine similarity is calculated over the ensuing area options and phrase options. Throughout coaching, the similarity of matching pairs is maximized, whereas minimized for incorrect pairs. In distinction to CLIP, the place the matching pairs are positioned on the diagonal of the similarity matrix, in GLIP the matching isn’t carried out on sentence stage, however on (sub)phrase stage leading to normally off-diagonal positions.
Phrase Grounding Formulated as Object Detection Drawback
The authors famous that the issue of phrase grounding (= associating phrases with objects/areas in a picture) may be formulated as Object detection Goal, the place the usual loss goal is:
The localization loss is anxious with the standard of the anticipated bounding field, which relying on the format, may be the dimensions and placement of the field. The classification loss is the important thing half within the unification. By calculating the logits over the similarity rating of text-image options as an alternative of over the logits from a picture classifier, the identical loss goal can be utilized for coaching.
Totally different Mannequin Variants
5 totally different fashions are educated to indicate the impact of the authors’ design selections and mannequin scale:
Trainer-Pupil Pre-Coaching
To spice up the efficiency of GLIP, the authors prepare the GLIP-T (C) mannequin (see Fig.3) on human annotated information, known as GoldG, to generate grounding information from text-image pairs from the web. They name this mannequin the instructor mannequin and subsequently prepare a scholar mannequin feeding it the with the info used to coach the instructor plus the info the instructor generated. See Fig. 4 for an illustration.
Word: Though the phrases instructor and scholar are used, it isn’t the identical course of as in data distillation, the place a smaller scholar mannequin is educated to match the output of a bigger instructor mannequin.
Curiously, as we’ll see within the experiments, the scholar surpasses the instructor on many (however not all) datasets for each; zero-shot and few-shot detection. Why is that? The paper hypothesizes, that eventhough the instructor supplies a prediction with low confidence (they name it an “educated guess”), it turns into the bottom fact (they name it “supervised sign”) within the generated dataset consumed by the scholar.
The GLIP paper presents numerous experiments and ablation research, primarily involved with:
- Zero-Shot Area Switch
- Information Effectivity
- Immediate Engineering
I’ve some doubts for a few of the outcomes and the way in which they’re introduced, and I’ll level them out within the annotations. I don’t need to diminish the achievements of GLIP and fairly view it with a crucial eye.
Now let’s leap into the small print!
Zero-Shot Area Switch
First, we’ll take a look into the outcomes from the zero-shot area switch. On this process the target is to investigate how effectively the pre-trained GLIP fashions carry out on a distinct dataset (i.e. COCO and LVIS) as used throughout pre-training and examine it towards a baseline with fashions which were educated in a supervised style. Then, the pre-trained GLIP is additional fine-tuned and evaluated on the dataset underneath take a look at.
In Fig.5 we see the outcomes from the zero-shot area switch on COCO. We see that each one GLIP fashions have a greater 0-shot efficiency as a supervised Quicker RCNN. We’re additionally introduced with the outcome, that GLIP-L outperforms the earlier SOTA (on the time of the paper’s launch). We see that the bigger scholar GLIP-L outperforms the instructor mannequin GLIP-T (C).
Following I listing my doubts when studying these outcomes and the claims made within the paper, the place it’s mentioned that GLIP-L surpasses the most effective supervised mannequin SoftTeacher.
- The mannequin that has higher metrics than SoftTeacher is GLIP-L, which is best by 0.2 factors. This small margin may not be the results of the brand new technique of GLIP however may be because of some variations in coaching hyperparameters.
- GLIP-L doesn’t even use the info (Cap4M or Cap24M) generated from instructor mannequin which they introduced as an excellent answer.
- GLIP-L has been educated on a a lot bigger corpus of coaching information as SoftTeacher.
In my view the outcomes evaluating the totally different GLIP fashions and the DyHead-T they educated themselves are fully positive, I simply have my doubts generally when totally different strategies and fashions are in contrast underneath unclear or totally different constraints.
In Fig.6, we see the zero-shot area switch efficiency on LVIS dataset. We will see that the biggest GLIP mannequin, GLIP-L, outperforms all different introduced supervised fashions.
Lastly, GLIP has been in contrast on its phrase grounding efficiency on the Flickr30K entities towards MDETR (see Fig.7). Each scholar fashions, GLIP-T and GLIP-L, surpass the MDETR baselines.
Information Effectivity
One other experiment is anxious with the info effectivity. This experiment goals to indicate how the efficiency (by way of common precision) modifications when fine-tuning a pre-trained mannequin on a sure variety of process particular information. In Fig.8, the fashions are evaluated on 13 totally different datasets and their efficiency is reported as common precision averaged over the 13 datasets. Outcomes are reported for 0-shot, 1-shot, 3-shot, 5-shot, 10-shot and “all”-shot (I doubt that’s an official time period for full fine-tuning, however I suppose you get the purpose 😅).
Immediate Engineering
Related as in CLIP, the authors additionally report a correlation of the mannequin’s efficiency and the formulation of the enter textual content immediate. They suggest two strategies to enhance the efficiency of a pre-trained mannequin, with out the necessity to retrain the mannequin’s weights:
- Handbook immediate tuning
- Immediate Tuning
The thought of handbook immediate tuning is to supply additional context in type of extra descriptive phrases, see Fig. 9:
Handbook immediate tuning can all the time be used to enhance the efficiency, that means it doesn’t matter if the mannequin is totally fine-tuned or if the mannequin is utilized in a zero-shot or few-shot situation.
The second strategy, immediate tuning, requires entry to floor fact labels of a downstream process and is particularly appropriate for situations, the place every detection process has a single immediate (e.g. “Detect automobile”). In that situation, this immediate would first be translated right into a characteristic embedding utilizing the textual content encoder. Then, the picture encoder and the deep fusion module are frozen and solely the enter embedding is optimized utilizing the bottom fact labels. The optimized embeddings would then function enter to the mannequin and the textual content encoder may very well be eliminated.
Fig.10 exhibits the results of this immediate tuning for numerous GLIP fashions. When utilized to fashions which have a deep fusion module, immediate tuning achieves virtually the identical efficiency as fine-tuning the mannequin’s weights.
As talked about at the start of this text, GLIP has been extensively adopted by an enormous variety of tasks.
Following a listing of papers that constructed upon GLIP: