Abstract: Vision-language models such as CLIP have boosted the performance of open-vocabulary object detection, where the detector is trained on base categories but required to detect novel categories ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results