Comment by y04nn
How does CLIP compare to YOLO[1]? I haven't looked into image classification/object recognition for a while, but I remember that YOLO was quite good was working on realtime video too.
How does CLIP compare to YOLO[1]? I haven't looked into image classification/object recognition for a while, but I remember that YOLO was quite good was working on realtime video too.
CLIP and YOLO work completely differently and have different purposes. CLIP uses transformers and embeddings and can compare text with images for classification. YOLO using a CNN and is trained with bounding boxes on images and is used for image recognition.
Give an image to CLIP and you can compare the similarity between the image and a sentence like 'a vase with roses in it'. Whereas with YOLO you give it an image and get the coordinates of bounding boxes around a vase, and around roses.