Comment by Eisenstein

CLIP and YOLO work completely differently and have different purposes. CLIP uses transformers and embeddings and can compare text with images for classification. YOLO using a CNN and is trained with bounding boxes on images and is used for image recognition.

Give an image to CLIP and you can compare the similarity between the image and a sentence like 'a vase with roses in it'. Whereas with YOLO you give it an image and get the coordinates of bounding boxes around a vase, and around roses.