The CLIP Encoder and Multimodal Bridges

Learn how the CLIP model connects text and image modalities by encoding them into a shared vector embedding space. Understand the architecture of its separate text and image encoders, the training process aligning embeddings semantically, and how this enables zero-shot learning for diverse tasks including classification and retrieval.

We'll cover the following...

Multimodal learning
The need for a common representation
Joint embedding space
- What the joint space looks like
Matching vectors
Zero-shot capabilities
- Why this works
Conclusion

Modern GenAI systems increasingly work with more than one type of data. Text, images, audio, and video all carry information in different forms, yet many real-world tasks require models to reason across these modalities. For example, a system might need to determine whether an image matches a caption, retrieve images based on a text query, or decide whether an image violates a content policy described in words.

To make this possible, models need a way to connect different modalities at the semantic level. One of the most influential approaches to doing this is CLIP (Contrastive Language–Image Pretraining). Rather than merging text and images directly, CLIP learns how to represent both in a shared space where they can be meaningfully compared.

Let’s learn how that bridge is built.

Multimodal learning

Text and images are fundamentally different kinds of data. Text is discrete and sequential, composed of tokens arranged in a specific order. Images, on the other hand, are continuous and spatial, represented as grids of pixel values. Because of these differences, the techniques used to process text and images are usually very different as well.

A language model processes sequences of tokens and learns relationships between words, phrases, and sentences. A vision model processes pixels and learns visual patterns such as edges, textures, objects, and scenes. These representations are not directly compatible. A sentence and an image cannot be compared in their raw forms.

This creates a core challenge for multimodal learning: “how can a model tell that a piece of text and an image refer to the same concept?”

The need for a common representation

To compare text and images, both must first be transformed into a format that supports comparison. In practice, this format is a vector embedding, which is a fixed-length numerical representation that captures semantic meaning.

The key idea behind CLIP-style models is simple but powerful: instead of directly merging text and image data, each modality is encoded separately into a vector, which is then placed in the same (shared) embedding space.

In this shared space:

Text descriptions and images that describe the same concept are close together.
Unrelated text and images are far apart.

Once both modalities live in the same space, matching becomes a geometric problem rather than a symbolic one. CLIP uses two separate encoders, one for text and one for images. Each encoder is specialized for its input modality and produces a vector embedding as output.

The text encoder takes tokenized text as input and processes it using a transformer-based architecture. Its output is a single vector that represents the semantic meaning of the entire text prompt, not individual words.

The image encoder takes image pixels as input and processes them using a vision model, such as a convolutional neural network (CNN) or a Vision Transformer (ViT). Its output is a single vector that captures the image’s semantic content.

Although the encoders are different internally, they are designed so that their outputs have the same dimensionality. This allows both text and image embeddings to exist in the same vector space.

Importantly, the encoders do not interact directly. They are only connected through the training objective, which encourages their outputs to align semantically.

Joint embedding space

The central idea behind CLIP is the joint embedding space. This is a vector space in which both text and images are represented as numerical vectors of the same size. What makes this space useful is its structure through training, rather than its dimensionality.

To understand this, consider the following image–text pair from the training data:

Image: A photograph of a dog playing in the snow.
Text: “A dog in the snow.”

During training, the image is passed through the image encoder, producing an image embedding. The text is passed through the text encoder, producing a text embedding. At this point, the model does not yet know whether these two embeddings should be close or far apart.

The training objective tells the model that this image and this text belong together. As a result, the model adjusts both encoders so that:

The image embedding moves closer to the text embedding, or
The text embedding moves closer to the image embedding.

At the same time, the model is shown many non-matching examples. For instance:

The same image paired with “A cat sleeping on a couch,” or

The same text paired with an image of a snowy mountain with no dog.

For these mismatched pairs, the training objective pushes the corresponding embeddings farther apart. Over time, this process shapes the embedding space so that semantic similarity corresponds to geometric proximity.

What the joint space looks like

After training, the joint embedding space develops structure. Images and text describing similar concepts cluster together, even when they come from different modalities.

For example:

Images of dogs, sketches of dogs, and the text “a dog” all occupy nearby regions.
Text like “a red sports car” ends up close to images of red cars, even if that exact pairing was never seen during training.
Unrelated concepts, such as “a bowl of soup” and images of airplanes, end up far apart.

The model does not store explicit labels such as “dog” or “car.” Instead, these concepts emerge as regions in the embedding space shaped by large-scale alignment.

Matching vectors

Once text and images live in the same embedding space, matching becomes straightforward. The system computes a similarity score between vectors, most commonly using cosine similarity.

To see how this works in practice, consider a simple scenario. Suppose we have:

One image embedding (an image of a dog in snow)
Three text captions:
- “A dog playing in the snow.”
- “A cat sitting indoors.”
- “A red sports car.”

Assume the image encoder produces the following embedding for an image of a dog playing in the snow:

This same comparison is repeated at scale across thousands or millions of image–text pairs, enabling retrieval, ranking, and zero-shot classification using a single unified mechanism.

This same mechanism works in reverse. Given a text query, the system can retrieve the most relevant images by ranking image embeddings based on their similarity to the query.

Zero-shot capabilities

One of the most powerful consequences of this setup is zero-shot learning. The model can perform tasks it was never explicitly trained for, simply by relying on similarity in the joint embedding space.

Assume we want to classify an image, but we have no task-specific classifier. Instead, we define class labels as text:

“a photo of a dog”
“a photo of a cat”
“a photo of a car”

Each label is passed through the text encoder to produce a text embedding. The image is passed through the image encoder to produce an image embedding. The image is then assigned the label whose text embedding is most similar to the image embedding.

No retraining is required. Adding a new class is as simple as adding a new text description.

Why this works

This works because CLIP does not learn tasks. It learns relationships between modalities. Classification, retrieval, filtering, and moderation all reduce to the same operation: comparing vectors in a shared space.

This is why CLIP-style models are widely used as foundational components in multimodal systems. They provide a flexible semantic bridge between text and images without requiring task-specific supervision.

Conclusion

CLIP connects text and images by encoding each into vectors and placing them in a shared embedding space. During training, matching image–text pairs are pulled closer together, while unrelated pairs are pushed apart. At inference time, matching reduces to computing cosine similarity between vectors.

This alignment enables zero-shot capabilities: images can be classified, retrieved, or filtered using only text descriptions, without task-specific training. By learning relationships rather than tasks, CLIP serves as a general-purpose bridge between language and vision.

Text Caption	Cosine Similarity	Interpretation
A dog playing in the snow	0.9996	Very strong match
A cat sitting indoors	0.76	Weak/partial match
A red sports car	-0.69	No match

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

6.System Design of a Text-to-Image Generation System

7.System Design of a Text-to-Speech Generation System

8.System Design of a Text-to-Video Generation System

9.System Design of an Image Captioning System

10.System Design of an Automatic Speech Recognition

11.System Design of Retrieval-Augmented Generation (RAG)

12.Conclusion

13.Free GenAI System Design Lessons