TL;DR
In this final blog post of the series, I would like to put everything project-related together and illustrate the following:
- Motivation: CLIP struggles when there are multiple entities, relations, and attributes in the example.
- Intuitions: Classification power of the final dot-product retrieval layer in standard contrastive models is limited by their query and key representations’ VC-dimension.
- Setup: Compare contrastive models with non-contrastive models on extensions of the well-known SET card game.
- Results: Non-contrastive models with HALF the number of parameters can solve games that contrastive models cannot.
- Analysis: Contrastive models become less able to develop preferences among keys as the vector representation dimension decreases.
- Takeaways: Vector representation dimension matters in standard contrastive models, especially when it comes to solving compositional problems in which dynamic queries are expected to extract numerous random subsets of the keys.
- Full resolution slides are here. Presentation recording is here. Code will be released shortly.
The Motivation
CLIP’s Failure Mode
Consider the following text prompt, which asks for a red cube on top of a green cube on top of a blue cube. The image forms a natural pairing with the text.

If I shuffle the colors in both the text and the image, I get another pairing.

If I continue this procedure to generate all six pairs of texts and images, and then feed them to CLIP, I expect CLIP to predict a 6 by 6 matrix with strong values along the diagonal. But the predicted logits are surprisingly even. It seems that when there are multiple entities, relations, and attributes in the example, CLIP struggles.

(FYI) In case this example looks familiar to anyone who follows OpenAI’s work, this text prompt came from the DALL-E blog. DALL-E (a generative model) also has a similar problem.
The Intuitions
The Dot-product Retrieval Layer in Contrastive Models
My intuitions for why the above happens have to do with the dot-product retrieval layer in standard contrastive models. This dot-product is taken between the query and key representation vectors. We can view this layer as performing linear classification, in which a linear boundary Q(query) is encoded to separate the data points Ks(keys). This image is showing one particular K.

Intuition 1. Vapnik-Chervonenkis (VC) Dimension
Under this linear classification perspective, the VC-dimension of query and key vector representations applies as a limitation of classification power. Suppose we have 6 keys, one query may match with 3 of them as positive keys.

But there are 2^6 ways to assign some keys as positive and the remaining as negative, as long as our hypothesis class can generate the corresponding 2^6 number of queries. If
can indeed do that, we say that
can shatter these six keys. In general, if our query and key representations are limited by vector dimension
, the VC-theory states that they cannot match with all possible subsets of
keys.

In datasets like ImageNet, there are only 1000 labels. In the contrastive learning setup, this corresponds to 1000(so few!) static queries. For example, match all the dogs, or match all the sailboats. The models’ classification power is from hitting the limit imposed by the vector representations’ VC-dimension. But in datasets like CLEVR, there are numerous dynamic queries. Such as match scenes with exactly one cylinder, or match scenes with more than one blue objects. Here, the queries extract subsets in numerous dynamic ways so we may hit the limit imposed by the vector representations’ VC-dimension.

Intuition 2. Poor Rank Approximation
My secondary intuition is related to the poor approximation of full rank matrices. Suppose we have a full-rank query-key distribution matrix and the number of keys is lesser than the number of queries. This matrix has rank
. On the other hand, the dot-product retrieval layer in contrastive models can be viewed as reconstructing this matrix as
, one position at a time. Considering all the positions in this matrix, taking all the dot-products is equivalent to performing a matrix multiplication between
–the matrix for all the keys, each with vector representation dimension
, and
–the matrix for all the queries, each with vector representation dimension
.
has rank
which is less than or equal to
. Therefore, as the vector representation dimension
of the queries and keys decreases,
becomes a poorer approximation of
, where the queries and keys were sampled from.

Setup
Model Architecture
To validate my intuitions, I set up two architectures. One is a contrastive model which encodes queries and keys separately and then scores their compatibility using the dot-product retrieval layer. The other is a non-contrastive model which scores each query-key pair as a continuous sequence.

The contrastive model uses an 8-layer transformer to encode the query symbols and an embeddings lookup to encode the key symbols. On the other hand, the non-contrastive model uses a 4-layer transformer to encode the concatenated query-key symbols. The contrastive model has 17M parameters, while the non-contrastive model has only HALF of that. The goal of my experiments is to show that contrastive models with limited vector representation dimensions are worse than non-contrastive models, which use only half the number of parameters.

The SET card game
To set up the task, I borrowed the well-known SET card game. It suits my purpose because of some nice, extendable properties. In the original card deck, each card has four attributes and three possible values per attribute. These two dimensions are scalable.

Any pair of cards can form a query that evaluates to a key card. A complete set has three cards. The game rules state that all three cards need to have the same value or all different values for any attributes.

The corresponding query-key distribution matrix has some regular pattern but is still full-rank.

To extend this game such that the VC-dimension intuition can be put to the test. I introduced the * regular expression such that queries can evaluate to subsets of key cards. In addition, I introduced the set union and symmetric difference operators to make the queries and their returned subsets even more dynamic.

The Extended SET card game — How to play
In the following, I will show an example game and how the models play it. Here’s a simpler deck of 27 cards. Each card has three attributes (color, count, and shape) and three possible values per attribute.

Each query has eight pairs of cards and set union operators between them. Depending on the query, the matching subset of keys varies. This particular query returns 13 matching(positive) keys and 14 non-matching(negative) keys. A different query may have a smaller or larger subset of keys, such as 3+ 24-, or 21+ 6-.

A training example is a tuple of query symbols, and a key symbol sampled from the matching positive keys. We can sample different queries and their keys this way to make a batch.

The contrastive model uses the infoNCE training objective, which normalizes dot-products over all the query-key pairings in this batch-size by batch-size matrix. The scores are penalized across both columns and rows, just like CLIP.

On the other hand, the non-contrastive model uses the conventional Cross-Entropy objective, which normalizes scores across keys in the support. The scores are penalized across rows (consider example independently) only.

At test time, the models are given queries such as this one, and the models score the compatibility of this query with each of the 27 keys in the support. Note that at convergence, the contrastive model scores and the non-contrastive scores have different interpretations in terms of probability (see earlier posts about infoNCE).

There are 13 ground-truth keys for this query so one crude metric is to measure how much the top 13 model predictions overlap with the 13 ground-truth keys. For now, I call this metric the Top K Prediction-Ground-truth Overlap.

A more sensitive metric is to compare the KL divergence between the normalized prediction scores and a ground-truth distribution constructed from dividing 1.0 among all the ground-truth keys evenly.

Putting it all together, I have seven contrastive models with vector representation dimensions ranging from 512 down to 4. They all have about 17M parameters and a non-contrastive model with only half the number of parameters. On the other hand, I have five different games based on a 27-card deck, from the original SET game to the extensions mentioned above, and a final game of shattering 27 numbers. These five games have increasing levels of difficulties, characterized by the respective total number of unique key subsets which the game queries should extract. I tried each game on all eight models.

Results, Analysis
27-card Results
Here’s the results plot. On the y-axis, we have KL Divergence Loss. On the x-axis, we have those five games ordered from left to right with increasing levels of difficulties, quantified by the total number of unique key subsets which the game queries should extract, measured in powers of two.

The first game on the left is the original SET. It requires the queries to extract 2^4.57 subsets of keys in total. All the models did very well.

The second game includes a * regular expression and requires the queries to extract 2^6 subsets of keys. We can see the contrastive model with vector dimension 4 starts to do poorly.

As we further introduce the set union and symmetric difference operators, the queries need to extract between 2^21 and 2^23 subsets of keys. We can see the contrastive models with vector dimension 8 to 20 also performs worse and worse.

Finally, on shattering 27 numbers, the vectors need to satisfy a VC-dimension of at least 27 so that the queries can extract all 2^27 subsets of keys. We can see that all models with vector dimensions less than 27 fail to different extents.

Notice that throughout these five games, three models perform consistently well. They are the contrastive models with vector dimensions 512, 27, and the non-contrastive model with only half the parameters.

On the other hand, the Top K Prediction-Ground-truth metric shows a similar, agreeing trend. Higher is better, so this plot looks like the KL Divergence Loss plot flipped upside down.

Error Analysis (27 cards)
To understand why the models with shorted vector dimensions performed worse, I zoomed into the third game (Extend w/ Union Op) to analyze. Here’s a representative query from the erroneous predictions. It has 13 positive keys and 14 negative keys. A perfect model would normalize the dot-products and distribute the probability mass evenly among the positive keys and nothing among the negative keys. The contrastive models with vector representation dimensions 512 and 27 are not far from that.

But as we drop the vector dimension to 20 or below, we start to see mass moving to the negative keys. As we hit 4, the predictions are about the same as even distribution. It seems that as we drop the vector representation dimension, the models become less able to develop a preference among the keys. We can use prediction entropy as a measure of this phenomenon.

(Left) What we saw from that one example can be generalized to all the queries that have 10-20 positive keys. The average prediction entropy increases as the vector dimension decreases. Notice that queries in this game return up to 2^21 subsets of keys, and starting with dimension 20 it grows more dramatically. (Right) This trend is also reflected in the aggregate of all queries in the test set. Prediction entropy increases monotonically as vector dimension decreases.

81 cards Results
I trained an 81-card version of this particular game as well. The setup is similar. The contrastive model ranged from vector representation dimension 512 down to 16, all with 17M parameters. The non-contrastive model is only half the size.
On the plots, the trends are similar. KL Divergence Loss decreases as the vector dimension increases. The Top K Prediction-Ground-truth Overlap metric increases with the vector dimension. Note that the game has 81 cards, so in theory, with a vector dimension of 81 or above, our models should solve any games (even shattering 81 numbers) perfectly. We do see agreeing trends here.

Thus far, these small game results validate my initial intuition. Solving compositional problems requires the models to encode compositional representations. One way to evaluate this quality is by testing whether models can effectively encode queries that can extract key subsets dynamically.
Training, Optimization
Here are a few things I learned from training these models:
1) More difficult games (longer queries, more logical operators, more subsets to be extracted) require more gentle learning rate schedules. Models were trained with the Adam optimizer using the standard transformer schedule. Warm-up steps were searched on a log scale. Cosine schedules had helped to unstuck models from local minima in some cases.
2) Efficiency of convergence depends on initialization schemes. Embeddings for the query symbols in the contrastive models and embeddings for the query-key symbols in the non-contrastive models were initialized with the standard transformer scheme. While the key embedding lookup was initialized with a uniform distribution , where
is the reciprocal of the vector representation dimension. I also tried the Word2Vec initialization, but convergence was slower.
3) Using plain dot-product, instead of cosine similarity with or without temperature, works better for this data.
Overall, I think these modifications are more specific to the synthetics games’ nature than contrastive models.
The ‘Thank You’s
I want to thank OpenAI for this opportunity. The program was fun, intense, and rewarding. I came in with a few topics of interest (reasoning, compositionality, multimodal), and my mentor Gabriel Goh helped me to crystallize them into concrete research questions. We explored alternatives, conducted sign-of-life experiments, designed multiple card games, and iterated until we saw something promising. All the lessons were worth the effort. Gabe provided instrumental guidance and lots of valuable feedback along the way. I also want to thank my fellow scholars, who provided tremendous support. I feel privileged to work with such talented and perseverant colleagues:
- Christina Kim
- Danielle Ensign
- Ellie Kitanidis
- Florentine (Tyna) Eloundou
- Jonathan Ward
- Kudzo Ahegbebu
- Sam Gbafa
- Shola Oyedele
Please read this OpenAI blog post to read about their work:
https://openai.com/blog/openai-scholars-2021-final-projects/