TL;DR

In the past couple of weeks, I took variable binding as a starting point and studied two relevant topics for the project:

- Vapnik-Chervonenkis(VC) dimension
- Contrastive learning for classification

Intuition

I am interested in measuring and demonstrating the limits of standard representation techniques to effectively encode inputs with high compositionality for downstream reasoning tasks. One such representation technique maybe Contrastive Learning for VC-dimension related reasons.

Vapnik-Chervonenkis(VC) dimension

Many common forms of contrastive learning involve encoding unimodal or cross-modal inputs into query and key vectors, scoring the query-key pairs using a similarity function, and applying a loss function to maximize the mutual information for matching pairs of queries and keys. For the similarity function, dot-product or cosine similarity (normalized dot-product) are most commonly used. This operation is where things get a little interesting because we can view this dot-product as performing linear binary classification. Each key vector is a data point, and each query vector is a linear classifier. The query vector draws a boundary to separate the matching key vectors from the non-matching key vectors.

The VC-theory states that the set of all linear classifiers (each with a bias term) for data points in dimension , can shatter at most data points. Let’s unpack this statement a bit:

**Shatter**: If there are data points in a binary classification problem, each of them can be labeled as or , so there are possible label assignments. For each of these label assignments, if there exists a linear classifier in such that no classification error is made, then can shatter these data points. Even if there’s only one particular label assignment which no classifiers in can classify without any error, cannot shatter these data points.

**Shatter at most data points**: If there exists at least one dataset — any dataset up to our choosing, in dimension of data points which can shatter, AND there exist no datasets at all in dimension of data points which can shatter, then the VC-dimension of is . The VC-theory says that . Lecture 7 of Professor Yaser Abu-Mostafa’s course at Caltech walks through a simple proof.

Putting this VC-theory together with contrastive learning, the VC-dimension poses some sort of limit on contrastive learning performance because of the latter’s use of the dot-product. If our query and key vectors are in representation space , for any dataset with more than data points, there exists at least one subset of keys that we cannot learn a query to match exclusively.

One important follow-up question is: Do we ever run into this limit in practice? Contrastive learning has achieved success for image and text representation learning, and even when we used these learned representations directly for classification. For example, CLIP is showing satisfactory zero-shot image classification results. Let’s reason about this more: shattering a dataset means drawing boundaries that separate ALL possible subsets of data. For image or image-text classification tasks, some subsets are more natural than the others—we group canines under one label, primates under one label, cars under one label, etc. But the grouping of an elephant, a phone, and a microwave oven under one label is not semantically natural. In fact, in many benchmark datasets, the data points form very few semantically meaningful subsets. As such, when we apply contrastive learning to learn representations that capture the invariant properties behind these data points, these learned invariant properties don’t have to support random subset retrieval.

However, the story may be different for highly compositional datasets. A task like CLEVR (Johnson et al. 2016), which has scenes composed of multiple objects of multiple colors, texture, and shape, allows us to subset the scenes in many possible ways as long as we issue the right questions. For example, the question *Is there anything in the scene?* returns the whole set of scenes; the question *Is there a red object in the scene?* creates a narrower subset; the question *Is there a red metal cube to the right of a green sphere?* returns an even narrower subset. Of course, this is no proof that we can create short, human-readable queries that match any random subset of scenes exclusively. With many subsets, we may have to ask a clunky question such as *Does this scene matches scene101 or scene5 or scene74 or scene13 or scene500?*. But it is now reasonable to speculate that contrastive learning models may fail to learn representations that support highly compositional reasoning, more so than other representation learning techniques such as generative models, due to VC-dimension related reasons. As a short exercise, I made two attempts to generate synthetic scenes and queries. The scenes and queries both follow symbolic templates, so to unburden the model from learning raw visual or natural language inputs.

Contrastive learning for classification

To validate the hypothesis above, we would have to evaluate contrastive learning’s accuracy on query-key matching. This evaluation is a classification problem: At test time, given a query and a key, what’s the probability that the pair match? For zero-shot image classification, given a query(e.g., image) and a set of keys(e.g., text labels), which key is the right match? However, this is not as straight-forward as we would hope. There are two blockers: 1) What should the labels be? {feline, canine} or {feline, canine, Ursidae, primate} 2) Do the loss functions used in contrastive learning indeed result in the model to predict “proper” probabilities during inference? This latter is a more complicated question, but answering it can help us interpret the numbers CLIP predicts and claims to be probabilities.

Let’s focus on the infoNCE loss function (Oord et al. 2018), which is used in training CLIP and many other contrastive learning models:

The authors claim that as the learning algorithm optimizes for this objective, converges to , where is a constant. This is based on an assumption that the optimal is:

However, there is no explanation for why this assumption is valid in the first place. On the other hand, toy experiments show that indeed converge to . If this was true, we can not consider CLIP’s outputs to be the classification probabilities we intuitively expect. Let be the text label which we are considering for the image , and refers to a complete, perfect set of labels for this classification task:

which does NOT evaluate to that is claimed on openAI’s blog. This is very curious, and suggests a principled way to improve CLIP’s zero-shot classification results, if we can train a language model and use it during inference, since:

which is the probability we actually care about.