In this final blog post of the series, I would like to put everything project-related together and illustrate the following:

Motivation: CLIP struggles when there are multiple entities, relations, and attributes in the example.

Intuitions: Classification power of the final dot-product retrieval layer in standard contrastive models is limited by their query and key representations’ VC-dimension.

Setup: Compare contrastive models with non-contrastive models on extensions of the well-known SET card game.

Results: Non-contrastive models with HALF the number of parameters can solve games that contrastive models cannot.

Analysis: Contrastive models become less able to develop preferences among keys as the vector representation dimension decreases.

Takeaways: Vector representation dimension matters in standard contrastive models, especially when it comes to solving compositional problems in which dynamic queries are expected to extract numerous random subsets of the keys.

CLIP’s Failure Mode Consider the following text prompt, which asks for a red cube on top of a green cube on top of a blue cube. The image forms a natural pairing with the text.

If I shuffle the colors in both the text and the image, I get another pairing.

If I continue this procedure to generate all six pairs of texts and images, and then feed them to CLIP, I expect CLIP to predict a 6 by 6 matrix with strong values along the diagonal. But the predicted logits are surprisingly even. It seems that when there are multiple entities, relations, and attributes in the example, CLIP struggles.

(FYI) In case this example looks familiar to anyone who follows OpenAI’s work, this text prompt came from the DALL-E blog. DALL-E (a generative model) also has a similar problem.

The Intuitions

The Dot-product Retrieval Layer in Contrastive Models My intuitions for why the above happens have to do with the dot-product retrieval layer in standard contrastive models. This dot-product is taken between the query and key representation vectors. We can view this layer as performing linear classification, in which a linear boundary Q(query) is encoded to separate the data points Ks(keys). This image is showing one particular K.

Intuition 1. Vapnik-Chervonenkis (VC) Dimension Under this linear classification perspective, the VC-dimension of query and key vector representations applies as a limitation of classification power. Suppose we have 6 keys, one query may match with 3 of them as positive keys.

But there are 2^6 ways to assign some keys as positive and the remaining as negative, as long as our hypothesis class can generate the corresponding 2^6 number of queries. If can indeed do that, we say that can shatter these six keys. In general, if our query and key representations are limited by vector dimension , the VC-theory states that they cannot match with all possible subsets of keys.

In datasets like ImageNet, there are only 1000 labels. In the contrastive learning setup, this corresponds to 1000(so few!) static queries. For example, match all the dogs, or match all the sailboats. The models’ classification power is from hitting the limit imposed by the vector representations’ VC-dimension. But in datasets like CLEVR, there are numerous dynamic queries. Such as match scenes with exactly one cylinder, or match scenes with more than one blue objects. Here, the queries extract subsets in numerous dynamic ways so we may hit the limit imposed by the vector representations’ VC-dimension.

Intuition 2. Poor Rank Approximation My secondary intuition is related to the poor approximation of full rank matrices. Suppose we have a full-rank query-key distribution matrix and the number of keys is lesser than the number of queries. This matrix has rank . On the other hand, the dot-product retrieval layer in contrastive models can be viewed as reconstructing this matrix as , one position at a time. Considering all the positions in this matrix, taking all the dot-products is equivalent to performing a matrix multiplication between –the matrix for all the keys, each with vector representation dimension , and –the matrix for all the queries, each with vector representation dimension . has rank which is less than or equal to . Therefore, as the vector representation dimension of the queries and keys decreases, becomes a poorer approximation of , where the queries and keys were sampled from.

Setup

Model Architecture To validate my intuitions, I set up two architectures. One is a contrastive model which encodes queries and keys separately and then scores their compatibility using the dot-product retrieval layer. The other is a non-contrastive model which scores each query-key pair as a continuous sequence.

The contrastive model uses an 8-layer transformer to encode the query symbols and an embeddings lookup to encode the key symbols. On the other hand, the non-contrastive model uses a 4-layer transformer to encode the concatenated query-key symbols. The contrastive model has 17M parameters, while the non-contrastive model has only HALF of that. The goal of my experiments is to show that contrastive models with limited vector representation dimensions are worse than non-contrastive models, which use only half the number of parameters.

The SET card game To set up the task, I borrowed the well-known SET card game. It suits my purpose because of some nice, extendable properties. In the original card deck, each card has four attributes and three possible values per attribute. These two dimensions are scalable.

Any pair of cards can form a query that evaluates to a key card. A complete set has three cards. The game rules state that all three cards need to have the same value or all different values for any attributes.

The corresponding query-key distribution matrix has some regular pattern but is still full-rank.

To extend this game such that the VC-dimension intuition can be put to the test. I introduced the * regular expression such that queries can evaluate to subsets of key cards. In addition, I introduced the set union and symmetric difference operators to make the queries and their returned subsets even more dynamic.

The Extended SET card game — How to play In the following, I will show an example game and how the models play it. Here’s a simpler deck of 27 cards. Each card has three attributes (color, count, and shape) and three possible values per attribute.

Each query has eight pairs of cards and set union operators between them. Depending on the query, the matching subset of keys varies. This particular query returns 13 matching(positive) keys and 14 non-matching(negative) keys. A different query may have a smaller or larger subset of keys, such as 3+ 24-, or 21+ 6-.

A training example is a tuple of query symbols, and a key symbol sampled from the matching positive keys. We can sample different queries and their keys this way to make a batch.

The contrastive model uses the infoNCE training objective, which normalizes dot-products over all the query-key pairings in this batch-size by batch-size matrix. The scores are penalized across both columns and rows, just like CLIP.

On the other hand, the non-contrastive model uses the conventional Cross-Entropy objective, which normalizes scores across keys in the support. The scores are penalized across rows (consider example independently) only.

At test time, the models are given queries such as this one, and the models score the compatibility of this query with each of the 27 keys in the support. Note that at convergence, the contrastive model scores and the non-contrastive scores have different interpretations in terms of probability (see earlier posts about infoNCE).

There are 13 ground-truth keys for this query so one crude metric is to measure how much the top 13 model predictions overlap with the 13 ground-truth keys. For now, I call this metric the Top K Prediction-Ground-truth Overlap.

A more sensitive metric is to compare the KL divergence between the normalized prediction scores and a ground-truth distribution constructed from dividing 1.0 among all the ground-truth keys evenly.

Putting it all together, I have seven contrastive models with vector representation dimensions ranging from 512 down to 4. They all have about 17M parameters and a non-contrastive model with only half the number of parameters. On the other hand, I have five different games based on a 27-card deck, from the original SET game to the extensions mentioned above, and a final game of shattering 27 numbers. These five games have increasing levels of difficulties, characterized by the respective total number of unique key subsets which the game queries should extract. I tried each game on all eight models.

Results, Analysis

27-card Results Here’s the results plot. On the y-axis, we have KL Divergence Loss. On the x-axis, we have those five games ordered from left to right with increasing levels of difficulties, quantified by the total number of unique key subsets which the game queries should extract, measured in powers of two.

The first game on the left is the original SET. It requires the queries to extract 2^4.57 subsets of keys in total. All the models did very well.

The second game includes a * regular expression and requires the queries to extract 2^6 subsets of keys. We can see the contrastive model with vector dimension 4 starts to do poorly.

As we further introduce the set union and symmetric difference operators, the queries need to extract between 2^21 and 2^23 subsets of keys. We can see the contrastive models with vector dimension 8 to 20 also performs worse and worse.

Finally, on shattering 27 numbers, the vectors need to satisfy a VC-dimension of at least 27 so that the queries can extract all 2^27 subsets of keys. We can see that all models with vector dimensions less than 27 fail to different extents.

Notice that throughout these five games, three models perform consistently well. They are the contrastive models with vector dimensions 512, 27, and the non-contrastive model with only half the parameters.

On the other hand, the Top K Prediction-Ground-truth metric shows a similar, agreeing trend. Higher is better, so this plot looks like the KL Divergence Loss plot flipped upside down.

Error Analysis (27 cards) To understand why the models with shorted vector dimensions performed worse, I zoomed into the third game (Extend w/ Union Op) to analyze. Here’s a representative query from the erroneous predictions. It has 13 positive keys and 14 negative keys. A perfect model would normalize the dot-products and distribute the probability mass evenly among the positive keys and nothing among the negative keys. The contrastive models with vector representation dimensions 512 and 27 are not far from that.

But as we drop the vector dimension to 20 or below, we start to see mass moving to the negative keys. As we hit 4, the predictions are about the same as even distribution. It seems that as we drop the vector representation dimension, the models become less able to develop a preference among the keys. We can use prediction entropy as a measure of this phenomenon.

(Left) What we saw from that one example can be generalized to all the queries that have 10-20 positive keys. The average prediction entropy increases as the vector dimension decreases. Notice that queries in this game return up to 2^21 subsets of keys, and starting with dimension 20 it grows more dramatically. (Right) This trend is also reflected in the aggregate of all queries in the test set. Prediction entropy increases monotonically as vector dimension decreases.

81 cards Results I trained an 81-card version of this particular game as well. The setup is similar. The contrastive model ranged from vector representation dimension 512 down to 16, all with 17M parameters. The non-contrastive model is only half the size.

On the plots, the trends are similar. KL Divergence Loss decreases as the vector dimension increases. The Top K Prediction-Ground-truth Overlap metric increases with the vector dimension. Note that the game has 81 cards, so in theory, with a vector dimension of 81 or above, our models should solve any games (even shattering 81 numbers) perfectly. We do see agreeing trends here.

Thus far, these small game results validate my initial intuition. Solving compositional problems requires the models to encode compositional representations. One way to evaluate this quality is by testing whether models can effectively encode queries that can extract key subsets dynamically.

Training, Optimization

Here are a few things I learned from training these models:

1) More difficult games (longer queries, more logical operators, more subsets to be extracted) require more gentle learning rate schedules. Models were trained with the Adam optimizer using the standard transformer schedule. Warm-up steps were searched on a log scale. Cosine schedules had helped to unstuck models from local minima in some cases.

2) Efficiency of convergence depends on initialization schemes. Embeddings for the query symbols in the contrastive models and embeddings for the query-key symbols in the non-contrastive models were initialized with the standard transformer scheme. While the key embedding lookup was initialized with a uniform distribution , where is the reciprocal of the vector representation dimension. I also tried the Word2Vec initialization, but convergence was slower.

3) Using plain dot-product, instead of cosine similarity with or without temperature, works better for this data.

Overall, I think these modifications are more specific to the synthetics games’ nature than contrastive models.

The ‘Thank You’s

I want to thank OpenAI for this opportunity. The program was fun, intense, and rewarding. I came in with a few topics of interest (reasoning, compositionality, multimodal), and my mentor Gabriel Goh helped me to crystallize them into concrete research questions. We explored alternatives, conducted sign-of-life experiments, designed multiple card games, and iterated until we saw something promising. All the lessons were worth the effort. Gabe provided instrumental guidance and lots of valuable feedback along the way. I also want to thank my fellow scholars, who provided tremendous support. I feel privileged to work with such talented and perseverant colleagues:

realized that point mutual information is a preferable alternative to probability density for zero-shot classification. CLIP is doing the right thing.

rolled back to the original project idea: Studying contrastive learning’s ability to solve compositional problems

set up a synthetic card game dataset and ran numerical experiments

found poor rank approximation to be another limitation of dot-product classification

The use of Point Mutual Information in CLIP and GPT-3 predictions

In the last blogpost, I explained that the InfoNCE objective converges to , and thus using to model or would be inappropriate. While the statement remains valid and results from the corresponding numerical experiments in the last blogpost still support it, using to make predictions is actually preferrable for zero-shot classification.

Imagine we are training a system to identify named entities, and for simplicity, we only care about geographical locations(city, country names). Our training corpus is all of today’s news articles. The locations “new york”, “new england”, “new zealand” are present in the corpus. Suppose a lot happened in New York yesterday, so the bigram count for “new york” is very high in today’s news, while not much has happened in “new england” and “new zealand” so the bigram counts for “new england” and “new zealand” are relatively low. As a results, is very high while and are very low. After training our system for a day, we give it a geography test with the following prompt:

Which of the following locations exists in the world? A. new york B. new zealand C. new england D. new diego E. new angeles

Answers A, B, C are all valid, so a system that’s good at identifying geography locations should score A, B, C equally high and D, E as zeros. This system should not score A much higher than B and C because of the particular distribution in today’s newspaper. Therefore, it is better to further normalize the scores by the marginal probabilities for “york”, “zealand”, “england”, “diego” and “angeles”. More formally, we only care about the amount of information shared between “new” and each of the five candidates. Therefore, using , which is the exponentiation of point mutual information, is more appropriate when we care about the identification of natural pairings more than the mapping of the particular training distribution.

CLIP was trained using the infoNCE objective, so it naturally normalizes over all the class candidates for zero-shot classification. In fact, GPT-3 also applies a similar correction. The generative model was initially trained to score . Therefore, during inference, we expect the system to normalize over all the possible completions to get . However, the researchers applied a correction. They further normalized by the unconditional probability . Empirically, the GPT-3 model performed better on some datasets with this correction. For details, please refer to section 2.4 in the GPT-3 paper.

Synthetic Card Game Data

I postulate a few basic, reasonable dimensions that quantify the degree of compositionality, including 1) the number of attributes, 2) the number of values, 3) the number of objects, and 4) the number of ways the objects can relate to each other. These dimensions exist by design in datasets like CLEVR, and exist in real-world images as well. My first synthetic dataset’s design exploits the first three dimensions, and it looks a little like the card game SET. The rules are pretty simple:

The query is a pair of cards, each with some attributes: Query card 1: “red, solid, circle” Query card 2: “blue, solid, circle” These two cards share two attributes, “solid” and “circle”, which characterize the query.

These are two example key cards considered as matches: Key card 1: “green, solid, circle” — matches both query attributes. Key card 2: “blue, solid, square” — matches only one of the query attributes.

This is an example key card not considered as a match: Key card 3: “orange, void, square” — matches none of the query attributes

This particular deck has 27 cards with 3 attributes and 3 possible values per attribute:

Interestingly, as we scale up the number of attributes and values per attribute, the distribution matrix (queries along one axis, key along the other axis) representing the matches grows in rank. We are looking at the deck of 27 cards with 3 attributes and 3 values per attribute in the following image. There are 27 cards along the y-axis and 351(27 choose 2) queries along the x-axis. Each bright spot refers to a query-key match. The rank of this matrix is 19.

If we scale to 81 cards with 4 attributes and 3 values per attribute. The rank of the matrix grows to 65.

Scaling along the attributes dimension increases the rank rapidly while scaling along the number of values per dimension increases the rank slowly.

Attributes

Values per Attribute

Rank

2

5

9

2

6

11

2

7

13

2

8

15

2

3

5

3

3

19

4

3

65

Approximation of high-rank matrices using dot-product classifiers

In an earlier blog post, I have explained why conventional contrastive learning, which uses a dot-product classifier, may fail to learn highly compositional data due to VC-dimension related reasons. Here, I am suggesting another deficiency of dot-product classifiers.

Following the setup of the card game above, if we try to solve it with a simple contrastive learning model, in which each key is represented as a learnable embedding in and do the same for each query, each position in the distribution matrix is now approximated by the dot-product between the two corresponding query and key embedding vectors. Zooming out to consider the whole by distribution matrix , we are approximating as , where is by and is by . If embedding dimension is lower than the rank of , and we continue to shorten it, our model becomes poorer and poorer at approximating the distribution matrix .

Numerical Validation

Indeed, when we code up this experiment and train the model, we will see that a deck with higher rank requires a larger embedding size to hit 100% accuracy on the training distribution. At the moment, I have a few numerical results with me, but will leave the full report along with other observations for the next blog post.

During the past two weeks, I have been following up on the hypothesis mentioned at the end of my last blog post.

Experimented on small datasets to validate that the infoNCE objective used in training CLIP converges to rather than , where are some constants. This suggests that using CLIP directly for classification, (i.e. approximating as ), is indeed not principled. This is the main focus of the this blog post.

Practiced writing proofs in statistics. This is an on-going exercise to warm up to construct a sound, step-by-step proof of the above. So far, I have been working on Rejection Sampling, and a few chapters from the “All of Statistics — A Concise Course in Statistical Inference” book.

Experiments

I ran experiments on both synthetic and natural datasets to observe what converges to under three different objectives, including infoNCE.

To start, let’s use a synthetic example to illustrate the problem. We contrive a dataset in which the joint distribution is simply the product of the marginals. That is, for every pair in the joint distribution between random variables and . Here are three visualizations made from the data:

Joint distribution

Product of marginals

Joint distribution div. by product of marginals

We construct a by parameter matrix , each value is denoted as . We define and use SGD to train the parameters according to each of the following objectives. At convergence (batch size=500, iterations=between 1k and 2k), we visualize the table for each of the objective:

Negative log likelihood

Negative pseudo log-likelihood

infoNCE used in CLIP

In this contrived dataset, training with the negative log-likelihood objective or the negative pseudo log-likelihood both converges the visualization to the visualization rather than the visualization. However, training with the infoNCE objective results in a noisy patch with no clear signal.

To quantitatively measure how well is approximating , we measure the KL-divergence between the normalized table and the groundtruth from the data.

Objective

Minimize Negative Log Likelihood

5.8801e-06

Minimize Negative Log Pseudo Likelihood

3.9604e-06

Minimize Info NCE Loss

0.0043

It is quite noticeable that approximates poorly.

Now, let’s modify the distribution such that .

From the data:

Visualization of tables at convergence:

The visual observation tells us that training with the infoNCE objective, the actually converge to a pattern much closer to rather than .

Objective

Minimize Negative Log Likelihood

9.8529e-06

Minimize Negative Log Pseudo Likelihood

6.7194e-06

Minimize Info NCE Loss

0.0037

Next, let’s look at an example that is more familiar to people who are interested probabilities. Consider the following procedure: we roll a 6-sided dice twice to collect two i.i.d. results, then collect an (=sum of results, =max of results) pair. We repeat this procedure many times to fill up a count table indexed by the support of two random variables (x-axis spans from 1 to 6, y-axis spans from 2 to 12). We look at the same set of visualization and metrics again.

From the data:

Visualization of tables at convergence:

Measure the difference between true distribution and normalized table:

birthwt (188 datapoints) (x-axis: infant’s weight in 100g, y-axis: mother’s age in years)

From the data:

Visualization of tables at convergence:

Measure the difference between true distribution and normalized table:

Objective

Minimize Negative Log Likelihood

5.6581e-07

Minimize Negative Log Pseudo Likelihood

7.3230e-06

Minimize Info NCE Loss

0.0010

These are just three of the set of small experiment results that I have. I am skipping the others in this blog post because their conclusions are all the same. Moving forward, I will be running experiments on datasets and architecture that are progressively more similar to CLIP. The end goal of these experiments, along with efforts to write the proof, is to demonstrate that there’s a principled method to improve CLIP’s classification performance, which is explained at the end of my last blog post.

In the past couple of weeks, I took variable binding as a starting point and studied two relevant topics for the project:

Vapnik-Chervonenkis(VC) dimension

Contrastive learning for classification

Intuition

I am interested in measuring and demonstrating the limits of standard representation techniques to effectively encode inputs with high compositionality for downstream reasoning tasks. One such representation technique maybe Contrastive Learning for VC-dimension related reasons.

Many common forms of contrastive learning involve encoding unimodal or cross-modal inputs into query and key vectors, scoring the query-key pairs using a similarity function, and applying a loss function to maximize the mutual information for matching pairs of queries and keys. For the similarity function, dot-product or cosine similarity (normalized dot-product) are most commonly used. This operation is where things get a little interesting because we can view this dot-product as performing linear binary classification. Each key vector is a data point, and each query vector is a linear classifier. The query vector draws a boundary to separate the matching key vectors from the non-matching key vectors.

The VC-theory states that the set of all linear classifiers (each with a bias term) for data points in dimension , can shatter at most data points. Let’s unpack this statement a bit:

Shatter: If there are data points in a binary classification problem, each of them can be labeled as or , so there are possible label assignments. For each of these label assignments, if there exists a linear classifier in such that no classification error is made, then can shatter these data points. Even if there’s only one particular label assignment which no classifiers in can classify without any error, cannot shatter these data points.

Shatter at most data points: If there exists at least one dataset — any dataset up to our choosing, in dimension of data points which can shatter, AND there exist no datasets at all in dimension of data points which can shatter, then the VC-dimension of is . The VC-theory says that . Lecture 7 of Professor Yaser Abu-Mostafa’s course at Caltech walks through a simple proof.

Putting this VC-theory together with contrastive learning, the VC-dimension poses some sort of limit on contrastive learning performance because of the latter’s use of the dot-product. If our query and key vectors are in representation space , for any dataset with more than data points, there exists at least one subset of keys that we cannot learn a query to match exclusively.

One important follow-up question is: Do we ever run into this limit in practice? Contrastive learning has achieved success for image and text representation learning, and even when we used these learned representations directly for classification. For example, CLIP is showing satisfactory zero-shot image classification results. Let’s reason about this more: shattering a dataset means drawing boundaries that separate ALL possible subsets of data. For image or image-text classification tasks, some subsets are more natural than the others—we group canines under one label, primates under one label, cars under one label, etc. But the grouping of an elephant, a phone, and a microwave oven under one label is not semantically natural. In fact, in many benchmark datasets, the data points form very few semantically meaningful subsets. As such, when we apply contrastive learning to learn representations that capture the invariant properties behind these data points, these learned invariant properties don’t have to support random subset retrieval.

However, the story may be different for highly compositional datasets. A task like CLEVR (Johnson et al. 2016), which has scenes composed of multiple objects of multiple colors, texture, and shape, allows us to subset the scenes in many possible ways as long as we issue the right questions. For example, the question Is there anything in the scene? returns the whole set of scenes; the question Is there a red object in the scene? creates a narrower subset; the question Is there a red metal cube to the right of a green sphere? returns an even narrower subset. Of course, this is no proof that we can create short, human-readable queries that match any random subset of scenes exclusively. With many subsets, we may have to ask a clunky question such as Does this scene matches scene101 or scene5 or scene74 or scene13 or scene500?. But it is now reasonable to speculate that contrastive learning models may fail to learn representations that support highly compositional reasoning, more so than other representation learning techniques such as generative models, due to VC-dimension related reasons. As a short exercise, I made two attempts to generate synthetic scenes and queries. The scenes and queries both follow symbolic templates, so to unburden the model from learning raw visual or natural language inputs.

Contrastive learning for classification

To validate the hypothesis above, we would have to evaluate contrastive learning’s accuracy on query-key matching. This evaluation is a classification problem: At test time, given a query and a key, what’s the probability that the pair match? For zero-shot image classification, given a query(e.g., image) and a set of keys(e.g., text labels), which key is the right match? However, this is not as straight-forward as we would hope. There are two blockers: 1) What should the labels be? {feline, canine} or {feline, canine, Ursidae, primate} 2) Do the loss functions used in contrastive learning indeed result in the model to predict “proper” probabilities during inference? This latter is a more complicated question, but answering it can help us interpret the numbers CLIP predicts and claims to be probabilities.

Let’s focus on the infoNCE loss function (Oord et al. 2018), which is used in training CLIP and many other contrastive learning models:

The authors claim that as the learning algorithm optimizes for this objective, converges to , where is a constant. This is based on an assumption that the optimal is:

However, there is no explanation for why this assumption is valid in the first place. On the other hand, toy experiments show that indeed converge to . If this was true, we can not consider CLIP’s outputs to be the classification probabilities we intuitively expect. Let be the text label which we are considering for the image , and refers to a complete, perfect set of labels for this classification task:

which does NOT evaluate to that is claimed on openAI’s blog. This is very curious, and suggests a principled way to improve CLIP’s zero-shot classification results, if we can train a language model and use it during inference, since:

Variable Binding — literature review and project feasibility study

Alternative project proposals — literature review on contrastive learning, risk minimization, and Vision Transformer. I will include the solidified thoughts for the next post.

NeurIPS 2020

NeurIPS 2019 was my first time attending a computer science conference. I was quite overwhelmed by its extensive program and learned that it was impossible to participate in all events. I attended workshops and tutorials relevant to my work but missed most of the spotlights and posters. This year, once the list of accepted papers was released, I skimmed through all the paper titles with my peers, read the abstracts of those which excite me, and studied 6-9 of those in greater detail. I curated my calendar to include 3-4 events each day of the week and attended most of them. Most of the events are related to either Neuro-Symbolic methods or Multimodal Navigation. Overall, it was a great learning experience, and I am grateful for OpenAI’s sponsorship.

Variable Binding

In the past year, I came across the term “Neuro-Symbolic” in MSR and a few conferences. I did not know the topic, but its premise piqued my interest. At the time, I was working on Vision-Language Navigation, and generalization to unseen environments was an open problem. I noticed that agents tend to pick up spurious visual signals from the scenes rather than perform reasoning at the spatial semantics level. Even if we feed an agent with ground-truth room labels, the agent without the support of explicit mapping modules still struggles to develop search heuristics. It seemed to my mentors and me that a well-designed computer program could capture and execute these search heuristics, using the current room type, doorway directions, and navigation history as typed-arguments. If we had pursued this approach, the research problem would have been how a DNN can be modularized to learn and solve these symbolic, discrete, logical programs on the fly.

On the other hand, the TP-transformer paper (Schlag et al. 2019) seems to have picked up some traction. It also has a symbolic flavor but attempts to address it from within the connectionist framework(contrary to the modular approach above). This transformer manages to solve math problems with SOTA, and its attention patterns show that arguments are more correctly bound to their relations. For example, in the division , can correctly attend to as its denominator, and this property is preserved even when the problem becomes more nested as in . Given OpenAI’s heavy interest in transformer models, I decided to investigate possible project directions extended from this paper.

One critical notion mentioned in this paper is Variable Binding. In the problem , the filler should be bound to through the denominator role. Existing, standard multi-head attention layers already allow this by letting parameters WQKV in each head to learn to bind fillers for one role(Goyal and Bengio 2020 use the terminology “typed argument”, which is analogous to those accepted by program functions). However, this binding mechanism is limited by the number of heads. The TP-transformer authors introduced a specific role-binding mechanism, which allows for finer role learning and binding within each head. Out of curiosity, I tested this fraction simplification problem on the GPT-3 playground:

This is a program that simplifies the division of fractions. The variables are a, b, c, and d. Terms within parentheses are simplified first.
PROMPT: a / b
RETURN: (a) / (b)
PROMPT: b / c
RETURN: (b) / (c)
-----------------------------------
PROMPT: (a / b) / c
RETURN: (a * c) / (b)
PROMPT: (c / b) / b
RETURN: (c * b) / (b)
-----------------------------------
PROMPT: (c / d) / (a /a)
RETURN: (c * a) / (d * a)
PROMPT: (d / b) / (c /a)
RETURN: (d * c) / (b * a)
-----------------------------------
PROMPT: ((a / b) / c) / (a / b)
RETURN: (a* c * b) / (b * a)
PROMPT: ((b / d) / a) / (c / d)
RETURN: (b * c * d) / (a * c * d)

Given the demonstration, the model returned the correct answers for the first three problems, which establishes its ability to recognize math symbols and perform basic fraction simplification. However, its failure on the last problem shows that it cannot bind variables recursively when the problem is sufficiently nested. While it’s unfair to evaluate GPT-3 on math problems, this observation, along with TP-transformer’s results, led to the question of whether transformer models can be benchmarked on their ability to bind variables.

However, it’s not straightforward to evaluate variable binding. Suppose is encoded into a vector, what should we do with this vector? We can use this information bottleneck to simplify the fraction into , but this metric collapses the model’s ability to come up with a logical internal representation by binding, and the model’s ability to solve the logical representation using learned arithmetics. A fairer approach is to query for the numerator(role) of the numerator(role), and see if the model returns (filler). This process is called unbinding. I did some literature review on this. Surprisingly, the most influential literature set came from the 1990s, when connectionists (most of whom in computer science) and symbolists (most of whom in cognitive science) attempted to imbue distributional representations(like word embeddings) with structural properties(like logical expressions). Here, I discuss the one I find to be the most representative.

In 1990, Paul Smolensky introduced tensor-products for variable binding. The following figures are directly extracted from his paper.

Here, a role is defined as an vector, and a filler is defined as an vector. The tensor-product of this binding is an matrix, with each element at position computed as the products of two scalars, each coming from the corresponding units filler and the role . This binding is also a rank 2 tensor.

Treating each tensor-product as a filler for higher-level binding operations, we can also bind recursively to construct representations for hierarchical structures—for instance, a complex, nested expression in lambda calculus. One caveat is that the rank of this resultant tensor-product increases — every time we bind a rank H filler with a rank 1 role, the operation returns a rank H + 1 tensor-product.

At any particular level of a hierarchical structure, the conjunction of bindings is computed as the element-wise sum of all bindings.

Theorem 3.1, 3.2, and 3.3 show two ways to unbind. The underlying assumption is that all role vectors, either designed by hand or learned, are linearly independent. This property allows us to unbind by taking the dot product between the bound tensor-product, and option 1) the dual basis of the query role vector, which returns the exact filler of interest, or option 2) the query role vector itself, which returns the sum of the filler of interest and some noise term.

One undesirable property of this binding mechanism is the tensor-product’s rank explosion when the problem’s structure becomes too nested. To mitigate this, the TP-transformer authors (which includes Smolensky himself) suggested using the Hadamard product as an approximation. With this approximation, binding an role vector with an filler vector returns an representation. We can also arrive at this representation by applying the diagonal function on the full tensor-product representation. To unbind, we take the Hadamard product between the bound representation with the query role vector itself.

Besides Smolensky’s tensor-product, other influential works includes Holographic Reduced Representation (Plate, 1994), Spatter Code (Kanerva 1994), and the Braiding operator (Gayler 1998). Across the board, researchers recognized that systems for composing structural representation need to support binding, unbinding, conjunction, and recursive construction/deconstruction. Also, note that the unbinding methods are specific to the respective binding methods. It wouldn’t be fair to evaluate one system’s ability to bind using another system’s unbinding operator.

In addition to this set of literature, I also reviewed the following to understand the definition of variable binding:

Variable Binding Tree and Lambda Calculus (Church, 1936): I find the variable binding tree’s formalism to be useful for generating synthetic data of different levels of binding complexity. The number of leaves corresponds to the number of terminal values bound as variables. The height of the tree corresponds to the number of recursive binding operations. The width of the tree at any level corresponds to the number of conjunctions at the level.

On the Binding Problem in Artificial Neural Networks (Greff et al. 2020): I find their characterization of the binding problem as three subproblems: Segregation, Representation, Composition to be insightful. They also provided a good survey of existing methods that attempts to perform binding within the connectionist framework, contrary to modular hybrid approaches, which designate some modules to deal with symbolic information content and others to deal with neural information content.

Measuring Compositionality in Representation Learning (Andreas, 2019): The author proposed a general procedure for measuring any encoding function’s ability to compose by learning an approximation function that has access to the ground-truth structure of a problem. If the encoding function doesn’t encode compositionally, the encoded representation would have no compositional properties, thus impossible for the approximation function to match using ground-truth information. While this procedure makes intuitive sense, I’m skeptical of certain aspects once the procedure is operationalized in the experiments. First, the composition function (*), is assumed to be the addition function for all experiments. There is no explanation for this collapse of conjunction and recursive binding into one single operator. Second, in some experiments, it is unclear in regards to what the encoding function is and what the approximation function is. There are no explanations for this either. These two aspects made the extension of this paper quite challenging.

In the past two (also last two) curriculum weeks, I have focused on:

Manually implementing mixed-precision training on a toy MLP problem. Ran experiments to understand when to expect memory efficiency and speedup.

Reviewing literature and drafting for my project proposal on variable binding. I will update here again, once the material is packaged up.

Mixed-Precision Training

The premise of mixed-precision training is that if all data and parameters are 16-bit instead of 32-bit, we cut the size of all tensors in half in GPU memory. Therefore, we can 1) double our batch size, and 2) increase arithmetic bandwidth within GPU and reduce network bandwidth between processes in distributed settings. NVIDIA’s Volta generation GPUs (e.g. V100) have hardware that supports these theoretical premises, and practitioners often observe a 3+X speedup in their experiments.

Mixed-precision training is becoming the status quo for research engineering, so I decided to do a short exercise to understand exactly how and when tensors are cast into 16-bit and the new class of bugs that can arise. Besides the PyTorch amp library and a few youtube videos that explain numeric representation using 16-bit versus 32-bit, I find this notebook from fast.ai quite useful. It explained away several bugs that I encountered when I naively set every tensor to 16-bit in a large model like the transformer and matches my expectation of where and when true 16-bit training fails due to numerical underflow and overflow. The tutorial suggested three techniques: 1) avoid imprecise weight update: maintain a 32-bit master copy of weight, with which we update with 32-bit gradients. 2) avoid numeric underflow in the 16-bit gradients computed during backprop, especially when the model is very deep: scale the loss as much as we can without overflowing (i.e. dynamically), backprop with this scaled-up loss to get 16-bit gradients (which are not too close to zero anymore), cast to 32-bit, downscale by the same loss scaler, and use this to update the 32-bit master weights. 3) avoid numeric overflow in activations or loss (the opposite problem for gradients): compute the loss in 32-bit, which means casting 16-bit inputs (e.g. logits) into 32-bit and use that to compute the loss. Apply the same to batchnorm layers.

To understand this workflow in detail, I tried the first two suggestions on a toy MLP problem, using the recommended apex library utilities, and without the fast.ai wrappers. The implementation is here, look for `def main_train` which breaks down the workflow into sixteen small steps.

I ran a few experiments with to observe gains in memory and wall-time efficiency. The experiments compared a 160M parameters model, 2.5M parameters model and 116K parameters model. Here are two main take-aways: 1) Wall-time speedup is only observed when the model is large(e.g. many wide layers in a MLP) . However, when the model is very large, the master copy of 32-bit weights and 32-bit gradients can take up so much memory that we are not able to scale up the batch size. 2) Memory efficiency (by measuring max memory allocated regularly) is only observed when the model is small. However, when the model is small, the number of matrix multiplications is trivial. We don’t get to take advantage of the increased arithmetic bandwidth—consequently, no wall-time speed ups.

In practice, there are several ways to implement mixed-precision training. These libraries have been optimized for speedup and memory efficiency. They are also straightforward to incorporate into any given PyTorch module code.

Certain operations are autocast into float-32. These Ops are more sensitive to numeric underflow or overflow. Hence, the list includes functions that involve logarithm, exponentiation, and normalization.

Certain operations are autocast into float-16. These Ops typically involve typical linear algebra.

The trainer class accepts arguments that support full-precision, conservative mixed precision (only ops robust to numerical issues are cast into 16-bit), mixed-precision (maintains a master copy of 32-bit weights) true 16-bit precision.

In the past two curriculum weeks, I have focused on getting my Transformer implementation to its best possible shape.

Datasets: I ran the transformer on four toy datasets, two sanity-check datasets and a subset of 1 benchmark machine translation dataset.

Experiments: I ran a set of experiments to understand the effects on weight initialization, weight tying vs. untying, and learning rate schedule on performance, and speed of convergence.

16-bit training: I started reading about 16-bit training, and ran some naive experiment to see what kinds of bugs would come up. I will leave this report for the next blog post when the whole exercise will be complete.

Datasets

The four toy datasets are designed to test if the implementation can:

1) transfer signals as-is from the encoder to the decoder, with a symbol copying task. There are 20 symbols, the first two are encoded as <go> and <stop>.

3) take advantage of additional structure on the input pattern. We should expect the model to converge sooner as it’s easier to remember a fixed sequence ordering and reuse it.

The two sanity-check datasets (Polynomial Expansion, Fake Dates) are also synthetic, and training on them further validate if the transformer can perform more complex reasoning over the entire input context window (instead of a limited local one).

Polynomial Expansion
X
(7-3*z)*(-5*z-9)
(2-2*n)*(n-1)
Y
15*z**2-8*z-63
-2*n**2+4*n-2

Fake Dates
X
april 11 1981
wednesday september 1 2004
Y
1981-04-11
2004-09-01

Finally, I downloaded the WMT-2014English-French dataset, pushed the data through a byte-pair encoding pipeline, and overfit on a very small subset (Train:1024, Val: 64). I didn’t train on the entire dataset mainly because of compute limitations. It’s also more important for my curriculum to prioritize 16-bit training and image recognition datasets.

X
<GO> les sché@@ mas et leurs inclu@@ sions sont télé@@ chargés automatiquement depuis internet la première fois qu'ils sont requ@@ is, puis gar@@ dés en cache local@@ ement. <EOS>
Y
<GO> the schem@@ as and their inclu@@ sions are automatically downloaded once from their internet location, and cach@@ ed loc@@ ally. <EOS>X
<GO> /@@ 1.1 avec connexions persist@@ antes et compression pour minimiser la charge réseau. <EOS>
Y
<GO> /@@ 1.1 with persistent connections and compression for minim@@ ising network lo@@ ad. <EOS>

If the hyperparameters are carefully chosen, the transformer should converge on all these datasets (exact match accuracy at 100% and BLEU > 0.85) within 1-4k steps. The corresponding notebooks are listed here. They follow the same transformer implementation using Heads=2 and N_layers=2, the only difference is the data (look for Dataset, Dataloader and DataModule classes).

In the process of validating the transformer on the above datasets, I realized that hyperparameters and weight initialization can have substantial effects on convergence and performance. The following plots corresponds to the second toy dataset above (reversing randomly generated input symbols).

Hyperparameterwarmup_steps: This hyperparameter directly controls the slope and peak of the learning rate schedule, and it should be adjusted according to the batch size. With a smaller batch size, the weight particles should take smaller steps.

Tying / Untying WQKVO at initialization (but not afterwards): For the toy datasets, initializing WQ and WK to the same weights seems important. One possible explanation is that doing so allows the multi-head attention module to start at a sensible baseline that is similar to a simple dot-product attention without linear projections. On the other hand, zeroing out WO was suggested in the Fixup initialization (Zhang et al. 2019) paper for residual learning without any normalization. Our architecture still maintains Layer-Norm between all sub-layers so we don’t see any gains from including this technique.

Embedding Initialization: Because the context length is short(only 8 positions), using sinusoidal embeddings did not benefit training as much as learned embeddings. Once the variance of embedding weights at initialization are scaled by N(0, 1/d_model), rather than the default N(0, 1) set in torch.nn.Embedding, the model converges smoothly at 4k rather than 20k steps. Finally, initializing the learned positional embedding to be N(0, 0.5 * variance of word embeddings) speeds up convergence further by a small margin.

In the past two curriculum focused weeks, I continued to brush up on some foundational ML topics.

Standard Transformer: I had a basic grasp of the standard transformer(Vaswani et al. 2017) architecture already, but a few other concepts crucial to successful implementation were still missing.

Distributed training: I have taken a class on scaling classic ML models before, but not specifically on distributing deep learning model training across GPUs and nodes. The mechanisms were opaque to me.

Automatic differentiation: This is a very foundational topic, yet my formal data science education never covered it!

Approach and Highlights

Nothing internalizes an ML/DL/Stats topic for me more than hands on implementation/problem solving, supplemented with lots of google search for literature, lecture notes, documentations, and discussion threads. This learning style is reminiscent of how I often let my hands(models, drawings) guide my brain(precedent review and analysis) during the early years of my education in Architecture. Conversations with my mentors remain crucial, and I checked in with my mentor Gabe whenever I’m stuck with a concept. Our discussions have been delightful.

With the Transformer, I took the model architecture diagram from the paper and refactored it into a top-down module collection. This was a relatively straight-forward first step. I also asked Alethea Power (an excellent scholar from the last cohort and current engineer) for their advice. I learned quite a few helpful tips that leverage Pytorch specifics for module organization and eliminating boilerplate training code with Lightning. Whenever a substantial number of modules are completed, I would design simple tests to ensure that they give me the expected outputs. I was especially delighted with the way matrix multiplication simplifies the code and optimizes speed and space efficiency (which the authors also claim) for multi-head attention. This use case can make a perfect textbook example for vectorization! However, for myself, the highlight of this implementation exercise is internalizing layer-norm, positional encoding, learning rate schedule, and label smoothing. I paid extra attention to their motivation, did some light literature review, and made my mentor spend spare time to discuss them with me (thanks, Gabe!). FYI, this collection is a standard set of resources that the scholars use to study and implementation of their own transformers:

With distributed training, my primary resource is the Pytorch documentation on the distributed package. Clarifying the difference between data-parallel and distributed data-parallel, understanding how, which, and why tensors are exchanged between processes and devices was a crucial first step. The primary motivations and concepts resemble how multinode training for classic ML algorithms in the map-reduce paradigm works (thanks to the W261 course. I did two light notebook exercises to understand how CUDA manages and reports memory usage for tensors and implemented a toy MLP problem to see how an implementation would align with the documentation for CPU training, single GPU with a single process multiple GPUs with individual processes. Also, Gabe and I had extensive discussions on the different trade-offs for different data, model and network challenges, when 1) data is heavy (e.g. image features); 2) model is heavy(e.g. transformer), and 3) network is slow (e.g. all-ring-reduce’s assumptions doesn’t work well with multiple nodes each with various GPUs across the ethernet). FYI, this is the seed list of my readings, from which I drilled down to the necessary details.

My short exercise to understand how CUDA allocated memory changes with creating tensors, deleting tensors and emptying cache.

My short exercise to understand how forced synchronization affects execution time.

With automatic differentiation, I solicited a list of readings from my peers. Like many beginners in DL, it was not clear why backprop is such a big deal (if it was just about the chain rule) and took so long before researchers came up with it. The difference between the forward vs. backward mode enlightened me! Also, following and practicing with Karpathy’s Micrograd implementation was really helpful. FYI, here is the list, from easy to involved:

I am currently working on validating my standard transformer implementation, with a few more toy problems. For example, sequence copying, reversing, and auto-regressive series (e.g. Fibonacci sequence). Also, other implementation checks with 16-bit training, memory profiling and torch lightning integration. This is a prerequisite to the research direction Gabe and I have in mind — bringing more clarity to the transformer type architecture through studying image recognition. More on that later 🙂

Hello, this is Legg. I am a Fall 2020 Scholar at OpenAI, working with my mentor Gabriel Goh in the Clarity team. Between October 2020 and April 2021, my peers and I will be publishing blog posts biweekly to share our journeys as scholars between October 2020 and April 2021.

I tend to get questions about career transition and whether one should invest in additional schooling or industry research residencies. Therefore, before diving into any technical discussions, I want to share a few things about myself so that the readers of this blog understand where my baseline is. Hopefully, this information, in addition to the collection of upcoming technical discussions, can help the readers make better judgments for their careers.

Between 2006-2017, I was trained and practiced as an Architect and Urban Designer. I did not receive any quantitative or engineering training from my bachelor’s and first master’s degrees. (I wouldn’t count grasshopper scripting, 3D modeling, or rendering as technical for AI purposes).

Between 2017-2019, I completed a master’s degree in data science. It is a popular degree for folks who want to transition into Data Science. The courses include statistics, machine learning, and NLP with deep learning. I also took two semesters off for full-time machine learning and data science internships.

Between 2019-2020, I was an AI resident at Microsoft Research. I focused on Vision-Language Navigation(VLN) and Instruction Generation.

Before my start date at OpenAI, I took about ten days to complete the Deep Learning Specialization (deeplearning.ai, Coursera) to refresh my memories and fill any knowledge gaps in the basics.

In summary, I am still very new to the field of AI. Through this scholar program, I am hoping to gain more experience in research engineering and developing research intuitions.

Back to documenting my first week at OpenAI, I took advantage of the general on-boarding period to learn a little more about Pytorch. I have been using Pytorch for 12 months, mostly on extending Pytorch research codebases such as ALFRED and Room2Room. I want to understand Autograd and the custom functions more. Therefore, I coded up logistic regression from scratch and compared it with the off-the-shelf implementation on a toy problem. More specifically, I implemented the following:

Linear, Sigmoid, and loss layers as custom torch.autograd.Functions

Gradient checking using torch.autograd.gradcheck

SGD optimizer as custom torch.optim.Optimizer

Toy dataset and dataloader using sklearn.make_classification, torch.utils.dataDataset, torch.utils.dataDatase

The exercise took about two days. I spent most of the time reading the Pytorch documentation, coding up my implementation, and debugging when the losses of my implementation mismatch the default implementation. The process was quite educational. The implementation ishere (feel free to comment with questions/suggestions), along with links to all the references I have used. Along this line of small exercises, I am interested in getting a similar toy problem to train on multiple GPUs. There are at least three ways to do this in Pytorch. More updates on that in about two weeks 🙂