Week 13 – VC-dimension, Contrastive Learning

TL;DR

In the past couple of weeks, I took variable binding as a starting point and studied two relevant topics for the project:

  • Vapnik-Chervonenkis(VC) dimension
  • Contrastive learning for classification

Intuition

I am interested in measuring and demonstrating the limits of standard representation techniques to effectively encode inputs with high compositionality for downstream reasoning tasks. One such representation technique maybe Contrastive Learning for VC-dimension related reasons. 

Vapnik-Chervonenkis(VC) dimension

Many common forms of contrastive learning involve encoding unimodal or cross-modal inputs into query and key vectors, scoring the query-key pairs using a similarity function, and applying a loss function to maximize the mutual information for matching pairs of queries and keys. For the similarity function, dot-product or cosine similarity (normalized dot-product) are most commonly used. This operation is where things get a little interesting because we can view this dot-product as performing linear binary classification. Each key vector is a data point, and each query vector is a linear classifier. The query vector draws a boundary to separate the matching key vectors from the non-matching key vectors. 

The VC-theory states that the set of all linear classifiers H (each with a bias term) for data points in dimension \mathbb{R}^d, can shatter at most d+1 data points. Let’s unpack this statement a bit:

Shatter: If there are N data points in a binary classification problem, each of them can be labeled as 1 or 0, so there are 2^N possible label assignments. For each of these label assignments, if there exists a linear classifier in H such that no classification error is made, then H can shatter these N data points. Even if there’s only one particular label assignment which no classifiers in H can classify without any error, H cannot shatter these N data points.

Shatter at most d+1 data points: If there exists at least one dataset — any dataset up to our choosing, in dimension d of N-1 data points which H can shatter, AND there exist no datasets at all in dimension d of N data points which H can shatter, then the VC-dimension of H is N. The VC-theory says that N=d+1. Lecture 7 of Professor Yaser Abu-Mostafa’s course at Caltech walks through a simple proof.

Putting this VC-theory together with contrastive learning, the VC-dimension poses some sort of limit on contrastive learning performance because of the latter’s use of the dot-product. If our query and key vectors are in representation space \mathbb{R}^d, for any dataset with more than d+1 data points, there exists at least one subset of keys that we cannot learn a query to match exclusively. 

One important follow-up question is: Do we ever run into this limit in practice? Contrastive learning has achieved success for image and text representation learning, and even when we used these learned representations directly for classification. For example, CLIP is showing satisfactory zero-shot image classification results. Let’s reason about this more: shattering a dataset means drawing boundaries that separate ALL possible subsets of data. For image or image-text classification tasks, some subsets are more natural than the others—we group canines under one label, primates under one label, cars under one label, etc. But the grouping of an elephant, a phone, and a microwave oven under one label is not semantically natural. In fact, in many benchmark datasets, the data points form very few semantically meaningful subsets. As such, when we apply contrastive learning to learn representations that capture the invariant properties behind these data points, these learned invariant properties don’t have to support random subset retrieval. 

However, the story may be different for highly compositional datasets. A task like CLEVR (Johnson et al. 2016), which has scenes composed of multiple objects of multiple colors, texture, and shape, allows us to subset the scenes in many possible ways as long as we issue the right questions. For example, the question Is there anything in the scene? returns the whole set of scenes; the question Is there a red object in the scene? creates a narrower subset; the question Is there a red metal cube to the right of a green sphere? returns an even narrower subset. Of course, this is no proof that we can create short, human-readable queries that match any random subset of scenes exclusively. With many subsets, we may have to ask a clunky question such as Does this scene matches scene101 or scene5 or scene74 or scene13 or scene500?. But it is now reasonable to speculate that contrastive learning models may fail to learn representations that support highly compositional reasoning, more so than other representation learning techniques such as generative models, due to VC-dimension related reasons. As a short exercise, I made two attempts to generate synthetic scenes and queries. The scenes and queries both follow symbolic templates, so to unburden the model from learning raw visual or natural language inputs.

Contrastive learning for classification

To validate the hypothesis above, we would have to evaluate contrastive learning’s accuracy on query-key matching. This evaluation is a classification problem: At test time, given a query and a key, what’s the probability that the pair match? For zero-shot image classification, given a query(e.g., image) and a set of keys(e.g., text labels), which key is the right match? However, this is not as straight-forward as we would hope. There are two blockers: 1) What should the labels be? {feline, canine} or {feline, canine, Ursidae, primate} 2) Do the loss functions used in contrastive learning indeed result in the model to predict “proper” probabilities during inference? This latter is a more complicated question, but answering it can help us interpret the numbers CLIP predicts and claims to be probabilities. 

Let’s focus on the infoNCE loss function (Oord et al. 2018), which is used in training CLIP and many other contrastive learning models:

The authors claim that as the learning algorithm optimizes for this objective, f(x,c) converges to k*\frac{p(x,c)}{p(x)p(c)}, where k is a constant. This is based on an assumption that the optimal \frac{f_k}{\sum_{X} f_k} is:

However, there is no explanation for why this assumption is valid in the first place. On the other hand, toy experiments show that f(x,c) indeed converge to k*\frac{p(x,c)}{p(x)p(c)}. If this was true, we can not consider CLIP’s outputs to be the classification probabilities we intuitively expect. Let y_i be the text label which we are considering for the image x_j, and Y refers to a complete, perfect set of labels for this classification task:

which does NOT evaluate to p(y_j|x_i) that is claimed on openAI’s blog. This is very curious, and suggests a principled way to improve CLIP’s zero-shot classification results, if we can train a p(y) language model and use it during inference, since:

which is the probability we actually care about.

Week 9 – NeurIPS 2020, Variable Binding

TL; DR

In the past two weeks, I have focused on:

  • NeurIPS 2020
  • Variable Binding — literature review and project feasibility study
  • Alternative project proposals — literature review on contrastive learning, risk minimization, and Vision Transformer. I will include the solidified thoughts for the next post.

NeurIPS 2020

NeurIPS 2019 was my first time attending a computer science conference. I was quite overwhelmed by its extensive program and learned that it was impossible to participate in all events. I attended workshops and tutorials relevant to my work but missed most of the spotlights and posters. This year, once the list of accepted papers was released, I skimmed through all the paper titles with my peers, read the abstracts of those which excite me, and studied 6-9 of those in greater detail. I curated my calendar to include 3-4 events each day of the week and attended most of them. Most of the events are related to either Neuro-Symbolic methods or Multimodal Navigation. Overall, it was a great learning experience, and I am grateful for OpenAI’s sponsorship.

Variable Binding

In the past year, I came across the term “Neuro-Symbolic” in MSR and a few conferences. I did not know the topic, but its premise piqued my interest. At the time, I was working on Vision-Language Navigation, and generalization to unseen environments was an open problem. I noticed that agents tend to pick up spurious visual signals from the scenes rather than perform reasoning at the spatial semantics level. Even if we feed an agent with ground-truth room labels, the agent without the support of explicit mapping modules still struggles to develop search heuristics. It seemed to my mentors and me that a well-designed computer program could capture and execute these search heuristics, using the current room type, doorway directions, and navigation history as typed-arguments. If we had pursued this approach, the research problem would have been how a DNN can be modularized to learn and solve these symbolic, discrete, logical programs on the fly. 

On the other hand, the TP-transformer paper (Schlag et al. 2019) seems to have picked up some traction. It also has a symbolic flavor but attempts to address it from within the connectionist framework(contrary to the modular approach above). This transformer manages to solve math problems with SOTA, and its attention patterns show that arguments are more correctly bound to their relations. For example, in the division a/b, a can correctly attend to b as its denominator, and this property is preserved even when the problem becomes more nested as in (a/b)/(c/d). Given OpenAI’s heavy interest in transformer models, I decided to investigate possible project directions extended from this paper. 

One critical notion mentioned in this paper is Variable Binding. In the problem a/b, the filler b should be bound to a through the denominator role. Existing, standard multi-head attention layers already allow this by letting parameters WQKV in each head to learn to bind fillers for one role(Goyal and Bengio 2020 use the terminology “typed argument”, which is analogous to those accepted by program functions). However, this binding mechanism is limited by the number of heads. The TP-transformer authors introduced a specific role-binding mechanism, which allows for finer role learning and binding within each head. Out of curiosity, I tested this fraction simplification problem on the GPT-3 playground:

This is a program that simplifies the division of fractions. The variables are a, b, c, and d. Terms within parentheses are simplified first.

PROMPT: a / b
RETURN: (a) / (b)
PROMPT: b / c
RETURN: (b) / (c)
-----------------------------------
PROMPT: (a / b) / c 
RETURN: (a * c) / (b) 
PROMPT: (c / b) / b
RETURN: (c * b) / (b)
-----------------------------------
PROMPT: (c / d) / (a /a)
RETURN: (c * a) / (d * a)
PROMPT: (d / b) / (c /a)
RETURN: (d * c) / (b * a)
-----------------------------------
PROMPT: ((a / b) / c) / (a / b)
RETURN: (a* c * b) / (b * a)
PROMPT: ((b / d) / a) / (c / d)
RETURN: (b * c * d) / (a * c * d)

Given the demonstration, the model returned the correct answers for the first three problems, which establishes its ability to recognize math symbols and perform basic fraction simplification. However, its failure on the last problem shows that it cannot bind variables recursively when the problem is sufficiently nested. While it’s unfair to evaluate GPT-3 on math problems, this observation, along with TP-transformer’s results, led to the question of whether transformer models can be benchmarked on their ability to bind variables.

However, it’s not straightforward to evaluate variable binding. Suppose (a/b)/(c/d) is encoded into a vector, what should we do with this vector? We can use this information bottleneck to simplify the fraction into (a*c)/(b*d), but this metric collapses the model’s ability to come up with a logical internal representation by binding, and the model’s ability to solve the logical representation using learned arithmetics. A fairer approach is to query for the numerator(role) of the numerator(role), and see if the model returns a(filler). This process is called unbinding. I did some literature review on this. Surprisingly, the most influential literature set came from the 1990s, when connectionists (most of whom in computer science) and symbolists (most of whom in cognitive science) attempted to imbue distributional representations(like word embeddings) with structural properties(like logical expressions). Here, I discuss the one I find to be the most representative.

In 1990, Paul Smolensky introduced tensor-products for variable binding. The following figures are directly extracted from his paper.

Here, a role is defined as an \mathbb{R}^5 vector, and a filler is defined as an \mathbb{R}^4 vector. The tensor-product of this binding is an \mathbb{R}^{20} matrix, with each element at position \phi, p computed as the products of two scalars, each coming from the corresponding units filler \phi and the role p. This binding is also a rank 2 tensor.

Treating each tensor-product as a filler for higher-level binding operations, we can also bind recursively to construct representations for hierarchical structures—for instance, a complex, nested expression in lambda calculus. One caveat is that the rank of this resultant tensor-product increases — every time we bind a rank H filler with a rank 1 role, the operation returns a rank H + 1 tensor-product.

At any particular level of a hierarchical structure, the conjunction of bindings is computed as the element-wise sum of all bindings.

1) Exact Unbinding Procedure
2) Self-addressing Procedure

Theorem 3.1, 3.2, and 3.3 show two ways to unbind. The underlying assumption is that all role vectors, either designed by hand or learned, are linearly independent. This property allows us to unbind by taking the dot product between the bound tensor-product, and option 1) the dual basis of the query role vector, which returns the exact filler of interest, or option 2) the query role vector itself, which returns the sum of the filler of interest and some noise term.

One undesirable property of this binding mechanism is the tensor-product’s rank explosion when the problem’s structure becomes too nested. To mitigate this, the TP-transformer authors (which includes Smolensky himself) suggested using the Hadamard product as an approximation. With this approximation, binding an \mathbb{R}^5 role vector with an \mathbb{R}^5 filler vector returns an \mathbb{R}^5 representation. We can also arrive at this representation by applying the diagonal function on the full tensor-product representation. To unbind, we take the Hadamard product between the bound \mathbb{R}^5 representation with the \mathbb{R}^5 query role vector itself. 

Besides Smolensky’s tensor-product, other influential works includes Holographic Reduced Representation (Plate, 1994), Spatter Code (Kanerva 1994), and the Braiding operator (Gayler 1998). Across the board, researchers recognized that systems for composing structural representation need to support binding, unbinding, conjunction, and recursive construction/deconstruction. Also, note that the unbinding methods are specific to the respective binding methods. It wouldn’t be fair to evaluate one system’s ability to bind using another system’s unbinding operator.

In addition to this set of literature, I also reviewed the following to understand the definition of variable binding:

Variable Binding Tree and Lambda Calculus (Church, 1936): I find the variable binding tree’s formalism to be useful for generating synthetic data of different levels of binding complexity. The number of leaves corresponds to the number of terminal values bound as variables. The height of the tree corresponds to the number of recursive binding operations. The width of the tree at any level corresponds to the number of conjunctions at the level.

On the Binding Problem in Artificial Neural Networks (Greff et al. 2020): I find their characterization of the binding problem as three subproblems: Segregation, Representation, Composition to be insightful. They also provided a good survey of existing methods that attempts to perform binding within the connectionist framework, contrary to modular hybrid approaches, which designate some modules to deal with symbolic information content and others to deal with neural information content.

Measuring Compositionality in Representation Learning (Andreas, 2019): The author proposed a general procedure for measuring any encoding function’s ability to compose by learning an approximation function that has access to the ground-truth structure of a problem. If the encoding function doesn’t encode compositionally, the encoded representation would have no compositional properties, thus impossible for the approximation function to match using ground-truth information. While this procedure makes intuitive sense, I’m skeptical of certain aspects once the procedure is operationalized in the experiments. First, the composition function (*), is assumed to be the addition function for all experiments. There is no explanation for this collapse of conjunction and recursive binding into one single operator. Second, in some experiments, it is unclear in regards to what the encoding function is and what the approximation function is. There are no explanations for this either. These two aspects made the extension of this paper quite challenging.

Week 7 – Mixed Precision Training

TL; DR

In the past two (also last two) curriculum weeks, I have focused on:

  • Manually implementing mixed-precision training on a toy MLP problem. Ran experiments to understand when to expect memory efficiency and speedup.
  • Reviewing literature and drafting for my project proposal on variable binding. I will update here again, once the material is packaged up.

Mixed-Precision Training

The premise of mixed-precision training is that if all data and parameters are 16-bit instead of 32-bit, we cut the size of all tensors in half in GPU memory. Therefore, we can 1) double our batch size, and 2) increase arithmetic bandwidth within GPU and reduce network bandwidth between processes in distributed settings. NVIDIA’s Volta generation GPUs (e.g. V100) have hardware that supports these theoretical premises, and practitioners often observe a 3+X speedup in their experiments.

Mixed-precision training is becoming the status quo for research engineering, so I decided to do a short exercise to understand exactly how and when tensors are cast into 16-bit and the new class of bugs that can arise. Besides the PyTorch amp library and a few youtube videos that explain numeric representation using 16-bit versus 32-bit, I find this notebook from fast.ai quite useful. It explained away several bugs that I encountered when I naively set every tensor to 16-bit in a large model like the transformer and matches my expectation of where and when true 16-bit training fails due to numerical underflow and overflow. The tutorial suggested three techniques:
1) avoid imprecise weight update: maintain a 32-bit master copy of weight, with which we update with 32-bit gradients.
2) avoid numeric underflow in the 16-bit gradients computed during backprop, especially when the model is very deep: scale the loss as much as we can without overflowing (i.e. dynamically), backprop with this scaled-up loss to get 16-bit gradients (which are not too close to zero anymore), cast to 32-bit, downscale by the same loss scaler, and use this to update the 32-bit master weights.
3) avoid numeric overflow in activations or loss (the opposite problem for gradients): compute the loss in 32-bit, which means casting 16-bit inputs (e.g. logits) into 32-bit and use that to compute the loss. Apply the same to batchnorm layers.

To understand this workflow in detail, I tried the first two suggestions on a toy MLP problem, using the recommended apex library utilities, and without the fast.ai wrappers. The implementation is here, look for `def main_train` which breaks down the workflow into sixteen small steps.

I ran a few experiments with to observe gains in memory and wall-time efficiency. The experiments compared a 160M parameters model, 2.5M parameters model and 116K parameters model. Here are two main take-aways:
1) Wall-time speedup is only observed when the model is large(e.g. many wide layers in a MLP) . However, when the model is very large, the master copy of 32-bit weights and 32-bit gradients can take up so much memory that we are not able to scale up the batch size.
2) Memory efficiency (by measuring max memory allocated regularly) is only observed when the model is small. However, when the model is small, the number of matrix multiplications is trivial. We don’t get to take advantage of the increased arithmetic bandwidth—consequently, no wall-time speed ups.

In practice, there are several ways to implement mixed-precision training. These libraries have been optimized for speedup and memory efficiency. They are also straightforward to incorporate into any given PyTorch module code.

Pytroch Automatic mixed Precision Package

  • Certain operations are autocast into float-32. These Ops are more sensitive to numeric underflow or overflow. Hence, the list includes functions that involve logarithm, exponentiation, and normalization. 
  • Certain operations are autocast into float-16. These Ops typically involve typical linear algebra.
  • GradScaler handles loss scaling.
  • My implementation of the MLP toy problem using torch.cuda.amp. Compared to the manual implementation, this has less education value but is a lot simpler.

Torch lightning 

  • The trainer class accepts arguments that support full-precision, conservative mixed precision (only ops robust to numerical issues are cast into 16-bit), mixed-precision (maintains a master copy of 32-bit weights) true 16-bit precision.

Week 5 – Transformer Diagnosis

TL;DR

In the past two curriculum weeks, I have focused on getting my Transformer implementation to its best possible shape.

  • Datasets: I ran the transformer on four toy datasets, two sanity-check datasets and a subset of 1 benchmark machine translation dataset.
  • Experiments: I ran a set of experiments to understand the effects on weight initialization, weight tying vs. untying, and learning rate schedule on performance, and speed of convergence.
  • 16-bit training: I started reading about 16-bit training, and ran some naive experiment to see what kinds of bugs would come up. I will leave this report for the next blog post when the whole exercise will be complete.

Datasets

The four toy datasets are designed to test if the implementation can:

1) transfer signals as-is from the encoder to the decoder, with a symbol copying task. There are 20 symbols, the first two are encoded as <go> and <stop>.

X
[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]
Y
[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]

2) manipulate the input pattern, with the targets reversed.

X
[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]
Y
[ 1,  6,  6,  3, 18,  2],
[ 1,  7, 14, 13, 15,  2]

3) take advantage of additional structure on the input pattern. We should expect the model to converge sooner as it’s easier to remember a fixed sequence ordering and reuse it.

X
[ 1, 15, 16, 17, 18,  2],
[ 1,  8,  9, 10, 11,  2]
Y
[ 1, 15, 16, 17, 18,  2],
[ 1,  8,  9, 10, 11,  2]

4) interpret the symbols as numeric values and perform simple arithmetic on them. For example, predict Y_k as the sum of X_k and X_(k-1).

X
[ 1,  7, 10,  8,  3,  2],
[ 1,  8,  6,  6,  9,  2]
Y
[ 1,  8, 17, 18, 11,  2],
[ 1,  9, 14, 12, 15,  2]

The two sanity-check datasets (Polynomial Expansion, Fake Dates) are also synthetic, and training on them further validate if the transformer can perform more complex reasoning over the entire input context window (instead of a limited local one).

Polynomial Expansion

X
(7-3*z)*(-5*z-9)
(2-2*n)*(n-1)
Y
15*z**2-8*z-63
-2*n**2+4*n-2
Fake Dates

X
april 11 1981
wednesday september 1 2004
Y
1981-04-11
2004-09-01

Finally, I downloaded the WMT-2014 English-French dataset, pushed the data through a byte-pair encoding pipeline, and overfit on a very small subset (Train:1024, Val: 64). I didn’t train on the entire dataset mainly because of compute limitations. It’s also more important for my curriculum to prioritize 16-bit training and image recognition datasets.

X  
<GO> les sché@@ mas et leurs inclu@@ sions sont télé@@ chargés automatiquement depuis internet la première fois qu'ils sont requ@@ is, puis gar@@ dés en cache local@@ ement. <EOS> 
Y
<GO> the schem@@ as and their inclu@@ sions are automatically downloaded once from their internet location, and cach@@ ed loc@@ ally. <EOS>

X
<GO> /@@ 1.1 avec connexions persist@@ antes et compression pour minimiser la charge réseau. <EOS> 
Y
<GO> /@@ 1.1 with persistent connections and compression for minim@@ ising network lo@@ ad. <EOS> 

If the hyperparameters are carefully chosen, the transformer should converge on all these datasets (exact match accuracy at 100% and BLEU > 0.85) within 1-4k steps. The corresponding notebooks are listed here. They follow the same transformer implementation using Heads=2 and N_layers=2, the only difference is the data (look for Dataset, Dataloader and DataModule classes).

Experiments

In the process of validating the transformer on the above datasets, I realized that hyperparameters and weight initialization can have substantial effects on convergence and performance. The following plots corresponds to the second toy dataset above (reversing randomly generated input symbols).

Hyperparameter warmup_steps: This hyperparameter directly controls the slope and peak of the learning rate schedule, and it should be adjusted according to the batch size. With a smaller batch size, the weight particles should take smaller steps.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. With low warmup_steps (4000 red, 6000 purple), the learning rate picks up too fast and peaks too high, the weights overstep the minima of the loss surface and never recovered.

Tying / Untying WQKVO at initialization (but not afterwards): For the toy datasets, initializing WQ and WK to the same weights seems important. One possible explanation is that doing so allows the multi-head attention module to start at a sensible baseline that is similar to a simple dot-product attention without linear projections. On the other hand, zeroing out WO was suggested in the Fixup initialization (Zhang et al. 2019) paper for residual learning without any normalization. Our architecture still maintains Layer-Norm between all sub-layers so we don’t see any gains from including this technique.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. With WQ, WK initialized to different weights (i.e. untying, dotted blue), the model’s performance plateaus below 0.4. All other combinations of weight untying verify that this lower performance is indeed due to untying WQ and WK only. Zeroing out WO (solid blue, dotted orange) hurts convergence in our architecture which maintains Layer-Norm.

Embedding Initialization: Because the context length is short(only 8 positions), using sinusoidal embeddings did not benefit training as much as learned embeddings. Once the variance of embedding weights at initialization are scaled by N(0, 1/d_model), rather than the default N(0, 1) set in torch.nn.Embedding, the model converges smoothly at 4k rather than 20k steps. Finally, initializing the learned positional embedding to be N(0, 0.5 * variance of word embeddings) speeds up convergence further by a small margin.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. Sinusoidal positional embeddings (dotted green) are inferior to learned embeddings (all other colors) with short context length(8) on both convergence speed and performance at 20k. Properly scaled N(0, 0.01) embeddings(red, green) converges much faster than N(0, 1) embeddings(purple, grey). Finally, setting the initialization variance of positional embeddings to be half of that of word embeddings improves convergence speed furthermore (red>green, purple>grey).

Week 3 – Transformer, Distributed Training, Automatic Differentiation

TL;DR

In the past two curriculum focused weeks, I continued to brush up on some foundational ML topics.

  • Standard Transformer: I had a basic grasp of the standard transformer(Vaswani et al. 2017) architecture already, but a few other concepts crucial to successful implementation were still missing.
  • Distributed training: I have taken a class on scaling classic ML models before, but not specifically on distributing deep learning model training across GPUs and nodes. The mechanisms were opaque to me.
  • Automatic differentiation: This is a very foundational topic, yet my formal data science education never covered it!


Approach and Highlights

Nothing internalizes an ML/DL/Stats topic for me more than hands on implementation/problem solving, supplemented with lots of google search for literature, lecture notes, documentations, and discussion threads. This learning style is reminiscent of how I often let my hands(models, drawings) guide my brain(precedent review and analysis) during the early years of my education in Architecture. Conversations with my mentors remain crucial, and I checked in with my mentor Gabe whenever I’m stuck with a concept. Our discussions have been delightful.

With the Transformer, I took the model architecture diagram from the paper and refactored it into a top-down module collection. This was a relatively straight-forward first step. I also asked Alethea Power (an excellent scholar from the last cohort and current engineer) for their advice. I learned quite a few helpful tips that leverage Pytorch specifics for module organization and eliminating boilerplate training code with Lightning. Whenever a substantial number of modules are completed, I would design simple tests to ensure that they give me the expected outputs. I was especially delighted with the way matrix multiplication simplifies the code and optimizes speed and space efficiency (which the authors also claim) for multi-head attention. This use case can make a perfect textbook example for vectorization! However, for myself, the highlight of this implementation exercise is internalizing layer-norm, positional encoding, learning rate schedule, and label smoothing. I paid extra attention to their motivation, did some light literature review, and made my mentor spend spare time to discuss them with me (thanks, Gabe!). FYI, this collection is a standard set of resources that the scholars use to study and implementation of their own transformers:

With distributed training, my primary resource is the Pytorch documentation on the distributed package. Clarifying the difference between data-parallel and distributed data-parallel, understanding how, which, and why tensors are exchanged between processes and devices was a crucial first step. The primary motivations and concepts resemble how multinode training for classic ML algorithms in the map-reduce paradigm works (thanks to the W261 course. I did two light notebook exercises to understand how CUDA manages and reports memory usage for tensors and implemented a toy MLP problem to see how an implementation would align with the documentation for CPU training, single GPU with a single process multiple GPUs with individual processes. Also, Gabe and I had extensive discussions on the different trade-offs for different data, model and network challenges, when 1) data is heavy (e.g. image features); 2) model is heavy(e.g. transformer), and 3) network is slow (e.g. all-ring-reduce’s assumptions doesn’t work well with multiple nodes each with various GPUs across the ethernet). FYI, this is the seed list of my readings, from which I drilled down to the necessary details.

With automatic differentiation, I solicited a list of readings from my peers. Like many beginners in DL, it was not clear why backprop is such a big deal (if it was just about the chain rule) and took so long before researchers came up with it. The difference between the forward vs. backward mode enlightened me! Also, following and practicing with Karpathy’s Micrograd implementation was really helpful. FYI, here is the list, from easy to involved:


Next Steps

I am currently working on validating my standard transformer implementation, with a few more toy problems. For example, sequence copying, reversing, and auto-regressive series (e.g. Fibonacci sequence). Also, other implementation checks with 16-bit training, memory profiling and torch lightning integration. This is a prerequisite to the research direction Gabe and I have in mind — bringing more clarity to the transformer type architecture through studying image recognition. More on that later 🙂

Week 1 – Custom Logistic Regression

Hello, this is Legg. I am a Fall 2020 Scholar at OpenAI, working with my mentor Gabriel Goh in the Clarity team. Between October 2020 and April 2021, my peers and I will be publishing blog posts biweekly to share our journeys as scholars between October 2020 and April 2021.

I tend to get questions about career transition and whether one should invest in additional schooling or industry research residencies. Therefore, before diving into any technical discussions, I want to share a few things about myself so that the readers of this blog understand where my baseline is. Hopefully, this information, in addition to the collection of upcoming technical discussions, can help the readers make better judgments for their careers.

  1. Between 2006-2017, I was trained and practiced as an Architect and Urban Designer. I did not receive any quantitative or engineering training from my bachelor’s and first master’s degrees. (I wouldn’t count grasshopper scripting, 3D modeling, or rendering as technical for AI purposes).
  2. Between 2017-2019, I completed a master’s degree in data science. It is a popular degree for folks who want to transition into Data Science. The courses include statistics, machine learning, and NLP with deep learning. I also took two semesters off for full-time machine learning and data science internships.
  3. Between 2019-2020, I was an AI resident at Microsoft Research. I focused on Vision-Language Navigation(VLN) and Instruction Generation.
  4. Before my start date at OpenAI, I took about ten days to complete the Deep Learning Specialization (deeplearning.ai, Coursera) to refresh my memories and fill any knowledge gaps in the basics. 

In summary, I am still very new to the field of AI. Through this scholar program, I am hoping to gain more experience in research engineering and developing research intuitions.


Back to documenting my first week at OpenAI, I took advantage of the general on-boarding period to learn a little more about Pytorch. I have been using Pytorch for 12 months, mostly on extending Pytorch research codebases such as ALFRED and Room2Room. I want to understand Autograd and the custom functions more. Therefore, I coded up logistic regression from scratch and compared it with the off-the-shelf implementation on a toy problem. More specifically, I implemented the following:

  • Linear, Sigmoid, and loss layers as custom torch.autograd.Functions
  • Gradient checking using torch.autograd.gradcheck
  • SGD optimizer as custom torch.optim.Optimizer
  • Toy dataset and dataloader using sklearn.make_classification, torch.utils.dataDataset, torch.utils.dataDatase

The exercise took about two days. I spent most of the time reading the Pytorch documentation, coding up my implementation, and debugging when the losses of my implementation mismatch the default implementation. The process was quite educational. The implementation is here (feel free to comment with questions/suggestions), along with links to all the references I have used. Along this line of small exercises, I am interested in getting a similar toy problem to train on multiple GPUs. There are at least three ways to do this in Pytorch. More updates on that in about two weeks 🙂