Week 5 – Transformer Diagnosis


In the past two curriculum weeks, I have focused on getting my Transformer implementation to its best possible shape.

  • Datasets: I ran the transformer on four toy datasets, two sanity-check datasets and a subset of 1 benchmark machine translation dataset.
  • Experiments: I ran a set of experiments to understand the effects on weight initialization, weight tying vs. untying, and learning rate schedule on performance, and speed of convergence.
  • 16-bit training: I started reading about 16-bit training, and ran some naive experiment to see what kinds of bugs would come up. I will leave this report for the next blog post when the whole exercise will be complete.


The four toy datasets are designed to test if the implementation can:

1) transfer signals as-is from the encoder to the decoder, with a symbol copying task. There are 20 symbols, the first two are encoded as <go> and <stop>.

[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]
[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]

2) manipulate the input pattern, with the targets reversed.

[ 1, 18,  3,  6,  6,  2],
[ 1, 15, 13, 14,  7,  2]
[ 1,  6,  6,  3, 18,  2],
[ 1,  7, 14, 13, 15,  2]

3) take advantage of additional structure on the input pattern. We should expect the model to converge sooner as it’s easier to remember a fixed sequence ordering and reuse it.

[ 1, 15, 16, 17, 18,  2],
[ 1,  8,  9, 10, 11,  2]
[ 1, 15, 16, 17, 18,  2],
[ 1,  8,  9, 10, 11,  2]

4) interpret the symbols as numeric values and perform simple arithmetic on them. For example, predict Y_k as the sum of X_k and X_(k-1).

[ 1,  7, 10,  8,  3,  2],
[ 1,  8,  6,  6,  9,  2]
[ 1,  8, 17, 18, 11,  2],
[ 1,  9, 14, 12, 15,  2]

The two sanity-check datasets (Polynomial Expansion, Fake Dates) are also synthetic, and training on them further validate if the transformer can perform more complex reasoning over the entire input context window (instead of a limited local one).

Polynomial Expansion

Fake Dates

april 11 1981
wednesday september 1 2004

Finally, I downloaded the WMT-2014 English-French dataset, pushed the data through a byte-pair encoding pipeline, and overfit on a very small subset (Train:1024, Val: 64). I didn’t train on the entire dataset mainly because of compute limitations. It’s also more important for my curriculum to prioritize 16-bit training and image recognition datasets.

<GO> les sché@@ mas et leurs inclu@@ sions sont télé@@ chargés automatiquement depuis internet la première fois qu'ils sont requ@@ is, puis gar@@ dés en cache local@@ ement. <EOS> 
<GO> the schem@@ as and their inclu@@ sions are automatically downloaded once from their internet location, and cach@@ ed loc@@ ally. <EOS>

<GO> /@@ 1.1 avec connexions persist@@ antes et compression pour minimiser la charge réseau. <EOS> 
<GO> /@@ 1.1 with persistent connections and compression for minim@@ ising network lo@@ ad. <EOS> 

If the hyperparameters are carefully chosen, the transformer should converge on all these datasets (exact match accuracy at 100% and BLEU > 0.85) within 1-4k steps. The corresponding notebooks are listed here. They follow the same transformer implementation using Heads=2 and N_layers=2, the only difference is the data (look for Dataset, Dataloader and DataModule classes).


In the process of validating the transformer on the above datasets, I realized that hyperparameters and weight initialization can have substantial effects on convergence and performance. The following plots corresponds to the second toy dataset above (reversing randomly generated input symbols).

Hyperparameter warmup_steps: This hyperparameter directly controls the slope and peak of the learning rate schedule, and it should be adjusted according to the batch size. With a smaller batch size, the weight particles should take smaller steps.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. With low warmup_steps (4000 red, 6000 purple), the learning rate picks up too fast and peaks too high, the weights overstep the minima of the loss surface and never recovered.

Tying / Untying WQKVO at initialization (but not afterwards): For the toy datasets, initializing WQ and WK to the same weights seems important. One possible explanation is that doing so allows the multi-head attention module to start at a sensible baseline that is similar to a simple dot-product attention without linear projections. On the other hand, zeroing out WO was suggested in the Fixup initialization (Zhang et al. 2019) paper for residual learning without any normalization. Our architecture still maintains Layer-Norm between all sub-layers so we don’t see any gains from including this technique.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. With WQ, WK initialized to different weights (i.e. untying, dotted blue), the model’s performance plateaus below 0.4. All other combinations of weight untying verify that this lower performance is indeed due to untying WQ and WK only. Zeroing out WO (solid blue, dotted orange) hurts convergence in our architecture which maintains Layer-Norm.

Embedding Initialization: Because the context length is short(only 8 positions), using sinusoidal embeddings did not benefit training as much as learned embeddings. Once the variance of embedding weights at initialization are scaled by N(0, 1/d_model), rather than the default N(0, 1) set in torch.nn.Embedding, the model converges smoothly at 4k rather than 20k steps. Finally, initializing the learned positional embedding to be N(0, 0.5 * variance of word embeddings) speeds up convergence further by a small margin.

Validation Exact-Match Accuracy (student-forced decoding). Batch size = 100. Sinusoidal positional embeddings (dotted green) are inferior to learned embeddings (all other colors) with short context length(8) on both convergence speed and performance at 20k. Properly scaled N(0, 0.01) embeddings(red, green) converges much faster than N(0, 1) embeddings(purple, grey). Finally, setting the initialization variance of positional embeddings to be half of that of word embeddings improves convergence speed furthermore (red>green, purple>grey).

Week 3 – Transformer, Distributed Training, Automatic Differentiation


In the past two curriculum focused weeks, I continued to brush up on some foundational ML topics.

  • Standard Transformer: I had a basic grasp of the standard transformer(Vaswani et al. 2017) architecture already, but a few other concepts crucial to successful implementation were still missing.
  • Distributed training: I have taken a class on scaling classic ML models before, but not specifically on distributing deep learning model training across GPUs and nodes. The mechanisms were opaque to me.
  • Automatic differentiation: This is a very foundational topic, yet my formal data science education never covered it!

Approach and Highlights

Nothing internalizes an ML/DL/Stats topic for me more than hands on implementation/problem solving, supplemented with lots of google search for literature, lecture notes, documentations, and discussion threads. This learning style is reminiscent of how I often let my hands(models, drawings) guide my brain(precedent review and analysis) during the early years of my education in Architecture. Conversations with my mentors remain crucial, and I checked in with my mentor Gabe whenever I’m stuck with a concept. Our discussions have been delightful.

With the Transformer, I took the model architecture diagram from the paper and refactored it into a top-down module collection. This was a relatively straight-forward first step. I also asked Alethea Power (an excellent scholar from the last cohort and current engineer) for their advice. I learned quite a few helpful tips that leverage Pytorch specifics for module organization and eliminating boilerplate training code with Lightning. Whenever a substantial number of modules are completed, I would design simple tests to ensure that they give me the expected outputs. I was especially delighted with the way matrix multiplication simplifies the code and optimizes speed and space efficiency (which the authors also claim) for multi-head attention. This use case can make a perfect textbook example for vectorization! However, for myself, the highlight of this implementation exercise is internalizing layer-norm, positional encoding, learning rate schedule, and label smoothing. I paid extra attention to their motivation, did some light literature review, and made my mentor spend spare time to discuss them with me (thanks, Gabe!). FYI, this collection is a standard set of resources that the scholars use to study and implementation of their own transformers:

With distributed training, my primary resource is the Pytorch documentation on the distributed package. Clarifying the difference between data-parallel and distributed data-parallel, understanding how, which, and why tensors are exchanged between processes and devices was a crucial first step. The primary motivations and concepts resemble how multinode training for classic ML algorithms in the map-reduce paradigm works (thanks to the W261 course. I did two light notebook exercises to understand how CUDA manages and reports memory usage for tensors and implemented a toy MLP problem to see how an implementation would align with the documentation for CPU training, single GPU with a single process multiple GPUs with individual processes. Also, Gabe and I had extensive discussions on the different trade-offs for different data, model and network challenges, when 1) data is heavy (e.g. image features); 2) model is heavy(e.g. transformer), and 3) network is slow (e.g. all-ring-reduce’s assumptions doesn’t work well with multiple nodes each with various GPUs across the ethernet). FYI, this is the seed list of my readings, from which I drilled down to the necessary details.

With automatic differentiation, I solicited a list of readings from my peers. Like many beginners in DL, it was not clear why backprop is such a big deal (if it was just about the chain rule) and took so long before researchers came up with it. The difference between the forward vs. backward mode enlightened me! Also, following and practicing with Karpathy’s Micrograd implementation was really helpful. FYI, here is the list, from easy to involved:

Next Steps

I am currently working on validating my standard transformer implementation, with a few more toy problems. For example, sequence copying, reversing, and auto-regressive series (e.g. Fibonacci sequence). Also, other implementation checks with 16-bit training, memory profiling and torch lightning integration. This is a prerequisite to the research direction Gabe and I have in mind — bringing more clarity to the transformer type architecture through studying image recognition. More on that later 🙂

Week 1 – Custom Logistic Regression

Hello, this is Legg. I am a Fall 2020 Scholar at OpenAI, working with my mentor Gabriel Goh in the Clarity team. Between October 2020 and April 2021, my peers and I will be sharing blog posts biweekly to share our learning journey between October 2020 and April 2021.

I tend to get questions about career transition and whether one should invest in additional schooling or industry research residencies. Therefore, before diving into any technical discussions, I want to share a few things about myself so that the readers of this blog understand where my baseline is. Hopefully, this information, in addition to the collection of upcoming technical discussions, can help the readers make better judgments for their careers.

  1. Between 2006-2017, I was trained and practiced as an Architect and Urban Designer. I did not receive any quantitative or engineering training from my bachelor’s and first master’s degrees. (I wouldn’t count grasshopper scripting, 3D modeling, or rendering as technical for AI purposes).
  2. Between 2017-2019, I completed a master’s degree in data science. It is a popular degree for folks who want to transition into Data Science. The courses include statistics, machine learning, and NLP with deep learning. I also took two semesters off for full-time machine learning and data science internships.
  3. Between 2019-2020, I was an AI resident at Microsoft Research. I focused on Vision-Language Navigation(VLN) and Instruction Generation. My mentors (Alex Polozov, Yonatan Bisk) and I just submitted our (my first) paper to AAAI 2021.
  4. Before my start date at OpenAI, I took about ten days to complete the Deep Learning Specialization (deeplearning.ai, Coursera) to refresh my memories and fill any knowledge gaps in the basics. 

In summary, I am still very new to the field of AI. Through this scholar program, I am hoping to gain more experience in research engineering and developing research intuitions.

Back to documenting my first week at OpenAI, I took advantage of the general on-boarding period to learn a little more about Pytorch. I have been using Pytorch for 12 months, mostly on extending Pytorch research codebases such as ALFRED and Room2Room. I want to understand Autograd and the custom functions more. Therefore, I coded up logistic regression from scratch and compared it with the off-the-shelf implementation on a toy problem. More specifically, I implemented the following:

  • Linear, Sigmoid, and loss layers as custom torch.autograd.Functions
  • Gradient checking using torch.autograd.gradcheck
  • SGD optimizer as custom torch.optim.Optimizer
  • Toy dataset and dataloader using sklearn.make_classification, torch.utils.dataDataset, torch.utils.dataDatase

The exercise took about two days. I spent most of the time reading the Pytorch documentation, coding up my implementation, and debugging when the losses of my implementation mismatch the default implementation. The process was quite educational. The implementation is here (feel free to comment with questions/suggestions), along with links to all the references I have used. Along this line of small exercises, I am interested in getting a similar toy problem to train on multiple GPUs. There are at least three ways to do this in Pytorch. More updates on that in about two weeks 🙂