Week 3 – Transformer, Distributed Training, Automatic Differentiation

TL;DR

In the past two curriculum focused weeks, I continued to brush up on some foundational ML topics.

  • Standard Transformer: I had a basic grasp of the standard transformer(Vaswani et al. 2017) architecture already, but a few other concepts crucial to successful implementation were still missing.
  • Distributed training: I have taken a class on scaling classic ML models before, but not specifically on distributing deep learning model training across GPUs and nodes. The mechanisms were opaque to me.
  • Automatic differentiation: This is a very foundational topic, yet my formal data science education never covered it!


Approach and Highlights

Nothing internalizes an ML/DL/Stats topic for me more than hands on implementation/problem solving, supplemented with lots of google search for literature, lecture notes, documentations, and discussion threads. This learning style is reminiscent of how I often let my hands(models, drawings) guide my brain(precedent review and analysis) during the early years of my education in Architecture. Conversations with my mentors remain crucial, and I checked in with my mentor Gabe whenever I’m stuck with a concept. Our discussions have been delightful.

With the Transformer, I took the model architecture diagram from the paper and refactored it into a top-down module collection. This was a relatively straight-forward first step. I also asked Alethea Power (an excellent scholar from the last cohort and current engineer) for their advice. I learned quite a few helpful tips that leverage Pytorch specifics for module organization and eliminating boilerplate training code with Lightning. Whenever a substantial number of modules are completed, I would design simple tests to ensure that they give me the expected outputs. I was especially delighted with the way matrix multiplication simplifies the code and optimizes speed and space efficiency (which the authors also claim) for multi-head attention. This use case can make a perfect textbook example for vectorization! However, for myself, the highlight of this implementation exercise is internalizing layer-norm, positional encoding, learning rate schedule, and label smoothing. I paid extra attention to their motivation, did some light literature review, and made my mentor spend spare time to discuss them with me (thanks, Gabe!). FYI, this collection is a standard set of resources that the scholars use to study and implementation of their own transformers:

With distributed training, my primary resource is the Pytorch documentation on the distributed package. Clarifying the difference between data-parallel and distributed data-parallel, understanding how, which, and why tensors are exchanged between processes and devices was a crucial first step. The primary motivations and concepts resemble how multinode training for classic ML algorithms in the map-reduce paradigm works (thanks to the W261 course. I did two light notebook exercises to understand how CUDA manages and reports memory usage for tensors and implemented a toy MLP problem to see how an implementation would align with the documentation for CPU training, single GPU with a single process multiple GPUs with individual processes. Also, Gabe and I had extensive discussions on the different trade-offs for different data, model and network challenges, when 1) data is heavy (e.g. image features); 2) model is heavy(e.g. transformer), and 3) network is slow (e.g. all-ring-reduce’s assumptions doesn’t work well with multiple nodes each with various GPUs across the ethernet). FYI, this is the seed list of my readings, from which I drilled down to the necessary details.

With automatic differentiation, I solicited a list of readings from my peers. Like many beginners in DL, it was not clear why backprop is such a big deal (if it was just about the chain rule) and took so long before researchers came up with it. The difference between the forward vs. backward mode enlightened me! Also, following and practicing with Karpathy’s Micrograd implementation was really helpful. FYI, here is the list, from easy to involved:


Next Steps

I am currently working on validating my standard transformer implementation, with a few more toy problems. For example, sequence copying, reversing, and auto-regressive series (e.g. Fibonacci sequence). Also, other implementation checks with 16-bit training, memory profiling and torch lightning integration. This is a prerequisite to the research direction Gabe and I have in mind — bringing more clarity to the transformer type architecture through studying image recognition. More on that later 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s