In the past two curriculum focused weeks, I continued to brush up on some foundational ML topics.
- Standard Transformer: I had a basic grasp of the standard transformer(Vaswani et al. 2017) architecture already, but a few other concepts crucial to successful implementation were still missing.
- Distributed training: I have taken a class on scaling classic ML models before, but not specifically on distributing deep learning model training across GPUs and nodes. The mechanisms were opaque to me.
- Automatic differentiation: This is a very foundational topic, yet my formal data science education never covered it!
Approach and Highlights
Nothing internalizes an ML/DL/Stats topic for me more than hands on implementation/problem solving, supplemented with lots of google search for literature, lecture notes, documentations, and discussion threads. This learning style is reminiscent of how I often let my hands(models, drawings) guide my brain(precedent review and analysis) during the early years of my education in Architecture. Conversations with my mentors remain crucial, and I checked in with my mentor Gabe whenever I’m stuck with a concept. Our discussions have been delightful.
With the Transformer, I took the model architecture diagram from the paper and refactored it into a top-down module collection. This was a relatively straight-forward first step. I also asked Alethea Power (an excellent scholar from the last cohort and current engineer) for their advice. I learned quite a few helpful tips that leverage Pytorch specifics for module organization and eliminating boilerplate training code with Lightning. Whenever a substantial number of modules are completed, I would design simple tests to ensure that they give me the expected outputs. I was especially delighted with the way matrix multiplication simplifies the code and optimizes speed and space efficiency (which the authors also claim) for multi-head attention. This use case can make a perfect textbook example for vectorization! However, for myself, the highlight of this implementation exercise is internalizing layer-norm, positional encoding, learning rate schedule, and label smoothing. I paid extra attention to their motivation, did some light literature review, and made my mentor spend spare time to discuss them with me (thanks, Gabe!). FYI, this collection is a standard set of resources that the scholars use to study and implementation of their own transformers:
- The Annotated Transformer
- The illustrated Transformer
- Karl Stratos’s notes
- Transformers from Scratch
- Vaswani’s guest lecture at Stanford
- My first sketch of the standard transformer, runs on toydatasets
With distributed training, my primary resource is the Pytorch documentation on the distributed package. Clarifying the difference between data-parallel and distributed data-parallel, understanding how, which, and why tensors are exchanged between processes and devices was a crucial first step. The primary motivations and concepts resemble how multinode training for classic ML algorithms in the map-reduce paradigm works (thanks to the W261 course. I did two light notebook exercises to understand how CUDA manages and reports memory usage for tensors and implemented a toy MLP problem to see how an implementation would align with the documentation for CPU training, single GPU with a single process multiple GPUs with individual processes. Also, Gabe and I had extensive discussions on the different trade-offs for different data, model and network challenges, when 1) data is heavy (e.g. image features); 2) model is heavy(e.g. transformer), and 3) network is slow (e.g. all-ring-reduce’s assumptions doesn’t work well with multiple nodes each with various GPUs across the ethernet). FYI, this is the seed list of my readings, from which I drilled down to the necessary details.
- Overview of the distributed package
- How to implement training across multiple GPUs?
- My implementation and outputs of a toy MLP using the DDP magical wrapper.
- My short exercise to understand how CUDA allocated memory changes with creating tensors, deleting tensors and emptying cache.
- My short exercise to understand how forced synchronization affects execution time.
With automatic differentiation, I solicited a list of readings from my peers. Like many beginners in DL, it was not clear why backprop is such a big deal (if it was just about the chain rule) and took so long before researchers came up with it. The difference between the forward vs. backward mode enlightened me! Also, following and practicing with Karpathy’s Micrograd implementation was really helpful. FYI, here is the list, from easy to involved:
- Chis Olah’s blog on backprop
- A quick, easy to follow summary
- Micrograd toy implementation by Karpathy
- Wikipedia on automatic differentiation
- University of Toronto lecturer notes on autograd and autodiff
- Michael Nielsen’s textbook chapter on the backprop algorithm
I am currently working on validating my standard transformer implementation, with a few more toy problems. For example, sequence copying, reversing, and auto-regressive series (e.g. Fibonacci sequence). Also, other implementation checks with 16-bit training, memory profiling and torch lightning integration. This is a prerequisite to the research direction Gabe and I have in mind — bringing more clarity to the transformer type architecture through studying image recognition. More on that later 🙂