# 10-423/623/723: Generative Al Lecture 17 – Distributed Training

Matt Gormley & Aran Nayebi

10/29/25

(Slide credits: Henry Chai & Pat Virtue)

#### Reminders

- Homework 4: Multimodal Foundation Models
  - Out: Thu, Oct 23
  - Due: Mon, Nov 3 at 11:59pm
  - Please be mindful of your grace day usage!
- HW623:
  - Out: Mon, Nov 3
  - Due: Mon, Dec 1 at 11:59pm
  - Only students enrolled in 10-623 and 10-723 should complete HW623; please do not submit HW623 if you are enrolled in 10-423

10/29/25

#### Llama-1

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

#### Llama-2

|         |     | Time<br>(GPU hours) | Power<br>Consumption (W) | Carbon Emitted (tCO <sub>2</sub> eq) |
|---------|-----|---------------------|--------------------------|--------------------------------------|
| Llama 2 | 7B  | 184320              | 400                      | 31.22                                |
|         | 13B | 368640              | 400                      | 62.44                                |
|         | 34B | 1038336             | 350                      | 153.90                               |
|         | 70B | 1720320             | 400                      | 291.42                               |
| Total   |     | 3311616             |                          | 539.00                               |

#### Llama-3

Compute. Llama 3 405B is trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3, using Meta's Grand Teton AI server platform (Matt Bowman, 2022). Each server is equipped with eight GPUs and two CPUs. Within a server, the eight GPUs are connected via NVLink. Training jobs are scheduled

#### Recall: How much did it cost to train LLaMa?



 "the newly announced clusters both contain 24,576 Nvidia Tensor Core H100 GPUs. This is a significant increase over the original clusters, which contained 16,000 Nvidia A100 GPUs."

#### Recall: How much did it cost to train LLaMa?

## GPU Comparison

|                        | NVIDIA A100       | NVIDIA H100       | NVIDIA B200       |  |
|------------------------|-------------------|-------------------|-------------------|--|
| Memory size            | 80 GB             | 80 GB             | 192 GB            |  |
| BF16<br>Performance    | 0.6 PFLOPS        | 1.9 PFLOPS        | 4.5 PFLOPS        |  |
| Peak memory bandwidth  | 1.6 TB / second   | 1.6 TB / second   | 8 TB / second     |  |
| Inter-GPU<br>bandwidth | ~ 0.9 TB / second | ~ 0.6 TB / second | ~ 1.8 TB / second |  |
| Price                  | ~ \$10K           | ~ \$25K           | ~\$50k            |  |

 Key takeaway: inter-GPU communication is the primary bottleneck in distributed systems (and speeding it up is worth a lot of money!)

# Parallelism in LLM Training

- Goal: divide the work of training an LLM across multiple
   GPUs such that inter-GPU communication is minimized
- Good news: Transformer-based LLM architectures are highly parallelizable!
- Easily exploitable parallelism in LLM training includes:
  - Data parallelism
  - Model or tensor parallelism
  - Pipeline parallelism
  - Optimizer-based parallelism
  - Token parallelism
  - Expert parallelism

10/29/25

#### Data Parallelism

 Approach: during training, split each minibatch of data points evenly across multiple GPUs and have each GPU compute the forward and backward pass for its data points.



- The simplest, most effective form of parallelism if attainable
  - Only one inter-GPU communication per iteration

Question: What is the fundamental problem here?

## Model Parallelism

- Approach: for a batch of data points, partition the forward and backward computations within a layer across multiple GPUs
  - Also called tensor parallelism
- Transformer based architectures have two primary modules that can be parallelized

13

- 1. MLP blocks
- 2. Attention blocks

10/29/25

#### Model Parallelism: MLP Blocks



Figure 3. Blocks of Transformer with Model Parallelism. f and g are conjugate. f is an identity operator in the forward pass and all reduce in the backward pass while g is an all reduce in the forward pass and identity in the backward pass.

#### Model Parallelism: MLP Blocks

Given a batch of training data points  $X \in \mathbb{R}^{N \times D}$ , and weight  $W \in \mathbb{R}^{D \times M}$  the operations to parallelize a Linear layer the MLP blocks

- f(X) = XW in the forward pass and
- $\partial \ell / \partial X = GW^T$  in the backward pass,  $G = \partial \ell / \partial f$

#### Two ways of partitioning W across multiple GPUs

• Column-parallel:  $W = [c_1, c_2, ..., c_M]$ 

• Row-parallel: 
$$W = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_D \end{bmatrix}$$

# Aside: Parallel Matrix Multiplication

#### • Standard:

$$Z = XW$$

$$Z_{i,j} = \sum_{k} X_{i,k} W_{k,j}$$







#### • Split *W* by column:

$$egin{array}{ll} oldsymbol{Z}_A &= oldsymbol{X} oldsymbol{W}_A \ oldsymbol{Z}_B &= oldsymbol{X} oldsymbol{W}_B \end{array} & Z = egin{bmatrix} oldsymbol{Z}_A \ oldsymbol{Z}_B \end{bmatrix}$$















## Aside: Parallel Matrix Multiplication

#### • Standard:

$$Z = XW$$

$$Z_{i,j} = \sum_{k} X_{i,k} W_{k,j}$$







#### • Split *W* by row:

$$Z_A = X_L W_A 
Z_B = X_R W_B$$

$$Z = Z_A + Z_B$$











#### Model Parallelism: MLP Blocks

Column-parallel: 
$$W = [c_1, c_2, ..., c_M]$$

- Forward pass:  $XW = [Xc_1, ..., Xc_M]$
- Backward pass:  $[g_1, g_2, \dots, g_M]$   $\begin{bmatrix} c_1^T \\ c_2^T \\ \vdots \\ c_M^T \end{bmatrix} = \sum_{m=1}^M g_m c_m^T$  Forward pass:  $[x_1, x_2, \dots, x_D]$   $\begin{bmatrix} r_1^T \\ r_2^T \\ \vdots \\ r_D^T \end{bmatrix} = \sum_{d=1}^D x_d r_d^T$

Row-parallel: 
$$W = \begin{bmatrix} r_1^T \\ r_2^T \\ \vdots \\ r_D^T \end{bmatrix}$$

- Backward pass:  $GW^T = [Gr_1, ..., Gr_D]$

- f(X) = XW in the forward pass
- $\partial \ell / \partial X = GW^T$  in the backward pass

#### Model Parallelism: MLP Blocks

Column-parallel: 
$$W = [c_1, c_2, ..., c_M]$$

• Forward pass: 
$$XW = [Xc_1, ..., Xc_M]$$

• Backward pass: 
$$[g_1, g_2, \dots, g_M]$$
  $\begin{bmatrix} c_1^T \\ c_2^T \\ \vdots \\ c_M^T \end{bmatrix} = \sum_{m=1}^M g_m c_m^T$    
• Forward pass:  $[x_1, x_2, \dots, x_D]$   $\begin{bmatrix} r_1^T \\ r_2^T \\ \vdots \\ r_D^T \end{bmatrix} = \sum_{d=1}^D x_d r_d^T$ 

Row-parallel: 
$$W = \begin{bmatrix} r_1^T \\ r_2^T \\ \vdots \\ r_D^T \end{bmatrix}$$

• Forward pass: 
$$[x_1, x_2, \dots, x_D]$$
  $\begin{bmatrix} r_1 \\ r_2^T \\ \vdots \\ r_D^T \end{bmatrix} = \sum_{d=1}^D x_d r_d^T$ 

• Backward pass:  $GW^T = [Gr_1, ..., Gr_D]$ 

- f(X) = XW in the forward pass
- $\partial \ell / \partial X = GW^T$  in the backward pass

#### Two key observations:

The column-parallel and rowparallel forward and backward computations are identical, just reversed

Consecutive MLPs can alternate between columnparallel and row-parallel implementations to minimize the sums across GPUs!

# Model Parallelism: Attention Blocks

- Multi-headed attention blocks trivially parallelize across attention heads
- Assuming attention heads are concatenated horizontally,
   the output can be passed to a row-parallel MLP directly



#### (b) Self-Attention

Figure 3. Blocks of Transformer with Model Parallelism. f and g are conjugate. f is an identity operator in the forward pass and all reduce in the backward pass while g is an all reduce in the forward pass and identity in the backward pass.

# Model Parallelism: Attention Blocks



Figure 4. Communication operations in a transformer layer. There are 4 total communication operations in the forward and backward pass of a single model parallel transformer layer.

- The previous examples of tensor parallelism also implicitly leverage *pipeline parallelism*, where different layers or modules are put on different GPUs
- Issue: the computation across layers is inherently sequential, not parallel!





 Naïve implementation has a ton of idle GPU time!

- The previous examples of tensor parallelism also implicitly leverage *pipeline parallelism*, where different layers or modules are put on different GPUs
- Issue: the computation across layers is inherently sequential, not parallel!





Idea: work on multiple microbatches concurrently!

- The previous examples of tensor parallelism also implicitly leverage *pipeline parallelism*, where different layers or modules are put on different GPUs
- Issue: the computation across layers is inherently sequential, not parallel!





Idea: work on multiple
 microbatches concurrently!



• The schedule above-right is the one-forward-one-backward (1F1B) mechanism: in the steady state, each GPU alternates between one forward pass and one backward pass



- Issue: if weights are updated in every backward pass, the weights used to compute the forward pass for a microbatch can be different from the weights used to compute the backward pass
  - The divergence is worse for earlier GPUs in the pipeline
  - Can lead to poor model convergence/optimization
- Solution: weight stashing after every forward pass, store the weights and reload them for the corresponding backward pass



- Indicates a pipeline flush between minibatches; the sync is required to compute the parameters for the next minibatch's forward pass
- The 1F1B is still highly efficient even when forward and backward passes take different amounts of compute (as is typically the case)

- Another interesting question that arises with pipeline parallelism: how should we partition the layers across GPUs?
- Solution: profile the code and apply dynamic programming!



- Another interesting question that arises with pipeline parallelism: how should we partition the layers across GPUs?
- Solution: profile the code and apply dynamic programming!

Let A(j, m) denote the time taken by the slowest stage in the optimal pipeline between layers 1 and j using mmachines. The goal of our algorithm is to find A(N, M), and the corresponding partitioning. Let  $T(i \rightarrow j, m)$ denote the time taken by a single stage spanning layers ithrough j, replicated over m machines.

$$T(i \to j, m) = \frac{1}{m} \max \left( \sum_{l=i}^{j} T_{l}, \sum_{l=i}^{j} \underline{W_{l}^{m}} \right)$$

where the left term inside the max is the total computation time for all the layers in the stage, and the right term is the total communication time for all the layers in the stage.

The optimal pipeline consisting of layers from 1 through j using m machines could either be a single stage replicated m times, or be composed of multiple stages.

Case 1: The optimal pipeline contains only one stage, replicated *m* times. In this case,

$$A(j,m) = T(1 \to j,m)$$

Case 2: The optimal pipeline contains more than one stage. In this case, it can be broken into an optimal subpipeline consisting of layers from 1 through i with m-m' machines followed by a single stage with layers i+1 through j replicated over m' machines. Then, using the optimal sub-problem property, we have

$$A(j, m) = \min_{1 \le i < j} \min_{1 \le m' < m} \max \begin{cases} A(i, m - m') \\ 2 \cdot C_i \\ T(i + 1 \rightarrow j, m') \end{cases}$$

#### The story so far:

Ultimately, data parallelism is still king and tensor / pipeline parallelism are used to support pushing more data through the forward and backward passes



10/29/25 Figure courtesy of Yuanzhi Li

- If we only need to compute/store the parameters and gradients, then data + model + pipeline parallelism (typically) suffices
  - E.g., for mini-batch SGD, the update only requires the gradients and the current weights:

38

$$W^{(t+1)} \leftarrow W^{(t)} - \gamma \nabla \ell^{(B)} (W^{(t)})$$

 However, many advanced optimization algorithms require storing additional intermediate or *state variables* to perform the parameter update

10/29/25

# Adam (Adaptive Moment Estimation)

 High-level intuition: Adam combines SGD with momentum (memory of previous gradient steps) and RMSProp (scaled step sizes based on previous gradients)

$$W^{(t+1)} \leftarrow W^{(t)} - \frac{\gamma}{\sqrt{S_t/(1-\beta_2^t)}} \odot \left(\frac{M_t}{1-\beta_1^t}\right)$$

where

• 
$$M_t = \beta_1 M_{t-1} + (1 - \beta_1) \nabla \ell^{(B)} (W^{(t)})$$

• 
$$S_t = \beta_2 S_{t-1} + (1 - \beta_2) \left( \nabla \ell^{(B)} (W^{(t)}) \odot \nabla \ell^{(B)} (W^{(t)}) \right)$$

- $\beta_1$  and  $\beta_2$  are *decay* parameters that dictate how much  $M_t$  and  $S_t$  are defined by previous time steps
  - $M_{-1}$  and  $S_{-1}$  are initialized to matrices of all zeros

# Adam (Adaptive Moment Estimation)

 High-level intuition: Adam combines SGD with momentum (memory of previous gradient steps) and RMSProp (scaled step sizes based on previous gradients)

$$W^{(t+1)} \leftarrow W^{(t)} - \frac{\gamma}{\sqrt{S_t/(1-\beta_2^t)}} \odot \left(\frac{M_t}{1-\beta_1^t}\right)$$

where

• 
$$M_t = \beta_1 M_{t-1} + (1 - \beta_1) \nabla \ell^{(B)} (W^{(t)})$$

• 
$$S_t = \beta_2 S_{t-1} + (1 - \beta_2) \left( \nabla \ell^{(B)} (W^{(t)}) \odot \nabla \ell^{(B)} (W^{(t)}) \right)$$

• Mixed-precision training:  $M_t$  and  $S_t$  are the same dimensionality as  $W^{(t)}$  and are typically stored in FP32 (instead of FP16) because of the squared term in  $S_t$ 



• For Adam, K=12=3\*4 because each parameter in  $\nabla l(W^{(t)})$ ,  $M_t$ , and  $S_t$  is stored in FP32 = 4 bytes per parameter

Parameters — Gradients — Optimizer States

Figure 1: Comparing the per-device memory consumption of model states, with three stages of ZeRO-DP optimizations.  $\Psi$  denotes model size (number of parameters), K denotes the memory multiplier of optimizer states, and  $N_d$  denotes DP degree. In the example, we assume a model size of  $\Psi = 7.5B$  and DP of  $N_d = 64$  with K = 12 based on mixed-precision training with Adam optimizer.



Figure 1: Comparing the per-device memory consumption of model states, with three stages of ZeRO-DP optimizations.  $\Psi$  denotes model size (number of parameters), K denotes the memory multiplier of optimizer states, and  $N_d$  denotes DP degree. In the example, we assume a model size of  $\Psi = 7.5B$  and DP of  $N_d = 64$  with K = 12 based on mixed-precision training with Adam optimizer.

(ZeRO-DP = Zero redundancy optimizer powered data parallelism)

| DP   | <b>7.5B Model</b> (GB) |            | <b>128B Model</b> (GB) |          |            | 1T Model (GB) |          |            |                                                                 |
|------|------------------------|------------|------------------------|----------|------------|---------------|----------|------------|-----------------------------------------------------------------|
|      | $P_{os}$               | $P_{os+g}$ | $P_{os+g+p}$           | $P_{os}$ | $P_{os+g}$ | $P_{os+g+p}$  | $P_{os}$ | $P_{os+g}$ | $\left[\begin{array}{c} \mathrm{P}_{os+g+p} \end{array}\right]$ |
| 1    | 120                    | 120        | 120                    | 2048     | 2048       | 2048          | 16000    | 16000      | 16000                                                           |
| 4    | 52.5                   | 41.3       | 30                     | 896      | 704        | 512           | 7000     | 5500       | 4000                                                            |
| 16   | 35.6                   | 21.6       | 7.5                    | 608      | 368        | 128           | 4750     | 2875       | 1000                                                            |
| 64   | 31.4                   | 16.6       | 1.88                   | 536      | 284        | 32            | 4187     | 2218       | 250                                                             |
| 256  | 30.4                   | 15.4       | 0.47                   | 518      | 263        | 8             | 4046     | 2054       | 62.5                                                            |
| 1024 | 30.1                   | 15.1       | 0.12                   | 513      | 257        | 2             | 4011     | 2013       | <b>15.6</b>                                                     |

Table 1: Per-device memory consumption of different optimizations in ZeRO-DP as a function of DP degree . Bold-faced text are the combinations for which the model can fit into a cluster of 32GB V100 GPUs.

#### 5.4 Implication on Model Size

The three phases of partitioning  $P_{os}$ ,  $P_{os+g}$ , and  $P_{os+g+p}$  reduces the memory consumption of each data parallel process on model states by up to 4x, 8x, and  $N_d$  respectively. Table 1 analyzes model-state memory consumption of a few example models under the 3 stages of ZeRO-DP optimizations for varying DP degree. Without ZeRO, the memory consumption is equal to the first row in the table, regardless of the DP degree. Note that, with  $N_d = 64$ , ZeRO can train models with up to 7.5B, 14B, and 128B parameters using  $P_{os}$ ,  $P_{os+g}$ , and  $P_{os+g+p}$ , respectively. When  $N_d = 1024$ , ZeRO with all of its optimizations enabled ( $P_{os+g+p}$ ) could train models with 1 Trillion parameters! Or potentially, models with Arbitrary size! Without ZeRO, the largest model DP alone can run has less than 1.5 Billion parameters.



The three phases of partitioning  $P_{os}$ ,  $P_{os+g}$ , and  $P_{os+g+p}$  reduces the memory consumption of each data parallel process on model states by up to 4x, 8x, and  $N_d$  respectively. Table 1 analyzes model-state memory consumption of a few example models under the 3 stages of ZeRO-DP optimizations for varying DP degree. Without ZeRO, the memory consumption is equal to the first row in the table, regardless of the DP degree. Note that, with  $N_d = 64$ , ZeRO can train models with up to 7.5B, 14B, and 128B parameters using  $P_{os}$ ,  $P_{os+g}$ , and  $P_{os+g+p}$ , respectively. When  $N_d = 1024$ , ZeRO with all of its optimizations enabled ( $P_{os+g+p}$ ) could train models with 1 Trillion parameters! Or potentially, models with Arbitrary size! Without ZeRO, the largest model DP alone can run has less than 1.5 Billion parameters.

# Recall: How large are LLMs?

| Model                  | Creators | Year of release | Training Data<br>(# tokens) | Model Size<br>(# parameters) |
|------------------------|----------|-----------------|-----------------------------|------------------------------|
| GPT-2                  | OpenAl   | 2019            | ~10 billion (40Gb)          | 1.5 billion                  |
| GPT-3<br>(cf. ChatGPT) | OpenAl   | 2020            | 300 billion                 | 175 billion                  |
| PaLM                   | Google   | 2022            | 780 billion                 | 540 billion                  |
| Chinchilla             | DeepMind | 2022            | 1.4 trillion                | 70 billion                   |
| LaMDA<br>(cf. Bard)    | Google   | 2022            | 1.56 trillion               | 137 billion                  |
| LLaMA                  | Meta     | 2023            | 1.4 trillion                | 65 billion                   |
| LLaMA-2                | Meta     | 2023            | 2 trillion                  | 70 billion                   |
| GPT-4                  | OpenAl   | 2023            | ?                           | ? (1.76 trillion)            |
| Gemini (Ultra)         | Google   | 2023            | ?                           | ? (1.5 trillion)             |
| LLaMA-3                | Meta     | 2024            | 15 trillion                 | 405 billion                  |

10/29/25

# Parallelism in LLM Training

- Goal: divide the work of training an LLM across multiple
   GPUs such that inter-GPU communication is minimized
- Good news: Transformer-based LLM architectures are highly parallelizable!
- Easily exploitable parallelism in LLM training includes:
  - Data parallelism
  - Model or tensor parallelism
  - Pipeline parallelism
  - Optimizer-based parallelism
  - Token parallelism
  - Expert parallelism

10/29/25

## Token Parallelism

- Insight: in a transformer LM with causal attention, the computation for each *token* in some transformer block only depends on the previous tokens at that layer
- Idea: instead of waiting for the entire previous layer to finish, start working on token t in layer l as soon as tokens 1 through t-1 in layer l-1 are done



(a) Transformer-based LM



(b) Operation partitioning (Megatron-LM)



(c) Microbatch-based pipeline parallelism (GPipe)



(d) Token-based pipeline parallelism (TeraPipe)

## Token Parallelism

- Insight: in a transformer LM, the computation for each token in some transformer block only depends on the previous tokens at that layer
- Idea: instead of waiting for the entire previous layer to finish, start working on token t in layer l as soon as tokens 1 through t-1 in layer l-1 are done

- Intuition: increasing pipeline granularity reduces idle time in the pipeline!
  - Efficiency gains scale with sequence lengths and models have been moving towards longer contexts

## Token Parallelism



 Note: this figures shows an all-forwardall-backward mechanism, which eliminates the need for weight stashing but introduces more idle time

- Intuition: increasing pipeline granularity reduces idle time in the pipeline!
  - Efficiency gains scale with sequence lengths and models have been moving towards longer contexts

# Expert Parallelism

Approach: just put each expert on a different GPU!



- Key consideration: to enforce a balanced load across GPUs, each expert is assigned a capacity
  - If an expert is assigned more tokens than its capacity, those tokens are just passed directly to the next layer via residual connections

# Expert Parallelism

Approach: just put each expert on a different GPU!



If the capacity factor is too large,
 some GPUs will likely be underutilized

- Key consideration: to
   enforce a balanced load
   across GPUs, each expert
   is assigned a capacity
  - If an expert is assigned more tokens than its capacity, those tokens are just passed directly to the next layer via residual connections