[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #292

xrsrke · 2025-03-10T11:12:56Z

Reproducing the paper "Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping" https://arxiv.org/abs/2409.15241

The losses are match after 20b tokens with 2m batch size, 20k steps, and fineweb dataset with 75% communication hiding for tensor parallelism,

The first PR is ready for review (i split it to two PRs), some left work for the next PR:

intra-layer overlapping (current overlapping communication within a layer), but if we do intra-layer overlapping, then we can almost totally overlapping the comm
create an fixed buffer to concat hidden states (1st image)
double check if there is cuda stream switching's overhead (4.3.1 in the 2nd image)
minimize kernel launch overhead (cuda graph? 4.3.2 in 2nd image)

Profiling results:

/fsx/phuc/new_workspace/experiments/nanotron_domino/profilings/exp7a11_like_exp7a6_but_remove_fwd_pass_cuda_syncronization_and_remove_cuda_syncronize_in_wait_comm_bwd_and_add_comm_syncronize_in_waitcomm_and_remove_explicite_async_op_arg_and_commit_600f01/20250228-160428/ip-26-0-161-142_51797.1740758749919300440.pt.trace.json

…_mbs2_and_gbs_300k_and_input_splitting_and_commit_23f2_but_remove_call_is_async_comm_twice_and_keep_not_async_bwd.layer_mlp_1__and_bwd.layer_attn_0

- execute backward comm in a separate stream - make commm stream in the backward pass wait for compute stream before run backward comm - make WaitComm’s compute stream to wait for the comm stream

…omm, and remove torch.cuda.synchronize() in WaitComm

…e_cuda_syncronize_in_wait_comm_bwd_and_add_comm_syncronize_in_waitcomm_and_commit_543ef56

…x_stream_not_sync_exp2a1c7_and_commit_23f2_and_75_percent_bwd_overlapping_with_cuda_stream_sync_bwd

…eturning it directly in linear modules

NouamaneTazi · 2025-03-10T17:13:55Z

src/nanotron/models/llama.py

+        hidden_states0 = self.input_layernorm(hidden_states0)
+        hidden_states1 = self.input_layernorm(hidden_states1)


Following up on #285 (comment)
I think we still need to add a TODO: comment here. Because ideally we want to interleave (overlap) this layernorm with some other op (either following fwd, or backward, or both)

And would be nice to add more comments in this domino class about what's overlapped (either at the top of the fwd, or before each op being overlapped)

NouamaneTazi · 2025-03-10T17:16:03Z

src/nanotron/models/llama.py

@@ -687,51 +701,39 @@ def forward(
        attention_output = (
            attention_output.contiguous().view(batch_size, q_length, self.n_local_q_heads * self.d_v).transpose(0, 1)
        )
+        # output, work = self.o_proj(attention_output, op_name=op_name)


clean comment

src/nanotron/models/llama.py

NouamaneTazi · 2025-03-10T17:21:12Z

src/nanotron/config/parallelism_config.py

+                self.tp_linear_async_communication is False
+            ), "Domino requires TP linear async communication to be False"
+            # TODO: support REDUCE_SCATTER mode for Domino
+            assert self.tp_mode == TensorParallelLinearMode.ALL_REDUCE, "Domino requires TP mode to be ALL_REDUCE"


Add new ticket in tracker to add support for REDUCE_SCATTER please

NouamaneTazi · 2025-03-10T17:22:32Z

src/nanotron/parallel/pipeline_parallel/engine.py

+from torch import nn
+from torch.nn.parallel import DistributedDataParallel


this is unrelated to this PR right? Im refactor this engine, I can take care of this change

NouamaneTazi · 2025-03-10T17:31:58Z

src/nanotron/parallel/tensor_parallel/domino.py

+BWD_ATTN_OP_NAME = "bwd.layer_attn_{}_batch_{}"
+BWD_MLP_OP_NAME = "bwd.layer_mlp_{}_batch_{}"
+
+_operation_context = threading.local()


is this necessary?

BWD_ATTN_OP_NAME

because we recall these name many places in the code, I want to make it consistent, so if we change the name, we don't have to manually replace in other places

NouamaneTazi · 2025-03-10T17:32:12Z

src/nanotron/parallel/tensor_parallel/domino.py

+    performs all-reduce asynchronously in tensor parallelism
+    """
+    NON_ASYNC_HANDLE_IDX = [
+        # "fwd.layer_mlp_{}_batch_1",


NouamaneTazi · 2025-03-10T17:32:51Z

src/nanotron/parallel/tensor_parallel/domino.py

+    """
+    Determine whether a module (e.g., mlp, attention)
+    performs all-reduce asynchronously in tensor parallelism
+    """


continue the description of this function.. how do we determine it? what do we check?

NouamaneTazi · 2025-03-10T17:35:03Z

src/nanotron/parallel/comm.py

+class AsyncCommBucket:
+    """
+    Store aynchronous communication operations.
+    """
+
+    def __init__(self):
+        self._async_op: Dict[int, "dist.Work"] = {}
+        self._copy_async_op: Dict[int, "dist.Work"] = {}
+
+    def add(self, op_name: int, work: "dist.Work"):
+        assert op_name not in self._async_op, f"Operation with name: {op_name} already exists"
+        assert work is not None
+        self._async_op[op_name] = work
+        self._copy_async_op[op_name] = work
+


are we sure we don't have an equivalent of this class in torch? o.O

NouamaneTazi · 2025-03-10T17:35:34Z

src/nanotron/parallel/comm.py

+
+        not_finished = []
+        for k, v in self._copy_async_op.items():
+            assert is_domino_async_comm(k) is True, f"Operation with name {k} wasn't executed asynchronously!"


i dont like the mention of domino here. this CommBucket should be independent of domino

xrsrke and others added 30 commits January 29, 2025 12:47

first draft of domino forward pass

d7bf8be

support the backward pass

3803b19

the first draft for bwd overlapping

d765fd5

add backward pass overlapping

9924608

fix some ops dont execute in the bwd pass

d6bc8da

fix can't find an ops in fwd

93b2f10

partially overlapping bwd pass

31db05d

fix stream not sync

23f2108

exp2a1c7c2_like_exp2a1c1_domini_llama3_3b_with_tp8_and_seqlen4096_and…

3e3ae8c

…_mbs2_and_gbs_300k_and_input_splitting_and_commit_23f2_but_remove_call_is_async_comm_twice_and_keep_not_async_bwd.layer_mlp_1__and_bwd.layer_attn_0

refactor

c261488

add tests and more refactoring

841c7d6

add domino config, fix breaks in _RowLinearAsyncCommunication

8a0f993

add bwd.layer_mlp_x_batch_1 as async op

a61d2df

- add cuda stream sync after attn_output0[work]

06e17bc

- execute backward comm in a separate stream - make commm stream in the backward pass wait for compute stream before run backward comm - make WaitComm’s compute stream to wait for the comm stream

wait default_stream instead of current_stream

8d44942

put torch.cuda.synchronize() everywhere

aa77e6c

only bwd.layer_attn_{}_batch_0 as non async

76b5f9a

exp7a7_like_exp7a6_but_remove_fwd_pass_cuda_syncronization

fe7ee7e

remove torch.cuda.synchronize in WaitComm.backward

e0a9bd0

add back torch.cuda.synchronize in WaitComm.backward and small refactors

a772ff0

add ctx.comm_stream.wait_stream(torch.cuda.default_stream()) to WaitC…

543ef56

…omm, and remove torch.cuda.synchronize() in WaitComm

exp7a10_like_exp7a6_but_remove_fwd_pass_cuda_syncronization_and_remov…

36c9980

…e_cuda_syncronize_in_wait_comm_bwd_and_add_comm_syncronize_in_waitcomm_and_commit_543ef56

remove comments and add typing

613eb16

remove explicite async_op arg

600f01a

Merge remote-tracking branch 'origin/main' into domino_revert_from_fi…

320e55d

…x_stream_not_sync_exp2a1c7_and_commit_23f2_and_75_percent_bwd_overlapping_with_cuda_stream_sync_bwd

pass stream amanger to llama's modules

29a8914

move domino's assert args to config

75abb32

add retrieving async distributed handle from comm bucket instead of r…

da4220c

…eturning it directly in linear modules

small refactor

d7a636f

add CudaStreamManager.init_default_comm_stream and fix domino test

d3d8c10

xrsrke added 6 commits March 7, 2025 18:43

removing op_name in the forward pass by adding OpNameContext

74d415c

add CudaStreamManager as context

08a4472

small refactor

684b1b9

Reverting repository to commit 74d415c

f8e8b1f

add todos

61ff007

add todo

9039ce2

xrsrke changed the title ~~Xrsrke/exp7a13b0 domino revert from fix stream not sync exp2a1c7 and commit 23f2 and 75 percent bwd overlapping with cuda stream sync bwd but remove stream manager ctx~~ [Feature] Hide 75% of the communication in tensor parallelism using DoMiNo Mar 10, 2025

add todos

62fb3b2

xrsrke mentioned this pull request Mar 10, 2025

[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #285

Closed

NouamaneTazi reviewed Mar 10, 2025

View reviewed changes

add todo and undo torch_nn

7c7b6f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #292

[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #292

xrsrke commented Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

xrsrke Mar 11, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

NouamaneTazi Mar 10, 2025

		hidden_states0 = self.input_layernorm(hidden_states0)
		hidden_states1 = self.input_layernorm(hidden_states1)

		from torch import nn
		from torch.nn.parallel import DistributedDataParallel

[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #292

Are you sure you want to change the base?

[Feature] Hide 75% of the communication in tensor parallelism using DoMiNo #292

Conversation

xrsrke commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment