[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

dorotat-nv · 2025-03-14T16:59:23Z

BioNeMo Framework Version

c61ef42

Bug Description

The ESM2 training breaks when it resumes from the checkpoint

Steps to Reproduce

run first training with num_steps=20 and val_check_interval=10

python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1  --limit-val-batches=1  --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=20 --resume-if-exists

Run again with --num-steps 30

python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1  --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=30 --resume-if-exists

Error Messages and Logs

megatron.core.dist_checkpointing.core.CheckpointingException: Cannot find global shape metadata for N-D flattened tensor ShardedTensor(key='optimizer.state.exp_avg.module.lm_head.dense.weight', dtype=torch.float32, local_shape=(1280, 1280), global_shape=(1280, 1280), global_offset=(0, 0), axis_fragmentations=(1, 1), replica_id=(0, 0, 0), prepend_axis_num=0, allow_shape_mismatch=False, flattened_range=slice(0, 1638400, None)) in checkpoint metadata: {'module.embedding.word_embeddings.weight': {}, 'module.encoder.layers.self_attention.linear_proj.weight': {}, 'module.encoder.layers.self_attention.linear_proj.bias': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_weight': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_bias': {}, 'module.encoder.layers.self_attention.linear_qkv.weight': {}, 'module.encoder.layers.self_attention.linear_qkv.bias': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_weight': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_bias': {}, 'module.encoder.layers.mlp.linear_fc1.weight': {}, 'module.encoder.layers.mlp.linear_fc1.bias': {}, 'module.encoder.layers.mlp.linear_fc2.weight': {}, 'module.encoder.layers.mlp.linear_fc2.bias': {}, 'module.encoder.final_layernorm.weight': {}, 'module.encoder.final_layernorm.bias': {}, 'module.lm_head.dense.weight': {}, 'module.lm_head.dense.bias': {}, 'module.lm_head.layer_norm.weight': {}, 'module.lm_head.layer_norm.bias': {}, 'module.output_layer.bias': {}}

Docker Image

No response

System Information

Environment Details:

OS: [e.g., Ubuntu 20.04]
CPU: [e.g., Intel i9-12900K]
RAM: [e.g., 64GB]

GPU Details:

GPU Model: [e.g., NVIDIA RTX 4090]
GPU Memory: [e.g., 24GB]
CUDA Version: [e.g., 12.1]
CUDA Driver: [e.g., 525.85.05]
cuDNN Version: [e.g., 8.9.0]

Additional Context

log_resume_training_esm2_c61ef42b0bff5efc79b3f873c71f39893921f7a9.txt

The text was updated successfully, but these errors were encountered:

dorotat-nv · 2025-03-14T17:03:42Z

it seems like checkpoint corruption. As indicated by @jstjohn , it might be work to check if fixes like https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/gpt/gpt_model.py#L331-L355
or
https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/megatron/hyena/hyena_utils.py#L1145-L1165
address this issue

dorotat-nv added the bug Something isn't working label Mar 14, 2025

dorotat-nv added the ESM2 label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

dorotat-nv commented Mar 14, 2025

dorotat-nv commented Mar 14, 2025

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

Comments

dorotat-nv commented Mar 14, 2025

BioNeMo Framework Version

Bug Description

Steps to Reproduce

Error Messages and Logs

Docker Image

System Information

Additional Context

dorotat-nv commented Mar 14, 2025