Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ESM2 training does not resume from checkpoint using train_esm2 #757

Open
dorotat-nv opened this issue Mar 14, 2025 · 1 comment
Open
Labels
bug Something isn't working ESM2

Comments

@dorotat-nv
Copy link
Collaborator

BioNeMo Framework Version

c61ef42

Bug Description

The ESM2 training breaks when it resumes from the checkpoint

Steps to Reproduce

  1. run first training with num_steps=20 and val_check_interval=10
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1  --limit-val-batches=1  --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=20 --resume-if-exists
  1. Run again with --num-steps 30
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1  --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=30 --resume-if-exists

Error Messages and Logs

megatron.core.dist_checkpointing.core.CheckpointingException: Cannot find global shape metadata for N-D flattened tensor ShardedTensor(key='optimizer.state.exp_avg.module.lm_head.dense.weight', dtype=torch.float32, local_shape=(1280, 1280), global_shape=(1280, 1280), global_offset=(0, 0), axis_fragmentations=(1, 1), replica_id=(0, 0, 0), prepend_axis_num=0, allow_shape_mismatch=False, flattened_range=slice(0, 1638400, None)) in checkpoint metadata: {'module.embedding.word_embeddings.weight': {}, 'module.encoder.layers.self_attention.linear_proj.weight': {}, 'module.encoder.layers.self_attention.linear_proj.bias': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_weight': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_bias': {}, 'module.encoder.layers.self_attention.linear_qkv.weight': {}, 'module.encoder.layers.self_attention.linear_qkv.bias': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_weight': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_bias': {}, 'module.encoder.layers.mlp.linear_fc1.weight': {}, 'module.encoder.layers.mlp.linear_fc1.bias': {}, 'module.encoder.layers.mlp.linear_fc2.weight': {}, 'module.encoder.layers.mlp.linear_fc2.bias': {}, 'module.encoder.final_layernorm.weight': {}, 'module.encoder.final_layernorm.bias': {}, 'module.lm_head.dense.weight': {}, 'module.lm_head.dense.bias': {}, 'module.lm_head.layer_norm.weight': {}, 'module.lm_head.layer_norm.bias': {}, 'module.output_layer.bias': {}}

Docker Image

No response

System Information

Environment Details:

  • OS: [e.g., Ubuntu 20.04]
  • CPU: [e.g., Intel i9-12900K]
  • RAM: [e.g., 64GB]

GPU Details:

  • GPU Model: [e.g., NVIDIA RTX 4090]
  • GPU Memory: [e.g., 24GB]
  • CUDA Version: [e.g., 12.1]
  • CUDA Driver: [e.g., 525.85.05]
  • cuDNN Version: [e.g., 8.9.0]

Additional Context

log_resume_training_esm2_c61ef42b0bff5efc79b3f873c71f39893921f7a9.txt

@dorotat-nv dorotat-nv added the bug Something isn't working label Mar 14, 2025
@dorotat-nv
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ESM2
Projects
None yet
Development

No branches or pull requests

1 participant