We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c61ef42
The ESM2 training breaks when it resumes from the checkpoint
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1 --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=20 --resume-if-exists
python sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py --train-cluster-path=/data/train_clusters.parquet --train-database-path=/data/train.db --valid-cluster-path=/data/valid_clusters.parquet --valid-database-path=/data/validation.db --micro-batch-size=16 --num-nodes=1 --num-gpus=1 --limit-val-batches=1 --min-seq-length=1024 --max-seq-length=1024 --num-layers=33 --hidden-size=1280 --num-attention-heads=20 --ffn-hidden-size=5120 --create-tensorboard-logger --wandb-project=fix-tensorboard-logs --val-check-interval 10 --wandb-group main --num-steps=30 --resume-if-exists
megatron.core.dist_checkpointing.core.CheckpointingException: Cannot find global shape metadata for N-D flattened tensor ShardedTensor(key='optimizer.state.exp_avg.module.lm_head.dense.weight', dtype=torch.float32, local_shape=(1280, 1280), global_shape=(1280, 1280), global_offset=(0, 0), axis_fragmentations=(1, 1), replica_id=(0, 0, 0), prepend_axis_num=0, allow_shape_mismatch=False, flattened_range=slice(0, 1638400, None)) in checkpoint metadata: {'module.embedding.word_embeddings.weight': {}, 'module.encoder.layers.self_attention.linear_proj.weight': {}, 'module.encoder.layers.self_attention.linear_proj.bias': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_weight': {}, 'module.encoder.layers.self_attention.linear_qkv.layer_norm_bias': {}, 'module.encoder.layers.self_attention.linear_qkv.weight': {}, 'module.encoder.layers.self_attention.linear_qkv.bias': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_weight': {}, 'module.encoder.layers.mlp.linear_fc1.layer_norm_bias': {}, 'module.encoder.layers.mlp.linear_fc1.weight': {}, 'module.encoder.layers.mlp.linear_fc1.bias': {}, 'module.encoder.layers.mlp.linear_fc2.weight': {}, 'module.encoder.layers.mlp.linear_fc2.bias': {}, 'module.encoder.final_layernorm.weight': {}, 'module.encoder.final_layernorm.bias': {}, 'module.lm_head.dense.weight': {}, 'module.lm_head.dense.bias': {}, 'module.lm_head.layer_norm.weight': {}, 'module.lm_head.layer_norm.bias': {}, 'module.output_layer.bias': {}}
No response
Environment Details:
GPU Details:
log_resume_training_esm2_c61ef42b0bff5efc79b3f873c71f39893921f7a9.txt
The text was updated successfully, but these errors were encountered:
it seems like checkpoint corruption. As indicated by @jstjohn , it might be work to check if fixes like https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/gpt/gpt_model.py#L331-L355 or https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/megatron/hyena/hyena_utils.py#L1145-L1165 address this issue
Sorry, something went wrong.
No branches or pull requests
BioNeMo Framework Version
c61ef42
Bug Description
The ESM2 training breaks when it resumes from the checkpoint
Steps to Reproduce
Error Messages and Logs
Docker Image
No response
System Information
Environment Details:
GPU Details:
Additional Context
log_resume_training_esm2_c61ef42b0bff5efc79b3f873c71f39893921f7a9.txt
The text was updated successfully, but these errors were encountered: