-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700
Comments
We regularly test in a slurm cluster environment. High level you have a top level sbatch script that configures how many nodes etc. then you have your srun command that specifies the image to use on each node. Slurm should handle pulling the container and running on each node. The sbatch command has the python command that specifies num nodes and gpus. If you're familiar with launching jobs using PyTorch lightning it's very similar. Under the hood the nemo2 megatron strategy manages setting up the distributed cluster state and communicating to the job on each node what it's responsible for. Good call out that this part could use more documentation and examples! The hard part is that everyone's slurm environment may be a bit different. |
Thank you for your detailed response regarding multi-node training support in the BioNeMo Framework. I have a few follow-up questions to further understand the deployment options:
I appreciate your assistance and look forward to your insights. |
Kindly ping here @jstjohn |
@hbsun2113 we added some slurm documentation, along with some of the training scripts we used for evo2 training into this PR: #746 Does that look reasonable? |
Problem & Motivation
The BioNeMo Framework's README primarily details the initiation of a single Docker container for model training. However, the codebase and benchmarks(e.g.
--num-nodes=${nodes}
) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.BioNeMo Framework Version
v2.3
Category
Model/Training
Proposed Solution
I propose an enhancement to the BioNeMo Framework's documentation to include:
Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.
Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.
Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.
Expected Benefits
Implementing this enhancement would:
Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.
Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.
Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.
By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.
Code Example
The text was updated successfully, but these errors were encountered: