Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

Open
hbsun2113 opened this issue Feb 24, 2025 · 5 comments

Comments

@hbsun2113
Copy link

Problem & Motivation

The BioNeMo Framework's README primarily details the initiation of a single Docker container for model training. However, the codebase and benchmarks(e.g. --num-nodes=${nodes} ) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.

BioNeMo Framework Version

v2.3

Category

Model/Training

Proposed Solution

I propose an enhancement to the BioNeMo Framework's documentation to include:

Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.

Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.

Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.

Expected Benefits

Implementing this enhancement would:

Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.

Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.

Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.

By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.

Code Example

@hbsun2113 hbsun2113 changed the title [Feature] <Title> Inquiry Regarding Multi-Node Training Support in BioNeMo Framework [Feature] <Inquiry Regarding Multi-Node Training Support in BioNeMo Framework> Feb 24, 2025
@hbsun2113 hbsun2113 changed the title [Feature] <Inquiry Regarding Multi-Node Training Support in BioNeMo Framework> [Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework Feb 24, 2025
@jstjohn
Copy link
Collaborator

jstjohn commented Feb 24, 2025

We regularly test in a slurm cluster environment. High level you have a top level sbatch script that configures how many nodes etc. then you have your srun command that specifies the image to use on each node. Slurm should handle pulling the container and running on each node. The sbatch command has the python command that specifies num nodes and gpus. If you're familiar with launching jobs using PyTorch lightning it's very similar. Under the hood the nemo2 megatron strategy manages setting up the distributed cluster state and communicating to the job on each node what it's responsible for. Good call out that this part could use more documentation and examples! The hard part is that everyone's slurm environment may be a bit different.

@hbsun2113
Copy link
Author

hbsun2113 commented Feb 28, 2025

@jstjohn

Thank you for your detailed response regarding multi-node training support in the BioNeMo Framework. I have a few follow-up questions to further understand the deployment options:

  1. Slurm Cluster Compatibility: Is the BioNeMo Framework compatible with any private Slurm clusters, or is it specifically optimized for NVIDIA DGX Cloud environments?

  2. Kubernetes Deployment: Does BioNeMo support deployment on Kubernetes clusters, similar to the Run NeMo Framework on Kubernetes setup? Are there specific configurations or considerations for such deployments?

  3. Direct Multi-Node Training: Is it possible to utilize native PyTorch Lightning or Megatron methods to run BioNeMo across multiple nodes without relying on orchestration platforms like Slurm or Kubernetes? If so, could you provide guidance or examples on setting up such an environment?

I appreciate your assistance and look forward to your insights.

@hbsun2113
Copy link
Author

Kindly ping here @jstjohn

@jstjohn
Copy link
Collaborator

jstjohn commented Mar 13, 2025

@hbsun2113 we added some slurm documentation, along with some of the training scripts we used for evo2 training into this PR: #746 Does that look reasonable?

@jstjohn
Copy link
Collaborator

jstjohn commented Mar 13, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants