[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

hbsun2113 · 2025-02-24T01:55:23Z

Problem & Motivation

The BioNeMo Framework's README primarily details the initiation of a single Docker container for model training. However, the codebase and benchmarks(e.g. --num-nodes=${nodes} ) suggest potential support for multi-node training. Given the computational demands of training large biomolecular models, leveraging a self-managed cluster could offer significant advantages in terms of resource optimization and cost-effectiveness. Therefore, understanding the framework's capabilities for multi-node training within a private cluster is crucial for efficient model development.

BioNeMo Framework Version

v2.3

Proposed Solution

I propose an enhancement to the BioNeMo Framework's documentation to include:

Detailed Instructions: Step-by-step guidance on configuring and initiating multi-node training sessions within a self-managed cluster environment.

Configuration Examples: Sample configuration files and command-line parameters tailored for multi-node setups.

Best Practices: Recommendations on optimizing performance and ensuring seamless communication between nodes during training.

Expected Benefits

Implementing this enhancement would:

Broaden Accessibility: Enable users without access to cloud services like DGX Cloud to fully utilize the BioNeMo Framework.

Enhance Flexibility: Allow researchers to tailor training environments to their specific hardware configurations.

Improve Resource Efficiency: Facilitate the effective use of existing infrastructure, potentially reducing operational costs.

By providing comprehensive guidance on multi-node training, the BioNeMo Framework can better serve a wider range of users and use cases.

Code Example

The text was updated successfully, but these errors were encountered:

jstjohn · 2025-02-24T03:46:38Z

We regularly test in a slurm cluster environment. High level you have a top level sbatch script that configures how many nodes etc. then you have your srun command that specifies the image to use on each node. Slurm should handle pulling the container and running on each node. The sbatch command has the python command that specifies num nodes and gpus. If you're familiar with launching jobs using PyTorch lightning it's very similar. Under the hood the nemo2 megatron strategy manages setting up the distributed cluster state and communicating to the job on each node what it's responsible for. Good call out that this part could use more documentation and examples! The hard part is that everyone's slurm environment may be a bit different.

hbsun2113 · 2025-02-28T01:17:30Z

@jstjohn

Thank you for your detailed response regarding multi-node training support in the BioNeMo Framework. I have a few follow-up questions to further understand the deployment options:

Slurm Cluster Compatibility: Is the BioNeMo Framework compatible with any private Slurm clusters, or is it specifically optimized for NVIDIA DGX Cloud environments?
Kubernetes Deployment: Does BioNeMo support deployment on Kubernetes clusters, similar to the Run NeMo Framework on Kubernetes setup? Are there specific configurations or considerations for such deployments?
Direct Multi-Node Training: Is it possible to utilize native PyTorch Lightning or Megatron methods to run BioNeMo across multiple nodes without relying on orchestration platforms like Slurm or Kubernetes? If so, could you provide guidance or examples on setting up such an environment?

I appreciate your assistance and look forward to your insights.

hbsun2113 · 2025-03-05T06:22:32Z

Kindly ping here @jstjohn

jstjohn · 2025-03-13T18:43:18Z

@hbsun2113 we added some slurm documentation, along with some of the training scripts we used for evo2 training into this PR: #746 Does that look reasonable?

jstjohn · 2025-03-13T18:44:24Z

See for example: https://github.com/NVIDIA/bionemo-framework/pull/746/files#diff-796634ff9080fbad3047dedcdd9b028034b6b9050a41f86f8c9ad0010cc7819f

hbsun2113 changed the title ~~[Feature] <Title> Inquiry Regarding Multi-Node Training Support in BioNeMo Framework~~ [Feature] <Inquiry Regarding Multi-Node Training Support in BioNeMo Framework> Feb 24, 2025

hbsun2113 changed the title ~~[Feature] <Inquiry Regarding Multi-Node Training Support in BioNeMo Framework>~~ [Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

hbsun2113 commented Feb 24, 2025

jstjohn commented Feb 24, 2025

hbsun2113 commented Feb 28, 2025 •

edited

Loading

hbsun2113 commented Mar 5, 2025

jstjohn commented Mar 13, 2025

jstjohn commented Mar 13, 2025 •

edited

Loading

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

[Feature] Inquiry Regarding Multi-Node Training Support in BioNeMo Framework #700

Comments

hbsun2113 commented Feb 24, 2025

Problem & Motivation

BioNeMo Framework Version

Category

Proposed Solution

Expected Benefits

Code Example

jstjohn commented Feb 24, 2025

hbsun2113 commented Feb 28, 2025 • edited Loading

hbsun2113 commented Mar 5, 2025

jstjohn commented Mar 13, 2025

jstjohn commented Mar 13, 2025 • edited Loading

hbsun2113 commented Feb 28, 2025 •

edited

Loading

jstjohn commented Mar 13, 2025 •

edited

Loading