[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

dorotat-nv · 2025-03-07T14:22:47Z

BioNeMo Framework Version

4b59b06

Bug Description

The unit test sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] fails on L40

Steps to Reproduce

Run the test on l40 with the following specification

12:12:10 Fri Mar 7 11:12:10 2025
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.8 |
12:12:10 |-----------------------------------------+------------------------+----------------------+
12:12:10 | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
12:12:10 | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
12:12:10 | | | MIG M. |
12:12:10 |=========================================+========================+======================|
12:12:10 | 0 NVIDIA L40 On | 00000000:C1:00.0 Off | 0 |
12:12:10 | N/A 31C P8 33W / 300W | 1MiB / 46068MiB | 0% Default |
12:12:10 | | | N/A |
12:12:10 +-----------------------------------------+------------------------+----------------------+
12:12:10
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | Processes: |
12:12:10 | GPU GI CI PID Type Process name GPU Memory |
12:12:10 | ID ID Usage |
12:12:10 |=========================================================================================|
12:12:10 | No running processes found |
12:12:10 +-----------------------------------------------------------------------------------------

Error Messages and Logs

12:23:15  sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_processes_special_characters PASSED [ 40%]
12:24:01  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_infer.py::test_run_infer PASSED [ 43%]
12:24:11  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_inference.py::test_infer_model_generates_expected_single_token_output PASSED [ 46%]
12:25:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_runs PASSED [ 50%]
12:28:09  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_evo2_runs PASSED [ 53%]
12:28:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_nv] PASSED [ 56%]
12:29:02  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] ci/scripts/run_pytest.sh: line 112:  9935 Killed                  pytest "${PYTEST_OPTIONS[@]}" --junitxml=$(basename $dir).junit.xml -o junit_family=legacy "$dir"
12:29:02  + exit_code=137
12:29:02  + [[ 137 -ne 0 ]]
12:29:02  + [[ false == true ]]
12:29:02  + echo 'Error: pytest failed with exit code 137'
12:29:02  Error: pytest failed with exit code 137
12:29:02  + error=true
12:29:02  + clean_pycache ./sub-packages/bionemo-evo2/
12:29:02  + local base_dir=./sub-packages/bionemo-evo2/
12:29:02  + echo 'Cleaning Python cache files in ./sub-packages/bionemo-evo2/...'

Docker Image

No response

System Information

Environment Details:

OS: [e.g., Ubuntu 20.04]
CPU: [e.g., Intel i9-12900K]
RAM: [e.g., 64GB]

GPU Details:

GPU Model: [e.g., NVIDIA RTX 4090]
GPU Memory: [e.g., 24GB]
CUDA Version: [e.g., 12.1]
CUDA Driver: [e.g., 525.85.05]
cuDNN Version: [e.g., 8.9.0]

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jstjohn · 2025-03-07T16:50:39Z

I'm 90% sure that this failure isn't what you think it is. These two architectures should use nearly the same amount of memory since the only difference is a slight increase in the number of params in the FFN layer so that it can be divided by 64 and then 16 rather than only by 16 followed by 16 to enable TP=64. Both cases only use a sequence length of 128. Also both cases limit to 4 layers only.

My guess is that the real problem is underlying instability or competition between jobs in our CI runners.

### Description Marking as xfail unit tests :sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] due to its issues on certain GPUs Issue as a follow-up: #731 ### Type of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: Dorota Toczydlowska <[email protected]>

dorotat-nv added the bug Something isn't working label Mar 7, 2025

dorotat-nv mentioned this issue Mar 7, 2025

xfail evo2 long context train test #732

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

dorotat-nv commented Mar 7, 2025

jstjohn commented Mar 7, 2025

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

Comments

dorotat-nv commented Mar 7, 2025

BioNeMo Framework Version

Bug Description

Steps to Reproduce

Error Messages and Logs

Docker Image

System Information

Additional Context

jstjohn commented Mar 7, 2025