Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "7b_arc_longcontext" evo2 training unit test on 1 gpu too memory consuming #731

Open
dorotat-nv opened this issue Mar 7, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@dorotat-nv
Copy link
Collaborator

BioNeMo Framework Version

4b59b06

Bug Description

The unit test sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] fails on L40

Steps to Reproduce

  1. Run the test on l40 with the following specification

12:12:10 Fri Mar 7 11:12:10 2025
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.8 |
12:12:10 |-----------------------------------------+------------------------+----------------------+
12:12:10 | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
12:12:10 | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
12:12:10 | | | MIG M. |
12:12:10 |=========================================+========================+======================|
12:12:10 | 0 NVIDIA L40 On | 00000000:C1:00.0 Off | 0 |
12:12:10 | N/A 31C P8 33W / 300W | 1MiB / 46068MiB | 0% Default |
12:12:10 | | | N/A |
12:12:10 +-----------------------------------------+------------------------+----------------------+
12:12:10
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | Processes: |
12:12:10 | GPU GI CI PID Type Process name GPU Memory |
12:12:10 | ID ID Usage |
12:12:10 |=========================================================================================|
12:12:10 | No running processes found |
12:12:10 +-----------------------------------------------------------------------------------------

Error Messages and Logs

12:23:15  sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_processes_special_characters PASSED [ 40%]
12:24:01  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_infer.py::test_run_infer PASSED [ 43%]
12:24:11  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_inference.py::test_infer_model_generates_expected_single_token_output PASSED [ 46%]
12:25:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_runs PASSED [ 50%]
12:28:09  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_evo2_runs PASSED [ 53%]
12:28:48  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_nv] PASSED [ 56%]
12:29:02  sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] ci/scripts/run_pytest.sh: line 112:  9935 Killed                  pytest "${PYTEST_OPTIONS[@]}" --junitxml=$(basename $dir).junit.xml -o junit_family=legacy "$dir"
12:29:02  + exit_code=137
12:29:02  + [[ 137 -ne 0 ]]
12:29:02  + [[ false == true ]]
12:29:02  + echo 'Error: pytest failed with exit code 137'
12:29:02  Error: pytest failed with exit code 137
12:29:02  + error=true
12:29:02  + clean_pycache ./sub-packages/bionemo-evo2/
12:29:02  + local base_dir=./sub-packages/bionemo-evo2/
12:29:02  + echo 'Cleaning Python cache files in ./sub-packages/bionemo-evo2/...'

Docker Image

No response

System Information

Environment Details:

  • OS: [e.g., Ubuntu 20.04]
  • CPU: [e.g., Intel i9-12900K]
  • RAM: [e.g., 64GB]

GPU Details:

  • GPU Model: [e.g., NVIDIA RTX 4090]
  • GPU Memory: [e.g., 24GB]
  • CUDA Version: [e.g., 12.1]
  • CUDA Driver: [e.g., 525.85.05]
  • cuDNN Version: [e.g., 8.9.0]

Additional Context

No response

@dorotat-nv dorotat-nv added the bug Something isn't working label Mar 7, 2025
@jstjohn
Copy link
Collaborator

jstjohn commented Mar 7, 2025

I'm 90% sure that this failure isn't what you think it is. These two architectures should use nearly the same amount of memory since the only difference is a slight increase in the number of params in the FFN layer so that it can be divided by 64 and then 16 rather than only by 16 followed by 16 to enable TP=64. Both cases only use a sequence length of 128. Also both cases limit to 4 layers only.

My guess is that the real problem is underlying instability or competition between jobs in our CI runners.

dorotat-nv added a commit that referenced this issue Mar 14, 2025
### Description
Marking as xfail unit tests
:sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext]
due to its issues on certain GPUs

Issue as a follow-up:
#731

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [x]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [ ]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing


> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

---------

Signed-off-by: Dorota Toczydlowska <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants