You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm 90% sure that this failure isn't what you think it is. These two architectures should use nearly the same amount of memory since the only difference is a slight increase in the number of params in the FFN layer so that it can be divided by 64 and then 16 rather than only by 16 followed by 16 to enable TP=64. Both cases only use a sequence length of 128. Also both cases limit to 4 layers only.
My guess is that the real problem is underlying instability or competition between jobs in our CI runners.
### Description
Marking as xfail unit tests
:sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext]
due to its issues on certain GPUs
Issue as a follow-up:
#731
### Type of changes
<!-- Mark the relevant option with an [x] -->
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):
### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing
> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.
### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```
### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->
- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully
---------
Signed-off-by: Dorota Toczydlowska <[email protected]>
BioNeMo Framework Version
4b59b06
Bug Description
The unit test sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py::test_train_single_gpu[7b_arc_longcontext] fails on L40
Steps to Reproduce
12:12:10 Fri Mar 7 11:12:10 2025
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.8 |
12:12:10 |-----------------------------------------+------------------------+----------------------+
12:12:10 | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
12:12:10 | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
12:12:10 | | | MIG M. |
12:12:10 |=========================================+========================+======================|
12:12:10 | 0 NVIDIA L40 On | 00000000:C1:00.0 Off | 0 |
12:12:10 | N/A 31C P8 33W / 300W | 1MiB / 46068MiB | 0% Default |
12:12:10 | | | N/A |
12:12:10 +-----------------------------------------+------------------------+----------------------+
12:12:10
12:12:10 +-----------------------------------------------------------------------------------------+
12:12:10 | Processes: |
12:12:10 | GPU GI CI PID Type Process name GPU Memory |
12:12:10 | ID ID Usage |
12:12:10 |=========================================================================================|
12:12:10 | No running processes found |
12:12:10 +-----------------------------------------------------------------------------------------
Error Messages and Logs
Docker Image
No response
System Information
Environment Details:
GPU Details:
Additional Context
No response
The text was updated successfully, but these errors were encountered: