Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s` #4884

tanishy7777 · 2025-01-07T19:47:33Z

Changes made in this Pull Request:

Parallized both rdf.InterRDF and rdf.InterRDF_s

TLDR: of the comments below: Initially I thought rdf isnt parallizable but turns out both classes in rdf can be parallelized.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4884.org.readthedocs.build/en/4884/

…zable

pep8speaks · 2025-01-07T19:47:39Z

Hello @tanishy7777! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/rdf.py:

Line 592:1: W293 blank line contains whitespace

In the file testsuite/MDAnalysisTests/analysis/test_rdf.py:

Line 160:1: E302 expected 2 blank lines, found 1

In the file testsuite/MDAnalysisTests/analysis/test_rdf_s.py:

Line 179:1: E302 expected 2 blank lines, found 1

Comment last updated at 2025-01-20 20:35:32 UTC

codecov · 2025-01-07T19:56:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.42%. Comparing base (35d9d2e) to head (ca06fd2).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4884      +/-   ##
===========================================
- Coverage    93.42%   93.42%   -0.01%     
===========================================
  Files          177      189      +12     
  Lines        21859    22950    +1091     
  Branches      3078     3080       +2     
===========================================
+ Hits         20422    21440    +1018     
- Misses         986     1059      +73     
  Partials       451      451

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

marinegor · 2025-01-08T12:25:18Z

@tanishy7777 thanks for the prompt PR! I have some comments though:

self.volume_cum is cummulated across the frames so we cant parallize simply using the split-apply-combine technique.

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

tanishy7777 · 2025-01-08T17:34:01Z

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

Got it! I have made analysis.rdf.InterRDF parallizable with this approach but analysis.rdf.InterRDF_s needs a bit more work.

tanishy7777 · 2025-01-08T19:06:33Z

when trying to make analysis.rdf.InterRDF_s parallizable I am running into an error with aggregating results.count.

FAILED testsuite\MDAnalysisTests\analysis\test_rdf_s.py::test_nbins[client_InterRDF_s1] - ValueError: setting an array element with a 
sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous p...

I tried to find out why this is happening by looking at the result itself which is being aggregated along 'count'

so the arrs that is passed to the ResultsGroup.ndarray_vstack function as shown below can be broken into arrs = [arr, arr1] (refere imgs)

so I tried finding the dimensions of the arrays manually because it wasnt converting to a numpy array.

so its not able to convert to a numpy array, because of inconsistent dimensions. I am not sure how to resolve this

here arr is basically made up of of 2 arrays of (1,2,412) and (2,2,412)
and arr1 is also made up of of 2 arrays of (1,2,412) and (2,2,412)

tanishy7777 · 2025-01-08T19:24:40Z

based on the above comment #4884 (comment)

I think we can mark analysis.rdf.InterRDF_s as non parallizable and mark analysis.rdf.InterRDF as parallizable

because its not possible to convert array of inhomogenous dimensions to a numpy array and since rdf.InterRDF_s needs numpy arrays to be passed. so even if we concatenate the arrays as normal lists performing even basic operations would be hard.

Because after the _get_aggregator method when _conclude is run operations like

self.results.count[i] / norm

wont be possible as division is not supported between list and int

marinegor · 2025-01-09T22:46:55Z

@tanishy7777 the class should be marked as non-parallelizable only if the algorithm to run it is not actually parallelizable, which I'm not yet convinced is the case for all the mentioned classes.

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

tanishy7777 · 2025-01-13T20:03:12Z

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

So, the array is something like this

The [412] denotes an array with 412 elements.

That is why it cant be processed by numpy directly. But I think if I modify it, using a custom comparator like you mentioned I can sum the entries(by adding the 3 blue arrays of shape 2x412 as in the picture) and convert it to a 1x2x412 array I think? I am not sure about the final dimension it needs to be converted to.

@marinegor

marinegor · 2025-01-13T20:32:20Z

I am not sure about the final dimension it needs to be converted to

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

tanishy7777 · 2025-01-13T20:57:09Z

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

Got it. Will work on that!

tanishy7777 · 2025-01-18T11:15:46Z

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests

I tried making the custom_aggregator but some tests are still failing

These 2 lines specifically are causing the errors in all the 12 tests
assert_allclose(max(rdf.results.rdf[0][0][0]), value)
assert rdf.results.edges[0] == rmin

I am not sure how to resolve this, I will dig through the docs a bit more. Will update if I can figure it out.

tanishy7777 · 2025-01-18T16:52:13Z

used a custom aggregator for results.counts and converted the aggregation method for
results.edges and results.bins to ResultsGroup.ndarray_mean

TLDR: Parallized both InterRDF and InterRDF_s classes in rdf.py

Thanks @marinegor for your help!

tanishy7777 · 2025-01-21T12:09:55Z

@orbeckst @RMeli @marinegor I think this is ready to be merged, can you please review it?

marinegor

hi @tanishy7777, sorry for the long review (again).

good work, thanks for your contribution! I've added my comments, main action points below:

move your custom aggregation function from the class to a standalone function, name appropriately and test
revert changes to test_xds.py and core/selection.py (I'm guessing they were introduced by black or smth)
make sure you don't need to track self.volume_cum, since you're tracking self.results.volume_cum and assigning self.volume_cum in _conclude. I made suggestions regarding that but might have missed something; please make sure until _conclude only self.results.volume_cum is used.

package/MDAnalysis/analysis/rdf.py

marinegor · 2025-02-20T08:52:11Z

package/MDAnalysis/analysis/rdf.py

+    def func(arrs):
+        r"""Custom aggregator for nested arrays


please make it a separate function since it's not really needed for the class to function, and also to avoid potential issues with serialization that class methods (even static) sometimes have

and change the name to something more descriptive, e.g. nested_array_sum, that reflects the nature of the function.

finally, this function must be tested in test_rdf.py

package/MDAnalysis/analysis/rdf.py

marinegor · 2025-02-20T08:55:58Z

package/MDAnalysis/core/selection.py

revert it back to the original state since this change is not related to the PR. you can just commit on top to make this change disappear, that's fine.

marinegor · 2025-02-20T08:56:54Z

testsuite/MDAnalysisTests/coordinates/test_xdr.py

revert it back to the original state since this change is not related to the PR. you can just commit on top to make this change disappear, that's fine.

Co-authored-by: Egor Marin <[email protected]>

tanishy7777 · 2025-03-14T20:47:36Z

hi @tanishy7777, sorry for the long review (again).

good work, thanks for your contribution! I've added my comments, main action points below:

move your custom aggregation function from the class to a standalone function, name appropriately and test

revert changes to test_xds.py and core/selection.py (I'm guessing they were introduced by black or smth)

make sure you don't need to track self.volume_cum, since you're tracking self.results.volume_cum and assigning self.volume_cum in _conclude. I made suggestions regarding that but might have missed something; please make sure until _conclude only self.results.volume_cum is used.

Sorry for the late reply. I will get to work on these changes! I had semester examinations so was quite busy the last 2 weeks.

…ysis into parallize_rdf

mark analysis.rdf.InterRDF and analysis.rdf.InterRDF_s as not paralli…

6f186f3

…zable

tanishy7777 added 2 commits January 8, 2025 22:30

Adds parallization for analysis.rdf.InterRDF

4819757

Minor changes

6f10067

Merge branch 'MDAnalysis:develop' into parallize_rdf

de5d5e0

Adds custom aggregegator for InterRDF_s

c584b0b

tanishy7777 added 2 commits January 18, 2025 21:26

Fixes aggregation of results.edges

2c919ea

Parallizes InterRDF_s

ea6d4ae

tanishy7777 added 4 commits January 18, 2025 22:40

Minor changes

a897ca9

Fixes linter

eb91475

Fixes linter

8686aaa

Tests for parallization

76c2468

tanishy7777 changed the title ~~mark analysis.rdf.InterRDF and analysis.rdf.InterRDF_s as not parallizable~~ parallelize analysis.rdf.InterRDF and analysis.rdf.InterRDF_s Jan 20, 2025

Update CHANGELOG

9177338

tanishy7777 changed the title ~~parallelize analysis.rdf.InterRDF and analysis.rdf.InterRDF_s~~ Parallelizes analysis.rdf.InterRDF and analysis.rdf.InterRDF_s Jan 20, 2025

tanishy7777 changed the title ~~Parallelizes analysis.rdf.InterRDF and analysis.rdf.InterRDF_s~~ Parallelizes MDAnalysis.InterRDF and MDAnalysis.InterRDF_s Jan 20, 2025

tanishy7777 changed the title ~~Parallelizes MDAnalysis.InterRDF and MDAnalysis.InterRDF_s~~ Parallelizes MDAnalysis.analysis.InterRDF and MDAnalysis.analysis.InterRDF_s Jan 20, 2025

tanishy7777 mentioned this pull request Jan 25, 2025

Parallelizes MDAnalysis.analysis.msd #4896

Open

5 tasks

marinegor requested changes Feb 20, 2025

View reviewed changes

tanishy7777 and others added 2 commits February 22, 2025 17:31

Update package/MDAnalysis/analysis/rdf.py

4971c50

Co-authored-by: Egor Marin <[email protected]>

Merge branch 'develop' into parallize_rdf

724dc1e

orbeckst added Component-Analysis parallelization labels Mar 14, 2025

marinegor self-assigned this Mar 16, 2025

tanishy7777 and others added 4 commits March 24, 2025 02:20

Merge branch 'develop' into parallize_rdf

fc34fe7

refactor custom aggregator for rdf

14cb40e

remove uneccesary variables

e9b599c

Merge branch 'parallize_rdf' of https://github.com/tanishy7777/mdanal…

ca06fd2

…ysis into parallize_rdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s` #4884

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s` #4884

tanishy7777 commented Jan 7, 2025 •

edited

Loading

pep8speaks commented Jan 7, 2025 •

edited

Loading

codecov bot commented Jan 7, 2025 •

edited

Loading

marinegor commented Jan 8, 2025

tanishy7777 commented Jan 8, 2025 •

edited

Loading

tanishy7777 commented Jan 8, 2025 •

edited

Loading

tanishy7777 commented Jan 8, 2025 •

edited

Loading

marinegor commented Jan 9, 2025

tanishy7777 commented Jan 13, 2025

marinegor commented Jan 13, 2025

tanishy7777 commented Jan 13, 2025

tanishy7777 commented Jan 18, 2025 •

edited

Loading

tanishy7777 commented Jan 18, 2025 •

edited

Loading

tanishy7777 commented Jan 21, 2025

marinegor left a comment

marinegor Feb 20, 2025

marinegor Feb 20, 2025

marinegor Feb 20, 2025

tanishy7777 commented Mar 14, 2025

Parallelizes MDAnalysis.analysis.InterRDF and MDAnalysis.analysis.InterRDF_s #4884

Are you sure you want to change the base?

Parallelizes MDAnalysis.analysis.InterRDF and MDAnalysis.analysis.InterRDF_s #4884

Conversation

tanishy7777 commented Jan 7, 2025 • edited Loading

PR Checklist

Developers certificate of origin

pep8speaks commented Jan 7, 2025 • edited Loading

Comment last updated at 2025-01-20 20:35:32 UTC

codecov bot commented Jan 7, 2025 • edited Loading

Codecov Report

marinegor commented Jan 8, 2025

tanishy7777 commented Jan 8, 2025 • edited Loading

tanishy7777 commented Jan 8, 2025 • edited Loading

tanishy7777 commented Jan 8, 2025 • edited Loading

marinegor commented Jan 9, 2025

tanishy7777 commented Jan 13, 2025

marinegor commented Jan 13, 2025

tanishy7777 commented Jan 13, 2025

tanishy7777 commented Jan 18, 2025 • edited Loading

tanishy7777 commented Jan 18, 2025 • edited Loading

tanishy7777 commented Jan 21, 2025

marinegor left a comment

Choose a reason for hiding this comment

marinegor Feb 20, 2025

Choose a reason for hiding this comment

marinegor Feb 20, 2025

Choose a reason for hiding this comment

marinegor Feb 20, 2025

Choose a reason for hiding this comment

tanishy7777 commented Mar 14, 2025

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s` #4884

Parallelizes `MDAnalysis.analysis.InterRDF` and `MDAnalysis.analysis.InterRDF_s` #4884

tanishy7777 commented Jan 7, 2025 •

edited

Loading

pep8speaks commented Jan 7, 2025 •

edited

Loading

codecov bot commented Jan 7, 2025 •

edited

Loading

tanishy7777 commented Jan 8, 2025 •

edited

Loading

tanishy7777 commented Jan 8, 2025 •

edited

Loading

tanishy7777 commented Jan 8, 2025 •

edited

Loading

tanishy7777 commented Jan 18, 2025 •

edited

Loading

tanishy7777 commented Jan 18, 2025 •

edited

Loading