New Gene Diagnostic Analysis #1461

mpg19 · 2025-02-10T20:00:07Z

This PR request adds a variety of analysis scripts to the model that help one analyze the “diagnostic” effects of inserting new genes into the E. coli chromosome. Visualizing the differences between outcomes of simulations that contain a new gene inserted to the chromosome (experimental variant) and do not contain a new gene (control variant) can provide valuable insights in understanding how the cell environment changes in response to genetically engineered modifications from a whole-cell perspective. Analysis scripts within this PR allow one to observe how the average total protein counts in a control simulation differ from those in an experimental simulation with a given translation efficiency and expression index for a new gene. Some examples of a variant script plot output can be found here:
variant_comparison_scatter_plot.pdf

In addition, this PR also provides new cohort analysis scripts that allow one to better observe the free, complexed, and total (free and complex forms combined) monomer counts across all cell generations for each seed within a given simulation. These plots allow one to visualize how the counts of a particular protein(s) may change over the course of generations within a simulation. See some examples of these plot outputs here: protein_count_visualization_plots.pdf

Lastly, this PR allows one to save the total average protein count data for cohort and variant simulations. This can be helpful when wanting to have the protein counts in excel form to analyze in different softwares outside the model and for pulling data down from HPCs where it may expire (for example, Sherlock SCRATCH data expires after 3 months). This data can also be useful for generating some interactive single-use plots locally.

…eins over a simulation's duration

…s in the titles

…es of code

… in multiple variants

…plots

rjuenemann

@mpg19, congrats on submitting your first big PR!

Overall the code looks great! Thank you for breaking it up into digestible functions, it made things very readable. There are a lot of cool and useful plots here.

I had some suggestions on how to structure some files for improved generalizability/efficiency and to reduce redundant code to make the code easier to maintain going forward. Happy to chat if you have any questions.

rjuenemann · 2025-02-16T06:55:59Z

models/ecoli/analysis/AnalysisPaths.py

+		mask = np.logical_or(self._path_data['variant'] ==
+							  variant if variant is not None else slice(None),
+							  self._path_data['seed'] ==
+							  seed if seed is not None else slice(None))


remember to correct the typing here to fix the mypy Jenkins test

rjuenemann · 2025-02-16T06:58:34Z

models/ecoli/analysis/cohort/__init__.py

+	"free_monomer_counts.py",
+	"complexed_monomer_counts.py",
+	"total_monomer_counts.py",


Could you put these in their alphabetical order within the list to match the code style?

rjuenemann · 2025-02-16T07:03:19Z

models/ecoli/analysis/cohort/complexed_monomer_counts.py

If the complexed_monomer_counts script needs to load in the free monomer counts and the total monomer counts, would it make sense to merge these three scripts into one monomer_counts script with a user specified option at the top where you could choose to generate total monomer counts, free monomer counts, and/or complexed monomer counts? You could implement this either as 3 stacked plots in the same PDF or three separate PDF files?

I would imagine in most cases you would want to run all 3 and this could be more convenient in terms of being able to run it with just one script call and also more memory/compute efficient because you wouldn't have to read in the data three separate times. This would also mean there is less redundant code between the scripts so if there was ever a change you would have to update fewer lines/files.

rjuenemann · 2025-02-16T07:05:55Z

models/ecoli/analysis/cohort/complexed_monomer_counts.py

+
+# Specifiy generations to be skipped if desired:
+SKIP_GENERATIONS = 0 # 0 -> no generations are skipped
+PLOT_AVERAGES = 0 # 0 -> no, 1 -> yes (if SKIP_GENERATIONS = 0)


Why does skip generations have to be 0 for plot averages to be 1? If this is always the case you might want to force it with an assert statement to prevent user error or undefined behavior

rjuenemann · 2025-02-16T07:07:47Z

models/ecoli/analysis/cohort/complexed_monomer_counts.py

+""" END USER INPUTS """
+
+class Plot(cohortAnalysisPlot.CohortAnalysisPlot):
+	def extract_data(self, simOutDir, cell_paths, monomer_ids, cistron_ids):


If this function is shared between multiple different files it might make sense to define it like this in one file and import it to the others to reduce code redundancy - that way if changes ever need to be made (e.g. if a listener name changed or something) they will only have to be made in one place

rjuenemann · 2025-02-16T08:54:36Z