-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Gene Diagnostic Analysis #1461
base: master
Are you sure you want to change the base?
Conversation
…eins over a simulation's duration
… in multiple variants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpg19, congrats on submitting your first big PR!
Overall the code looks great! Thank you for breaking it up into digestible functions, it made things very readable. There are a lot of cool and useful plots here.
I had some suggestions on how to structure some files for improved generalizability/efficiency and to reduce redundant code to make the code easier to maintain going forward. Happy to chat if you have any questions.
mask = np.logical_or(self._path_data['variant'] == | ||
variant if variant is not None else slice(None), | ||
self._path_data['seed'] == | ||
seed if seed is not None else slice(None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember to correct the typing here to fix the mypy Jenkins test
"free_monomer_counts.py", | ||
"complexed_monomer_counts.py", | ||
"total_monomer_counts.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put these in their alphabetical order within the list to match the code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the complexed_monomer_counts script needs to load in the free monomer counts and the total monomer counts, would it make sense to merge these three scripts into one monomer_counts script with a user specified option at the top where you could choose to generate total monomer counts, free monomer counts, and/or complexed monomer counts? You could implement this either as 3 stacked plots in the same PDF or three separate PDF files?
I would imagine in most cases you would want to run all 3 and this could be more convenient in terms of being able to run it with just one script call and also more memory/compute efficient because you wouldn't have to read in the data three separate times. This would also mean there is less redundant code between the scripts so if there was ever a change you would have to update fewer lines/files.
|
||
# Specifiy generations to be skipped if desired: | ||
SKIP_GENERATIONS = 0 # 0 -> no generations are skipped | ||
PLOT_AVERAGES = 0 # 0 -> no, 1 -> yes (if SKIP_GENERATIONS = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does skip generations have to be 0 for plot averages to be 1? If this is always the case you might want to force it with an assert statement to prevent user error or undefined behavior
""" END USER INPUTS """ | ||
|
||
class Plot(cohortAnalysisPlot.CohortAnalysisPlot): | ||
def extract_data(self, simOutDir, cell_paths, monomer_ids, cistron_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is shared between multiple different files it might make sense to define it like this in one file and import it to the others to reduce code redundancy - that way if changes ever need to be made (e.g. if a listener name changed or something) they will only have to be made in one place
inv_monomer_idx_dict.get(monomer_id) for monomer_id in protein_idxs] | ||
return protein_ids | ||
|
||
def get_LogData(self, protein_idxs, interest_protein_counts, index_vals=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case
|
||
return avg_log_interest_proteins | ||
|
||
def filter_data(self, nonfiltered_protein_counts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as filter_data in save_control_and_experimental_monomer_count_data
return nonzero_PCs, nonzero_ids, nonzero_idxs | ||
|
||
|
||
def generate_plot(self, var_num, all_protein_counts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before - since this doesn't export the plot within the function, plotting objects should either be passed in the function or something should be explicitly returned
plt.figure(figsize=(10, 10)) | ||
|
||
# find the location of the new gene's PCs within the dataframe: | ||
new_gene_PCs = var1_y[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know for sure that this will always be the last entry? What if there are multiple new genes added? Maybe there is a safer way to check
otherstr = "y = " + str(round(slope[0], 3)) + "x + 0" | ||
|
||
# format the plot: | ||
#plt.axis('square') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be uncommented? or deleted?
This PR request adds a variety of analysis scripts to the model that help one analyze the “diagnostic” effects of inserting new genes into the E. coli chromosome. Visualizing the differences between outcomes of simulations that contain a new gene inserted to the chromosome (experimental variant) and do not contain a new gene (control variant) can provide valuable insights in understanding how the cell environment changes in response to genetically engineered modifications from a whole-cell perspective. Analysis scripts within this PR allow one to observe how the average total protein counts in a control simulation differ from those in an experimental simulation with a given translation efficiency and expression index for a new gene. Some examples of a variant script plot output can be found here:
variant_comparison_scatter_plot.pdf
In addition, this PR also provides new cohort analysis scripts that allow one to better observe the free, complexed, and total (free and complex forms combined) monomer counts across all cell generations for each seed within a given simulation. These plots allow one to visualize how the counts of a particular protein(s) may change over the course of generations within a simulation. See some examples of these plot outputs here: protein_count_visualization_plots.pdf
Lastly, this PR allows one to save the total average protein count data for cohort and variant simulations. This can be helpful when wanting to have the protein counts in excel form to analyze in different softwares outside the model and for pulling data down from HPCs where it may expire (for example, Sherlock SCRATCH data expires after 3 months). This data can also be useful for generating some interactive single-use plots locally.