Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO - Part 3 #64

Draft
wants to merge 73 commits into
base: dev
Choose a base branch
from

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Nov 4, 2024

Note: The above issue will be implemented in 3 PRs:

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

From comment #36 (comment)

  • on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
  • DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

@aditya0by0 aditya0by0 self-assigned this Nov 4, 2024
@aditya0by0 aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request
@aditya0by0 aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15
@aditya0by0
Copy link
Collaborator Author

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

@aditya0by0 aditya0by0 marked this pull request as draft February 21, 2025 16:03
- prop annotations has both direct and transitive annotations
@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Feb 28, 2025

@sfluegel05, I have made the suggested changes for scope. Please check.
The notebook-tutorial for scope is also completed. Please review the notebook.

For DeepGO2, I re-checked the code and didn't find any discrepancy between our implementation and theirs.

@sfluegel05
Copy link
Collaborator

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

@aditya0by0
Copy link
Collaborator Author

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

@sfluegel05, I have resolved this issue, Please check.

@sfluegel05
Copy link
Collaborator

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

  1. Find the sequences and all labels that can be applied to each sequence
  2. Based on that information, construct the graph
  3. Based on the graph, select the labels that pass the threshold

I hope this helps!

@aditya0by0
Copy link
Collaborator Author

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

  1. Find the sequences and all labels that can be applied to each sequence
  2. Based on that information, construct the graph
  3. Based on the graph, select the labels that pass the threshold

I hope this helps!

Thanks for the suggestion. I have fixed the issue and now all labels have more than or equal to 50 true instances for SCOPe50.
I had also started a training for it and I am facing a error related to electra. Please check here . Please let me know if you have any suggestions on how to resolve this.

Also, I have made suggested changes for scope notebook.

Please check.

@aditya0by0 aditya0by0 mentioned this pull request Mar 17, 2025
@sfluegel05
Copy link
Collaborator

My first guess is that you have to change model.config.max_position_embeddings. This is set to 1,800 at the moment, apparently you need ~2,500 instead.
Thanks for making the changes to the notebook.

@aditya0by0
Copy link
Collaborator Author

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

@aditya0by0
Copy link
Collaborator Author

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

Please check here the results after 24hrs of training, only 6 epochs completed. The batch file has maximum 24hrs as timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SCOPe dataset to our pipeline Protein function prediction with GO
2 participants