Protein function prediction with GO - Part 3 #64

aditya0by0 · 2024-11-04T12:07:11Z

PR for the Issue Protein function prediction with GO #36

Note: The above issue will be implemented in 3 PRs:

Protein function prediction with GO #39 (Merged)
Protein function prediction with GO - Part 2 #57 (Merged)
Protein function prediction with GO - Part 3 #64
PR for the issue Add SCOPe dataset to our pipeline #67

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)

DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

- migration from deep go format to chebai->go_uniprot format

- #36 (comment)

- +migration structure changes

aditya0by0 · 2024-11-13T22:45:41Z

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

- consider proteins domain in the dataset which maps to any selected node irrespective of the hierarchy level

- https://stackoverflow.com/a/24871316/17626445

- prop annotations has both direct and transitive annotations

aditya0by0 · 2025-02-28T11:15:19Z

@sfluegel05, I have made the suggested changes for scope. Please check.
The notebook-tutorial for scope is also completed. Please review the notebook.

For DeepGO2, I re-checked the code and didn't find any discrepancy between our implementation and theirs.

chebai/models/ffn.py

sfluegel05 · 2025-03-04T08:34:47Z

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

aditya0by0 · 2025-03-10T12:13:26Z

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

@sfluegel05, I have resolved this issue, Please check.

sfluegel05 · 2025-03-12T16:51:54Z

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

Find the sequences and all labels that can be applied to each sequence
Based on that information, construct the graph
Based on the graph, select the labels that pass the threshold

I hope this helps!

aditya0by0 · 2025-03-15T17:26:21Z

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

Find the sequences and all labels that can be applied to each sequence

Based on that information, construct the graph

Based on the graph, select the labels that pass the threshold

I hope this helps!

Thanks for the suggestion. I have fixed the issue and now all labels have more than or equal to 50 true instances for SCOPe50.
I had also started a training for it and I am facing a error related to electra. Please check here . Please let me know if you have any suggestions on how to resolve this.

Also, I have made suggested changes for scope notebook.

Please check.

sfluegel05 · 2025-03-18T15:50:35Z

My first guess is that you have to change model.config.max_position_embeddings. This is set to 1,800 at the moment, apparently you need ~2,500 instead.
Thanks for making the changes to the notebook.

aditya0by0 · 2025-03-19T11:35:51Z

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

aditya0by0 · 2025-03-19T20:09:11Z

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

Please check here the results after 24hrs of training, only 6 epochs completed. The batch file has maximum 24hrs as timeout.

script to evaluate go predictions

bdba442

aditya0by0 self-assigned this Nov 4, 2024

aditya0by0 mentioned this pull request Nov 4, 2024

Protein function prediction with GO #36

Open

aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

aditya0by0 added 12 commits November 4, 2024 15:22

Merge branch 'dev' into protein_prediction

264bd94

add fmax to evaluation script

6c0fce1

Merge branch 'dev' into protein_prediction

154e827

add base code for deep_go data migration

58ae92d

- migration from deep go format to chebai->go_uniprot format

varry fmax threshold as per paper

78a38de

go_uniprot: add sequence len to docstring

3a4e007

update experiment evidence codes as per DeepGo SE

227a014

- #36 (comment)

Merge branch 'dev' into protein_prediction

33436e8

consIder X as a valid amino acid as per DeepGO-SE

c6d60cd

- #36 (comment)

deepgo se mirgration : add class to migrate

ca5461f

Merge branch 'dev' into protein_prediction

af54954

migration: rectify errors

dfb9430

aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15

aditya0by0 added 9 commits November 7, 2024 13:25

protein trigram containing tokenS with X

085b13b

- #36 (comment)

protein token unigram contain X

3e0bae0

- #36 (comment)

add migration for deepgo1 - 2018 paper

99b5af1

deepgo1: create non-exclusive val set as a placeholder

a15d492

deepgo1: further split train set into train and val for

e0a8524

- +migration structure changes

migration script update

093be28

add classes to use migrated deepgo data

14db9d6

deepgo: minor code change

8922d4d

modify prints to display actual file name

796356c

aditya0by0 added 3 commits November 17, 2024 23:42

create sub dir for deego dataset and move rel files

3c11a69

update imports as per new deepGO dir

2b571c5

update import dir for pretrain test

f75e30b

aditya0by0 added 6 commits February 16, 2025 00:47

scope: data filtering update

d3fd0f2

- consider proteins domain in the dataset which maps to any selected node irrespective of the hierarchy level

scope: avoid data fragmentation and add progress bar

c791893

scope: vectorized operation instead of df.itterows

aad16d9

- https://stackoverflow.com/a/24871316/17626445

scope: fix multiple chain filtering

13b8795

scope: tutorial for scope data exploration

4572272

scope: update tutorial

eba0417

aditya0by0 marked this pull request as draft February 21, 2025 16:03

aditya0by0 added 2 commits February 21, 2025 18:02

scope: add more scope details to tutorial

dad6f76

minor changes: deepgo configs + scope

fd6dd01

aditya0by0 mentioned this pull request Feb 23, 2025

set out_dim dynamically #74

Draft

deepgo2 migration: exp_annoations not needed

4a8f821

- prop annotations has both direct and transitive annotations

aditya0by0 commented Mar 2, 2025

View reviewed changes

chebai/models/ffn.py Outdated Show resolved Hide resolved

fix scope version in scope50.yml

1c432de

sfluegel05 and others added 3 commits March 4, 2025 09:51

modify notebook introduction

f13e935

ffn: fix error for loss kwargs

36e6162

scope: fix for no True labels for some classes/columns

93c7fc5

aditya0by0 added 4 commits March 14, 2025 10:44

Merge branch 'dev' into protein_prediction

6d7b467

scope: fix for true values less given threshold for some labels

767b210

go_notebook: update import statement

081b44d

scope notebook: add scope description and minor changes

81c1348

aditya0by0 mentioned this pull request Mar 17, 2025

Ensemble Models #77

Draft

electra config: increase max_postional_embeddings to 3000

2b0ed0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein function prediction with GO - Part 3 #64

Protein function prediction with GO - Part 3 #64

aditya0by0 commented Nov 4, 2024 •

edited

Loading

aditya0by0 commented Nov 13, 2024

aditya0by0 commented Feb 28, 2025 •

edited

Loading

sfluegel05 commented Mar 4, 2025

aditya0by0 commented Mar 10, 2025

sfluegel05 commented Mar 12, 2025

aditya0by0 commented Mar 15, 2025

sfluegel05 commented Mar 18, 2025

aditya0by0 commented Mar 19, 2025

aditya0by0 commented Mar 19, 2025

Protein function prediction with GO - Part 3 #64

Are you sure you want to change the base?

Protein function prediction with GO - Part 3 #64

Conversation

aditya0by0 commented Nov 4, 2024 • edited Loading

PR for the Issue Protein function prediction with GO #36

PR for the issue Add SCOPe dataset to our pipeline #67

Changes to be done in this PR

aditya0by0 commented Nov 13, 2024

aditya0by0 commented Feb 28, 2025 • edited Loading

sfluegel05 commented Mar 4, 2025

aditya0by0 commented Mar 10, 2025

sfluegel05 commented Mar 12, 2025

aditya0by0 commented Mar 15, 2025

sfluegel05 commented Mar 18, 2025

aditya0by0 commented Mar 19, 2025

aditya0by0 commented Mar 19, 2025

aditya0by0 commented Nov 4, 2024 •

edited

Loading

aditya0by0 commented Feb 28, 2025 •

edited

Loading