-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements for processing prokaryotic datasets using the nf-core/rnaseq pipeline #1512
Comments
Hi @sebschulz1! Thank you for the beautifully thought-out summary 🤩 I completely agree with you that the pipeline hasn't handled prokaryotic genomes optimally - partly because of poor annotation consistency and partly to keep the maintenance overhead down. From an implementation perspective, has anyone tried to run STAR directly against the genome and not the transcriptome and then using featureCounts downstream? Trying to figure out if we can avoid adding and using another aligner to the pipeline if we don't see any explicit gains? |
Indeed, thanks a lot for this thorough assessment. Most welcome if the three of you would be eager working in this direction! I wonder, what the difference between using Also, by setting an extremely high penalty on the opening of gaps, STAR can presumably be used as an splice-unaware aligner: |
Thanks for all the above everyone :-). It's been on my list for a while to improve the prokaryotic stuff to incorporate suggestions from previous discussion, but I maybe hadn't been planning changes quite as extensive as some of what's here, so good have the proposals clear. Adding a new aligner would be quite a bit of additional complexity, we'd have new params, new genome indexing steps etc. It may be worth it, but when we start heading that way I start to wonder if maybe we should be thinking of a specialist workflow. Preprocessing is already a subworkflow, for example, so could be re-used in a new workflow easily. Before we start thinking that way, let's see if existing workarounds (with improvements as suggested by @MatthiasZepper ) do the trick. Stage 1 The first step is to see how effective adapting the current components for prokaryotic data can be, using existing workarounds. The Salmon error doesn't invalidate the approach, it just means we need to investigate/ debug as necessary. I think we probably need a prokaryotic profile. That could override the existing workflow's parameterisation (e.g. Stage 2 If parameterisation is insufficient and we need e.g. preprocessing steps specifically for prokaryotic data, we could add some new pipeline logic, maybe moving eukaryotic and prokaryotic alignment / quantification to their own local subworkflows, Stage 3 The above have proved insufficient, and we move to use a completely different toolset for prokaryotic data- e.g. Bowtie2 -> featureCounts. Again, at that point I'd be asking if it was actually appropriate to do both types of analysis in the same pipeline. Might be, but the question would need to be asked. |
Description of feature
Authors: Juliana Assis, Sebastian Schulz and Albert Palleja
We would like to emphasize the importance to facilitate the processing of prokaryotic bulk RNAseq data using the v3.x nf-core/rnaseq pipelines, as proposed by Matthias Zepper previously (#1085).
Given the growing importance of (large-scale) transcriptomic analyses in microbial biology and biotechnology, there is a need for standardized and thus reproducible processing of prokaryotic RNAseq data (bacterial and archaeal).
Although bacterial RNAseq data have been processed with nf-core/rnaseq pipelines previously, this was – to the best of our knowledge – either performed using the older v1.4.2 pipeline (Muehler et al., 2022 and Muehler et al., 2024) or using a workaround that required prior manipulation of input files in combination with setting specific pipeline parameters (summarized by Marine Cambon in a Slack thread).
Our recent attempts to run the nf-core/rnaseq pipeline v3.16.1, launched from the Seqera platform, using said workaround, however, failed (for details see section "Pipeline testing").
Conclusion
Neither the use of old pipeline versions nor the manipulation of input files is desirable from the perspective of using most updated software pipelines and standardized data processing workflows, respectively.
Pipeline testing
We followed the suggested workaround procedure for processing prokaryotic RNAseq data sets by using a genome fasta, a modified gtf and a self-generated transcript fasta file as input to the pipeline (v3.16.1) and by setting the following pipeline parameters:
--extra_star_align_args "--sjdbGTFfeatureExon CDS"
(suggested in slack post)--featurecounts_feature_type "CDS"
(modified by us to use CDS from the gtf file; pipeline default setting is "exon").The pipeline failed at Salmon quantification stage. The error message reports a mismatch of sequence lengths between entries of the user-provided transcript fasta file and the pipeline-generated SAM file.
Error message
Steps to reproduce the error
Download the Escherichia coli RNAseq data set PRJNA554579 published by Rychel et al., 2023.
Download the reference genome fasta and the annotation gtf file for Escherichia coli str. K-12 substr. MG1655 (accession: GCF_000005845.2).
Extract and modify the gtf file by deleting all lines with an empty "transcript_id" field.
Command
Parameters
Recommendations for v3.x pipeline development
The following points are subject to discussion with the nf-core community.
Prokaryotic vs. eukaryotic mode
It might be beneficial to have a parameter to run the pipeline in either
--prokaryotic
or--eukaryotic
mode, as suggested previously (#1085).Integration of featureCounts
Re-introducing featureCounts into v3.x pipelines might be beneficial for processing prokaryotic datasets. This has been suggested recently (Perelo et al., 2024):
(text snippet from Perelo et al., 2024)
Alignment and Quantification Tools
Using STAR might not be optimal for prokaryotic datasets. A splice-unaware aligner might be the better choice since bacterial transcripts lack introns and are therefore not (alternatively) spliced (Mahmud et al., 2021):
(text snippet from Mahmud et al., 2021)
We suggest integrating Bowtie/Bowtie2 as a splice-unaware aligner and featureCounts for quantification. This combination is used in other prokaryotic RNAseq processing pipelines (see the iModulon pipeline and the ProkSeq pipeline).
The text was updated successfully, but these errors were encountered: