Improved gene annotation of the fungal wheat pathogen Zymoseptoria tritici based on combined Iso-Seq and RNA-Seq evidence
bioRxiv, 2023•biorxiv.org
Despite large omics datasets, the establishment of a reliable gene annotation is still
challenging for eukaryotic genomes. Here, we used the reference genome of the major
fungal wheat pathogen Zymoseptoria tritici (isolate IPO323) as a case study to develop
methods to improve eukaryotic gene prediction. Four previous IPO323 annotations identified
10,933 to 13,260 gene models, but only one third of these coding sequences (CDS) have
identical structures. To resolve these discrepancies and improve gene models, we …
challenging for eukaryotic genomes. Here, we used the reference genome of the major
fungal wheat pathogen Zymoseptoria tritici (isolate IPO323) as a case study to develop
methods to improve eukaryotic gene prediction. Four previous IPO323 annotations identified
10,933 to 13,260 gene models, but only one third of these coding sequences (CDS) have
identical structures. To resolve these discrepancies and improve gene models, we …
Abstract
Despite large omics datasets, the establishment of a reliable gene annotation is still challenging for eukaryotic genomes. Here, we used the reference genome of the major fungal wheat pathogen Zymoseptoria tritici (isolate IPO323) as a case study to develop methods to improve eukaryotic gene prediction. Four previous IPO323 annotations identified 10,933 to 13,260 gene models, but only one third of these coding sequences (CDS) have identical structures. To resolve these discrepancies and improve gene models, we generated full-length transcripts using long-read sequencing. This dataset was used together with other evidence (RNA-Seq transcripts and protein sequences) to generate novel ab initio gene models. The selection of the best structure among novel and existing gene models was performed according to transcript and protein evidence using InGenAnnot, a novel bioinformatics suite. Overall, 13,414 re-annotated gene models (RGMs) were predicted, including 671 new genes among which 53 encoded effector candidates. This process corrected many of the errors (15%) observed in previous gene models (coding sequence fusions, false introns, missing exons). While fungal genomes have poor annotations of untranslated regions (UTRs), our Iso-Seq long-read sequences outlined 5’ and 3’UTRs for 73% of the RGMs. Alternative transcripts were identified for 13% of RGMs, mostly due to intron retention (75%), likely corresponding to unprocessed pre-mRNAs. A total of 353 genes displayed alternative transcripts with combinations of previously predicted or novel exons. Long non-coding transcripts (lncRNAs) and double-stranded RNAs from two fungal viruses were also identified. Most lncRNAs corresponded to antisense transcripts of genes (52%). lncRNAs that were up or down regulated during infection were enriched in antisense transcripts (70%), suggesting their involvement in the control of gene expression. Our results showed that combining different ab initio gene predictions and evidence-driven curation using InGenAnnot improved the quality of gene annotations of a compact eukaryotic genome. Our analysis also provided new insights into the transcriptional landscape of Z. tritici, helping develop an increasingly complex picture of its biology.
biorxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果