作者
Andy Kong
发表日期
2017
简介
Proteogenomics is an area of proteomics concerning the detection of novel peptides and peptide variants nominated by genomics and transcriptomics experiments. While the term primarily refers to studies utilizing a customized protein database derived from select sequencing experiments, proteogenomics methods can also be applied in the quest for identifying previously unobserved, or missing, proteins in a reference protein database. The identification of novel peptides is difficult and results can be dominated by false positives if conventional computational and statistical approaches for shotgun proteomics are directly applied without consideration of the challenges involved in proteogenomics analyses. In this dissertation, I systematically distill the sources of false positives in peptide identification and present potential remedies, including computational strategies that are necessary to make these approaches feasible for large datasets. In the first part, I analyze high scoring decoys, which are false identifications with high assigned confidences, using multiple peptide identification strategies to understand how they are generated and develop strategies for reducing false positives. I also demonstrate that modified peptides can cause violations in the target-decoy assumptions, which is a cornerstone for error rate estimation in shotgun proteomics, leading to potential underestimation in the number of false positives. Second, I address computational bottlenecks in proteogenomics workflows through the development of two database search engines: EGADS and MSFragger. EGADS aims to address issues relating to the large sequence space …