A study on optimizing markduplicate in genome sequencing pipeline

Q Zhao - Proceedings of the 5th International Conference on …, 2018 - dl.acm.org
Proceedings of the 5th International Conference on Bioinformatics Research …, 2018dl.acm.org
MarkDuplicate is typically one of the most time-consuming operations in the whole genome
sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome
data and mark duplicate reads in sorted genome data, has relatively low performance on
MarkDuplicate due to its single-thread sequential Java implementation, which has caused
serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in
Picard, we present our two-stage optimization solution as a preliminary study on next …
MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果

Google学术搜索按钮

example.edu/paper.pdf
查找
获取 PDF 文件
引用
References