A study on optimizing markduplicate in genome sequencing pipeline- 学术资源搜索

A study on optimizing markduplicate in genome sequencing pipeline

Q Zhao - Proceedings of the 5th International Conference on …, 2018 - dl.acm.org

Proceedings of the 5th International Conference on Bioinformatics Research …, 2018•dl.acm.org

MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.

ACM Digital Library

展开收起

被引用次数：6 相关文章

以上显示的是最相近的搜索结果。查看全部搜索结果

Google学术搜索按钮

安装不用了

example.edu/paper.pdf

查找

获取 PDF 文件

引用

References