1. 国防科技大学计算机学院,湖南 长沙 410073
2. 湖南大学信息科学与工程学院,湖南 长沙 410082
3. 国家超级计算长沙中心,湖南 长沙 410082
[ "崔英博(1989- ),男,博士,国防科技大学计算机学院助理研究员,主要研究方向为高性能计算、生物大数据挖掘等" ]
[ "黄春(1973- ),女,博士,国防科技大学计算机学院研究员,主要研究方向为高性能计算系统、并行编译、并行编程和高性能数学库等" ]
[ "唐滔(1984- ),男,博士,国防科技大学计算机学院副研究员,主要研究方向为编译器、并行计算和高性能计算" ]
[ "杨灿群(1968- ),男,博士,国防科技大学计算机学院研究员,主要研究方向为高性能计算、并行编程等" ]
[ "廖湘科(1963- ),男,博士,国防科技大学计算机学院研究员,主要研究方向为高性能计算、系统软件等" ]
[ "彭绍亮(1979- ),男,博士,湖南大学信息科学与工程学院教授,国家超级计算长沙中心副主任,主要研究方向为高性能计算、生物信息、大数据挖掘、区块链等" ]
网络首发:2020-09,
纸质出版:2020-09-15
移动端阅览
崔英博, 黄春, 唐滔, 等. 基因组大数据变异检测算法的并行优化[J]. 大数据, 2020,6(5):2020041-1.
Yingbo CUI, Chun HUANG, Tao TANG, et al. Parallel optimization of variation detection algorithms for large-scale genome data[J]. Big Data Research, 2020, 6(5): 2020041-1.
崔英博, 黄春, 唐滔, 等. 基因组大数据变异检测算法的并行优化[J]. 大数据, 2020,6(5):2020041-1. DOI: 10.11959/j.issn.2096-0271.2020041.
Yingbo CUI, Chun HUANG, Tao TANG, et al. Parallel optimization of variation detection algorithms for large-scale genome data[J]. Big Data Research, 2020, 6(5): 2020041-1. DOI: 10.11959/j.issn.2096-0271.2020041.
序列比对和变异检测是基因组数据分析的基础步骤,是后续各种功能性分析的前提,也是基因组数据分析中最耗时的环节。为有效处理高通量测序技术产生的海量基因组大数据,采用OpenMP、MPI等技术,对序列比对算法和SNP检测算法进行了多级并行优化,并对相关算法进行了改进。在不同数据集和并行规模下的测试中,核心算法加速比达到9倍以上,大规模测试中算法的并行效率保持在60%以上,在保证精度的前提下获得了良好的并行性能和可扩展性,有效提高了基因组大数据变异检测的能力。
Sequence alignment and mutation detection are the basic steps of genomic data analysis.They are the premise of subsequent functional analysis
and the most time-consuming steps.In order to effectively deal with the massive genomic big data brought by high-throughput sequencing technology
MPI
OpenMP and other technologies to perform multi-level parallel optimization of sequence alignment algorithm and SNP detection algorithm were used.By testing on different data sets and parallel scales
the core algorithm reached more than 9x speedup
and the parallel efficiency remained above 60% in large-scale test.The improved algorithms obtain good parallel performance and scalability
that effectively improves the ability of genomic big data mutation detection.
STEPHENS Z D , LEE S Y , FAGHRI F , et al . Big data:astronomical or genomical? [J ] . PLoS Biology , 2015 , 13 ( 7 ):e1002195.
MARX V . Biology:the big challenges of big data [J ] . Nature , 2013 , 498 ( 7453 ): 255 - 260 .
KATHIRESAN N , TEMANNI R , ALMABRAZI H , et al . Accelerating next generation sequencing data analysis with system level optimizations [J ] . Scientific Reports , 2017 , 7 ( 1 ):9058.
LI H , HOMER N . A survey of sequence alignment algorithms for next-generation sequencing [J ] . Briefings in Bioinformatics , 2010 , 11 ( 5 ): 473 - 483 .
ALTSCHUL S F , GISH W , MILLER W , et al . Basic local alignment search tool [J ] . Journal of Molecular Biology , 1990 , 215 ( 3 ): 403 - 410 .
LI R , LI Y , KRISTIANSEN K , et al . SOAP:short oligonucleotide alignment program [J ] . Bioinformatics , 2008 , 24 ( 5 ): 713 - 714 .
JIANG H , WONG W H . SeqMap:mapping massive amount of oligonucleotides to the genome [J ] . Bioinformatics , 2008 , 24 ( 20 ): 2395 - 2396 .
SMITH A D , XUAN Z , ZHANG M Q . Using quality scores and longer reads improves accuracy of Solexa read mapping [J ] . BMC Bioinform , 2008 , 9 ( 1 ):128.
HOMER N , MERRIMAN B , NELSON S F . BFAST:an alignment tool for large scale genome resequencing [J ] . PLoS One , 2009 , 4 ( 11 ):e7767.
SCHATZ M C . CloudBurst:highly sensitive read mapping with MapReduce [J ] . Bioinformatics , 2009 , 25 ( 11 ): 1363 - 1369 .
CHEN Y , SOUAIAIA T , CHEN T . PerM:efficient mapping of short sequencing reads with periodic full sensitive spaced seeds [J ] . Bioinformatics , 2009 , 25 ( 19 ): 2514 - 2521 .
CLEMENT N L , SNELL Q , CLEMENT M J , et al . The GNUMAP algorithm:unbiased probabilistic mapping of oligonucleotides from next generation sequencing [J ] . Bioinformatics , 2010 , 26 ( 1 ): 38 - 45 .
FERRAGINA P , MANZINI G . Opportunistic data structures with applications [C ] // The 41st Annual Symposium on Foundations of Computer Science . Piscataway:IEEE Press , 2000 : 390 - 398 .
LI H , DURBIN R . Fast and accurate short read alignment with Burrows-Wheeler transform [J ] . Bioinformatics , 2009 , 25 ( 14 ): 1754 - 1760 .
LANGMEAD B , TRAPNELL C , POP M , et al . Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J ] . Genome Biology , 2009 , 10 ( 3 ).
LI R , YU C , LI Y , et al . SOAP2:an improved ultrafast tool for short read alignment [J ] . Bioinformatics , 2009 , 25 ( 15 ): 1966 - 1967 .
LI R , LI Y , FANG X . SNP detection for massively parallel whole-genome resequencing [J ] . Genome Research , 2009 , 19 ( 6 ): 1124 - 1132 .
TANG J , LEUNISSEN J A M , VOORRIPS R E . HaploSNPer:a web-based allele and SNP detection tool [J ] . BMC Genetics , 2008 , 9 ( 1 ).
DEREEPER A , NICOLAS S , LE C L . SNiPlay:a web-based tool for detection,Application to grapevine diversity projects [J ] . BMC Bioinformatics , 2011 , 12 ( 1 ).
TANG J , VOSMAN B , VOORRIPS R E . QualitySNP:a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species [J ] . BMC Bioinformatics , 2006 , 7 ( 1 ).
LI H , HANDSAKER B , WYSOKER A . The sequence alignment/map format and SAMtools [J ] . Bioinformatics , 2009 , 25 ( 16 ): 2078 - 2079 .
DEPRISTO M A , BANKS E , POPLIN R . A framework for variation discovery and genotyping using next-generation DNA sequencing data [J ] . Nature Genetics , 2011 , 43 ( 5 ): 491 - 498 .
RACZY C , PETROVSKI R , SAUNDERS C T . Isaac:ultra-fast whole-genome secondary analysis on Illumina sequencing platforms [J ] . Bioinformatics , 2013 , 29 ( 16 ): 2041 - 2043 .
WEI Z , WANG W , HU P , et al . SNVer:a statistical tool for variant calling in analysis of pooled or individual nextgeneration sequencing data [J ] . Nucleic Acids Research , 2011 , 39 ( 19 ):132.
KOBOLDT D C , CHEN K , WYLIE T , et al . VarScan:variant detection in massively parallel sequencing of individual and pooled samples [J ] . Bioinformatics , 2009 , 25 ( 17 ): 2283 - 2285 .
KOBOLDT D C , ZHANG Q , LARSON D E , et al . VarScan2:somatic mutation and copy number alteration discovery in cancer by exome sequencing [J ] . Genome Research , 2012 , 22 ( 3 ): 568 - 576 .
LANGMEAD B , SCHATZ M C , LIN J . Searching for SNPs with cloud computing [J ] . Genome Biology , 2009 , 10 ( 11 ).
ZHAO S , PRENGER K , SMITH L . Rainbow:a tool for large-scale wholegenome sequencing data analysis using cloud computing [J ] . BMC Genomics , 2013 , 14 ( 1 ).
LU M , ZHAO J , LUO Q . GSNP:a DNA single-nucleotide polymorphism detection system with GPU acceleration [C ] // International Conference on Parallel Processing.[S.l.:s.n] . 2011 : 592 - 601 .
彭绍亮 , 牛琦 , 李肯立 , 等 . CPU-MIC异构并行架构下基于大规模频繁子图挖掘的药物发现算法 [J ] . 大数据 , 2019 , 5 ( 2 ): 89 - 103 .
PENG S L , NIU Q , LI K L , et al . A scalable CPU-MIC coordinated drug-finding tool by frequency subgraph mining [J ] . Big Data Research , 2019 , 5 ( 2 ): 89 - 103 .
彭绍亮 , 杨顺云 , 孙哲 , 等 . 生物效应大数据评估聚类算法的并行优化 [J ] . 大数据 , 2018 , 4 ( 3 ): 24 - 36 .
PENG S L , YANG S Y , SUN Z , et al . Parallel optimization for clustering algorithm of large-scale biological effect evaluation [J ] . Big Data Research , 2018 , 4 ( 3 ): 24 - 36 .
SCHINDLER M , . A fast block-sorting algorithm for lossless data compression [C ] // The Conference on Data Compression . Piscataway:IEEE Press , 1997 .
0
浏览量
773
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621