1. 黑龙江工程学院计算机科学与技术学院,黑龙江 哈尔滨 150050
2. 哈尔滨理工大学软件学院,黑龙江 哈尔滨 150040
3. 北京大学生物医学工程系,北京 100871
[ "詹晓娟(1978-),女,黑龙江工程学院讲师,主要研究方向为数据挖掘、机器学习、生物信息。" ]
[ "姚登举(1980-),男,哈尔滨理工大学副教授,主要研究方向为数据挖掘、机器学习、生物信息。" ]
[ "朱怀球(1970-),男,北京大学教授,主要研究方向为生物医学信息学和计算系统生物学。" ]
网络首发:2016-03,
纸质出版:2016-03-20
移动端阅览
詹晓娟, 姚登举, 朱怀球. 高通量DNA测序数据的生物信息学方法[J]. 大数据, 2016,2(2):76-87.
xiaojuan Zhan, dengju Yao, huaiqiu Zhu. Bioinformatics methods for high-throughput DNA sequencing data[J]. BIG DATA RESEARCH, 2016, 2(2): 76-87.
詹晓娟, 姚登举, 朱怀球. 高通量DNA测序数据的生物信息学方法[J]. 大数据, 2016,2(2):76-87. DOI: 10.11959/j.issn.2096-0271.2016021.
xiaojuan Zhan, dengju Yao, huaiqiu Zhu. Bioinformatics methods for high-throughput DNA sequencing data[J]. BIG DATA RESEARCH, 2016, 2(2): 76-87. DOI: 10.11959/j.issn.2096-0271.2016021.
高通量测序技术产生的DNA序列数据长度较短,而且数据量非常巨大。分析了高通量测序环境下大数据的挑战和机遇,总结并讨论了数据压缩、宏基因组数据序列拼接、宏基因组数据序列分析方面的算法和工具等研究成果。最后,展望了高通量测序下DNA短读序列数据研究的发展趋势。
DNA sequence data generated by high-throughput sequencing technology is short in length
and the amount of data is enormous. The challenges and opportunities of the big data in high-throughput sequencing environment were analyzed. The data compression
the assembly of metagenomic sequence data
and algorithms and tools of metagenomic sequence data analysis also were summarized and discussed. Finally
the future of the study on short read DNA sequence data in high-throughput sequencing environment was discussed.
SCHUSTER S C . Next-generation sequencing transforms today’s biology [J ] . Nature Methods , 2008 , 5 ( 1 ): 16 - 18 .
SANGER F , NICKLEN S , COULSON A R . . DNA sequencing with chain-terminating inhibitors [J ] . Proceeding of the National Academy of Sciences , 1977 ,B7( 12 ): 5463 - 5467 .
SHENDURE J , JI H . Next-generation DNA sequencing [J ] . Nature Biotechnology , 2008 , 26 ( 10 ): 1135 - 1145 .
HIGGINS G . Human Genomes and Big Data Challenges [R ] . Mason: AssureRx Health Inc , 2013 .
WARD R M , SCHMIEDER R , HIGHNAM G , et al . Big data challenges and opportunities in highthrough-put sequencing [J ] . Systems Biomedicine , 2013 , 1 ( 1 ): 29 - 34 .
DUNHAM I , BIRNEY E , LAJOIE B R , et al . An integrated encyclopedia of DNA elements in the human genome [J ] . Nature , 2012 , 489 ( 7414 ): 57 - 74 .
COLLINS F S , BARKER A D . Mapping the cancer genome [J ] . Scientific American , 2007 , 296 ( 3 ): 50 - 57 .
HAYDEN E C . International genome project launched [J ] . Nature , 2008 , 451 ( 7177 ): 378 - 389 .
GEVERS D , KNIGHT R , PETROSINO J F , et al . The human microbiome project:a community resource for the healthy human microbiome [J ] . PLoS Biology , 2012 , 10 ( 8 ):e1001377.
HAUSSLER D , O’BRIEN S J , RYDER O A , et al . Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species [J ] . The Journal of Heredity , 2008 , 100 ( 6 ): 659 - 674 .
O’ROAK B J , VIVES L , GIRIRAJAN S , et al . Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations [J ] . Nature , 2012 , 485 ( 7397 ): 246 - 250 .
EHRLICH S D . MetaHIT: the European union project on metagenomics of the human intestinal tract[M]// Metagenomics of the Human Body . New York: Springer , 2011 : 307 - 316 .
LEGRAIN P , AEBERSOLD R , ARCHAKOV A , et al . The human proteome project: current state and future direction [J ] . Molecular & Cellular Proteomics , 2011 , 10 ( 7 ):M111. 009993.
GILBERT J A , MEYER F , ANTONOPOULOS D , et al . Meeting report: the terabase metagenomics workshop and the vision of an earth microbiome project [J ] . Standards in Genomic Sciences , 2010 , 3 ( 3 ): 243 .
ROBINSON G E , HACKETT K J , PURCELL M M , et al . Creating a buzz about insect genomes [J ] . Science , 2011 , 331 ( 6023 ): 1386 .
JOLY Y , DOVE E S , KNOPPERS B M , et al . Data sharing in the post-genomic world: the experience of the international cancer genome consortium (ICGC) data access compliance office (DACO) [J ] . PLoS Comput Biol , 2012 , 8 ( 7 ):e1002549.
WU X D , ZHU X Q . Data mining with big data [J ] . IEEE Transactions on Knowledge and Data Engineering , 2014 , 26 ( 1 ): 97 - 108 .
CHRISTLEY S , LU Y , LI C , et al . Human genomes as email attachments [J ] . Bioinformatics , 2009 , 25 ( 2 ): 274 - 275 .
BRADON M C , WALLACE D C , BALDI P , et al . Data structures and compression algorithms for genomic sequence data [J ] . Bioinformatics , 2009 , 25 ( 14 ): 1731 - 1738 .
KOZANITIS C , SAUNDERS C , KRUGLYAK S , et al . Compressing genomic sequence fragments using SlimGene [J ] . Journal of Computational Biology , 2011 , 18 ( 3 ): 401 - 413 .
WANG C , ZHANG D . A novel compression tool for efficient storage of genome resequencing data [J ] . Nucleic Acids Research , 2011 , 39 ( 7 ): e45.
FRITZ M H Y , LEINONEN R , COCHRANE G , et al . Efficient storage of high throughput DNA sequencing data using reference-based compression [J ] . Genome Research , 2011 , 21 ( 5 ): 734 - 740 .
MILLER J R , KOREN S , SUTTON G , et al . Assembly algorithms for next-generation sequencing data [J ] . Genomics , 2010 , 95 ( 6 ): 315 - 327 .
BONFIELD J K , MAHONEY M V . Compression of FASTQ and SAM format sequencing data [J ] . Plos One , 2013 , 8 ( 3 ): 1453 - 1456 .
COX A J , BAUER M J , JAKOBI T , et al . Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform [J ] . Bioinformatics , 2012 , 28 ( 11 ): 1415 - 1419 .
HACH F , NUMANAGIĆ I , ALKAN C , et al . SCALCE: boosting sequence compression algorithms using locally consistent encoding [J ] . Bioinformatics , 2012 , 28 ( 23 ): 3051 - 3057 .
SELVA J J , CHEN X . SRComp: short read sequence compression using burstsort and Elias omega coding [J ] . PloS One , 2013 , 8 ( 12 ): e81414.
PATRO R , KINGSFORD C . Data-dependent bucketing improves reference-free compression of sequencing reads [J ] . Bioinformatics , 2015 :btv248.
JONES D C , RUZZO W L , PENG X , et al . Compression of next-generation sequencing reads aided by highly efficient de novo assembly [J ] . Nucleic Acids Research , 2012 , 40 ( 22 ): e171.
METZKER M L . Applications of next-generation sequencing technologies the next generation [J ] . Nature Reviews Genetics , 2010 , 11 ( 1 ): 31 - 46 .
WOOLEY C , GODZIK A , FRIEDBERG I . . A primer on metagenomics [J ] . PLoS Comput Biol , 2010 , 6 ( 2 ):e1000667.
POP M , PHILLIPPY A , DELCHER A L , et al . Comparative genome assembly [J ] . Briefings in Bioinformatics , 2004 , 5 ( 3 ): 237 - 248 .
KECECIOGLU J , JU J . Separating repeats in DNA sequence assembly[C]// The 5th Annual International Conference on Computational Biology, April 22-25,2001, Montreal, Canada . [S.l.:s.n.] , 2001 : 176 - 183 .
PRIDE D T , MEINERSMANN R J , WASSENAAR T M , et al . Evolutionary implications of microbial genome tetranucleotide frequency biases [J ] . Genome Research , 2003 , 13 ( 2 ): 145 - 158 .
WU Y W , YE Y . A novel abundance-based algorithm for binning metagenomic sequences using l-tuples [J ] . Journal of Computational Biology , 2011 , 18 ( 3 ): 523 - 534 .
PRAKASH T , TAYLOR T D . Functional assignment of metagenomic data:challenges and applications [J ] . Briefings in Bioinformatics , 2012 , 13 ( 6 ): 711 - 727 .
QIN J , LI R , RAES J , et al . A human gut microbial gene catalogue established by metagenomic sequencing [J ] . Nature , 2010 , 464 ( 7285 ): 59 - 65 .
QIN J , LI Y , CAI Z , et al . A metagenome-wide association study of gut microbiota in type 2 diabetes [J ] . Nature , 2012 , 490 ( 7418 ): 55 - 60 .
BORODOVSKY M , MCININCH J . GENMARK: parallel gene recognition for both DNA strands [J ] . Computers &Chemistry , 1993 , 17 ( 2 ): 123 - 133 .
LUKASHIN A , BORODOVSKY M . GeneMark.hmm: new solutions for gene finding [J ] . Nucleic Acids Research , 1998 , 26 ( 4 ): 1107 - 1115 .
BESEMER J , LOMSADZE A , BORODOVSKY M , et al . GeneMarks: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J ] . Nucleic Acids Research , 2011 , 29 ( 12 ): 2607 - 2618 .
SALZBERG S L , DELCHER A L , KASIF S , et al . Microbial gene identification using interpolated Markov models [J ] . Nucleic Acids Research , 1998 , 26 ( 2 ): 544 - 548 .
DELCHER A L , BRATKE K A , POWERS E C , et al . Identifying bacterial genes and endosymbiont DNA with Glimmer [J ] . Bioinformatics , 2007 , 23 ( 6 ): 673 - 679 .
FRIGAARD N U , MARTIMEZ A , MINCER T J , et al . Proteorhodopsin lateral gene transfer between marine planktonic bacteria and archaea [J ] . Nature , 2006 , 439 ( 7078 ): 847 - 850 .
OUYANG Z , ZHU H , WANG J , et al . Multivariate entropy distance method for prokaryotic gene identification [J ] . Journal of Bioinformatics and Computational Biology , 2004 , 2 ( 2 ): 353 - 373 .
ZHU H Q , HU G Q , YANG Y F , et al . MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes [J ] . BMC Bioinformatics , 2007 , 8 ( 1 ): 97 .
NOGUCHI H , TANIGUCHI T , ITOH T , et al . MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes [J ] . DNA Research , 2008 , 15 ( 6 ): 387 - 396 .
HOFF K J , LINGNER T , MEINICKE P , et al . Orphelia: predicting genes in metagenomic sequencing reads [J ] . Nucleic Acids Research , 2009 , 37 ( suppl 2 ): W101 - W105 .
ZHU W , LOMSADZE A , BORODOVSKY M , et al . Ab initio gene identification in metagenomic sequences [J ] . Nucleic Acids Research , 2010 , 38 ( 12 ):e132.
RHO M , TANG H , YE Y , et al . FragGeneScan:predicting genes in short and error-prone reads [J ] . Nucleic Acids Research , 2010 , 38 ( 20 ):e191.
KELLEY D R , LIU B , DELCHER A L , et al . Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering [J ] . Nucleic Acids Research , 2012 , 40 ( 1 ):e9.
HYATT D , LOCASCIO P F , HAUSER L J , et al . Gene and translation initiation site prediction in metagenomic sequences [J ] . Bioinformatics , 2012 , 28 ( 17 ): 2223 - 2230 .
WANG Y , LEUNG H C M , YIU S M , et al . MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample [J ] . Bioinformatics , 2012 , 28 ( 18 ): i356 - i362 .
LIU Y , GUO J , HU G , et al . Gene prediction in metagenomic fragments based on the SVM algorithm [J ] . BMC Bioinformatics , 2013 , 14 ( suppl 5 ): S12 .
DESANTIS T Z , HUGENHOLTZ P , LARSEN N , et al . Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB [J ] . Applied and Environmental Microbiology , 2006 , 72 ( 7 ): 5069 - 5072 .
PRUESSE E , QUAST C , KNITTEL K , et al . SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB [J ] . Nucleic Acids Research , 2007 , 35 ( 21 ): 7188 - 7196 .
0
浏览量
523
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621