Characterization of sequence-specific errors in various next-generation sequencing systems

Sunguk Shin, Joonhong Park

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Next-generation sequencing (NGS) is a popular method for assessing the molecular diversity of microbial communities without cultivation, for identifying polymorphisms in populations, and for comparing genomes and transcriptomes. However, sequence-specific errors (SSEs) by NGS systems can result in genome mis-assembly, overestimation of diversity in microbial community analyses, and false polymorphism discovery. SSEs can be particularly problematic due to rich microbial biodiversity and genomes containing frequent repeats. In this study, SSEs in public data from all popular NGS systems were discovered using a Markov chain model and hotspots for sequence errors were identified. Deletion errors were frequently preceded by homopolymers in non-Illumina NGS systems, such as GS FLX+. Substitution errors were often related to high GC contents and long G/C homopolymers in Illumina sequencing systems such as HiSeq. After removal of long G/C homopolymers in HiSeq, the average lengths of contigs and average SNP quality increased. SSEs were selectively removed from our mock community data by quality filtering, and a bias against specific microbes was identified. Our findings provide a scientific basis for filtering poor-quality reads, correcting deletion errors, preventing genome mis-assembly, and accurately assessing microbial community compositions and polymorphisms.

Original languageEnglish
Pages (from-to)914-922
Number of pages9
JournalMolecular BioSystems
Volume12
Issue number3
DOIs
Publication statusPublished - 2016 Mar 1

Fingerprint

Genome
Microbial Genome
Markov Chains
Biodiversity
Base Composition
Transcriptome
Single Nucleotide Polymorphism
Population
Data Accuracy

All Science Journal Classification (ASJC) codes

  • Biotechnology
  • Molecular Biology

Cite this

@article{a834eb9877a1413ebb3eeb7da1ef5c4a,
title = "Characterization of sequence-specific errors in various next-generation sequencing systems",
abstract = "Next-generation sequencing (NGS) is a popular method for assessing the molecular diversity of microbial communities without cultivation, for identifying polymorphisms in populations, and for comparing genomes and transcriptomes. However, sequence-specific errors (SSEs) by NGS systems can result in genome mis-assembly, overestimation of diversity in microbial community analyses, and false polymorphism discovery. SSEs can be particularly problematic due to rich microbial biodiversity and genomes containing frequent repeats. In this study, SSEs in public data from all popular NGS systems were discovered using a Markov chain model and hotspots for sequence errors were identified. Deletion errors were frequently preceded by homopolymers in non-Illumina NGS systems, such as GS FLX+. Substitution errors were often related to high GC contents and long G/C homopolymers in Illumina sequencing systems such as HiSeq. After removal of long G/C homopolymers in HiSeq, the average lengths of contigs and average SNP quality increased. SSEs were selectively removed from our mock community data by quality filtering, and a bias against specific microbes was identified. Our findings provide a scientific basis for filtering poor-quality reads, correcting deletion errors, preventing genome mis-assembly, and accurately assessing microbial community compositions and polymorphisms.",
author = "Sunguk Shin and Joonhong Park",
year = "2016",
month = "3",
day = "1",
doi = "10.1039/c5mb00750j",
language = "English",
volume = "12",
pages = "914--922",
journal = "Molecular BioSystems",
issn = "1742-206X",
publisher = "Royal Society of Chemistry",
number = "3",

}

Characterization of sequence-specific errors in various next-generation sequencing systems. / Shin, Sunguk; Park, Joonhong.

In: Molecular BioSystems, Vol. 12, No. 3, 01.03.2016, p. 914-922.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Characterization of sequence-specific errors in various next-generation sequencing systems

AU - Shin, Sunguk

AU - Park, Joonhong

PY - 2016/3/1

Y1 - 2016/3/1

N2 - Next-generation sequencing (NGS) is a popular method for assessing the molecular diversity of microbial communities without cultivation, for identifying polymorphisms in populations, and for comparing genomes and transcriptomes. However, sequence-specific errors (SSEs) by NGS systems can result in genome mis-assembly, overestimation of diversity in microbial community analyses, and false polymorphism discovery. SSEs can be particularly problematic due to rich microbial biodiversity and genomes containing frequent repeats. In this study, SSEs in public data from all popular NGS systems were discovered using a Markov chain model and hotspots for sequence errors were identified. Deletion errors were frequently preceded by homopolymers in non-Illumina NGS systems, such as GS FLX+. Substitution errors were often related to high GC contents and long G/C homopolymers in Illumina sequencing systems such as HiSeq. After removal of long G/C homopolymers in HiSeq, the average lengths of contigs and average SNP quality increased. SSEs were selectively removed from our mock community data by quality filtering, and a bias against specific microbes was identified. Our findings provide a scientific basis for filtering poor-quality reads, correcting deletion errors, preventing genome mis-assembly, and accurately assessing microbial community compositions and polymorphisms.

AB - Next-generation sequencing (NGS) is a popular method for assessing the molecular diversity of microbial communities without cultivation, for identifying polymorphisms in populations, and for comparing genomes and transcriptomes. However, sequence-specific errors (SSEs) by NGS systems can result in genome mis-assembly, overestimation of diversity in microbial community analyses, and false polymorphism discovery. SSEs can be particularly problematic due to rich microbial biodiversity and genomes containing frequent repeats. In this study, SSEs in public data from all popular NGS systems were discovered using a Markov chain model and hotspots for sequence errors were identified. Deletion errors were frequently preceded by homopolymers in non-Illumina NGS systems, such as GS FLX+. Substitution errors were often related to high GC contents and long G/C homopolymers in Illumina sequencing systems such as HiSeq. After removal of long G/C homopolymers in HiSeq, the average lengths of contigs and average SNP quality increased. SSEs were selectively removed from our mock community data by quality filtering, and a bias against specific microbes was identified. Our findings provide a scientific basis for filtering poor-quality reads, correcting deletion errors, preventing genome mis-assembly, and accurately assessing microbial community compositions and polymorphisms.

UR - http://www.scopus.com/inward/record.url?scp=84959235640&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959235640&partnerID=8YFLogxK

U2 - 10.1039/c5mb00750j

DO - 10.1039/c5mb00750j

M3 - Article

VL - 12

SP - 914

EP - 922

JO - Molecular BioSystems

JF - Molecular BioSystems

SN - 1742-206X

IS - 3

ER -