Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project

Jin Young Cho, Hyoung Joo Lee, Seul Ki Jeong, Kwang Youl Kim, Kyung Hoon Kwon, Jong Shin Yoo, Gilbert S. Omenn, Mark S. Baker, William S. Hancock, Young-Ki Paik

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Approximately 2.9 billion long base-pair human reference genome sequences are known to encode some 20000 representative proteins. However, 3000 proteins, that is, ∼15% of all proteins, have no or very weak proteomic evidence and are still missing. Missing proteins may be present in rare samples in very low abundance or be only temporarily expressed, causing problems in their detection and protein profiling. In particular, some technical limitations cause missing proteins to remain unassigned. For example, current mass spectrometry techniques have high limits and error rates for the detection of complex biological samples. An insufficient proteome coverage in a reference sequence database and spectral library also raises major issues. Thus, the development of a better strategy that results in greater sensitivity and accuracy in the search for missing proteins is necessary. To this end, we used a new strategy, which combines a reference spectral library search and a simulated spectral library search, to identify missing proteins. We built the human iRefSPL, which contains the original human reference spectral library and additional peptide sequence-spectrum match entries from other species. We also constructed the human simSPL, which contains the simulated spectra of 173907 human tryptic peptides determined by MassAnalyzer (version 2.3.1). To prove the enhanced analytical performance of the combination of the human iRefSPL and simSPL methods for the identification of missing proteins, we attempted to reanalyze the placental tissue data set (PXD000754). The data from each experiment were analyzed using PeptideProphet, and the results were combined using iProphet. For the quality control, we applied the class-specific false-discovery rate filtering method. All of the results were filtered at a false-discovery rate of <1% at the peptide and protein levels. The quality-controlled results were then cross-checked with the neXtProt DB (2014-09-19 release). The two spectral libraries, iRefSPL and simSPL, were designed to ensure no overlap of the proteome coverage. They were shown to be complementary to spectral library searching and significantly increased the number of matches. From this trial, 12 new missing proteins were identified that passed the following criterion: at least 2 peptides of 7 or more amino acids in length or one of 9 or more amino acids in length with one or more unique sequences. Thus, the iRefSPL and simSPL combination can be used to help identify peptides that have not been detected by conventional sequence database searches with improved sensitivity and a low error rate.

Original languageEnglish
Pages (from-to)4959-4966
Number of pages8
JournalJournal of Proteome Research
Volume14
Issue number12
DOIs
Publication statusPublished - 2015 Dec 4

Fingerprint

Human Chromosomes
Proteome
Chromosomes
Libraries
Proteins
Peptides
Databases
Amino Acids
Peptide Library
Human Genome
Base Pairing
Quality Control
Proteomics
Quality control
Mass spectrometry
Mass Spectrometry
Genes
Tissue

All Science Journal Classification (ASJC) codes

  • Biochemistry
  • Chemistry(all)

Cite this

Cho, Jin Young ; Lee, Hyoung Joo ; Jeong, Seul Ki ; Kim, Kwang Youl ; Kwon, Kyung Hoon ; Yoo, Jong Shin ; Omenn, Gilbert S. ; Baker, Mark S. ; Hancock, William S. ; Paik, Young-Ki. / Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project. In: Journal of Proteome Research. 2015 ; Vol. 14, No. 12. pp. 4959-4966.
@article{1d3b310cc3434f4b993235685fbaed75,
title = "Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project",
abstract = "Approximately 2.9 billion long base-pair human reference genome sequences are known to encode some 20000 representative proteins. However, 3000 proteins, that is, ∼15{\%} of all proteins, have no or very weak proteomic evidence and are still missing. Missing proteins may be present in rare samples in very low abundance or be only temporarily expressed, causing problems in their detection and protein profiling. In particular, some technical limitations cause missing proteins to remain unassigned. For example, current mass spectrometry techniques have high limits and error rates for the detection of complex biological samples. An insufficient proteome coverage in a reference sequence database and spectral library also raises major issues. Thus, the development of a better strategy that results in greater sensitivity and accuracy in the search for missing proteins is necessary. To this end, we used a new strategy, which combines a reference spectral library search and a simulated spectral library search, to identify missing proteins. We built the human iRefSPL, which contains the original human reference spectral library and additional peptide sequence-spectrum match entries from other species. We also constructed the human simSPL, which contains the simulated spectra of 173907 human tryptic peptides determined by MassAnalyzer (version 2.3.1). To prove the enhanced analytical performance of the combination of the human iRefSPL and simSPL methods for the identification of missing proteins, we attempted to reanalyze the placental tissue data set (PXD000754). The data from each experiment were analyzed using PeptideProphet, and the results were combined using iProphet. For the quality control, we applied the class-specific false-discovery rate filtering method. All of the results were filtered at a false-discovery rate of <1{\%} at the peptide and protein levels. The quality-controlled results were then cross-checked with the neXtProt DB (2014-09-19 release). The two spectral libraries, iRefSPL and simSPL, were designed to ensure no overlap of the proteome coverage. They were shown to be complementary to spectral library searching and significantly increased the number of matches. From this trial, 12 new missing proteins were identified that passed the following criterion: at least 2 peptides of 7 or more amino acids in length or one of 9 or more amino acids in length with one or more unique sequences. Thus, the iRefSPL and simSPL combination can be used to help identify peptides that have not been detected by conventional sequence database searches with improved sensitivity and a low error rate.",
author = "Cho, {Jin Young} and Lee, {Hyoung Joo} and Jeong, {Seul Ki} and Kim, {Kwang Youl} and Kwon, {Kyung Hoon} and Yoo, {Jong Shin} and Omenn, {Gilbert S.} and Baker, {Mark S.} and Hancock, {William S.} and Young-Ki Paik",
year = "2015",
month = "12",
day = "4",
doi = "10.1021/acs.jproteome.5b00578",
language = "English",
volume = "14",
pages = "4959--4966",
journal = "Journal of Proteome Research",
issn = "1535-3893",
publisher = "American Chemical Society",
number = "12",

}

Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project. / Cho, Jin Young; Lee, Hyoung Joo; Jeong, Seul Ki; Kim, Kwang Youl; Kwon, Kyung Hoon; Yoo, Jong Shin; Omenn, Gilbert S.; Baker, Mark S.; Hancock, William S.; Paik, Young-Ki.

In: Journal of Proteome Research, Vol. 14, No. 12, 04.12.2015, p. 4959-4966.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project

AU - Cho, Jin Young

AU - Lee, Hyoung Joo

AU - Jeong, Seul Ki

AU - Kim, Kwang Youl

AU - Kwon, Kyung Hoon

AU - Yoo, Jong Shin

AU - Omenn, Gilbert S.

AU - Baker, Mark S.

AU - Hancock, William S.

AU - Paik, Young-Ki

PY - 2015/12/4

Y1 - 2015/12/4

N2 - Approximately 2.9 billion long base-pair human reference genome sequences are known to encode some 20000 representative proteins. However, 3000 proteins, that is, ∼15% of all proteins, have no or very weak proteomic evidence and are still missing. Missing proteins may be present in rare samples in very low abundance or be only temporarily expressed, causing problems in their detection and protein profiling. In particular, some technical limitations cause missing proteins to remain unassigned. For example, current mass spectrometry techniques have high limits and error rates for the detection of complex biological samples. An insufficient proteome coverage in a reference sequence database and spectral library also raises major issues. Thus, the development of a better strategy that results in greater sensitivity and accuracy in the search for missing proteins is necessary. To this end, we used a new strategy, which combines a reference spectral library search and a simulated spectral library search, to identify missing proteins. We built the human iRefSPL, which contains the original human reference spectral library and additional peptide sequence-spectrum match entries from other species. We also constructed the human simSPL, which contains the simulated spectra of 173907 human tryptic peptides determined by MassAnalyzer (version 2.3.1). To prove the enhanced analytical performance of the combination of the human iRefSPL and simSPL methods for the identification of missing proteins, we attempted to reanalyze the placental tissue data set (PXD000754). The data from each experiment were analyzed using PeptideProphet, and the results were combined using iProphet. For the quality control, we applied the class-specific false-discovery rate filtering method. All of the results were filtered at a false-discovery rate of <1% at the peptide and protein levels. The quality-controlled results were then cross-checked with the neXtProt DB (2014-09-19 release). The two spectral libraries, iRefSPL and simSPL, were designed to ensure no overlap of the proteome coverage. They were shown to be complementary to spectral library searching and significantly increased the number of matches. From this trial, 12 new missing proteins were identified that passed the following criterion: at least 2 peptides of 7 or more amino acids in length or one of 9 or more amino acids in length with one or more unique sequences. Thus, the iRefSPL and simSPL combination can be used to help identify peptides that have not been detected by conventional sequence database searches with improved sensitivity and a low error rate.

AB - Approximately 2.9 billion long base-pair human reference genome sequences are known to encode some 20000 representative proteins. However, 3000 proteins, that is, ∼15% of all proteins, have no or very weak proteomic evidence and are still missing. Missing proteins may be present in rare samples in very low abundance or be only temporarily expressed, causing problems in their detection and protein profiling. In particular, some technical limitations cause missing proteins to remain unassigned. For example, current mass spectrometry techniques have high limits and error rates for the detection of complex biological samples. An insufficient proteome coverage in a reference sequence database and spectral library also raises major issues. Thus, the development of a better strategy that results in greater sensitivity and accuracy in the search for missing proteins is necessary. To this end, we used a new strategy, which combines a reference spectral library search and a simulated spectral library search, to identify missing proteins. We built the human iRefSPL, which contains the original human reference spectral library and additional peptide sequence-spectrum match entries from other species. We also constructed the human simSPL, which contains the simulated spectra of 173907 human tryptic peptides determined by MassAnalyzer (version 2.3.1). To prove the enhanced analytical performance of the combination of the human iRefSPL and simSPL methods for the identification of missing proteins, we attempted to reanalyze the placental tissue data set (PXD000754). The data from each experiment were analyzed using PeptideProphet, and the results were combined using iProphet. For the quality control, we applied the class-specific false-discovery rate filtering method. All of the results were filtered at a false-discovery rate of <1% at the peptide and protein levels. The quality-controlled results were then cross-checked with the neXtProt DB (2014-09-19 release). The two spectral libraries, iRefSPL and simSPL, were designed to ensure no overlap of the proteome coverage. They were shown to be complementary to spectral library searching and significantly increased the number of matches. From this trial, 12 new missing proteins were identified that passed the following criterion: at least 2 peptides of 7 or more amino acids in length or one of 9 or more amino acids in length with one or more unique sequences. Thus, the iRefSPL and simSPL combination can be used to help identify peptides that have not been detected by conventional sequence database searches with improved sensitivity and a low error rate.

UR - http://www.scopus.com/inward/record.url?scp=84948954063&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84948954063&partnerID=8YFLogxK

U2 - 10.1021/acs.jproteome.5b00578

DO - 10.1021/acs.jproteome.5b00578

M3 - Article

VL - 14

SP - 4959

EP - 4966

JO - Journal of Proteome Research

JF - Journal of Proteome Research

SN - 1535-3893

IS - 12

ER -