Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate

Gun Wook Park, Heeyoun Hwang, Kwang Hoe Kim, Ju Yeon Lee, Hyun Kyoung Lee, Ji Yeong Park, Eun Sun Ji, Sung Kyu Robin Park, John R. Yates, Kyung Hoon Kwon, Young Mok Park, Hyoung Joo Lee, Young Ki Paik, Jin Young Kim, Jong Shin Yoo

Research output: Contribution to journalArticle

15 Citations (Scopus)

Abstract

In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).

Original languageEnglish
Pages (from-to)4082-4090
Number of pages9
JournalJournal of Proteome Research
Volume15
Issue number11
DOIs
Publication statusPublished - 2016 Nov 4

Fingerprint

Search Engine
Search engines
Proteomics
Pipelines
Peptides
Proteins
Alternative Splicing
Human Chromosomes
Databases
Proteome
Chromosomes
Proteogenomics
Tissue
Liquid chromatography
Liquid Chromatography
Mass Spectrometry
Mass spectrometry

All Science Journal Classification (ASJC) codes

  • Biochemistry
  • Chemistry(all)

Cite this

Park, Gun Wook ; Hwang, Heeyoun ; Kim, Kwang Hoe ; Lee, Ju Yeon ; Lee, Hyun Kyoung ; Park, Ji Yeong ; Ji, Eun Sun ; Park, Sung Kyu Robin ; Yates, John R. ; Kwon, Kyung Hoon ; Park, Young Mok ; Lee, Hyoung Joo ; Paik, Young Ki ; Kim, Jin Young ; Yoo, Jong Shin. / Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate. In: Journal of Proteome Research. 2016 ; Vol. 15, No. 11. pp. 4082-4090.
@article{32d8dd6ff8db43c2848547706dc36ed8,
title = "Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate",
abstract = "In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0{\%} at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1{\%} at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).",
author = "Park, {Gun Wook} and Heeyoun Hwang and Kim, {Kwang Hoe} and Lee, {Ju Yeon} and Lee, {Hyun Kyoung} and Park, {Ji Yeong} and Ji, {Eun Sun} and Park, {Sung Kyu Robin} and Yates, {John R.} and Kwon, {Kyung Hoon} and Park, {Young Mok} and Lee, {Hyoung Joo} and Paik, {Young Ki} and Kim, {Jin Young} and Yoo, {Jong Shin}",
year = "2016",
month = "11",
day = "4",
doi = "10.1021/acs.jproteome.6b00376",
language = "English",
volume = "15",
pages = "4082--4090",
journal = "Journal of Proteome Research",
issn = "1535-3893",
publisher = "American Chemical Society",
number = "11",

}

Park, GW, Hwang, H, Kim, KH, Lee, JY, Lee, HK, Park, JY, Ji, ES, Park, SKR, Yates, JR, Kwon, KH, Park, YM, Lee, HJ, Paik, YK, Kim, JY & Yoo, JS 2016, 'Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate', Journal of Proteome Research, vol. 15, no. 11, pp. 4082-4090. https://doi.org/10.1021/acs.jproteome.6b00376

Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate. / Park, Gun Wook; Hwang, Heeyoun; Kim, Kwang Hoe; Lee, Ju Yeon; Lee, Hyun Kyoung; Park, Ji Yeong; Ji, Eun Sun; Park, Sung Kyu Robin; Yates, John R.; Kwon, Kyung Hoon; Park, Young Mok; Lee, Hyoung Joo; Paik, Young Ki; Kim, Jin Young; Yoo, Jong Shin.

In: Journal of Proteome Research, Vol. 15, No. 11, 04.11.2016, p. 4082-4090.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate

AU - Park, Gun Wook

AU - Hwang, Heeyoun

AU - Kim, Kwang Hoe

AU - Lee, Ju Yeon

AU - Lee, Hyun Kyoung

AU - Park, Ji Yeong

AU - Ji, Eun Sun

AU - Park, Sung Kyu Robin

AU - Yates, John R.

AU - Kwon, Kyung Hoon

AU - Park, Young Mok

AU - Lee, Hyoung Joo

AU - Paik, Young Ki

AU - Kim, Jin Young

AU - Yoo, Jong Shin

PY - 2016/11/4

Y1 - 2016/11/4

N2 - In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).

AB - In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).

UR - http://www.scopus.com/inward/record.url?scp=84994591691&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994591691&partnerID=8YFLogxK

U2 - 10.1021/acs.jproteome.6b00376

DO - 10.1021/acs.jproteome.6b00376

M3 - Article

C2 - 27537616

AN - SCOPUS:84994591691

VL - 15

SP - 4082

EP - 4090

JO - Journal of Proteome Research

JF - Journal of Proteome Research

SN - 1535-3893

IS - 11

ER -