Clustering protein environments for function prediction

Finding PROSITE motifs in 3D

Sungroh Yoon, Jessica C. Ebert, Eui-Young Chung, Giovanni De Micheli, Russ B. Altman

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

Background: Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. Results: We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. Conclusion: Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.

Original languageEnglish
Article numberS10
JournalBMC Bioinformatics
Volume8
Issue numberSUPPL. 4
DOIs
Publication statusPublished - 2007 May 22

Fingerprint

Cluster Analysis
Clustering
Proteins
Protein
Prediction
K-means Algorithm
K-means Clustering
Supervised Learning
Computational methods
Genomics
Clustering algorithms
Computational Methods
Clustering Algorithm
Annotation
Learning systems
Machine Learning
Three-dimensional
Evaluate
Demonstrate

All Science Journal Classification (ASJC) codes

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Yoon, Sungroh ; Ebert, Jessica C. ; Chung, Eui-Young ; De Micheli, Giovanni ; Altman, Russ B. / Clustering protein environments for function prediction : Finding PROSITE motifs in 3D. In: BMC Bioinformatics. 2007 ; Vol. 8, No. SUPPL. 4.
@article{8c6793d5cace4cd28fdd01fbca52a698,
title = "Clustering protein environments for function prediction: Finding PROSITE motifs in 3D",
abstract = "Background: Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. Results: We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were {"}rediscovered{"}. We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. Conclusion: Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.",
author = "Sungroh Yoon and Ebert, {Jessica C.} and Eui-Young Chung and {De Micheli}, Giovanni and Altman, {Russ B.}",
year = "2007",
month = "5",
day = "22",
doi = "10.1186/1471-2105-8-S4-S10",
language = "English",
volume = "8",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 4",

}

Clustering protein environments for function prediction : Finding PROSITE motifs in 3D. / Yoon, Sungroh; Ebert, Jessica C.; Chung, Eui-Young; De Micheli, Giovanni; Altman, Russ B.

In: BMC Bioinformatics, Vol. 8, No. SUPPL. 4, S10, 22.05.2007.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Clustering protein environments for function prediction

T2 - Finding PROSITE motifs in 3D

AU - Yoon, Sungroh

AU - Ebert, Jessica C.

AU - Chung, Eui-Young

AU - De Micheli, Giovanni

AU - Altman, Russ B.

PY - 2007/5/22

Y1 - 2007/5/22

N2 - Background: Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. Results: We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. Conclusion: Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.

AB - Background: Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. Results: We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. Conclusion: Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.

UR - http://www.scopus.com/inward/record.url?scp=34447637608&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34447637608&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-8-S4-S10

DO - 10.1186/1471-2105-8-S4-S10

M3 - Article

VL - 8

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 4

M1 - S10

ER -