CODENSE
Abstract

Motivation: The rapid accumulation of biological network data has led to an urgent need in computational methods for graph pattern mining. One important problem is to iden-tify recurrent patterns across multiple networks to discover biological modules. However, existing algorithms for fre-quent pattern mining become very costly in time and space as the pattern sizes and network numbers increase. Cur-rently, no efficient algorithm is available for mining recur-rent patterns across large collections of genome-wide net-works.
Results: We developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs. Compared to previous methods, our approach is scalable in the number and size of the input graphs, flexible in mining either weighted or unweighted graphs, and adjustable in terms of exact or ap-proximate pattern mining. Applying CODENSE to 39 co-expression networks derived from microarray data sets, we demonstrated that our software discovered a large number of functionally homogenous clusters, and made functional predictions for 83 uncharacterized yeast genes.
Supplemental material
1. Microarray Data Used in This Study
We integrated all yeast microarray data sets (up to January 2003), each containing at least 8 experiments, from Stanford Microarray Database (SMD) (19 cDNA array data sets corresponding to 19 SMD Subcategories) from the NCBI Gene Expression Omnibus (GEO) (4 Affymetrix data sets). We also included two cDNA array data sets (11390663 , 10657304) and the Rosetta Compendium data generated by Rosetta Inpharmatics Company. Besides Rosetta Compendium, all data sets each contained a set of expression profiles measured under relevant and coherent conditions. The Rosetta Compendium includes 300 knockout and chemical treatment experiments. Among those experiments, we only took experiments in which the deleted genes have known functions, and we further classify this subset of experiments into 14 data sets based on the GeneOntology (GO) biological process categories of the deleted genes. This resulted in a total of 39 data sets comprising 618 expression profiles. Below is the detailed description of the data sets.
# | Publication | Platform | Experiments |
---|---|---|---|
1 | 9843569 | cDNA(SMD) | Alpha factor release |
2 | 9843569 | cDNA(SMD) | cdc15 block release6 |
3 | 11102521 | cDNA(SMD) | DTT Exposure |
4 | 9843569 | cDNA(SMD) | Elutriation |
5 | 9843569 | cDNA(SMD) | Forkhead regulation |
6 | 11102521 | cDNA(SMD) | Gamma radiation |
7 | 11102521 | cDNA(SMD) | Menadione exposure |
8 | 11598186 | cDNA(SMD) | DNA damage (MMS) response |
9 | 11102521 | cDNA(SMD) | Nitrogen depletion |
10 | 11102521 | cDNA(SMD) | Nutrition limitation |
11 | 11102521 | cDNA(SMD) | Osmotic shock |
12 | 11455386 | cDNA(SMD) | SIR proteins (Chromatin Silencing) |
13 | 11102521 | cDNA(SMD) | Sorbitol effects |
14 | 11102521 | cDNA(SMD) | H2O2 response |
15 | 11102521 | cDNA(SMD) | Heat shock |
16 | 11102521 | cDNA(SMD) | Heat steady |
17 | 11206552 | cDNA(SMD) | CellCycle Factor |
18 | 11102521 | cDNA(SMD) | YPD Stationary phase |
19 | 11102521 | cDNA(SMD) | Zinc homoeostasis |
20 | 12875747 | Affymetrix(GEO) | Aging |
21 | 14555471 | Affymetrix(GEO) | Chitin synthesis |
22 | 12702272 | Affymetrix(GEO) | Fermentation time course |
23 | 12370439 | Affymetrix(GEO) | Ume6 regulon |
24 | 10929718 | Affymetrix(Rosetta Compendium) | Cell cycle control |
25 | 10929718 | Affymetrix(Rosetta Compendium) | Cell wall organization |
26 | 10929718 | Affymetrix(Rosetta Compendium) | Chromatin assembly |
27 | 10929718 | Affymetrix(Rosetta Compendium) | Ion homeostasis |
28 | 10929718 | Affymetrix(Rosetta Compendium) | Nucleotide metabolism |
29 | 10929718 | Affymetrix(Rosetta Compendium) | Organelle biogenesis |
30 | 10929718 | Affymetrix(Rosetta Compendium) | Perception of external stimulus |
31 | 10929718 | Affymetrix(Rosetta Compendium) | Protein biosynthesis |
32 | 10929718 | Affymetrix(Rosetta Compendium) | Protein degradation |
33 | 10929718 | Affymetrix(Rosetta Compendium) | Protein metabolism |
34 | 10929718 | Affymetrix(Rosetta Compendium) | Protein phosphorylation |
35 | 10929718 | Affymetrix(Rosetta Compendium) | Protein transport |
36 | 10929718 | Affymetrix(Rosetta Compendium) | Pseudohyphal growth |
37 | 10929718 | Affymetrix(Rosetta Compendium) | Steroid metabolism |
38 | 11390663 | cDNA(Rosetta Inpharmatics Company) | Pseudohyphal growth |
39 | 10657304 | cDNA(Rosetta Inpharmatics Company) | MAPK pathway |
2. Functional annotation of uncharacterized genes
The large number of functionally homogenous clusters identified by CODENSE provides a solid foundation for functional annotation of uncharacterized genes. For those clusters containing unknown genes, we annotated them with the most dominating functional category. To assess the prediction accuracy of our method, we employed a "leave-one-out" approach by masking a known gene to be unknown, and assign its function based on the remaining known genes in the cluster.
We have assigned functions to 448 known genes, and achieved a prediction accuracy of 50%.
Downloads
CODENSE is a software package to mine coherent dense subgraphs from multiple biological networks. CODENSE is short for Mining Coherent Dense Subgraphs. By simplifying the problem of identifying coherent dense subgraphs across n graphs into a problem of identifying dense subgraphs in two special graphs: the summary graph and the second-order graph, CODENSE can efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs.
Authors
- Haiyan Hu¹
- Xifeng Yan²
- Yu Huang¹
- Jiawei Han²
- Xianghong Jasmine Zhou¹
¹Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
²Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
Contact
Xianghong Jasmine Zhou
Professor
Pathology and Laboratory Medicine
University of California at Los Angeles
Office: Boyer Hall, Suite 520
Phone: 310-267-0363
Email: xjzhou@mednet.ucla.edu