Abstract

CODENSE logo with overlapping grid patterns in red, green, and blue, and bold yellow-orange text.

Motivation: The rapid accumulation of biological network data has led to an urgent need in computational methods for graph pattern mining. One important problem is to iden-tify recurrent patterns across multiple networks to discover biological modules. However, existing algorithms for fre-quent pattern mining become very costly in time and space as the pattern sizes and network numbers increase. Cur-rently, no efficient algorithm is available for mining recur-rent patterns across large collections of genome-wide net-works.

Results: We developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs. Compared to previous methods, our approach is scalable in the number and size of the input graphs, flexible in mining either weighted or unweighted graphs, and adjustable in terms of exact or ap-proximate pattern mining. Applying CODENSE to 39 co-expression networks derived from microarray data sets, we demonstrated that our software discovered a large number of functionally homogenous clusters, and made functional predictions for 83 uncharacterized yeast genes.

Supplemental material

1. Microarray Data Used in This Study

We integrated all yeast microarray data sets (up to January 2003), each containing at least 8 experiments, from Stanford Microarray Database (SMD) (19 cDNA array data sets corresponding to 19 SMD Subcategories) from the NCBI Gene Expression Omnibus (GEO) (4 Affymetrix data sets). We also included two cDNA array data sets (11390663 , 10657304) and the Rosetta Compendium data generated by Rosetta Inpharmatics Company. Besides Rosetta Compendium, all data sets each contained a set of expression profiles measured under relevant and coherent conditions. The Rosetta Compendium includes 300 knockout and chemical treatment experiments. Among those experiments, we only took experiments in which the deleted genes have known functions, and we further classify this subset of experiments into 14 data sets based on the GeneOntology (GO) biological process categories of the deleted genes. This resulted in a total of 39 data sets comprising 618 expression profiles. Below is the detailed description of the data sets.

# Publication Platform Experiments
1 9843569 cDNA(SMD) Alpha factor release
2 9843569 cDNA(SMD) cdc15 block release6
3 11102521 cDNA(SMD) DTT Exposure
4 9843569 cDNA(SMD) Elutriation
5 9843569 cDNA(SMD) Forkhead regulation
6 11102521 cDNA(SMD) Gamma radiation
7 11102521 cDNA(SMD) Menadione exposure
8 11598186 cDNA(SMD) DNA damage (MMS) response
9 11102521 cDNA(SMD) Nitrogen depletion
10 11102521 cDNA(SMD) Nutrition limitation
11 11102521 cDNA(SMD) Osmotic shock
12 11455386 cDNA(SMD) SIR proteins (Chromatin Silencing)
13 11102521 cDNA(SMD) Sorbitol effects
14 11102521 cDNA(SMD) H2O2 response
15 11102521 cDNA(SMD) Heat shock
16 11102521 cDNA(SMD) Heat steady
17 11206552 cDNA(SMD) CellCycle Factor
18 11102521 cDNA(SMD) YPD Stationary phase
19 11102521 cDNA(SMD) Zinc homoeostasis
20 12875747 Affymetrix(GEO) Aging
21 14555471 Affymetrix(GEO) Chitin synthesis
22 12702272 Affymetrix(GEO) Fermentation time course
23 12370439 Affymetrix(GEO) Ume6 regulon
24 10929718 Affymetrix(Rosetta Compendium) Cell cycle control
25 10929718 Affymetrix(Rosetta Compendium) Cell wall organization
26 10929718 Affymetrix(Rosetta Compendium) Chromatin assembly
27 10929718 Affymetrix(Rosetta Compendium) Ion homeostasis
28 10929718 Affymetrix(Rosetta Compendium) Nucleotide metabolism
29 10929718 Affymetrix(Rosetta Compendium) Organelle biogenesis
30 10929718 Affymetrix(Rosetta Compendium) Perception of external stimulus
31 10929718 Affymetrix(Rosetta Compendium) Protein biosynthesis
32 10929718 Affymetrix(Rosetta Compendium) Protein degradation
33 10929718 Affymetrix(Rosetta Compendium) Protein metabolism
34 10929718 Affymetrix(Rosetta Compendium) Protein phosphorylation
35 10929718 Affymetrix(Rosetta Compendium) Protein transport
36 10929718 Affymetrix(Rosetta Compendium) Pseudohyphal growth
37 10929718 Affymetrix(Rosetta Compendium) Steroid metabolism
38 11390663 cDNA(Rosetta Inpharmatics Company) Pseudohyphal growth
39 10657304 cDNA(Rosetta Inpharmatics Company) MAPK pathway

2. Functional annotation of uncharacterized genes

The large number of functionally homogenous clusters identified by CODENSE provides a solid foundation for functional annotation of uncharacterized genes. For those clusters containing unknown genes, we annotated them with the most dominating functional category. To assess the prediction accuracy of our method, we employed a "leave-one-out" approach by masking a known gene to be unknown, and assign its function based on the remaining known genes in the cluster.

We have assigned functions to 448 known genes, and achieved a prediction accuracy of 50%.

Predict function for known genes (.xls)

Predict function for unknown genes (.xls) 

Downloads

CODENSE is a software package to mine coherent dense subgraphs from multiple biological networks. CODENSE is short for Mining Coherent Dense Subgraphs. By simplifying the problem of identifying coherent dense subgraphs across n graphs into a problem of identifying dense subgraphs in two special graphs: the summary graph and the second-order graph, CODENSE can efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs.

Download CODENSE v1.0 Manual

Authors

  • Haiyan Hu¹
  • Xifeng Yan²
  • Yu Huang¹
  • Jiawei Han²
  • Xianghong Jasmine Zhou¹

¹Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA

²Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA

Contact

Xianghong Jasmine Zhou
Professor
Pathology and Laboratory Medicine
University of California at Los Angeles

Office: Boyer Hall, Suite 520 
Phone: 310-267-0363
Email: xjzhou@mednet.ucla.edu