CODENSE

Abstract

Motivation: The rapid accumulation of biological network data has led to an urgent need in computational methods for graph pattern mining. One important problem is to iden-tify recurrent patterns across multiple networks to discover biological modules. However, existing algorithms for fre-quent pattern mining become very costly in time and space as the pattern sizes and network numbers increase. Cur-rently, no efficient algorithm is available for mining recur-rent patterns across large collections of genome-wide net-works.

Results: We developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs. Compared to previous methods, our approach is scalable in the number and size of the input graphs, flexible in mining either weighted or unweighted graphs, and adjustable in terms of exact or ap-proximate pattern mining. Applying CODENSE to 39 co-expression networks derived from microarray data sets, we demonstrated that our software discovered a large number of functionally homogenous clusters, and made functional predictions for 83 uncharacterized yeast genes.

Supplemental material

1. Microarray Data Used in This Study

We integrated all yeast microarray data sets (up to January 2003), each containing at least 8 experiments, from Stanford Microarray Database (SMD) (19 cDNA array data sets corresponding to 19 SMD Subcategories) from the NCBI Gene Expression Omnibus (GEO) (4 Affymetrix data sets). We also included two cDNA array data sets (11390663 , 10657304) and the Rosetta Compendium data generated by Rosetta Inpharmatics Company. Besides Rosetta Compendium, all data sets each contained a set of expression profiles measured under relevant and coherent conditions. The Rosetta Compendium includes 300 knockout and chemical treatment experiments. Among those experiments, we only took experiments in which the deleted genes have known functions, and we further classify this subset of experiments into 14 data sets based on the GeneOntology (GO) biological process categories of the deleted genes. This resulted in a total of 39 data sets comprising 618 expression profiles. Below is the detailed description of the data sets.

#	Publication	Platform	Experiments
1	9843569	cDNA(SMD)	Alpha factor release
2	9843569	cDNA(SMD)	cdc15 block release6
3	11102521	cDNA(SMD)	DTT Exposure
4	9843569	cDNA(SMD)	Elutriation
5	9843569	cDNA(SMD)	Forkhead regulation
6	11102521	cDNA(SMD)	Gamma radiation
7	11102521	cDNA(SMD)	Menadione exposure
8	11598186	cDNA(SMD)	DNA damage (MMS) response
9	11102521	cDNA(SMD)	Nitrogen depletion
10	11102521	cDNA(SMD)	Nutrition limitation
11	11102521	cDNA(SMD)	Osmotic shock
12	11455386	cDNA(SMD)	SIR proteins (Chromatin Silencing)
13	11102521	cDNA(SMD)	Sorbitol effects
14	11102521	cDNA(SMD)	H2O2 response
15	11102521	cDNA(SMD)	Heat shock
16	11102521	cDNA(SMD)	Heat steady
17	11206552	cDNA(SMD)	CellCycle Factor
18	11102521	cDNA(SMD)	YPD Stationary phase
19	11102521	cDNA(SMD)	Zinc homoeostasis
20	12875747	Affymetrix(GEO)	Aging
21	14555471	Affymetrix(GEO)	Chitin synthesis
22	12702272	Affymetrix(GEO)	Fermentation time course
23	12370439	Affymetrix(GEO)	Ume6 regulon
24	10929718	Affymetrix(Rosetta Compendium)	Cell cycle control
25	10929718	Affymetrix(Rosetta Compendium)	Cell wall organization
26	10929718	Affymetrix(Rosetta Compendium)	Chromatin assembly
27	10929718	Affymetrix(Rosetta Compendium)	Ion homeostasis
28	10929718	Affymetrix(Rosetta Compendium)	Nucleotide metabolism
29	10929718	Affymetrix(Rosetta Compendium)	Organelle biogenesis
30	10929718	Affymetrix(Rosetta Compendium)	Perception of external stimulus
31	10929718	Affymetrix(Rosetta Compendium)	Protein biosynthesis
32	10929718	Affymetrix(Rosetta Compendium)	Protein degradation
33	10929718	Affymetrix(Rosetta Compendium)	Protein metabolism
34	10929718	Affymetrix(Rosetta Compendium)	Protein phosphorylation
35	10929718	Affymetrix(Rosetta Compendium)	Protein transport
36	10929718	Affymetrix(Rosetta Compendium)	Pseudohyphal growth
37	10929718	Affymetrix(Rosetta Compendium)	Steroid metabolism
38	11390663	cDNA(Rosetta Inpharmatics Company)	Pseudohyphal growth
39	10657304	cDNA(Rosetta Inpharmatics Company)	MAPK pathway

2. Functional annotation of uncharacterized genes

The large number of functionally homogenous clusters identified by CODENSE provides a solid foundation for functional annotation of uncharacterized genes. For those clusters containing unknown genes, we annotated them with the most dominating functional category. To assess the prediction accuracy of our method, we employed a "leave-one-out" approach by masking a known gene to be unknown, and assign its function based on the remaining known genes in the cluster.

We have assigned functions to 448 known genes, and achieved a prediction accuracy of 50%.

Predict function for known genes (.xls)

Predict function for unknown genes (.xls)

Downloads

CODENSE is a software package to mine coherent dense subgraphs from multiple biological networks. CODENSE is short for Mining Coherent Dense Subgraphs. By simplifying the problem of identifying coherent dense subgraphs across n graphs into a problem of identifying dense subgraphs in two special graphs: the summary graph and the second-order graph, CODENSE can efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs.

Download CODENSE v1.0 Manual

Authors

Haiyan Hu¹
Xifeng Yan²
Yu Huang¹
Jiawei Han²
Xianghong Jasmine Zhou¹

¹Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA

²Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA

Contact

Xianghong Jasmine Zhou
Professor
Pathology and Laboratory Medicine
University of California at Los Angeles

Office: Boyer Hall, Suite 520
Phone: 310-267-0363
Email: xjzhou@mednet.ucla.edu