back to Tutorial Program

Clustering Genes by RNA Expression: How to Get Started and Where to Go Next

Isaac S. Kohane, Atul Butte, and Ben Reis

Children's Hospital Informatics Program (Harvard)


The massively parallel acquisition of RNA expression data is rapidly becoming streamlined and dropping in price. In the near future we can expect that biologists and clinicians in many institutions will be routinely measuring such data. Therefore analysis of these data sets to characterize biological systems, identify high-yield candidate genes/ESTâs for further biological investigation, or quantify a patientâs health risks, to just name a few tasks, will become a standard part of the investigational armamentarium. Many algorithms have been developed to take RNA expression data sets and generate clusters that are putatively reflective of functional dependencies. These algorithms range in complexity from simple fold-difference calculations to comprehensive pair-wise comparisons and model construction. This tutorial is designed to teach the basics of the various bioinformatics methodologies available to analyze RNA expression data sets, yet will approach the subject from a practical standpoint, so that attendees can immediately put these algorithms to use.


Goals of the tutorial:

By the end of the session, attendees will be able to:

1. Understand the formats of expression data files produced by Affymetrix software and Incyte software.

2. Be able to explain the different types of genomic clustering available, including intervention fold differences, self-organizing maps, phylogenetic-type trees, and know the advantages and disadvantages of each.

3. Know how to calculate correlation coefficient, mutual information, entropy, and other measures of information.

4. Be able to interpret the results of each clustering method, and know what possible next steps are available in analyzing the results.

5. Understand all the types of experiments done with microarrays to date, and the potential variety of experiments possible.

Structure of the tutorial (3 parts):

1. Review

The first part of the tutorial will be the most didactic. It will include a review of:
  • Nature and format of the expression data files generated by the two most common technologies will be described. Particular emphasis will be placed on the different characteristics of these measurement systems, including noise profiles, and how normalization of the data sets can be approached (and common mistakes).
  • Typical flow of an investigation of the functional genomics of a biological domain going from hypothesis generation to hypothesis validation.
  • Description of the most frequently used clustering techniques. Their strengths and weakness will be summarized. The questions for which each might be better suited will be addressed as well as reasonable approaches to the interpretation of results generated by these techniques. 
  • Review of several instances of the clustering techniques applied to various biological systems.

2. Questions and Answers

During this segment of the tutorial, participants will be encouraged to explore how they might use these techniques in domains that are of interest to them. Also, the instructors will moderate a more detailed discussion of the problems associated with each of the techniques reviewed and where the current research challenges lie.

3. Example Analysis

A publicly available data set will be introduced. The instructors will lead the participants step by step through several analyses of this data set. The will provide a very concrete sense of what is involved in performing the analyses introduced in the Review part of the tutorial.


The instructors for this course, listed below, are involved in investigations of gene expression with collaborators in multiple academic centers in Boston and elsewhere. These collaborations involve the study of the functional genomics of organ transplantation and rejection, cardiac disease, angiogenesis, tumorigenesis, neurodevelopment, neuromuscular disease, neuroendocrine circadian rhythmicity, to just mention a few of the established application domains.
Isaac S. Kohane, MD, PhD

Associate Professor of Pediatrics

Harvard Medical School

Director, Childrenâs Hospital Informatics Program

Atul Butte, MD

Fellow in Informatics

Division of Health Sciences and Technology, Harvard/MIT and

Childrenâs Hospital Informatics Program

Ben Reis, PhD

Fellow in Informatics

Division of Health Sciences and Technology, Harvard/MIT
Childrenâs Hospital Informatics Program