Organizational Features of Eukaryotic Promoters and Applications of in Silico Promoter Analysis in the Annotation of the Human Genome

Thomas Werner, Institute of Mammalian Genetics, GSF-National Research Centre for Environment and Health, Ingolstaedter Landstr. 1, D-85764 Neuherberg

Synopsis

The tutorial is meant as an INTRODUCTION to the problems of in silico promoter analysis for both computer scientists lacking profound biological background as well as for biologists interested in the BASIC principles of promoter analysis algorithms. Specific problems of chromosome-scale promoter analysis will be discussed on the basis of current results from the human genome project. The tutorial is intended for a rather general interdisciplinary audience.

Details

The tutorial will outline the basic events of transcriptional initiation on polymerase II promoters like binding of activating factors, assembly of a preinitiation complex, and formation of a fully competent initiation complex. Other promoter-related events (chromatin alterations) will be mentioned. The major elements of transcription control sequences and their basic functional features will be introduced. This includes transcription factor binding sites that are most important among these elements. They consist of about 10 to 30 nucleotides, not all of which are equally important for protein binding. Other important elements frequently found in regulatory regions are secondary structure elements (not necessarily binding proteins), which have the potential to form hairpin structures (after transcription into into RNA) or the distort the DNA into so called ^îcruciform-DNA" where the DNA strands separate and form symmetrical strand internal hairpins. Cruciform DNA is associated with certain matrix attachment regions (MARs). Direct repeats are quite common within regulatory DNA regions. The consist either of short sequences which are repeated two or more times within a short region or they can be complex repeats where a pattern of two or more elements is repeated. Repeat structures are often associated with enhancers. The last category is represented by three-dimensional structures, which are hard to assess computationally at the time and cannot be discussed in too much detail for that reason.

Computational methods used to describe the first three elements of transcription control regions will be demonstrated and briefly discussed.A very important aspect of the bioinformatics of regulatory regions is the significance definition used to determine "functional" elements. Recognition of sequence patterns usually is based on optimization schemes to ensure the best correlation of the methods with the data (positive and negative training sets). However, in most cases it is not possible to collect sufficient data to perform a rigorous correlation analysis. Therefore, bioinformatics methods often rely on statistical analysis of their training sequences and optimize for statistically most significant features. Unfortunately, this kind of optimization does not always reflect the evolutionary optimization of regulatory sequences.Appropriate selection of control sets is also a very important step in the evaluation of algorithms and programs. Usually, true positive elements/regions (i.e. functionally verified in the laboratory) are not abundant (10 to 100) and true negative regions are even more scarce. Negative often means just ^îno positive functions found" which can also be due to failures in detection as well as true absence of these functions.

The second part of the tutorial will introduce the functional structures and sequence features of promoter sequences with special emphasis on the organizational restrictions imposed by the described

sequence of initiation events. Differences of promoter analysis to the exon/intron finding problem as well as to protein sequence analysis will be illustrated. Functional conservation over longer sequence stretches in homologous promoters and their exploitation in phylogenetic footprinting will be illustrated as well as the lack of overall sequence similarity between functionally related but not homologous promoters and the consequences for the choice of methods will be shown (e.g. alignment programs). Available training data as well as an estimation about the reliability of such data collections will be summarized (EPD, GenBank/EMBL annotations).

Several basic types of regulatory regions will be described in functional terms where possible. This will include chromatin looporganization (briefly), focusing mostly on organizational elements like matrix attachment regions (MARs). Enhancers, silencers, and promoters represent the best analyzed regions of transcription control and will be discussed in more detail, based on a recent review (Werner 1999). All of these regulatory regions contain a variety of transcription factor binding sites as their basic elements. The biological function of these regions is generated by hierarchical organization of the individual elements into modules and complete regulatory units. The proteins binding to the individual elements must interact with each other as well as with the basal transcription machinery (assembled around the core promoter) which imposes various restrictions on the arrangement of individual elements. Two levels of functional organization have to be considered.

The concept of functional promoter modules will be introduced. Hallmarks and organizational features of such modules and will be described and consequences of these features on promoter organization illustrated by a couple of experimentally verified examples. Implementation of a modular concept into promoter analysis algorithms will be explained and at least two different approaches to describe promoter modules by bioinformatics methods will be demonstrated in detail.

Termination/polyadenylation is also involved in transcriptional regulation and will be discussed especially to illustrate the organizational similarities with other regulatory regions.This part of the tutorial will also cover a synopsis of several available tools for regulatory region analysis especially for promoter analysis. Both the theoretical background of the methods as well as some selected application results will be explained during the tutorial. Emphasis will be given to the elucidation of the biological models and the degree of representation by the computational model as well as to the practical applicability of the methods to different examples. A strong selection criterion for the range of methods considered will be their availability through a WWW interface. Experience has shown that programs requiring local installation are rarely used even by computer scientists.

Special emphasis will be given to the trade-off between generality and specificity which still appears inevitable. Large scale application of the approach to human chromosome 22 sequences will be discussed. The results of using several independent annotation strategies will be illustrated. Results obtained from the application of such strategies can be federated as independent data. For example, exon finding algorithms and promoter finding programs can operate independently of each other. However, biologically meaningful result have to locate a promoter closely upstream of a valid exon model.

There are two principal ways to combine such results:

Federation will yield a synopsis of all results where compatible annotations will be combined and conflicting annotations are either ignored or reported as conflicting. The second fundamentally different strategy is to use a hierarchical order of analysis where the results of one method are used to restrict the input for the next method. The difficulty with this approach is that the ranking of methods can be critical for the final results obtained. Therefore, reliable and general quality criteria are required to select the order of applications yielding the best overall results. Consequences of both strategies will be illustrated on specific examples. Aspects of applicability of the approaches to other genomes will also be discussed, which will be of special interest because several non-mammalian genomes are also being sequenced.

Participants will also be provided with internet URLs for a selection of available programs with WWW-interfaces as well as resources providing the actual data of currently available human genome promoter annotation.