Introduction to Computational Sequence Analysis
Most of the new data entering the biological databases come now from whole genome sequencing projects. From the genes themselves to the structural or functional properties of the proteins predicted to be encoded by these genes, most of the biological features discovered through genome sequencing projects are inferred by computational analysis of the sequences. Taking advantage of this huge amount of new data and tools requires from the biologists the capacity to evaluate the relevance of the available data and computational tools, with respect to their biological questions.
This tutorial is aimed at biologists who do not have any strong background in bioinformatics, and who wish to understand the models and methods underlying the main approaches used in computational sequence analysis.
The goals of the tutorial are:
In order to address in reasonable depth some algorithms and models, the tutorial will focus on one theme, "biological databases and search for similarities". "Biological databases" will include primary sequence data, and derived data like "motifs" or "domains". Around this main theme, other aspects like "gene discovery" or "molecular phylogeny" will be briefly reviewed in the introduction and conclusion.
The duration of the tutorial will be 4 hours, including coffee-break.
1: Pairwise sequence comparison
Basic sequence comparison: dot-plots, "diagonals" methods.
Score of an alignment, similarity vs distance measures. Dynamic programming (Needleman & Wunsch and
Smith & Waterman algorithms).
1.2: Scoring models:
Modeling of gap weights.
Protein similarity matrices.
Matrices for nucleic acids.
1.3: Statistics of alignment scores
Distribution of alignment scores.
Theoretical and empirical models for evaluating the statistical significance of similarity scores.
2: Looking for similarities in sequence databases
2.1: Sequence Databases:
nucleic vs proteic, general vs specialized, annotations, exhaustiveness, redundance...
2.2: Fasta & Blast programs:
Algorithms (fasta, blast1, ncbi-blast2, wu-blast2).
Statistics and significance of a search.
3: Motifs (domains, patterns etc) and Multiple alignments
Representation of the information extracted from several aligned sequences: consensus, regular expressions, profiles, HMMs.
Algorithms used for building multiple alignments and/or inferring motifs (briefly reviewed or treated in more
depth, depending on the time).
Available data resources: databases like prosite, blocks, pfam... and programs.
Use of these concepts in the context of a sequences DB search: examples of PSI-blast and PHI-blast.
Frédérique Galisson: She is a PhD in Molecular Biology (1993), with a post-doc work on Genomics and a Pasteur Institute's diploma in computer science (1995). For the past 4 years, she has been working in the "Service d'Informatique Scientifique" at the Pasteur Institute where she has developed bioinformatics courses dedicated to biologists. The goals of the courses are to explain to biologists the theoretical basis, algorithmic and mathematical methods, biological models and hypotheses, underlying the programs used in "sequence analysis", to introduce them to the bioinformatics resources offered locally on the Pasteur Institute's servers, to give them a practical introduction to their use, and to present to them the bioinformatics field and the related research activities. For the last 4 years, she has also been involved in service and research activities.