10 Steps to Success in Bioinformatics
by Webb Miller2009 ISCB Accomplishment by a Senior Scientist Award
I felt it was appropriate for the recipient of the ISCB Accomplishment by a Senior Scientist Award for 2009 to record some impressions of the bioinformatics field, and I thank the conference organizers for giving me that opportunity. I have arranged my thoughts as ten principles that guide me in setting my research agenda. Of course, other scientists may follow substantially different principles. Also, there are components other than research to a successful scientific career, including educational work such as developing curricula, and involvement in professional activities such as organizing conferences. The suggestions below are simply the rules that underlie my research program. There are only two fundamental principles, from which the others follow, somewhat like two axioms and eight theorems. In particular, principles 3-5 follow from #1, while principles 6-10 follow from #2. After describing the 10 principles, I will recount some of the twists and turns of my professional life that led me to adopt them.
1. Become a biologist. This message comes directly from a keynote address given by David Botstein at the 1997 RECOMB meeting. My interpretation of the adage is to organize one’s value system so that the driving principle becomes solving biological problems, rather than looking for biological problems to illustrate research in other areas. That is, be driven to solve problems concerning cancer, AIDS, ageing, development, evolution, etc., not by guidelines used in other disciplines, such as "my proof is harder than your proof".
However, this is still quite different from what I call the traditional way of doing biology, which is to work on the same problem for years, applying any techniques (bioinformatics, sequencing, mouse transgenics, etc.) needed to get answers. Sometimes scientists who take that approach become experts in a required technique, but often they assign the task to a lab member or buy the service from a vendor.
As a bioinformatics specialist, you will likely not work that way. Instead, you might bring your acquired tool set to various biological problems with the arrival of a new collaborator, set of data, or technology. For instance, your institution might buy a new kind of sequencing instrument, which would permit your local cancer biologist to address novel hypotheses if only she could find someone to analyze the sequence data. (That would be you.)
Becoming a biologist allows you to anticipate which computational challenges are most worthy of a substantial investment in software development. Moreover, communications with collaborators become much easier. Finally, I predict that as you get older, cancer and other aspects of human health will come to seem more interesting than, e.g., algorithm analysis.
2. Value your number of citations above your number of publications. Bioinformatics is an excellent field for attracting high numbers of citations in the scientific literature. For instance, only a small proportion of papers in the journal Genome Research are devoted to describing bioinformatic software and/or web servers, but they account for 8 of the journal’s 12 most-cited papers (out of 2,663 publications as of 1 January 2009); the 12 are just the papers published in that journal that have been cited at least 500 times. Also, bioinformatics has some of the biggest citation monsters in all of science, such as the 1994 paper on Clustal-W (Thompson et al. 1994) with over 25,000 citations, despite the fact that the journals with the highest impact factors (e.g., Nature and Science) do not publish this kind of paper. By comparison, no 1994 paper from any discipline in Nature or Science has over 6,000 citations, and only four of the 5,858 papers published that year in those journals have over 3,000 citations. One lesson that I take from comparing these numbers is that citation counts for bioinformatic papers are not directly comparable to those for papers that present scientific discoveries. Moreover, even among bioinformatic papers, citation counts are by no means guaranteed to provide an accurate assessment of a paper’s impact. However, citation counts have the advantage of being objective and easily obtained, and I believe that by observing characteristics shared by the most highly cited bioinformatic papers, one can glean valuable guidelines for structuring research programs.
3. Collaborate, and do it with great collaborators. In my experience, collaborative projects are the best way to focus all of the necessary expertise on a biological question that has a major bioinformatic component. Also, it is essential (but not always easy) to find collaborators with whom you can be productive. Here is a list of a dozen great collaborators that basically traces my career in bioinformatics. If you get an opportunity to work with one of these people, I recommend that you jump at the chance. Listed in approximately chronological order they are: Gene Myers, David Lipman, Bill Pearson, Ross Hardison, Richard Gibbs, Eric Green, Ladeana Hillier, David Haussler, Jim Kent, Mathieu Blanchette, Adam Siepel, Tom Pringle, Bill Murphy, and Stephan Schuster. I also have a list of collaborators to avoid. Maybe someday over a beer I’ll give you those names.
4. Do not expect a warm welcome from everyone. Some biologists will welcome you with open arms, some will be pleasant but too busy, and some will resent your trying to get lots of money to work on “their” problem or experimental system. Issues may arise because a traditional biologist thinks of bioinformatics as a routine service. Once, on an NIH proposal-review panel, I remarked that a bioinformatics person applying for funding was not listed as a co-author on some of the papers where he claimed to be a collaborator. A biologist on the panel replied, “I never put my bottle-washers on my papers.” My first exposure to real hostility in this field came as quite a shock, because my initial collaborations had been so agreeable.
5. Be a good collaborator. Everyone knows how to do this. Basically, it is just what you learned in kindergarten: maintain humility and a sense of humor, do more than your share, and deliver on time. However, not everyone adheres to these rules; you probably remember that “other list” mentioned at the end of principle 3.
6. Distribute and maintain software and/or run web servers that you personally continue to use. When I look for the characteristics that distinguish the highly cited papers describing bioinformatic software from the less popular ones, this pops out as the strongest correlation. Writing software to be used only by collaborators or customers, or for a task that won’t interest you next year, just does not seem to work well; you may be able to get a publication out of it, but in 10 years it will probably look like a waste of time as judged by citation count.
7. Alternate between working on specific datasets and writing general-purpose software. This principle is implied by #6. In computer science jargon, the distinction is between an instance of a problem and a full-fledged computational problem. Thus, the folks responsible for Clustal have also written biology papers about protein families, where it just so happens that they used Clustal to do the analysis. Incidentally, papers that focus on important datasets are how you can become a media star (another potential objective function).
8. Write some of your own software. This is the most controversial of my suggestions. A bioinformatics leader about half my age once expressed amazement that I still write programs. However, relegating all programming to others causes problems, both for working on a specific dataset and for writing general-purpose software. When you get an idea for a program adjustment that might work better for the current dataset, you won’t want to wait until your student comes to the office to find out. More importantly, when a change is needed in software you have been distributing for five years, you won’t want to be without the services of someone who knows the code.
9. Don't give up. I'm suggesting that you write papers, look at them in 5-10 years to see how many times they were cited, and adjust your research program accordingly. Obviously, this takes time. Fortunately it gets easier as you learn more biology, develop a pool of collaborators, and get a reputation for playing fair.
10. Be excited about your work. This is essential for maintaining a long-running research program. We all know that “burn-out” happens. The trick is to see when it is headed your way and make the sacrifices needed to avoid it. Over the years, different strategies have worked for me; below I outline my current approach.
How I came to adopt these principles: My professional career led me to bioinformatics by an extremely circuitous route, along which I experimented with a number of approaches for picking research problems. It was a long time before I adopted principles 1-10.
I went to a small liberal-arts college, Whitman College in Walla Walla, Washington, where I eventually majored in mathematics. As a junior, I realized I was running out of classes that I wanted to take, so I started exploring other topics and became interested in the idea that certain well-formulated tasks cannot be solved by any computational method whatsoever. This was well before NP-completeness had been defined, and the ideas were more akin to the unsolvability of the Turing-machine halting problem. As a senior, I succeeded in writing a publishable paper. Thus began a period of many years when a major criterion for picking my research problems was that nobody was working on anything too similar. In some sense, my goal was to minimize the number of papers cited in my publications, given that one typically cites papers on a similar topic. After obtaining graduate degrees in mathematics and starting a job as an Assistant Professor of Computer Science, I made a major shift in research direction, toward reasoning about floating-point computation, but retained the ideal of working in isolation. All this was fairly successful by many criteria: I published papers and a research monograph, regularly obtained NSF funding, and became a Full Professor at a university next to a warm beach.
Next, I moved to the Computer Science Department at the University of Arizona, where I spent most of four years writing textbooks. The two most important developments were that I started writing a book on Unix-related software tools (Miller 1987), which did wonders for my programming skills, and I met Gene Myers. Gene had started working in what is now called bioinformatics while still a graduate student, and after leaving Tucson I eventually followed his lead.
In 1985 I returned to Penn State, where I had gone for my first faculty position. When interviewing in 1985, I told Penn State officials that I would spend a year finishing a book and then make a radical career shift, though at that time I had no idea where I would land. Gene Myers took a sabbatical leave at Penn State in the academic year 1987-1988, which started my career in bioinformatics. Over several years, Gene and I coauthored a number of papers, of which I remain proud, but the motivation was still from a computer scientist’s perspective – elegant proofs of difficult results were often a main goal. However, this experience set me on the path to adopting the 10 principles. Ross Overbeek once told me that it took him three years to become a biologist, but it took me much longer.
During Gene’s time at Penn State, we contacted a few people with similar interests whose names we knew from reading their papers. One of those people was David Lipman, who visited Penn State to give a colloquium. Meeting David led to my involvement on the two main papers about Blast (Altschul et al. 1990, 1997), and guaranteed that I’ll never have to worry about having enough citations. However, to be honest, the reason those papers have been wildly successful is that for 20 years, scientists at the National Center for Biotechnology Information have labored tirelessly to provide Blast access to GenBank.
There isn’t space here to recount all of the marvelous collaborations that followed, but I will mention the three most important ones. Since 1991 I have collaborated with Ross Hardison, a prominent Penn State biologist. Because of his influence, I picked the computational problem that has occupied me for nearly 20 years – how to compare long DNA sequences. We have bounced back and forth between using sequence-comparison tools to learn about biology on one hand, and building the tools to facilitate that job on the other. We are still co-investigators on each others’ NIH grant. The biological question lying at the center of these oscillations is gene regulation. In particular, our goal has been to use inter-species sequence conservation to help predict the location and properties of genomic signals that regulate gene transcription.
David Haussler is another fantastic collaborator. I started working with David and Jim Kent on the mouse genome project, around the year 2000, using a pairwise aligner called Blastz (Schwartz et al. 2003) from my lab. (The latest version, called Lastz, is still under active development, driven by the anticipated read-length increase for ultra-short-read sequencing technologies.) We then moved quickly to the rat genome. The computational problem was to compare three genome sequences – a moderately well assembled human genome, a mouse genome that was not so accurate, and a rat sequence that initially was relatively unreliable. I built a program called Multiz to align those three sequences; it and an alternative tool are described by Blanchette et al. (2004). The Santa Cruz Browser has been a godsend for my software-development program; through it, alignments produced by my tools (e.g., Miller et al. 2007) have reached a huge audience. Most recently, the software has been used to align 44 vertebrate genomes, which is far beyond the original design specifications, and new ideas will be needed long before we reach David’s dream of having 10,000 vertebrate genome sequences.
A third superb collaborator is Stephan Schuster, another Penn State biologist. When we met, in 2004, we both began looking for a way to collaborate. Stephan’s expertise was in sequencing, with a focus on bacterial genomes, while my interests were limited to mammals, or at least vertebrates. On November 18, 2005, Stephan bridged the gap when he walked into my office with sequence data from a woolly mammoth, and my career immediately took a 90-degree turn. Our biological motivation is extinction, that is, we want to learn about the genomic properties that affect or are affected by the process of extinction of species. The technical challenges include sequencing 50,000-year-old specimens and accelerating sequence analysis to keep pace with faster sequencing methods. I must admit to having strayed from some of the principles stated above, but starting projects that require years to reach fruition is not very practical now that I’ve reached 65. On the other hand, my projects with Stephan have been effective at attracting citations in the popular media. For instance, our paper on the woolly mammoth genome (Miller et al. 2008) was chosen as one of Time Magazine’s Top 10 Science Stories of 2008. Our project to understand extinction is continuing (e.g., Miller et al. 2009), and we are turning to the problem of preventing the extinction of endangered species.
This is not to say that I have given up on my bread-and-butter projects. Ewan Birney has initiated a friendly competition, actually more like a collaboration, to see which of UCSC and Ensembl can produce the best set of whole-genome vertebrate alignments. We’re working to keep ahead of him, though in any case, all will surely benefit.
My current approach to staying excited about research is to mix long-term projects, which provide the continuity that Ph.D. students need, with novel, short-term projects where I am a main programmer and sometimes the only Penn State participant (e.g., Murphy et al. 2007). Without the latter kind of project, i.e., if I were simply managing other programmers, I would find it much more difficult to get out of bed in the morning.
A variety of criteria can be used to measure the success of a career in scientific research, including the numbers of publications, Impact Factor of the journals where you publish, citations to your papers, research dollars, employees, invited talks, graduated Ph.D. students, service to the professional community, or appearances of your name in the New York Times. I know – I’ve tried most of them. Once you settle on an “objective function”, there are many optimization strategies. Here I describe the criteria and strategies that have worked best for me. I offer these observations in the hope that some of you will re-think your basic assumptions. Or at least I might help convince you that it is never too late to find a new and exciting scientific project.
References
S. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman (1990) A basic local alignment search tool. Journal of Molecular Biology 215, 403-410.
S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. Lipman (1997) Gapped BLAST and PSI-BLAST — a new generation of protein database search programs. Nucleic Acids Research 25, 3389-3402.
M. Blanchette, E. Green, W. Miller and D. Haussler (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Research 14, 2412-2423.
W. Miller (1987) A Software Tool Sampler, Prentice-Hall, Englewood Cliffs, New Jersey.
W. Miller, K. Rosenbloom, R. C. Hardison, M. Hou, J. Taylor, B. Raney, R. Burhans, D. C. King, R. Baertch, D. Blankenberg, S. L. Kosakovsky Pond, A. Nekrutenko, B. Giardine, R. S. Harris, S. Tyekucheva, M. Diekhans, T. H. Pringle, W. J. Murphy, A. Lesk, G. M. Weinstock, K. Lindblad-Toh, R. A. Gibbs, E. S. Lander, A. Siepel, D. Haussler and W. J. Kent (2007) 28-way vertebrate alignment and conservation track at the UCSC Genome Browser. Genome Research 17, 1797-1808.
W. Miller, D. Drautz, A. Ratan, B. Pusey, J. Qi, A. M. Lesk, L. Tomsho, M. Packard, F. Zhao, A. Sher, A. Tikhonov, B. Raney, N. Patterson, K. Lindblad-Toh, E. S. Lander, J. R. Knight, G. P. Irzyk, K. M. Fredrikson, T. T. Harkins, S. Sheridan, T. Pringle and S. C. Schuster (2008) Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456, 387-390.
W. Miller, D. Drautz, J. Janecka, A. Lesk, A. Ratan, L. Tomsho, M. Packard, Y. Zhang, L. McClellan, J. Qi, F. Zhao, M. T. G. Gilbert, L. Dalén, J. Arsuaga, P. Erickson, D. Huson, K. Helgen, W. J. Murphy, A. Götherström and S. C. Schuster (2009) The mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus). Genome Research 19, 213-220.
W. J. Murphy, T. H. Pringle, T. A. Crider, M. S. Springer and W. Miller (2007) Using genomic data to unravel the root of the placental mammal tree. Genome Research 17, 413-421.
S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler and W. Miller (2003) Human-mouse alignments with Blastz. Genome Research 13, 103-107.
J. D. Thompson, D. G. Higgins and T. J. Gibson (1994) CLUSTAL-W — Improving the sensitivity of progressive multiple sequence alignment through sequencing weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673-4680.
TOP