Exploring Molecular Evolution

If you have a computer with an Internet connection and access to a good library, and no money, you can do cutting-edge evolutionary genetics research. Seriously. The reason is, sequencing genes has now become a routine procedure in labs all over the world, and all of these sequences get submitted to a database, GenBank. The number of sequences in GenBank is growing exponentially, and as of mid-2005 there were over 46 million of them. All of them have been analyzed to some extent, as a part of the study for which they were determined, but there are all kinds of interesting comparisons among related sequences that the intial researchers never bothered with. So, with the help of free downloadable software, you can search the database and explore the evolution of genes in ways that perhaps no one has before. If you like biology and computers and are willing to put a little time into understanding what you're doing, you could have a hobby that probes the basis of life itself and potentially generates publishable scientific results, even if you are not a professional scientist (even if you're not an adult!). Who knows what you could discover?

To begin, pick an interesting gene. At the GenBank nucleotide search, you could search for genes from your favorite species by typing either its common name (e.g. "northern leopard frog") or its scientific name (e.g. "Rana pipiens"). Alternatively, you could search for a particular kind of gene by typing in the name of the protein it codes for, such as "hemoglobin" (which carries oxygen in our blood), "collagen" (which holds your body together and is used to make Jello), or "insulin" (which regulates sugar metabolism).

To study molecular evolution, you need multiple related sequences that can be aligned and compared. Your initial search might give you several "hits," but genes with similar names are not necessarily closely related. Copy one of your sequences and paste in into BLAST, which will search GenBank for closely related genes. Are the most similar genes from closely related species (usually they are, but not always)? Do they all seem to perform the same function (you can't always tell, but sometimes the names and notes about them give clues)? Anything surprising?

Once you have several related gene sequences, you need to align them. The idea is to line up similar parts of sequences, that are presumably descended from the same ancestral sequence. BioEdit is a free program you can download for sequence alignment. That website explains how the program works. How similar are your sequences? Are they more alike in some regions that in others? Regions of a gene that are especially important to its function are often highly conserved among sequences.

To see how your sequences are related, you need to reconstruct a phylogeny. BioEdit has a limited ability to do this. Better phylogeny programs include PHYLIP and PAUP* (the latter is not free, but it is cheap, hence the name). If you are using genes for different species, does your phylogeny match the known evolutionary relationships among them (a good site that shows phylogenies among species is the Tree of Life)? If not, this could be evidence of a gene duplication or other evolutionary phenomenon (for example, perhaps in one lineage this gene has evolved remarkably fast, making it quite dissimilar even from closely related sequences). Alternatively, your phylogeny might not be well supported (in other words, there's not enough information in the sequences to determine the true phylogeny with statistical confidence), and thus parts of it may be in error. Most phylogeny programs can determine support using methods such as bootstrapping, so don't conclude too much before you look into that.

It's also possible to see how natural selection has affected the gene's evolution. Genes, which are strings of DNA bases, code for proteins, which are strings of amino acids. As a gene evolves, some base substitutions don't change the amino acid sequence. Presumably these have no effect on the fitness of the organism, and we call these synonymous substitutions. Other base substitutions, though, do change the amino acid sequence, and we call these nonsynonymous substitutions. The rate of synonymous substitution is (usually and approximately) equal to the mutation rate, whereas the nonsynonymous substitution rate will depend on natural selection. If most nonsynonymous mutations are harmful, then few will be fixed, and the nonsynonymous substitution rate will be less than the synonymous substitution rate. This is true for most genes in the genome. But, if a gene is under positive natural selection to change, such that most nonsynonymous mutations are actually advantageous, then the nonsynonymous substitution rate will exceed the synonymous substitution rate. It is unusual to see this, and thus its exciting when you find a gene family that shows this pattern. Free downloadable programs that look at nonsynonymous and synonymous differences among sequences include PAML, DnaSP, and Mega.

If you have a gene sequence that codes for a protein, translate it into the amino acid sequence using BioEdit. Each of the twenty amino acids can be represented by a single-letter code. These are colored according the their biochemical properties, but it would be useful to become more familiar with features of the amino acids, discussed in any basic biochemistry textbook. What sorts of evolutionary changes in amino acids have occured in your sequences? Do your amino acids tend to change to other amino acids of the same charge, or of similar size, or do these biochemical features vary? Find the sites where the amino acid is the same across all sequences. Are certain amino acids more likely than others to be conserved across sequences?

Obviously, molecular evolutionary theory is complex, and I can't explain everything on this webpage. I'm just hoping to show that exploring molecular evolution on your own is possible and can be fascinating; if I sparked your curiosity, you'll probably need to do a lot of background reading. One good reference is Molecular Evolution by Wen-Hsiung Li. In addition, not all of these programs are as user-friendly as they could be, and you'll need to take some time to read the manuals and maybe some papers that have used these methods. A good way to find papers is through PubMed, on the same website that maintains GenBank (NCBI). For example, if you want to find research papers that used PAML in their methods, search for "PAML." You'll need access to a good academic library to actually read the papers, unless you want to buy them individually from the publisher.

Back to Main Page