HomeArticlesScott Edwards (Harvard) Part 1: Gene trees and phylogeography
Scott Edwards (Harvard) Part 1: Gene trees and phylogeography
August 30, 2019
Hi, my name is Scott Edwards and I’m a professor of organismic and evolutionary biology at Harvard University. And today we’re going to learn about gene trees and phylogeography. Phylogeography is a very exciting field combining genetics and ecology and increasingly, information from geographic information systems. And so we’re going to learn how to combine genetic variation with these other fields how to measure genetic variation and how to use the patterns in the gene trees to infer something about population history. So, what is phylogeography? We’re going to first talk about that and then we’re going to talk about how do we measure genetic variation within a species. This is really important because the amount of genetic variation within a species can tell us a lot about how healthy that species might be, whether it has the possibility of adapting to environmental change such as climate change. And then, we’re going to talk about an interesting way of looking at genetics diversity and that is through so called gene trees. As we’ll see, gene trees are a very intuitive and natural way to look at genetic variation within a species. And then finally, we’re going to talk about how do we look at natural selection. How can we tell whether genetic variation is not being influenced by just neutral processes, just random processes of population genetics but might actually be influenced by deterministic processes like natural selection. So, to begin, phylogeography is using genetic variation and looking at it in the context of gene trees or genealogies of alleles and linking those genealogies with the geography of the genetic variation. As its name implies, phylogeography combines a little bit of phylogenetics, looking at the relationships of alleles and genes to each other, as well as geography. How do we lay those lineages on a map and infer something about the history of populations? So let’s dive right in and talk about some of the history of phylogeography and why it’s such as special field today. Looking at genetic variation within species is an old pastime. It’s answered a lot of riddles about the nature of species themselves. And this slide here shows a picture of a type of data called allozyme electrophoresis. It was a very common tool used in the 1960s and early 70s to look at the amount of genetic variation within species. What you can see here are bands on a gel. You can see that the bands vary in their distance from the top to the bottom. And what these are are protein variants. Each lane, going from top to bottom, represents a different individual and the different bands you can see are the different alleles within one particular locus, one particular gene, in this case, in the fruit fly Drosophila pseudoobscura. What you can see is that there is a lot of genetic variation; different individuals look different. These alleles are being separated by the charge, not the size of the molecule, but the charge; whether it’s a negatively charged allele or a positively charged allele. And you can see there’s a lot of genetic variation. And this was scientists’ first view of the amount and nature of genetic variation within species. It was a very exciting time and as you can see this paper was published in 1966. Let’s fast forward now to a different approach which was introduced to the study of natural populations in the late 1970s. And one of the pioneers in this field was a man named John Avise and this image is from a book that he published in the mid-90s summarizing these results. This new technique was called restriction enzyme analysis. It used what molecular biologists call restriction enzymes to cut DNA into fragments. Each restriction enzyme (these are proteins that come from bacteria) comes from a specific species of bacteria and cuts the DNA at a specific recognition site which is indicated by anywhere from 4 to usually 6 letters in the DNA sequence. Every time that protein sees that sequence of letters, it will cut the DNA at that site. And what you can see again on this gel each lane going up and down represents a different individual. In this case, we’re using the restriction enzyme called EcoRI and that tells us it’s from E. coli. You can also see on this gel, there’s a lot of genetic variation between individuals. You can see that the different lanes look different. So, for example, lanes D, C, and E, you can see there’s a different banding pattern than the lanes A and B, those group of lanes A and B. And what that means is that lanes D, C and E have an extra sequence in their genome that allows this protein to cut there. It’s a way of detecting genetic variation. And this new way of detecting variation was quite different from the previous way I showed you, allozyme electrophoresis. And critically, as you can see on the bottom of this slide, by cutting the DNA in particular places, we can begin to form ideas about how the different variants are related to each other. They’re not simply alleles in the population that are different from each other and we can score them as the same or different. We can actually begin to related those alleles to each other in a simple genealogy. And that’s what you see below the slide. You can see a network of connections between the different so-called haplotypes indicated in each lane of this gel. What we’e done is we’ve moved from what could call the older population genetics which was based primarily on what we call allele frequencies. We could count the number of individuals that have allele A in a population at maybe 40% and the number of individuals that have allele B which may be 60%. And for a long time, that’s how population geneticists and phylogeographers would make inferences about the population history. Allele A would be present in 40% in one population, 60% in another population and from that we could make inferences about what had happened to those populations. That was the old way of doing things. The new way, which I think is much more exciting takes into account the genealogy of the alleles themselves. Instead of just counting the frequencies of alleles in populations we can actually relate those alleles to each other in a genealogical tree or in a phylogenetic tree. So the new view…of course we can calculate the frequencies if we want, but we have this additional power of connecting the alleles together in a genealogical tree, and as you’ll see, this allows us to really infer some interesting things about population history. Also, an interesting difference from the earlier population genetics is that whereas the older population genetic approaches tended to think of genetic variation going forward in time, We would simulate what gene frequencies would look like from an ancestral population to a more recent population. In this new approach, which is actually called coalescent theory, because the alleles are coalescing into common ancestors. In this new approach, we’re actually going to think of it going backwards in time. We can actually make inferences from just the sample of alleles that we have collected. And that greatly simplifies thinking about phylogeography and population genetics. We don’t have to worry about the entire population, just the sample that we have looked at in our particular study. OK, so here’s an example of a gene tree. This is, in fact, the first gene tree; published in the late 1970s by John Avise and colleagues. It relates to alleles in a group of rodents called deer mice. You can see Paramiscus polionotus, Paramiscus maniculatus, and Paramiscus leucopus. And the particular gene we’re looking at is a very important molecule called mitochondrial DNA. And this was a really remarkable paper because it showed how you could relate different alleles to each other in a genealogical tree. We won’t talk about how we actually construct those relationships. That’s another field called phylogenetics. Take it for now that we can assemble the different genetic variants into a genealogy. And what you can see here, for example, is that the alleles within maniculatus are most closely related to the alleles in polionotus. They share a common ancestor. Outside of that we have additional alleles from Paramiscus leucopus which is the most distantly related species in this group. This opened up a whole new way of looking at genetic variation and was really quite exciting for starting off the field of phylogeography. Now, an important concept to think about in much of subsequent discussion is this idea of effective population size. It’s a very important term in population genetics and in phylogeography. It’s usually abbreviated as a capital N or a capital N sub e. What is effective population size? The effective population size of a species is the size of an ideal population whose dynamics, whose evolutionary and population dynamics, mimic the actually population that we’re studying. The assumption here is that no population in nature is simple. There’s a lot of complicated dynamics going on. Populations are going extinct; being re-colonized by new individuals. Populations are experiencing flow of genes into them and out of them. Populations are changing in size constantly. All of these are very complicated dynamics. What we want to do is to summarize a basic characteristic of those real populations in a single variable and that variable is effective population size. It’s the size of a very simple population that has the same dynamics as the very complex population that we’re studying in nature. So, by dynamics I mean things like how genetic diversity there is in a population, what’s the rate of loss of genetic variation, if that population becomes isolated from other populations or it can be something like the change in allele frequencies over time of those real populations. By ideal, I mean a single population where there’s no natural selection at all. Everything is neutral meaning allele A and allele B don’t change the fitness of the individuals’ bearing those alleles. Every allele is, in terms of fitness, the same. There’s no population structure meaning there’s no subdivision of this population into more fine units. And we have completely random mating within that single population. So, it might be surprising that we can actually encapsulate all the complexities of a real species into a single variable and in fact, there are a lot of assumptions that go into this exercise. But, still, the effective population size is a very important way of describing populations. Let’s look at some examples of effective population size and how they differ from real populations. Here’s an example of an effective population size which is based on the sex ratio in the population. For example, say you have an elephant seal and we know of course that elephant seals typically have one or a few males that mate with all of the females in the population. So, you may have 25 individuals total, with 1 of them being a male, 24 of them being female. So our census size in that case is 25, the actual number of individuals. However, the effective population size of that species one can calculate by the equation that you see, which is simply 4 times the number of males, which is 1, times the number of females, which is 24. So that’s 4 times 24 which is 96. Divide that now by the sum of the number males and females which is 25. And what you’re going to get is a number approximately 4. And now you can see how the effective population size is much, much smaller than the census size which is 25. And that’s because there’s this deviation from equal sex ratios. In this population we’ve got 24 females and a single male. Let’s look at another example of effective population size. Here’s a case where we have an isolated population. It’s not getting any new genetic diversity from other populations. And it’s also not getting any new diversity through new mutations. Let’s talk about what’s going to happen to genetic diversity in that population. It’s going to decline over time just by chance. Alleles will get lost. Individuals with those alleles won’t breed. And you can see here that our measure of diversity, which is H, how that declines with time, which is the little t (time is measured in generations), and, you can see our friend N, our effective population size. We can see how we would predict the loss of heterozygosity over time with each generation if this population were an ideal population of a certain size. And that’s what you can see on the graph here and that is an example of effective population size based not on males and females, but now based on the loss of genetic diversity over time. Another example is what we call the variance effective population size. In this case, the variance in the number of matings of say males will cause the effective size to shift from the census size. This effective size is similar to the sex ratio effective population size that we saw 2 slides ago. In this effective population size, N is now our number of breeding individuals in the population. Could be our census size. V sub k is the variance in the number of matings, for example, per individual or per male. When the variance is very high as it often is in species like these social birds. We have a Sage Grouse up on top and a Red-winged Blackbird below. The Sage Grouse is a lekking species. Again, like the elephant seals, a few males get all of the matings. Red-winged Blackbirds are polygamous and so again, one male may have several females in its territory. So the variance in the mating successes is going to be quite high. Some males are going to get a lot of matings. Other males will get very few matings. That will cause the variance to increase. And as you can see from this equation, if the variance increases, the effective population size will decrease. So, again, as is usually the case, the effective population size is much less than the census size; much less than the actual number of individuals who breed at least once in the population. Finally, our last example is how the effective population size changes, if the population size is changing over time. What you can see here are four populations sampled at four different times. Each one has a census size; N sub 1, N sub 2, N sub 3 etcetera. You can see that the size of this population is changing. In the time intervals 2 and 3, the population is quite small whereas in 1 and 4 its much larger. We can encapsulate all those dynamics and changes of population size into a single number by looking at what’s called the harmonic mean of the population sizes that we’ve sampled. The harmonic mean is illustrated by the equation at the bottom. It turns out that the harmonic mean is dominated by small Ns, by small populations. It’s essentially the reciprocal of the average of the reciprocals of the population size. And, the fact that we have bottlenecks at generations 2 and 3 tells us that overall, our effective population size is going to be much smaller than the actual sizes. And in this case, our effective population size is going to be dominated by those time intervals when the population size was small. OK, so we’ve gone through effective population size and it’s a really important concept and that’s why I spent some time on it. It’s a really important concept for measuring genetic diversity. And what we’re going to do now is talk about two ways of measuring genetic diversity and remember, genetic diversity is important because it’s a basic description of the genetics of a population. Our first measure of genetic diversity is going to be called theta. It’s the Greek letter Q. Theta is equivalent to 4 times the effective population size times the mutation rate, that’s our little mu you can see there. So this number, which is usually a fairly big number, N is often 100 or 1000. Mu might be a fairly small number like 10 to the 16 or 10 to the -7 depending on how we’re measuring it. Multiply those together, and we get what’s called the population mutation rate. Essentially, it’s the number of new mutations occurring in a population each generation. And what i’ve shown you here is a way of measuring this from vary straight-forward genetic data from a population. So, let’s say we have sampled 3 alleles, 3 forms of a gene, in a population. What we’re going to do is we’re going to count the number of differences, the number of single nucleotide differences between those alleles. In this example here, you can see there are 3 differences. So our S is going to equal 3. You can see in that first nucleotide A differs from G in allele 3. Later on we’ve got allele 1 differing in a C where the others have a T. And then finally, in that second to last site, we’ve got allele number 3 differing as a C where the others have a T. So, that is S as 3 and we can simply calculate theta by plugging it into this straight-forward equation, where we put a 3 in the numerator and below we’re going to calculate what’s called a Taylor Series, where we’re going to go from i equals 1 all the way up to i equals n minus 1, where n is our sample size. In our case, n is 3. So, we’re going to simply divide 3 by this expansion of 1 over 1 plus 1 over 2 plus 1 over 3. And that’s a very simple way of measuring this population mutation rate which was developed by Waterson and is also called Waterson’s Theta. So it’s a very straight forward measure. We can sequence DNA, count the differences, estimate the combination of effective population size and mutation rate. And the important thing to remember is that the mutation rate is always going to be a locus-specific estimate of theta. Some genes will have high mutation rate. Some gene will have a low mutation rate. Most genes under neutrality should all have the same effective population size although we’ll see some interesting differences from that later on. But, the N should always be the same from gene to gene. What might differ though is the mutation rate and so it’s important to remember that theta is always a gene specific measurement. OK, another way of calculating theta: same parameter but calculated a different way. We’re also measuring 4 times N times mu and again we’re going to have three alleles in our population. This measure is going to take advantage not of the number of polymorphic sites that you saw in the earlier slide but instead we’re going to look at what is called the number of pair-wise differences, the number of differences between each pair of alleles. Now, again, if we sample three alleles, we have three different pairs of alleles. We have 1 versus 2, 1 versus 3, and 2 versus 3. So three different pairs. We can count the number of differences between each of those pairs and those number of differences are indicated by the ks. k sub 1,2; k sub 2,3; k sub 1,3. That’s what’s in the numerator. In the denominator we can have what is written as three choose two. How many ways are there to choose two objects when we have three? Turns out that that number is simply 3. Right? There’s three pairs in our population. So, again, it’s a very simple measurement. We’re going to take the average of the pair-wise differences among those three pairs. And we’re going to…that’s basically what it is. The denominator is simply our average. We’re going to divide by the total number of pairs. So our overall measurement is simply the average pair-wise difference between all alleles in the population. Even when those alleles are the same, we want to count them as separate copies. So those will enter as 0 for k. And so this is a second way to measure theta and it’s very simple. You can sequence DNA from several individuals, look at the genetic differences and it’s a very quick window into this combination of mutation rate and effective population size. Now, we won’t be getting into it but sufficient to say that there’s some very creative ways of comparing pi and theta, our two measures of genetic variation to ask whether or not the population is experiencing natural selection. It turns out pi and theta should be the same under neutrality. You should get the same number. It turns out that if those numbers differ, that’s one signal that they could be experiencing some deviation from that ideal model. Maybe that there’s natural selection going on. It may be that there’s some other deviation like there’s population substructure maybe of two populations when you thought you just had one. We won’t be going into detail in that measure but it’s called Tajima’s D. But, it’s a very common measure for looking at natural selection when you have genetic variation from natural populations. OK, let’s look at some examples of theta or what’s also called nucleotide diversity; the amount of genetic variation within a species. You can see on this slide nucleotide diversity measured in a bunch of different mammalian groups. For example, you can see Dasyuromorpha which are a group of marsupials. You can see Carnivora, which are carnivores like cats and dogs. You can see Cetartiodactyla which is the big group to which hoofed mammals and whales belong to. They’re actually related to each other. You can see Primates, Chiroptera or bats and what you can see is that nucleotide diversity, on average, differs quite a lot between these different groups. In the rodents, at the far right, you see a vary high genetic variation very high levels of nucleotide diversity. Whereas, in the groups towards the left, the Perissodactyls and the Dasyuromorpha they have very low genetic diversity. And, the question is, why do these groups of mammals differ in their diversity? It turns out there are multiple explanations. Rodents may have a larger N. They may have an larger effective population size and be able to sustain more genetic variation than other groups. It may also be that for some reason the mutation rates in rodents are higher on a per generation basis or on a per year basis than those of other groups. Remember, mutation rate will also influence nucleotide diversity. But, surveys such as these are very important for getting a broad view of genetic variation within different groups. Here’s an example comparing the nucleotide diversity in many different mammalian groups to those of birds and what you can see again is that the birds generally have lower genetic diversity within their species than do mammals. This is really really interesting. It could mean that bird populations and their effective population sizes are lower than those in mammals. It may also be that birds have a lower mutation rate overall, across their genome, across all their loci than do mammals. And so, again, teasing out these different sources of differences in nucleotide diversity can be very challenging but it’s really an essential part to understanding what this simple statistic might mean. OK, let’s talk now a bit about gene trees and another way of looking at genetic variation. One thing we’re going to see is that the gene tree is a measure of the relatedness of alleles to each other. We’ve seen that in some of the earlier slides. And two interesting patterns emerge from decades of studies of gene trees. One is that gene trees…the patterns in the relationships among alleles in a population don’t always match the species population histories that they’re sampled from. We might think that gene trees should be the same as the relationship of species to each other but in fact they can sometimes be different and we’ll see some examples of that. Second, that the divergence times of alleles as they coalesce back into the past, those divergence times can differ from the divergence times of the populations harboring them. It’s a very non-intuitive result that was first discovered through empirical analysis of simple molecules like mitochondrial DNA. What we are going to see is how the gene tree patterns sometimes differ from the population histories but how we can use these patterns to understand the nature of population divergence. So what you can see here are two examples of population history. In both examples you have single ancestral populations splitting into three populations. First, splitting into two at time T1 and then splitting one of the lineages into another two populations at time T2. What you can see in the image on the left is that the bold lines, i.e. the particular ancestry of alleles we’re looking at actually have a set of relationships that’s different from the population history. You can see that alleles sampled from species 2 and species 3 down there at the bottom those are actually more closely related to each other in the gene tree than are the two populations that branched off more recently in the scenario which are species 1 and 2. You can see that there’s a conflict between the genetic history and the actual population history. Now, on the gene tree on the right, you can see that, in fact, the gene tree history matches the species history. Both the gene tree and the species tree have populations 1 and 2 being most closely related and then population 3 branching off earlier. Now, this was a big surprise to geneticists when they discovered this phenomenon. It’s a phenomenon called incomplete lineage sorting. Lineages being the genetic lineages percolating through populations. It turns out that we don’t need to invoke any special phenomena like hybridization or gene flow to explain these discordant patterns in the gene trees. We can actually simply invoke a succession of very rapid speciation events such that the genetic variation in the ancestral population doesn’t quite catch up, doesn’t come to fixation, in population before the next speciation event has occurred and it’s a perfectly normal process. It happens all the time in populations especially when they diverge in rapid succession from each other. You can see here these parameters on the far right showing twice the effective population size and then also the time (T) divided by 4Ne. Turns out that if the interval between the speciation events is very, very short, on the order or less than 4N generations, where N again, is our effective population size, we’re going to have a good likelihood of seeing this incomplete lineage sorting. The alleles will not have sorted out in time. Another phenomenon that we see is this sort of overshoot of the coalescence time from the time of ancestry of the populations. So you can see, for example, that in both of these scenarios, the ancestry of the alleles (those bold lines) goes way farther back than does the ancestry of the actual populations. So you can see that the common ancestor of all the populations begins at time T1. It’s that upper line in both those scenarios, but you can see that the genetic variation, the genetic coalescence, of alleles goes back even farther. Of course it’s this coalescence that we see in genetic data when we sample it. And what we want to be aware of is that this is actually a somewhat greater time point of divergence than the actual population history. And the reason that is is because the ancestral population had some genetic variation in it already before it underwent any split. It had some depth to its gene tree. It had some genetic variation. And so some of that is going to persist throughout all these population splits. In fact, we except, on average, to see about 2N extra generations in this gene tree that will overshoot by 2N generations the actual population history. So these are just some interesting patterns that phylogeographers learned about once they started investigating patterns of genetic variation within species. Let’s talk about some population processes now. The first I want to point out is what happens when we look at multiple populations and we see a discord between the gene boundaries and the population boundaries. So, again, here’s a gene tree. You can see the labels at the top of the tree. Those are the population from which the allele was sampled. You can see population 1, population 2. You can see that this tree is not restricted to the species patterns. There are some alleles from population 2, for example, that are more closely related to alleles from population 1. All of the alleles from population 1 don’t cluster together and all the alleles from population 2 don’t cluster together. Again, this could have a number of sources. One could be incomplete lineage sorting. That’s one possible source of this pattern. Another pattern as we’ll see could be gene flow between populations; movement of individuals from population 1 to population 2 and successful breeding in that new site. So, both incomplete lineage sorting and gene flow can cause these discordances between the gene tree and the population tree. We can count the number of discordances in this gene tree by simply moving down the tree and noting when we have a population switch. So you can see where the black dots are. Those are instances where we have a switch from population 1 to 2 or population 2 to 1. And that number is going to useful for understanding the dynamics of these populations. So here is a case where we’re going to use s, which you see on the y-axis. This is a different s than what we saw earlier. I hope you’re not confused. S is on the y-axis and on the x-axis we see time since divergence of the two populations. That divergence might be very recent, in which case t over N is zero or it may be very big in which t over N is large. But you can see how s, we expect this number of intercalations will be very high at the beginning of the speciation event and it will gradually decline to 1 as the divergence time increases. So this is a way of using this discordance to understand how old two populations are. How long in the past did they split? Here’s an example from primates where we’re going to actually see a discordance between the gene tree and the population tree in real data and in fact data that’s very close to our hearts because it involves humans. Here is a gene tree embedded within the actual population history. What you can see is that the gene tree matches the species tree. Humans and chimpanzees are most closely related followed by gorillas and orangutans. You can see that we’ve looked at the number of generations between the splits indicated by t. So, we have a gene tree that’s compatible with the species tree. This is in fact what we see at the majority of genes at the human genome. However, we also see situations where the gene tree doesn’t match the species tree. Here’s a case where you can see that there’s an allele sampled from a chimpanzee which traces back to not an allele from humans but an allele from gorillas. We know that gorillas and chimps are not their own closest relatives. Chimps should be most closely related to humans. None the less, some of their genes trace back to an ancestry with gorillas. This again, is entirely expected. It could be due to simply rapid speciations as we talked about before. However, scientists are increasingly thinking that hybridization in the past could have lead to such patterns. So it’s still a point of debate among scientists. Again, we’re going to count the number of these interspecific coalescence events, our s. This time not to look at the age of populations but to look at how much gene flow, how much movement of individuals is going on between populations. Here we’re going to use s as an index of gene flow. Again, you can see s on the y-axis and you can see our measure of gene flow, which is N (effective population size) times m, which in this case is the fraction of the two populations which are exchanging migrants. So m is going to be something like .2 or .4. Again, N is going to something like 100 or 1000. Multiply those together, we get the average number of migrants exchanged between populations per generation. So a number like .5 or 1 or 2. You can see as s increases, our estimate of gene flow (Nm) also increases. Although, it gets challenging to distinguish moderate levels of gene flow where Nm is about 20 from very high levels of gene flow where Nm is say 40. It’s very tough to tell the difference between those two with population genetic data. OK, let’s look at an example where we are using s to look at gene flow in the next slide. Here we have some work that I did in Australia on a group of birds called Grey Crowned Babblers. You can see I have sampled these populations throughout many regions of Western and Northern Australia. You can see that the colors of each population on the map match the colors on the gene tree. And you can also see, on the gene tree, that not all the colors form individual clusters. The red cluster, for example, falls in two different places. You can see the orange lineages towards the bottom of the tree falling in multiple different places. You can see the light blue lineages not forming a single group. That’s lineage H. So, these could all be due to either incomplete lineage sorting or gene flow. In this particular case, gene flow seemed like the best explanation and we can use our s statistic to estimate Nm or the amount of gene flow between these populations. And of course this gene flow will erode the monophyly, the distinctness of each population. Let’s talk about another measure of genetic variation and that is Fst. Fst is what’s called a fixation index developed by Sewall Wright in the 1930s and 40s. We can use our measures of genetic variation, theta or pi, to estimate Fst. In this case, we’ve got multiple populations. We can use the theta calculated between different populations (that’s our subscript b) or theta calculated within individual populations (that’s our subscript w). The ratio of those tells us about how much genetic variation is found among different populations. If there’s a lot…if every population is very different, Fst is going to be very high. It’s going to be close to 1. By contrast, if every population is similar, is very much the same, all having the same genetic patterns, Fst is going to be very low; close to zero. We can use these measures of diversity to estimate this apportionment of genetic diversity. So, imagine this is the total variation within a population. If we look at individual populations, and find just a small amount of genetic variation with the consequence that most of the genetic variation is actually found between different populations; populations are quite different, Fst is going to be very high, close to 1. By contrast, again if we imagine the total amount of genetic variation within a species, now let’s look at individual populations within that species, we might actually find a lot of variation within the individual populations, almost as much as we would find in the entire species. In that case, the amount of variation between different populations is very low. And in this case, Fst would be close to zero, very low. So it’s a basic measure of how distinct different populations are. We can use Fst to ask about natural selection. It turns out that if everything is neutral, we expect Fst to be the same for most or all of the genes in the genome. And, we can use so called outliers, Fst outliers, to target individual genes which might be subject to natural selection and might be deviating from neutral processes. So, here’s some fictitious diagrams showing how Fst would be expected to change with the average allele frequency. We won’t worry about the details here, but this dotted line on the panel to the right indicates the expected, maximum Fst given a particular allele frequency for a particular locus. Each dot in this diagram is a different gene from the same set of populations. What you can see is that above that dotted line we have two red dots which show an abnormally high Fst given their allele frequencies. These are good candidates for alleles that have been driven apart not just by neutral processes but by natural selection. They show conspicuously high differences in frequency between populations. What you can see on the left here is a distribution of Fsts over different loci. Again, we expect Fst to be the same, pretty much, among different genes with a little noise. Much of population genetics is stochastic variation caused genetic drift and the coalescence process. But we expect there to be a mode, a single peak for Fst. However, genes that are very high in their Fst or very low in their Fst are good candidates for loci under different kinds of natural selection. So let’s look at an example. Here’s an example from a fish called Killifish. It’s very common across the Eastern US. You can see these researchers sampled them from several different sites along the Eastern US, in particular at so called super-fun sites, which are highly polluted areas that may have caused natural selection in these fish because of their high levels of pollution. And what you can see is again a distribution of Fsts across many, many loci. Each point on these graphs is a different gene. You can see the allele frequencies on the bottom of the axis. On the y-axis you can see the Fst and our expected maximum Fst is the line going across from left to right. You can see some conspicuous outliers. For example, locus B109 way at the top in that top panel is showing a very high Fst given it’s allele frequencies of about 0.5. This tell us that that gene may be undergoing natural selection possibly driven by the pollution in these super-fun sites. Similarly, on the bottom diagram you can also see a number of outliers that may be subject to natural selection. Here are some examples from human data. You can see that in this case these researchers looked at 8500 different places in the genome and they calculated Fst among different human populations. You can see that most of these loci show an Fst of about 0.1. That says that human populations are very, very mixed up. There’s not a lot of genetic differences between different human populations or so called human races. However, there’s a small number of genes, which you can see towards the left of this diagram which show a very high Fst; Fst’s approaching 1. And these, again, are our Fst outliers. Again, most of the human genome shows very little differentiation between populations. Only a few loci show high levels of differentiation. Another example of natural selection is when we see a rapid approach to what we call reciprocal monophyly. That’s the situation we saw earlier when alleles form separate clusters within different populations or species. And what these researchers have done is they’ve looked at gene trees of different loci along a chromosome in these European Corn Borers, a type of moth. What you can see is that for many of these loci, the red and the blue lineages are from different populations, for most of them they are very mixed up. There’s not a good distinction between the red and the blue populations. However, you can see for the gene tree right in the center, there’s a strong differentiation between the red and the blue populations. There is what we call reciprocal monophyly. They are showing separate clusters in those two lineages. This is an indication that natural selection may have driven these alleles to rapid fixation in these two populations and caused shifts in allele frequencies to be sure but also shifts in the genealogical pattern. Whereas most loci show a mixed up pattern, perhaps due to incomplete lineage sorting, the alleles in that panel in the center are showing increased reciprocal monophyly, increased fixation. So again, a signature of natural selection. OK, so we’ve seen different patterns of genetic variation, different patterns in gene trees. And what I want to end with is the ways in which scientists are using climate history to understand patterns of genetic variation within species. What you can see here are projections of the niche, the ecological niche, of a land snail in Northern Australia projected back in time. So what you can see is tis land snail likes moist, wet habitats such as rainforest and you can see in the dark green areas, the habitats where it lives in Northern Australia. You can see that this extent of rain forest is a patchwork in the present day closer to the right. You can see in the center panel that it was quite extensive about 7000 or 8000 years ago. Where as at the last glacial maximum, or LGM, about 120000 years ago, rain forest was quite restricted. On the extreme left, you can see the areas have showed the most stable patches of rain forest over time, across all the other 4 maps. And what was intriguing was that these researchers found that those areas of Northern Australia showed the most stability in the climate and the habitat also showed the most genetic variation. This is a pattern which has now been seen in several different species, both vertebrates and invertebrates, on different continents such as Australia and South America. And it’s one of the ways where climate scientists are using genetics and information from climate history to tell what are the influences on genetic variation within species. Why are some populations more diverse genetically than others? Perhaps it’s because those populations have been stable over time, and they haven’t experienced a lot of extinction and recolonization which would tend to reduce genetic diversity. So I hope what you’ve seen today is a nice overview of the ways in which scientists can use genetic diversity to study the history or populations. We’ve seen some basic measures of genetic diversity. Once scientists realized that alleles could be looked at genealogically, and we could count the differences between alleles, we can find different ways of measuring genetic diversity and getting a window into that very important population genetic parameter, the effective population size. We’ve also seen how we can use patterns in the gene trees, whether alleles are clustering by population or species, or whether they’re mixed up between populations and species, to ask how recently populations have diverged or whether they are experiencing gene flow. Finally, we’ve seen how we can use statistics, such as Fst, to ask about natural selection. Have certain loci been driven to fixation, to high or low frequency, in a particular population because of natural selection? That’s what Fst outliers can show us. And then finally, this example from climate studies showing how climate history can be linked with phylogeography to better understand the determinants of genetic diversity in natural populations. So thank you very much and I hope you’ve enjoyed our quick tour through gene trees and phylogeography.