Introduction to Population Genetics – Lynn Jorde (2014)

Tyra Wolfsberg:
Good morning, everyone. Welcome to week seven of our current topic series. Thank you for
coming. This week we’re honored to have with us Dr. Lynn Jorde from the University of Utah,
School of Medicine, where he holds the H.A. and Edna Benning Presidential Endowed Chair
in the Department of Human Genetics, and he’s also the appointed Chair in the Department
of Human Genetics. Dr. Jorde received his degrees from the University of New Mexico.
His lab studies the evolution of mobile elements, and the effects of these elements on the human
genome. He’s also interested in natural selection in humans, and has identified genes that have
helped Tibetan populations adapt to living at high altitudes. Finally, he’s used whole
genome sequencing to uncover disease-causing mutations, and to estimate the human mutation
rate. Dr. Jorde served on several advisory panels
for the National Science Foundation and the NIH, and in 2012 he was elected as a fellow
of the American Association for Advancement of Science. Finally, Dr. Jorde has received
12 teaching awards from the University of Utah, as well as one from the American Society
of Human Genetics. I’m pleased to say he’ll be bringing that excellent teaching style
here to NIH this morning, and I’m sure you’ll enjoy learning a lot from today’s talk, which
is intended to provide you with an overview of population genetics. Please join me in
welcoming Dr. Jorde to the NIH this morning. [applause] Lynne Jorde:
Well, thanks very much, Tyra. It’s a pleasure to be here again. And before I start, let
me say that I’m happy to entertain questions at any point in the talk. So if something
comes up that you’d like to know more about, don’t be shy about asking a question. This
discloses that I have no commercial interest related to this presentation. So this morning what I’d like to talk to you
about is, first of all, an overview of patterns of human genetic variation, both among populations,
because really that is the essence of populations genetics, but also now, particularly with
whole genome data, we can really dissect pattern of variations, similarities and differences
at the individual level, giving us, I think, a very different and much fuller perspective
of human genetic variation. We’ll talk about the implications of our findings in human
population genetics, offer the concept of race, something that I think always stirs
a certain amount of controversy, and something that I think can be illuminated by our genetic
data. We’ll talk about linkage disequilibrium, a fundamental population genetic process that
has been very important in disease gene identification. Throughout, we’ll be talking about the relevance
of genome sequencing data for these topics. So there are a number of applications of human
genetic variation. One is in deciphering human history, because really, the history of our
species is written in our genome. And more and more, we have the technology to make inferences
about that history. And I’ll be giving you a few examples of how genetic data can be
used to infer human history going back hundreds of thousands of years. We can infer individual
ancestry. I’ll give you some examples of that. And this is something that I think is much
more informative than traditional self-identified population categories. Genetic variation is
used commonly now, as you know, in the field of forensics. Tens of thousands of cases every
year are solved using DNA data. So this is a very important, and to some extent unanticipated
application of basic population genetics, application of things like Hardy-Weinberg
equilibrium, linkage disequilibrium to help exonerate the innocent and to convict the
guilty. And finally, perhaps most importantly, principles of populations genetics are used
to find, to identify, and to understand disease-causing genes. And we’ll be talking about some of
those applications. So, of course, mutation is the fundamental
source of genetic variation in our species and others. We now can estimate the human
mutation rate directly by sequencing families. We sequenced a human family from Utah a few
years ago, and estimated the human mutation rate to be about 1.3 x 10-8 per base pair
per generation. And there have been now several estimates using families that all come up
with about the same number: roughly one in a 100 million base pairs per generation for
single nucleotide variance. So what that means is that we transmit about 30 new DNA variants
each time we make a gamete. And I really like this quote from Lewis Thomas, the science
writer, about mutation. He said, “The capacity to blunder slightly is the real marvel of
DNA. Without this special attribute, we would still be anaerobic bacteria, and there would
be no music.” So I think we should be thankful for our mutations, because some mutations
under natural selection lead to adaptation to a changing environment; others, of course,
cause disease. Another thing we’ve learned by sequencing
families is that the mutation rate goes up substantially with advanced paternal age.
We’ve known for some time that certain autosomal dominant diseases increase in frequency with
the age of the father, but now, by looking at sequence information, whole genome sequences
in families, we know that — we estimate that there are about an additional two mutations
each year with each additional year of paternal age after around age 30 as a result of spermatogonia
continuing to undergo mitotic divisions throughout the life of the male. So at least three quarters
of all new mutations in mammalian species can be attributed to males. So in addition,
to wreaking a lot of the havoc in the world in general, males also wreak most of the havoc
in the genome, at least at the level of single nucleotide variance. So given that these mutations are happening
all the time, that we’re transmitting them from generation to generation, a natural question
is, “Well, how much — at the DNA level if we look at aligned DNA bases, how much do
we actually differ?” Well, identical twins are nature’s clones, so for all intents and
purpose, they differ at none of their DNA base pairs. There are, of course, somatic
mutations that cause small differences, but we can say that they are, essentially, genetically
identical. You probably know that for any pair of unrelated humans, we differ at about
one in a 1,000 of our base pairs. And I think that’s a very important result, because it
tells us that at the level of DNA, the most fundamental biological unit, we are 99.9 percent
identical. If we compare ourselves to our nearest evolutionary relative, the chimp,
we are about 99 percent identical. We are about 99 percent chimp at the DNA level. Mouse,
as you would expect with 70 million years of separation, we differ at one-sixth to a
third of our base pairs. And if we look at something very different, broccoli, we are
thankfully, mostly different from broccoli. Well, a small number of differences, then,
proportionally, only one in 1,000, but because, as you know, we have 3 billion base pairs
in a haploid genome, that means that between any pair of haploid genomes, including the
two genomes that you get from your parents, there are about 3 million single nucleotide
polymorphism, or variant differences. So actually a lot of variation for evolution to work with. Now, we can put this in context a little bit
by comparing the amount of variation in humans with that of other great ape species. And
this is a paper published just last year, sequencing 79 great apes. And we see that
for humans on average there are around 3 million single nucleotide variance per individual.
We compare an individual genome to the reference for common chimps; it’s nearly double. For
gorilla, it’s more than double. For orang, it’s about three times as much. So humans,
at least relative to other great ape species, are somewhat depauperate in genetic variation,
and what this suggests is that we were founded by a relatively small number of individuals
not so very long ago. So we haven’t had that — as much time to accumulate variation. Now, another important kind of genetic variation,
and one that population geneticists are using more and more, are copy number variance. So
here we have a couple of genes, A and B that exist in extra copy in a genome. And these
are often defined as deletions or duplications greater than 1,000, sometimes greater than
500 base pairs. And all together, they account for a substantial amount of inter-individual
variation, each human being heterozygous for at least 100 copy number variance, or more
if you define them as being a bit smaller; but another important source of variation,
and one that is traced, in some cases to the causation of diseases like schizophrenia and
autism. So we can also ask the question — we’ve said,
how much do individuals differ from each other. We can ask the question, “Well, how much do
populations from each other?” And of course, this has been really a central focus of population
genetics for a long time. So I’ll show you some data from a fairly widely-distributed
series of human populations. We’ve collected many of these over the years. Eight hundred
fifty individuals in 40 different populations distributed across the major continents of
the world. And of course, there’s a substantial amount of phenotypic variation in these individuals.
And these are photographs of some of the people that were sampled in the course of these studies. So one of the ways that we can look at variation
among populations is with a simple tabulation of allele frequencies. So if we have, let’s
say, three populations here, and let’s suppose for simplicity, we’re looking at three single
nucleotide variance. These are the major allele frequencies, the allele with higher frequency.
We can assess variation among populations simply by looking at the frequencies of these
alleles and comparing them. And one of the things that we typically do is to estimate
average heterozygosity — this a fundamental measure of variation — so that for each locus,
we assess the proportion of heterozygous individuals, typically by direct counting, or we can make
a Hardy-Weinberg calculation, and then we can average that heterozygosity across loci. So one of the ways that we apply this is to
— is to estimate a quantity called FST, and this is something used very often in population
genetic analysis. And we can think of FST as the amount of genetic variation in a whole
population, a whole sample — let’s say the whole world — that arises because of differences
in populations, rises because of subdivision. So a simple measure of FST is shown here.
We look at the total heterozygosity in our sample; let’s say all the heterozygosity in
humans across the world, the average heterozygosity. And then we subtract from that the average
heterozygosity within each subpopulation. So if we divide our populations into continents,
we would look at the average heterozygosity in each continent, subtract that from the
total, and then normalize by dividing by the total. So you could imagine that if this quantity
were very high — in fact if there was much variation within populations as there is in
the whole sample, then FST would be zero. What that says is there really no differentiation
across human populations. Every subpopulation has just as much variation as the entire population.
No differentiation. On the other hand, if all variation exists between populations,
in other words, if this quantity is always zero, every subpopulation is essentially a
clone, then FST would be one. So this is a way of saying how much variation in a sample
is due to subdivision; due to the fact that this is not a completely random mating population. So if we look at some measures of FST using
different kinds of genetic systems. These are short tandem repeats; these are a couple
kinds of mobile elements systems. Here’s a 250k SNP. What we see, actually, is very consistent
across different kinds of genetic systems; that FST, the amount of variation due to subdivision,
typically runs between 10 and about 15 percent. We see similar results for sequence data as
well. So most of the variation in human populations would be found within any major subdivision;
within, let’s say, Asia or within Africa — a little more in Africa, but the bottom line
is that if we look at the variation within one major human population, we see 90 percent
of human genetic variation in that population. We only get an extra 10 percent if we look
at the rest of the world. So we are really somewhat minimally differentiated, which I
think is another important point with some real social implications. Now, we can compare FST in these genetic systems
with FST for a measure of skin pigmentation, which is highly differentiated across continents.
And we see essentially the opposite result: 90 percent of variation is found between major
continents. So for this very visible indicator that people often use to essentially classify
populations, there is a lot of variation among populations. Essentially, the reverse of what
we see for genetic systems. And if we now look at some of the genes that underlie skin
pigment — skin pigmentation, they also vary tremendously among populations, as you would
expect. So here are the tabulation that we did on
the samples I showed you earlier with a 250k SNP simply to ask the question, “Well, how
many — what proportion of alleles are shared among populations?” And we divided our populations
into Sub-Saharan Africa, Europe, East Asia, and the Indian subcontinent. And what we found
with that SNP ChIP, which, of course, consists mostly of common SNPs, where the minor allele
frequencies exceeds 5 percent, about 80 percent of the SNPs of the minor alleles were shared
in all four groups: 88 percent in at least three, 92 percent in at least two; 7percent
were African specific, and less than 1 percent were specific to any of the three non-African
populations. So the bottom line here is that for these SNPs with frequencies greater than
5 percent or so, they are — they typically are old polymorphisms. You have the — polymorphism
typically has to have some age to attain a higher frequency. They tend to be shared among
populations. And in fact, none of these SNPs were fixed present in one population, fixed
absent in another. So they’re — none of them could be used actually on its own to differentiate
populations. And this is a similar result from the 1,000
Genomes data. In an earlier version of dbSNP that consisted mostly of common SNPs, this
is — these are the Asian 1,000 Genomes populations, the European-derived — this is actually a
sample from Utah — and then African. And most of these SNPs are shared in all three
populations with — somewhat more are found in Africa, relative to Europe and Asia, but
mostly shared. And these are — these are relatively common SNPs where the average allele
frequency difference between populations is right around 15 percent. But now more recently,
we can look at rarer SNPs identified by sequencing. And now you see a very different pattern.
Most of these are not shared among populations. They’re rare enough so that they arose relatively
recently, and therefore tend not to be shared among continental populations. And in fact,
for alleles where the minor — for SNPs where minor allele frequency is less than 5 percent,
less than 2 percent of those are actually shared across continents. So it’s much, much
more common to see population specificity with these rare alleles, which is what we
would expect given population history, but a very different picture from one that we
see for the more common SNPs. So we can look at differences among populations
using a simple, genetic distance measure. And I’ll just take you through how we estimate
those to give you the basic principle. The simplest form of a genetic distance, if we’re
estimated the distance between population I and J is to simply take the absolute value
of the difference in allele frequencies. So the allele frequency in population I, minus
the allele frequency in population J. So if we look at — back at our little matrix of
allele frequencies, our distance for locus one would simply be this number minus that
one, the absolute value. And then, we can just average this over all of our SNVs — we
might have a half a million or a billion of them — to get the distance, the genetic distance
between that pair of populations. And you could imagine that this starts to get much
more complex to evaluate as we get more and more populations. If we have 50 populations,
then we’ve got a 50 by 50 matrix of genetic distances. So we can use these genetic distances to build
a population network that displays similarities of populations. So let’s take that first single
nucleotide variant. Here are our three populations. And we can subtract a piece of one, a piece
of two from piece of one here, so these two SNV frequencies. And we can take that difference
to place a node between populations one and two. And then, a commonly used approach then
averages these two allele frequencies, the ones from P1 and P2, and then subtracts that
from piece of three, this frequency, to give us the distance between these two populations
averaged, here, and the third population. So we can see, very simply, that populations
one and two are more closely related; three is a bit more distantly related. And that’s
essentially how these networks are built. Now, this is kind of a whimsical analysis
that my colleague Steve Guthrie [spelled phonetically] did a few years ago, just illustrating how
you can use this technique to understand not just genetic distances, but all kinds of variation.
The New York Times published this matrix of disagreements on decisions in the U.S. Supreme
Court a few years ago. So it’s a nine by nine matrix showing the percent of time that each
pair of justices disagrees. This would be just like a genetic distance, except in this
case it’s a disagreement distance. So you that Justices Thomas and Scalia disagreed
only 9 percent of the time. Well, that makes sense. Whereas, Thomas and Stevens disagreed
most of the time; Scalia and Stevens disagreed most of the time. But you have to stare at
a matrix like this for a while before you can really intuit the pattern. So what Steve
did — he was interested in learning some of these techniques — he put this matrix
into a program that made a neighbor joining network. And you can immediately see the two
wings of the court: conservative here, more liberal here, and the swing vote, Justice
Kennedy. So these networks can very easily portray relationships among individuals or
populations. And that’s one of the reasons we like to use them. So here’s an application of that technique,
a neighbor joining network, using 100 autosomal Alu polymorphisms. So these are mobile elements
that insert into the genome. There are thousands of polymorphic Alus, where they are present
in some individuals, absent in others. We like them for these kinds of studies, because
we know that if two people share an Alu at a given spot in the genome, then they share
a common ancestor in whom that Alu occurred. So these give us, essentially, polarity. We
know that the absence of the Alu is the ancestral state; presence of the Alu is the derived
state. And they are virtually never precisely deleted. So they’re very good markers of events
in population history. So we looked at this series of populations,
made a neighbor joining network using the techniques I just described, and we see some
interesting patterns in a diagram like this. Here are African populations, and we see quite
a lot of variation among these populations. Here’s a group of European populations, substantially
less variation; East Asian, South Indian populations, giving us a nice portrayal of human genetic
diversity in the Old World. And we also see that there’s a quite a long branch separating
these Sub-Saharan African populations from the others and, as I mentioned, more variation
here. And the ancestral state, which would be absence of Alus, is closest to this group
of populations, suggesting that this would be the ancestral — the descendants of the
ancestral population for modern humans. These are bootstrap support levels telling us that
this result is supported 100 percent of the time; this branch 97 percent; this branch
97 percent. So with just 100 polymorphisms, we have really quite good confidence in this
result. Now, here’s a similar exercise done with a
250k SNP ChIP on 40 populations. And we see very much the same patterns again. Here’s
a series of African populations. Here are the European populations. Here are populations
from the Indian subcontinent and Pakistan. Here you see a fairly long branch length for
Native American populations, but branching off an Asian cluster, as we would expect.
And down here are a couple of South Pacific populations, again, with a long branch length,
indicated founder effect as they were founded by a relatively small number of individuals,
but a pattern in general quite consistent with what we saw for those Alu polymorphisms.
This is a completely different set of populations published a few years ago in Nature, where
once again we see a very, very similar pattern both for a half a million SNPs, geographic
patterning to genetic distances, and also for a smaller number of copy number variance.
So the bottom line here is that we see a very consistent picture of human genetic variation,
regardless of the sampling frame, regardless of the kinds of genetic system that we examine. And another thing that we see very clearly
from these data is that as we go — if we look at heterozygosity — in this case we’re
looking at haplotype heterozygosity, so these are groups of linked SNPs. And we’re asking
how much they vary. We see the greatest variation in Africa, and then a progressive decline
in variation as we go from Africa to Europe to East Asia, and then the more recently founded
Polynesian and American populations. So this is a very reproducible pattern. And what it
reflects is what’s termed a serial founder effect. So the largest ancestral population,
being in Africa, a subset of that population going out to found Europe and Asia, so a founder
effect there. Another subset of that population going out to found the Americas, so a continued
serial founder effect as humans spread across the globe, resulting in less and less genetic
variation, essentially, the further we go from Africa. And this is a nice diagram published in a
review a couple of years ago that just outlines those major patterns. An out-of-Africa movement
something like 80,000, maybe 100,000 years ago; then going into Eurasia; and finally
about 20,000 years ago into the Americas; very recently into Polynesia. And one of the
interesting questions, and something I’ll come back to in a minute is whether these
anatomically modern humans, people who looked just like you and me, as they came out of
Africa and encountered Neanderthals in Europe, was there mixture with that population? And
we’ll come back to what genomic results tell us about that in just a minute. Now, that’s a nice summary of essentially
the origins of modern humans across the world, but there are other sources of information
on our origins. The supermarket shelf is a good one. So I ran across this at the supermarket
10 years or so ago, and I was surprised to learn that Adam and Eve’s skeletons had been
stolen — I didn’t know they had been discovered — but because there were more amazing photos
inside, I actually bought this, and this is what I learned: all that’s left was Eve’s
leg, and it looks like the identity of the perpetrator may have been established. It’s
kind of interesting what you can learn from supermarket tabloids. Well, another way that we can look at genetic
variation is through something we call principle components analysis, and we should go through
this, because this is — this is a way that genetic data, population, individual data
are often displayed now. And what it is basically is a — is a data reduction technique, because
imagine that you’re looking at 1,000 individuals and you want to assess the genetic patterns,
the differences and similarities in those 1,000 individuals. You have 1,000 by 1,000
matrix to try to explore. We need some way of reducing the variation in that matrix down
to something we can actually look at. That’s what principle components is. And here’s a
very simple example. Let’s imagine we’re looking at height and weight. We can diagram it like
this and we can run just a standard regression line through that set of points, and that’s
the line that accounts for as much variation in height and weight as possible; it’s probably
a representation of overall size. And then, we could run another line through to try to
account for the next greatest amount of variation. And that’s what principal components analysis
does. It takes a huge matrix, in this case 850 by 850; each of these dots is an individual.
We look at the amount of the allele sharing between each pair of individuals, and then
we run a line through that multi-dimensional matrix and ask, what single line accounts
for as much variation among individuals as possible? And we plot the individuals along
that line. And so, the first principal component here, we can see separates this group of sub-Saharan
African individuals from other populations — so consistent with there being a founder
event in which a subset of the ancestors of this population went on to found the rest
of the world — and if you look at the second axis, it’s basically a west-to-east axis:
Europe, west Asia, Central Asia, all the way out to East Asia, with these groups plotting
in here closest to their ancestral population. So, it’s a very convenient way in just two
dimensions of representing as much variation in human diversity as we can.
Here’s a plot for just Eurasian populations, and what you see here is that this creates
essentially a map of Eurasia. So here is northern Europe, southern Europe, Central Asia, East
Asia, Southeast Asia, and then the Indian subcontinent with Nepalese out here distributed
quite widely. So this tells us that geographic patterning does affect genetic relationships
among populations, because for the vast majority of our history we’re much more likely to mate
with someone five kilometers away than with someone 5,000 kilometers away. And we still
see the signatures of that relative degree of isolation when we look at genetic variation
in populations. Over the last few hundred years, of course, this is beginning to change
and to break down. And we’ll show you some examples of that and how that affects our
genomes. But in many cases we can distinguish between
fairly — closely related populations. So we published this just recently looking at
a couple of Tibetan populations. They speak different dialects; they’re largely discernable
from one another on a plot like this. And here are different Mongolian populations,
here and here; again, distinguishable on a principal components plot. So if we’re looking,
for example, for populations stratification, if we’re doing an association study, this
kind of a display helps us to determine, helps us to detect stratification in populations.
And then, we can use the loadings on these axes to essentially control for that stratification
if we need to. Here’s a great example. This is published
by Carlos Bustamante’s Group a few years ago, looking at 3,000 individuals from Europe.
And what you see here, these are color coded. Each of these is an individual. These are
two principal components. They used a 500,000k ChIP, looked at allele sharing among pairs
of individuals — and this essentially reconstructs a map of Europe. So the countries here pretty
much correspond to the locations of the individuals here, although some individuals fall closer
to members of other populations. So as a result of gene flow through time — this is not by
any means perfect, but they estimated that for the majority of their — the individuals
in their sample they could trace their birthplace to within a few hundred kilometers based on
their genetic profile. So in many important ways our history is written in our genomes. Now, one thing I like to — I compare this
plot from 2008 to one that we published, now 30 years ago, doing pretty much the same thing,
but with only 15 loci instead of 500,000. We were not able to look at individuals. You
wouldn’t have adequate resolution with just 15 loci, so we looked at allele frequencies
and populations, but what you see again, with just 15 loci, is a map of Europe. So it’s
quite interesting to see this reproduced on a much grander scale, and at the individual
level with a larger number of populations. So, so far I’ve been talking about data based
on primarily on microarrays — SNP arrays — but as, I’m sure you’re aware, SNP arrays
miss an important part of variation; that is a variation due to less common alleles.
They’re also typically selected for diversity in a specific population, usually populations
of European ancestry. So we worry about biases, ascertainment biases, in the data that we
get from SNP microarrays. Sequences, on the other hand, give us information about rare
variance, and in most ways we can consider them to be unbiased. So they do permit a number
of inferences that simply aren’t possible from microarray data. The reason is shown
here. This was an early study done by Andy Clark comparing the allele frequency spectrum
— so these are alleles with minor count of one, two, three, four — through this sample.
This is what you would expect at equilibrium; that is for a constant population you expect
an excess of rare alleles. For the HapMap data, which were based on SNP microarrays
you can see that there is a real deficiency of these rare alleles, because these SNPs
were really designed for more common SNPs. And then, for two sequence data sets at that
time, Pearlagin [spelled phonetically] and NIHS, there was actually an excess of rare
alleles over what you would expect at equilibrium. But it’s this class of alleles that tell us
a lot of things about population history, about population size, and about growth rates.
So sequence data give us this information that really the microarray data don’t give
us accurately. One of the things that this allows — this is from the 1000 Genomes data
— is an accurate inference of population sizes and migration rates through time for
human populations. So these bars represent the size of populations. This is the African
founder population. This is the effective size of that population. The estimate here
is that about 50,000 years ago a small piece of that population went out to found Eurasia,
and then there was rapid expansion of that derived population; very, very rapid population
growth from an initial bottleneck with migration among population subsequently. So although
we think of out-of-Africa as a single event, it was probably multiple events, and there
was probably — there were probably back-to-Africa events as well, at least to some extent. But
with sequence data we can really portray human history much more accurately in greater detail. So here is an allele frequency spectrum like
the one I just showed you now for 200,400 exomes from the Seattle Group. And we see
again this excess of very rare variance; in fact, more than we would expect in a constant
population. What this reflects is population growth, and I’ll show you an example of that
in a second. But one of the interesting findings of this study is that 73 percent of all protein
coding single nucleotide variants, and 86 percent of the deleterious SNVs are very young.
They’ve arisen within the past 5,000 to 10,000 years as human populations exploded, because
a growing population does not successfully eliminate these rare variance, including the
deleterious ones. And another interesting finding from this study is that we see more
deleterious single nucleotide variants in European and Asian populations than in African
populations. The reason for that is that European and Asian populations had this incredible
bottleneck as they came out of Africa, and then expanded very, very rapidly retaining
those rare variants, including the ones that are deleterious, not necessarily lethal — those
would be eliminated quickly, but other deleterious variants. And this, from a population genetic
perspective, helps to explain why we see more rare variants, more deleterious rare variance
in European and Asian than in African populations. So to understand why population expansions
increased the frequency of rare variance, let me use this little example here. So here
we have an individual who has had two children and, of course, if that individual has received
a new variant, a de novo variant from one of his parents, if he has just two children,
there’s a chance with each child that he will not pass on the new variant. So the extinction
probability when he has only two children is one-half times one-half, or one-quarter.
So there’s a good chance that that new variant is simply going to go extinct in one generation.
On the other hand, let’s say he’s from Utah and he’s got 10 children. Now, the extinction
probability goes to one-half to the tenth. In other words, the chance is that only one
in a thousand that that allele will go extinct in this generation. So this would represent
a rapidly growing population, and if this person’s descendants also have a lot of descendants,
that extinction probability is low. So for rare variance, for variance that arise in
a time of rapid population growth they tend not to be eliminated, simply because the extinction
probability in any generation is really quite low. And that helps to explain why we see
this excess of rare variance in human populations, and particularly in human populations that
have undergone a bottleneck and extreme expansion. Now, I said that we’d come back to the issue
of mixture with Neanderthals, because people are naturally interested in this. As our ancestors
came out of Africa, did they mix with Neanderthals? And the separation of human and Neanderthal
ancestors took place something like 300,000 or 400,000 years ago, but the question is,
when these populations were near each other some 50,000 to 60,000 — 70,000 years ago,
was there gene flow between them? And we now have, actually, very good evidence from nuclear
sequencing of Neanderthal skeletons that about 1 to 3 or 4 percent of modern human DNA has
Neanderthal origin, but only among non-Africans, so that as humans went out-of-Africa, encountered
Neanderthals — probably first in the mid-East — there was a small amount of mixture. So
instead of the African replacement hypothesis, we now refer to a leaky replacement hypothesis.
Neanderthals were mostly replaced, but probably not entirely, and in fact, we see Neanderthal
DNA in pretty much all non-African populations. And one of the interesting questions is, could
some of the shared sequences have adaptive significance? And there is now some evidence
based on surveys of the 1000 Genomes data that in fact they do. For example, genes that
encode keratin filaments appear to have been selected for in these Neanderthal modern human
mixed populations. So here is a plot showing probability of Neanderthal ancestry in CEU
Europeans, CHB East Asians, and in sub-Saharan Africans. You can see that there are sections
in this individual that are very, very likely — almost a 100 percent probability Neanderthal
origin, and in this European individual; whereas for sub-Saharan Africans, typically, you see
no evidence of Neanderthal contribution. So another interesting application of the 1000
Genomes data, searching for Neanderthal genes, searching for those that may have been selected
for adaptation in this new environment as populations were coming out of Africa many
thousands of years ago. And, of course, you can send your DNA into some of the direct-to-consumer
testing companies, and they will estimate your portion of Neanderthal genes, typically
between 1 and 3 percent. So this is finding your inner Neanderthal, as they say. So one of the interesting questions that arises
as we’re looking at population similarities and differences is what can genetics tell
us about the concept of race? And I put this in quotes, because it’s a term I personally
don’t use in writing, but certainly it is used, and I think often misunderstood. Here’s
a quote from an editorial in the New England Journal in the last decade, stating quite
unequivocally that race is biologically meaningless. There was a response in the New York Times
by a Sally Satel, a psychiatrist, who said, “I am” — and this was deliberately provocative
— “I am a racially profiling doctor.” Her argument was that self-identified population
affiliation gave information about response to some of the drugs that she prescribes as
a psychiatrist. So the question is, how useful a concept is this, and what can genetics do
to illuminate our understanding? Back about 10 years ago this article made
the cover of Scientific American, of Steve Olsen, the science writer, and Mike Bamshad,
my colleague, were the co-authors. And the question was, does race exist, and according
to Scientific American, science has the answer. I always get suspicious anytime they say that
science has the answer, but I think science, genetics can give us, at least, some insight.
So, we can start by looking at DNA sequence differences among individuals. And we’ve kind
of gone over this concept, but if we have DNA sequences — and I thought I would use
some political figures for this example. Let’s say we have a sequence from Rick Santorum,
Mick Romney, Hillary Clinton, and, I almost hated to — I hated to put him in, but John
Edwards. And our question is, how different are they? We can make a matrix of DNA sequence
differences. We see that Romney and Santorum differ at just two bases here. Clinton and
Santorum differ at five. And we see that Edwards and Santorum differ at six. Edwards and Clinton
at only one. This is a hypothetical example, but now we can put this pattern in a tree,
a network, as we did before, and again, we see some very discernable patterns, a clustering. So we can do the same thing with real DNA
sequence. And we did this with sequence at the angiotensinogen locus some time ago. It’s
14 KBf sequence, so a relatively small amount of sequence. But we’re asking the question,
for these major population groups — Asia, Europe, Africa — how similar are people to
one another for the DNA at this locus? And what we see is that for this gene sometimes
an individual of African descent is actually more similar to people from Asia or Europe
than to others from Africa. Now, partly this reflects the fact that this is a relatively
small amount of genetic variation, but it says that for any given gene it’s very difficult
to trace population origin from that gene, and conversely, if we know your population
origin, we can’t predict necessarily your genotype or genotypes at that locus. This
also reflects the sharing of DNA that has gone on through the history of our species,
because human populations have mixed and migrated fairly extensively throughout our history.
And the mosaic patterns that we see in many of these diagrams are a reflection of that. And this is actually something that Darwin
himself was aware of. He said a long time ago, “That it may be doubted that whether
any character can be named, which is distinctive of a race and is constant.” In other words,
characters tend to be shared across populations and any single character is not going to delineate
a specific population group. So we then took that same group of individuals and we used
about 200 loci, and again, made a diagram. And now you see that for these individuals
— and they’re from Africa, Europe, and East Asia, so they are geographically –the group’s
somewhat separated. But now every individual falls into a group that is consistent with
their continent of origin. Now, one of the things you notice here is
that the lengths of these branches are much, much greater within populations and that’s
consistent with that FST estimate that says most differentiation, most variation occurs
within populations, but there is enough between population variation here so that we can begin
to see a pattern according to ancestry. And that may seem a little bit paradoxical when
you compare this to the diagram I just showed you for angiotensinogen, but it makes sense
if you think in terms of this being a lot more information about ancestry, about population
history. So in a way it’s like looking at, say, just height in males and females. If
we measure everyone’s height and try to determine a sex from that, we’re going to be wrong a
lot of the time, but there is on average a difference. If we add another characteristic,
like waist/hip ratio, well, then we have more accurate separation of our two groups. And
the more characters we look at the more accurately we can discern these two different groups.
So, that’s essentially what we’re saying, is that with more genetic information we can
more accurately discern the histories, at least, on a very basic level, of these continental
populations. So here’s another example, now using more
single nucleotide variants, and you can see in this neighbor joining network it appears
that there are groups, and in fact they do correspond to various worldwide populations.
These are new world populations, Asian populations, African, a Spanish population, a south Indian
population. But we shouldn’t get too mislead by this, because we can add populations with
a more complex history, such as African Americans, where some fall into this group with African
populations; others trend toward other groups because of the complex history of this population.
The same thing if we look at, say, Puerto Ricans who, again, have a complex history,
complex ancestral history, where some fall in with the Spanish group, others fall in
closer to an African group. So the point here is that, especially as human populations become
more mobile, it’s very difficult to classify every individual into a nice, neat category. Here’s a similar exercise that my graduate
student Wilfred Wu carried out a year or so ago with the complete genomics data. So this
is whole genome sequence. And we see very much the same kind of pattern where in general
individuals — and these are individuals from the 1000 Genomes Project, sequenced by complete
genomics — and we can see that in general these population groups do tend to fall together,
but there are interesting exceptions. For example, individuals from Mexico are distributed
in various places throughout the graph, once again illustrating their complex demographic
history. Another thing Wilfred did that was kind of interesting, just from a genomics
point of view, was to compare — he included here the same subject sequenced in the 1000
Genomes database with their sequence in complete genomics, and on average the between platform
differences were about 348,000 variants. A lot of that has to do with relatively low
coverage in the 1000 Genomes database, so we would expect it. And it’s actually kind
of encouraging that each of these pairs, which are the same individuals on two different
platforms, did at least cluster together. But you can see that between platform a difference
quite clearly in this slide. So here’s just one more example of the point
I’m making here. This is a principal components plot for American populations of African,
European, Asian, and Hispanic descent. And again, you see that some individuals, for
example of African descent, are closer to members of other populations than they are
to many of the other individuals of African descent. So very difficult to put a self-identified
group into a nice, neat little compartment. So what this tells us is that if we look at
multiple polymorphisms, if we look at a lot of SNPs or single nucleotide variants, if
we look at enough of them we can often learn something about population affiliation, kind
of the non-overlapping parts of these circles. But the converse — and this is where people
sometimes get confused — the converse is not true. If we know your population affiliation,
we can’t predict your SNP genotype, because these populations typically differ just in
frequency of SNPs and there’s a lot of overlap. So I think that’s a very important point that
we need to make, especially to the general public. And it really points up, I think,
the fallacy of thinking typologically, which is what racial categories tend to be. Humans
really don’t fall into discrete groups like this. They’re — what the genetic data tell
us is that there’s a tremendous amount of an overlap in genetic information across human
populations. But here’s a good example of that, or also
of how self-identified population affiliation can be misleading. Wayne Joseph was a principal,
is a principal, in the school system in California. He was raised in a family in Louisiana that
was self-identified African American. He sent his saliva in to a direct-to-consumer testing
company, and this is what he learned: that, at least according to their estimates, he
was 57 percent European, 39 percent Native American, maybe 4 percent East Asian, although
that could just be an error term, but no apparent African ancestry. So in his case his self-identified
population affiliation appears to have been really completely wrong. Now, this didn’t
change anything importantly for him. Culturally he maintained his same affiliation, but it
shows how that self-identified affiliation can be wrong, can be misleading. So I think a much more useful than the concept
of race is individual ancestry, because we can now estimate genetic ancestry for individuals,
at least at a broad level. And someone with this apportionment of ancestry would likely
self-identify as African American as very likely would someone with this. And yet their
ancestries and their genetic makeup could be really quite different. And that’s why
I think it’s much better really to assess ancestry at the individual level rather than
to use these categories. I’ll just give you an example from my own genetic testing because
I sent my DNA into one of the companies — I guess this was 23andMe — a few years ago.
And they will assess your paternal and maternal ancestry. This is based on Y-chromosomes.
So I have this particular Y-haplogroup, I1*. And it was kind of amusing to learn I share
it with Jimmy Buffet and Warren Buffet. They don’t know that. [laughter] Lynne Jorde:
And it hasn’t done anything for my singing or my investing ability. But my grandparents
all came from Norway. So this is consistent with what I know about my ancestry. My maternal
line, my mitochondrial DNA again, the haplogroup I have is quite common in Europe, fairly widely
spread throughout Europe. So that, again, makes sense. And then using ancestry informative
markers across the genome, they attempt to essentially paint your chromosomes with ancestry.
And I was hoping that I would have something exotic, but according to this, at least, my
ancestry derives 100 percent from Europe. I was hoping that my kind of rambunctious
Viking ancestors might have brought something interesting into the genome, but it doesn’t
look like that’s the case. But here we’re looking at the ancestry of a Berber female
from North Africa. So this is an African, but where 86 percent of the ancestry is predicted
to be European-derived. And we see quite a lot of mixture in that ancestry, even more
so for a self-identified African American. And the important point here is that for this
individual some regions of the genome would be African-derived, other regions of the genome
would be European-derived. And if we’re interested in disease susceptibility that is genetically
related, what we really want to do is to look at those specific regions and look at their
genetic makeup rather than assessing self-identified population affiliation. So for biomedicine I think these findings
do have some important implications. First of all, they tell us that if we look at a
large number of independent polymorphisms we can learn about population history. We
can learn about ancestry. But, and very importantly, these variants typically differ only in their
frequency and they typically overlap a lot among populations. And here’s an example of
that. This was a study done on response to ACE inhibitors in African American and European
American populations — a very large meta-analysis — and it addressed the issue or the question
of whether African Americans tend to respond less to ACE inhibitors for lowering blood
pressure than European Americans. And what we see here is that the decrease in blood
pressure, in systolic pressure in response to ACE inhibitors, is a few millimeters less
in African Americans than European Americans, but that there’s a large distribution here,
a large amount of overlap, and so as you can see, many of the African American patients
would respond better to an ACE inhibitor than would many of the European American populations. So far better than using this average difference
as an indicator of who should get this drug, it’d be much better to be able to look directly
at genotypes and individuals to predict response. And we see a good example of that with EGFR
inhibitors and non-small cell lung cancer. So EGFR inhibitors, like gifitinib and erlotinib,
inhibit tyrosine kinase activity, and they’re estimated to be effective in treating this
condition in roughly 10 percent of Europeans, but a higher percentage of Asians. So one
might imagine using population affiliation as an indicator for who should get this drug
to treat non-small cell lung cancer. But it’s interesting that if you look at somatic mutations
in EGFR — these are gain-of-function mutations — we see those in about 10 percent of European
patients with this condition, a higher percentage of Japanese and other Asian individuals, and
in fact, 70 to 80 percent of those who have the mutations respond to the drug; fewer than
10 percent of those without respond. So you can see that looking at the gene itself, looking
for gain-of-function mutations is a much better indicator of who is going to be a good responder
than is the more blunt population category. This is one more example of that response,
or the calibration of warfarin dosage. So this is a standard clinical algorithm that
takes population affiliation into account, but here are the results of looking — of
doing genetic testing for VKORC1, CYP2C9; they are both involved in warfarin metabolism.
And here you see a much, much bigger difference between this genotype category and this genotype
category than we do across a population. So again, individual testing giving us a much
better prediction of response than the use of population affiliation. So what this, I
think, tells us is that genetic variation, we’ve seen, is correlated with geographic
location, but it tends to be distributed continuously across space. It’s difficult to delineate
specific borders or boundaries between populations. So race — going back to a question raised
earlier — it may not be biologically meaningless, but it’s biologically very imprecise. It is
a very blunt tool, and we can use better tools, genetic tools, to infer individual ancestry,
and that, I think, will provide more medically relevant and useful information. And I want to go on now to the topic of linkage
disequilibrium, but everyone has been sitting very still for about an hour, and so I’d like
to invite you to just stand up and stretch for a minute before we do the last half hour
of the lecture. So I think it’s cruel and unusual punishment to make you sit here for
90 minutes. It violates your — what is it? Eighth Amendment rights. Female Speaker:
Can I ask a question while we’re — Lynne Jorde:
Oh, yes. Absolutely. Female Speaker:
— not on. Is it on? What’s the — how did you define what were Neanderthal genotypes? Lynne Jorde:
Okay. That’s a — Female Speaker:
Since there are no Neanderthal around. Lynne Jorde:
Oh, but several have been sequenced. Female Speaker:
From frozen material, as it were? [laughs] Lynne Jorde:
Well, it was — I’m not so sure about the exact provenance, but there were several Neanderthal
specimens have been sequenced, including one at 42X coverage, so that has given a baseline
for the Neanderthal genomes. Female Speaker:
And they’re taken from geographically somewhat diverse areas? Lynne Jorde:
Yeah. One was in Croatia, another much further east. I don’t remember the exact location. Female Speaker:
But the point is they’re closer to each other than to anything else. Lynne Jorde:
Yes. Female Speaker:
Okay. Lynne Jorde:
They’re much — the sequences, the Neanderthal sequences are much more similar to each other
and quite divergent from human. Male Speaker:
I had another question on the — so the findings of much greater diversity, genetic diversity
within Africa than other populations, how much of that is due to population substructure
within Africa, FST values between different populations in Africa? Lynne Jorde:
Yeah. So there is, as you’d probably expect, more substructure as we look across Africa,
that population having been resident in Africa longer, and having more time to subdivide
and differentiate. It also has a larger effective size, and the larger the effective size of
a population, the more variation you see. So in all the different kinds of systems we’ve
looked at, we tend to see about 20 to 25 percent more variation in samples of persons of African
ancestry than in non-African — those of non-African ancestry. Okay. Well, it looks like everyone has sat
back down, so we’ll go on to talk about what I think is a very interesting application
of a population genetic concept: linkage disequilibrium to disease gene mapping. Let me ask, how many
of you are familiar with the concept of linkage disequilibrium? Okay, I see just a few hands.
So let’s go through this, because this turns out to be very important for understanding
of not just SNP data, but also genome data. So basically what linkage disequilibrium is,
it can described as the non-random association of alleles at linked loci. So let’s imagine
that we have here two loci, A and B, and their alleles are big A and little A. At equilibrium
we’re going to see all possible combinations, but in disequilibrium where there’s a non-random
association of alleles, we see big A and big B together, little A and little B together,
but very seldom do we see the other combinations. And that, in essence, is what we mean by linkage
disequilibrium. Now, we can quantify this by looking at the
allele frequencies of big A and little A, 60 and 40 percent; big B and little B, 70
and 30 percent. Now, what we would expect under equilibrium is that in this population,
haplotypes having this combination would be seen 42 percent of the time. That is the frequency
of big A multiplied by the frequency of big B. That’s essentially random association,
very much like Hardy-Weinberg, except now extended to two loci. Similarly, we would
expect the frequency of big A and little B together on the same chromosome copy, the
same haplotype, to be 18 percent, 60 percent times this frequency, 30 percent. So that’s
what we would expect under linkage equilibrium, but let’s suppose we assay a population and
we see that we have a real excess of this haplotype and an excess of this haplotype,
and then a paucity of the other two haplotypes. Well, that would be linkage disequilibrium.
We’re finding these alleles in combination much more often than we would expect given
their frequency. So what this suggests, most of the time, is
that the alleles that have higher linkage disequilibrium have had less opportunity for
crossover to occur between their respective loci. So over many generations we’re going
to find big B and big C together on the same haplotype more often than big A and big B,
because being further apart, these two loci will have had their alleles broken up by recombination
much more frequently than this pair. So what that implies is that we can look at linkage
disequilibrium patterns as a way of inferring how close together any two loci are. It’s
another way of doing linkage analysis. But it has some advantages. We don’t need
family data necessarily. If you’re doing traditional linkage, you’re, of course, counting recombinants
from one generation to the next. We can use microarrays or sequence data so we can look
at a large number of single nucleotide variants spaced as closely as a kb or so. And we can
do association studies that effectively incorporate not just the last two or three generations
of recombination to map loci, but essentially all of the generations of recombination that
have occurred since a variant arose, because, really, for any given mutation, populations
are in essence just one big pedigree. So if all of these individuals in these different
families inherited a mutation from this founder back here, what linkage disequilibrium allows
us to do is to look at recombinations that have occurred between this mutation and nearby
SNPs throughout the generations. So in principal, it allows us to more finely map loci than
we could map if we were just doing recombination mapping, linkage analysis, in say three generations
of a family. So that’s the advantage of linkage disequilibrium, and that’s one of the reasons
why, if you look at the number of papers published over the last 30 years on linkage disequilibrium,
back in the 1980s — this was when I first became interested in that topic — there were
about 50 papers a year published on linkage disequilibrium. You could read a paper a week,
and you knew everything that was going on. Now, this has kind of plateaued, but at around
12 to 14, 15 hundred papers a year are published on this topic. So it has gained a lot of interest,
a lot of popularity as a gene mapping tool. But there are a lot of factors that can influence
linkage disequilibrium patterns. One is chromosome location, just as with recombination, because
recombination is more common near telomeres, the relationship between linkage disequilibrium,
between two loci and their actual distance is going to vary. We also know that there
is less recombination within genes than in extragenic regions that will, again, affect
the relationship between linkage disequilibrium and physical distance. Sequence patterns affect
recombination and therefore linkage disequilibrium. So GC content associated with more recombination,
presence of inserts like Alus, associated with more recombination. We know now of recombination
hotspots every 50 to 100 kb in the genome; in particular, motifs that are bound by this
zinc finger protein, PRDM9, are associated with a high proportion of hotspots. It’s interesting
that there is more variation in PRDM9, in a repeat unit in PRDM9 in African populations
than non-African populations, one of the reasons why we see more recombinations in African
populations, that and their population history. And finally, the evolutionary factors that
we’re interested in in population genetics — natural selection, gene flow, mutation,
and genetic drift — all influence the pattern of linkage disequilibrium. So linkage disequilibrium can be rather complex
to interpret. Here’s an example: the age of a population. And of course, populations really
all have the same age, but when we talk about an older population, we’re really referring
to a population that was founded longer ago like the current African population, and in
such populations there have been many generations for recombinations to occur. So that means
that there’ll be a lot of different haplotypes in relatively smaller blocks in a population
like that. On the other hand, if we look at, for example, a Finnish population, most of
which was founded relatively recently from a small number of individuals, there haven’t
been as many generations passing for recombinations to have occurred, so we tend to see larger
blocks of haplotypes, more disequilibrium. And that means that a mutation here will be
associated with more nearby polymorphisms even after many — even in modern populations,
whereas a mutation that occurred in this population will tend to occur in association with a smaller
number of polymorphisms. And if we look at patterns of disequilibrium
in these populations — we’re looking now back at the angiotensinogen locus, and each
of these little units here is a SNP at that locus, and we can interpret this plot much
like we do a mileage chart. For those of you that remember mileage charts from atlases,
this might be, say, San Francisco, this would be New York, and here would be the distance
between San Francisco and New York. If this were Los Angeles, then this would be the distance
between San Francisco and Los Angeles. Well, for these SNPs, this is the amount of linkage
disequilibrium between these two SNPs, this pair of SNPs, and this is the amount of linkage
disequilibrium between these two rather distant SNPs. Red indicates high disequilibrium, and
what you see here is much more disequilibrium in this locus in the more recently founded
Eurasian population than in the African sample; so consistent with what we know about population
history. So one of the questions that we want to ask
is “Well, how general are these patterns across the genome? And how much does linkage disequilibrium
vary with genomic location and with population?” And I would say that about 10 or 12 years
ago, our knowledge of that, of haplotype structure across the genome, was kind of like this map
of the world in 1544. I think these maps are fascinating. At the time Europe was reasonably
mapped out, Asia to some extent, North America was not even on the map, so it was a — it
was a pretty low-resolution and misinformed map of the world. Well, the HapMap Project,
which I know all of you have heard about, really sought to create a better, more accurate
map, haplotype map, of the human genome. So it started with 600,000 SNPs. That was expanded.
The populations were three: 90 Utah CEPH individuals representing people of European ancestry,
90 Yorubans from Nigeria, and 90 East Asians; by no means a complete sample of human diversity,
but a small subsample. And the idea was to evaluate patterns of linkage disequilibrium
in haplotype structure across the genome in these different populations. And I think the result was a map that looks
more like this. By 1688 we had a much better resolved map of the world. Somehow California
still escaped notice here, but for the most part we had a much, much better map of linkage
disequilibrium. And this has led to some very useful applications: first of all, understanding
human genome-wide haplotype diversity; detecting recombination hotspots; detecting genes that
have experienced strong natural selection; and of course, detecting disease-causing mutations.
And in this last part of the talk I’ll go through a few examples of those. Certainly
one of the take-home messages from that project was that SNPs, many SNPs throughout the genome
are redundant. So if you have this SNP here, then you almost certainly have these alleles
here, so these tag SNPs are really all we have to genotype. The others are effectively
uninformative, because they’re in strong linkage disequilibrium, meaning that we don’t have
to type nearly as many polymorphisms to get complete coverage of the human genome, more
in individuals of African descent, but still far fewer than the total number of SNPs that
have been discovered. And that has led to this success story, I think, and you’ve all
seen this slide or some version of it: the many, many hundreds of replicated associations
across the genome using SNPs designated from the HapMap Project. Now, it also — these data also allow us to
detect hotspots of recombination, because what we will see often is blocks of linkage
disequilibrium for this group of sequence, where there are strong associations among
alleles, but no association between essentially this block and this block, because of a recombination
hotspot, where recombination is elevated at least tenfold over the rest of the genome,
rapidly disassociating those groups of alleles from one another. And of course, that is going
to influence our estimates of distance among loci. If there’s an intervening hotspot, we’re
going to have unexpectedly low linkage disequilibrium. So we estimate that there are as many as 50,000
or so recombination hotspots throughout the genome, and that about 60 percent of all recombination
occurs in just 6 percent of the genome, much of it focused at these highly active hotspot
areas. And what’s very interesting is that hotspots vary among species. In fact, in chimp,
the location of hotspots completely different from that of humans. PRDM9 is not active in
chimps, so that explains part of it. And we also see variation even among human populations
in the, in the location and activity of hotspots. So this is really helping us to understand
this very, very important property of genomes, how frequently, and where they shuffle, recombine. Now, another thing that these linkage disequilibrium
patterns allow us to do is to detect regions that have undergone very strong natural selection.
And the idea is diagrammed here. If we imagine a new DNA variant arising on a haplotype background,
it will slowly — if it’s neutral, that is, if it does not undergo natural selection,
it will in some cases slowly increase in frequency, but as it does so, that background haplotype,
that is the other SNPs associated with it, become smaller and smaller due to recombination
that’s occurring generation after generation. So for a neutral variant, if it attains high
frequency it will have relatively low disequilibrium with other nearby SNPs, because of recombination. But now imagine that this is an advantageous
variant that it sweeps very rapidly to high frequency. What it’s going to do is to carry
those other variants along with it, also at high frequency, and we’re going to see long
regions of homozygosity in populations, because of selection not only of this advantageous
variant, but of nearby SNPs. So we can look for regions that have this signature as a
way of detecting SNPs, detecting variants that have undergone very rapid positive selection.
So this is another illustration of the idea. If there’s positive selection for this variant,
it will pull the adjacent variant along with it that it’s associated with here, and after
a while, most, maybe all, members of the population will have this combination of variants. You’ll
see region of homozygosity here. We can compare that, for example, to purifying selection
where variants occur, but because they’re deleterious, they simply get eliminated. So this approach is now being used in a number
of very interesting applications. For example, to show that the variation at the G6PD locus
was selected for very strongly for malaria protection. This cytochrome P450 locus underwent
selection for sodium retention. A very interesting story, the lactase enhancer populations, independent
populations, some in Europe some in Africa, that are herding populations have hereditary
lactase persistence so that they can digest milk throughout their lives. There’s an enhancer
element that has undergone strong selection in those independent populations, a good example
of convergent evolution just within the last 10,000 years. Several skin pigmentation loci
that have, again, undergone rapid selection as humans encountered new environments, and
Tyra mentioned work that we and others have done on high-altitude hypoxia response in
Tibetan populations, because Tibet is one of those great, essentially natural experiments
done on humans where humans lived — moved to an altitude of 15,000 feet or so and successfully
adapted, in part by altering their response to high-altitude hypoxia. And so we’ve discovered
selection, and now specific variants at these members of the HIF pathway of — that helped
to confer to that high-altitude adaptation. So these were all discovered by exploiting
these properties of linkage disequilibrium in populations. And I’ll say that population genetics is also
guiding the development of sequence analysis, as we are now analyzing more and more exomes
and whole genomes. The 1000 Genomes Project provides a very useful set of control sequences,
because whenever we sequence a group of patients, one of the questions is, if we find a variant
in that group, is it a variant that is absent in other populations, or at least very rare?
And the 1000 Genomes Project has given us one of the important sets of control sequences
for that kind of variant analysis. And I think we need our population genetic analysis to
inform us about the nature, the behavior of rare variants, because these rare variants
often are the ones that we are especially interested in, in terms of their association
with disease now that we’re able to do whole genome sequencing. And evolutionary principles,
population genetic principals, help us to determine when a variant is actually functionally
significant, because we can find associated variants, but figuring out which ones actually
have functional relevance is, in many cases, quite a challenge. So we incorporate, and others do this as well,
purifying selection. We look at regions that have undergone purifying selection as a way
of prioritizing candidate variants when we’re doing genome analysis. And I’ll just mention
this software that we’ve developed in the last few years: VAAST, and now Pedigree VAAST,
which has just come out. So this is a tool for analyzing sequence data, and Pedigree
VAAST makes use of sequence data in pedigrees. So that’ll — that’s just coming out in Nature
Biotechnology, but one of things we use is evidence of purifying selection to assign
functional significance, and of course, evolutionary conservation among species — again, very
useful in deciding which variants might actually have functional relevance and significance. So I’ll just wrap up by saying that what I
hope you’ve seen today is that genetic variation does contain a lot of useful information about
the history of our species. I think it gives us a more subtle and nuanced view of issues
like race and how relevant they are or are not to medicine. I think it gives us some
useful alternatives. And really, population genetic analysis, our understanding of concepts
like linkage disequilibrium has actually been of fundamental significance in gene mapping
efforts, and now as we’re trying to understand the role of rare and common variants in disease,
again, understanding the evolutionary processes that give rise to those variants is turning
out to be of key significance. And I hope you’ve seen that population genetics, which
sometimes people associate with a lot of heavy math, is actually fun. So I hope I’ve convinced
you of that. I’ll leave you with this picture of the lovely Wasatch Mountains. This is my
back yard where I enjoy playing, and here are some of the people that contributed to
some of the work I told you about, and I want to thank you for your kind attention, and
I’m happy to take any questions. [applause] Lynne Jorde:
We’ve got about three or four minutes here. Yes, sir? Female Speaker:
Can you use the microphone, please? Can you use the microphone? Male Speaker:
Like plants and microorganisms, do humans have significant numbers of mobile elements,
transposons and such, and how does this complicate genetic analysis? Lynne Jorde:
Yeah. That’s a great question. We estimate that at least half of our genome is derived
from mobile element insertions. So if you look at Alus, it’s about 11 percent. There
are more than a million 300-base Alus in the human genome, mostly inserted earlier in the
course of primate evolution, but some of them since humans diverged along their own independent
lineage. Another 17 to 20 percent are LINE-1 elements, and one of the interesting questions
is what effect these have on the genome. We know that occasionally you can have, for example,
transduction of other genetic elements, as an L1 pops out and goes someplace else; sometimes
it takes other material with it because it has rather weak poly(A) signal, so it is sometimes
involved in the transfer of other genetic elements. Because these are highly methylated
sequences, they’re CG-rich, they may affect gene regulation depending where they land. So we think that they do occasionally have
effects on the genome. And, of course, we’ve got some very good examples now in which these
elements have inserted into a specific gene and caused loss of function. And there’s some
good examples in which they mediate unequal crossover. The BRCA1 gene is full of Alu elements,
and that’s one of the reasons why you see so many deletions in BRCA1 is that these Alus
are mediating unequal crossover and causing deletion. Also they do have some interesting
effects, and I think, because they’ve been difficult to identify easily, they have been
somewhat challenging to understand, but a lot of work is being done in that area. Other questions? Okay. Well, thanks very much. [applause]

Tags:, ,

Add a Comment

Your email address will not be published. Required fields are marked *