Analysis of Genetic Association Studies

[Dr. Teri Manolio]
What we thought we’d do this
morning is get a little bit more into some of the details of
genetic association studies. And I think we forgot to mention
yesterday that we did want to get some course evaluations
from you, so we’ll be passing those out at the break,
and if you would fill them out, at the end, obviously,
that would be very helpful, and just leave them
at the front there. Okay, so the things that I was
going to cover this hour were discrete and
quantitative traits, measures of association,
false positives, quality control, which is a big
issue, then sort of initial looks at the data with what
are called Q-Q plots, odds ratios, which you all
are quite familiar with, but how they’re calculated
in genetic studies, and then a little bit of
transmission and interactions. And because we’re
being quantitative, “Yes, I know that, Sidney,
everybody knows that, but look, four wrongs squared
minus two wrongs to the four, yes — divided by three, by this
formula do make a right.” So Gary was also a very
quantitative person. Quantitative genetics is
actually something that we don’t tend to learn in
high school and college. We tend to focus more on
the qualitative components, but the quantitative is
concerned with inheritance of differences between
individuals that are of degree rather
than of kind. I kind of like the
way they put this, so I quoted Falconer and Mackay;
this textbook of quantitative genetics is really —
it’s very readable, and really the classic in the
field, so if you’re interested in reading more, I would
highly suggest it. So what are the
differences? There are continuous gradations
obviously, rather than sharply demarcated types. The effects of the genes
generally are small, in order to give you a smooth
distribution as opposed to, say, multiple means. Where in qualitative traits
the effects are large, and then usually there are many
genes in quantitative traits where there tend to be single
genes inherited in Mendelian ratios for discrete traits,
and I put a question mark there because, actually,
you know, that was kind of the central dogma until a few
years ago when it became very clear that there were important
complex traits that were inherited in families in very
sort of predictable ways that didn’t seem to follow
Mendelian ratios. Models for single traits,
so you have a big A allele that gives you a pink flower
and a little a allele that gives you a white flower,
and if you have three different genotype groups,
if A — if big A is dominant, anywhere that you have a big A
you’re going to have a pink flower, as you
see here. Anywhere that you have —
if A is recessive, you’d need to have two copies
of big A in order to have a pink flower. And if A is codominant,
there are a variety of terms for this, but codominant
is one that you tend to see fairly often, you’d get —
with two copies you’d get a really pink flower,
with one copy you’d get sort of a pinkish-whitish
flower, and then with no copies you’d have
your white flower. For quantitative traits,
say that your big A allele gives you X units
increase in height, and your little a allele gives
you X units decrease in height. And say your — these are
of equal frequency, your population mean then would
be some centered value zero minus X, and plus X would be
the extreme values for your big A and little
a alleles. So if you have two copies of
little a and a is completely — big A is completely dominant,
both your big A homozygote and the heterozygote would be
at this end of the spectrum. If A is only partially dominant,
you wouldn’t be right down here at the middle, but you would be
sort of a little bit closer over to the big A homozygote. If A is not dominant or
codominant, you’d be, you know, sort of smack in the
middle, and you can even have, although it’s rare, A being
overdominant if it’s a heterozygote. And this may be some kinds of
interactions between the two homozygotes or some
other thing. Again, this doesn’t happen very
often, but it is wise to be aware that it
could happen. The quantitative traits that
to date have had published genome-wide studies, actually
yesterday there was another one in protein levels,
but for the time being, there were these
that we can list. The Framingham study,
as I mentioned, had these 18 groups of traits,
many of which were quantitative, and I never quite know how
to count them, is it — you know, is it 16 or is it 34,
but at least it’s around 20ish. And we went through
an example yesterday, in terms of how one looks at
associations to get allelic odds ratios. So this again for a discrete
trait for a myocardial infarction, 55 percent of your
cases have the C allele, 45 percent — sorry, 47 percent
of the controls have the C allele. This gives you chi-squared 55
and an odds ratio of 10 — I’m sorry, and a p-value
of 10 to the minus 13th. You can calculate what’s
called an allelic odds ratio, which is the odds of having
a disease if you carry one or two copies
of the allele. So it doesn’t matter
how many you have, just having the presence of the
allele, and here that would be 1.4. You can also calculate
these by genotype group. Again, we did
this yesterday. So the CC homozygote, 31 percent
of the cases and 23 percent of the controls,
while the GG homozygote, 20 percent of the cases,
28 percent of the controls. And again, a strong chi-square
with two degrees of freedom now, and a p-value of 10
to the minus 14th. Here you can calculate the
heterozygote odds ratio specifically, which sort of
lets these float free. It might be that the homozygote
doesn’t have all that much increased odds over
the heterozygote, it would be a little hard to
explain biologically unless you truly have a dominant trait,
which we tend to be thinking now may not be very common
at all these days, at least not in
complex diseases. But calculate a heterozygote
odds ratio between these two groups is 1.5, and a homozygote
between this group and that group would
be 1.9. And this is the way that
these data are displayed, and again we talked about this
yesterday with the log of the p-value here, and this is a nice
example, because they really found just sort of one very
strong association on chromosome 9 — 9P21,
and this association has been found in many
other studies as well — and mentioned that there
are other ways of looking at these in addition,
and here’s another one — one for a continuous trait now,
serum uric acid levels. What this group did was
to regress inverse normalized levels. Why it was inverse normalized
was just to make it transform into a normal distribution
against the number of alleles. This was an additive model,
so if you have one allele, you’d get, you know, sort
of part of the effect, and two alleles, you get twice
the effect, and they’re sort of forced into that kind
of a distribution. And even one can use covariates,
and you can insert any covariates that you
want in here. So it’s nice to see that,
while with the qualitative traits they tend to be
relatively simple analyses, just with chi-squares,
essentially, or even Fisher’s exact test,
with the quantitative traits, people are getting a
little bit more sophisticated, in terms of linear regression
and adjusting for covariates. This one is another example
in the uric acid literature, and showing here in two
different cohorts, here now are the mean uric acid
levels in the three genotype groups — AA, AG, and GG in
these two different cohorts. And showing the additive effect
for a single allele, having one allele — sorry,
having one G allele drops your genotype mean,
as you can see here, in both of these
cohorts. Association methods for
quantitative traits that tend to be used, as I mentioned,
linear regression, sometimes multivariable adjusted
residuals, just to take out the noise effects
of covariates. One can use linear regression
of log transformed or centralized BMI; this was
done in the Frayling paper. Variance components was very,
very popular at one point in linkage analyses, and it’s sort
of carried over into these studies as well. And so you can do a
z-score analysis, as Sanna did at quantile
normalized height, but there are, as I said, a
variety of ways of doing this. Ways of dealing with multiple
testing — and I think we talked about that a little
bit yesterday — the family-wise error rate is
sort of the general term that’s given to corrections for
either Bonferroni, the A prime, or the Sid k, which is a way
of sort of correcting for the number of times that
you subtract your p-value, your type 1 error from
the universe of possible type 1 errors. It mentioned also the false
discovery rate proportion of significant associations that
are actually false positives, and the false positive report
probability of Wacholder et al., which sounds very much
like the false discovery rate but is a little
bit different. There’s also a Bayes factor
analysis; Bayesian analyses are sort of creeping into
analysis of genome-wide studies, particularly
with U.K. influence, so the Wellcome Trust did part
of this in one of their papers. It is a challenge, because you
have to, you know, basically identify a reasonable
alternative model that you’re comparing your
data to. So not a lot of people use it,
but you may see more and more of it. So, moving on to quality
control, as these folks are doing, “Hold your horses,
everyone, let’s let it run for a minute and see if
it gets any colder.” So, unfortunately
with genotyping, we tend to sort of take
it at face value, and there’s sort of this in
silico truth that you just assume that it must be right
if it comes out of a machine. And there are actually a number
of things that one needs to do to be sure that the data that
you get are not artifactual. Theses include being sure that
the samples that you have are the right samples and that
they’re high quality samples, so we talked yesterday about
how many satellites — variable number of tandem
repeats are used in forensic genotyping to
identify individuals. And there are automated kits
that are very easy to use and relatively
inexpensive. Identifiler is
one of them. I’m sorry, I don’t know the
company name or their location, but some of the genotyping
labs will run these SNPs — I mean, these markers,
they’re highly polymorphic, there are, I don’t know,
13 to 20 of them, and use that essentially as
a bar code on the sample, so that whenever then you go
to test that sample again, you run this bar code on it,
and make sure that you have the right sample and it’s a
better label than anything you could stick on that
side of the tube. Blind duplicates, of course,
make sense. What tends to be done, rather
than blind duplicates from within the study is duplicates
of known specimens like the subsamples, or any other
standard samples from the HapMap. Gender checks, it’s surprising
how many women end up, you know, in the prostate cancer samples,
and how many men end up in the female breast cancer
samples, et cetera. This is a little bit
more of a scary one, because while you have a very
good genotype measure, generally for sex, you
don’t have the best, you know, recording
of that information, and it’s sometimes incomplete or
incorrect, surprisingly enough. And so those corrections can be
made, but you wonder what kinds of other corrections needed
to be made that are not — that cannot be picked
up genotypically. A cryptic relatedness,
or what we sort of refer to as outbreaks of twinning,
when there are duplicate samples in there that
you didn’t realize, and suddenly you have people
that have exactly the same markers that probably are not
twins, but those are also things to be looked for,
and we’ll talk a little bit about that. Degradation of the samples or
fragmentation is something that the laboratories
should be testing for. Call rates greater than
80 to 90 percent, and we’ll talk about those,
heterozygosity and plate/batch calling effects. So all of those are things that
one looks at in the samples, and then, once you’ve done the
samples, you also want to look at the SNPs within samples,
so again you would look at duplicate concordance rates
for a single SNP, because some SNPs are hard to
type, just like some samples are hard to type. Mendelian errors, if you have
family data, you can then see if there are errors
in transmission, so you might have siblings
where there are more than four alleles, which
you shouldn’t have, or, you know, certainly
parent-offspring where the offspring has things that
the parents couldn’t possibly have transmitted to them. Usually, only one of these will
be accepted for a given SNP, and sometimes not even that
many, but usually not more than that. Hardy-Weinberg errors, we talked
a little bit yesterday about the expected proportions of the
common allele, homozygote, the heterozygote and
the variant allele, homozygote and those should be
in binomial proportions, if those conditions that
I mentioned apply, so things like random mating,
which actually doesn’t apply, no in-and-out migration,
no selection and that. But in general, most SNPs
tend to be in Hardy-Weinberg equilibrium, and usually the
threshold for throwing out a SNP on violations of
Hardy-Weinberg are fairly high, so, you know,
10 to the minus seventh, you have to have a pretty strong
difference in distributions in order to question
that SNP. Heterozygosity, and we’ll talk
a little bit about this, they’re, you know, based on
population genetics, there should be a proportion
of people who are heterozygous in a given SNP, it’s usually
in the 30 percent range, it seems to sort
of fall there. A little big higher for
populations of recent African ancestry, a little bit
lower for populations that are younger, as it were,
or inbred populations where you wouldn’t expect to
have quite so many alleles floating in a population. Call rates for a given SNP,
there are some SNPs that you just can’t — you’re just
not sure from the intensity data that are generated,
what to call one of them, and so you leave one out, and
I’ll show you some examples of that. And you’d really like those
call rates to be greater than 98 percent, and that tends
to be the standard these days. It had been lower in the past,
and a lot of those that were lower actually turned
out to be artifact, and so we’ve sort of
learned a lesson there. Minor allele frequency,
which is the frequency in a population of the uncommon
allele, and generally alleles less than about a 1 percent
frequency, there’s some real difficulties in trying
to genotype those. Now, this is going to not be
a challenge when we get to sequencing for even rarer and
rarer SNPs that may be only .5 percent or .1 percent
of the population. They may not be able to be typed
on the existing platforms, because they’re not terribly
sensitive to rare alleles. And then, perhaps the most
important thing is validating the most critical results on an
independent genotyping platform. So you take those — the results
from your Affymetrix array or whatever, and then type
them on something that’s just specific to
a couple of SNPs. And I mentioned Hardy-Weinberg
yesterday, and again here it is again, so the ideal
conditions for it, and these are the proportions
that you would expect to see. So these are some data from the
genetic association information network, testing the two
platforms that we used on the HapMap data. The HapMap samples are available
from Coriell Repository. There’s a small fee, I believe,
for requesting them and sending them out. But they’re — they’ve been
genotyped by everybody in the world, so the data
are very stable. And what we showed here is with
the Perlegen platform of 481,000 SNPs, the Affymetrix
platform of 439,000 that were used — and this is
the 5.0 platform — that were used in the GAIN
study, these estimates of coverage are the coverage,
the number of SNPs in the genome, estimated that are
covered at .8 or greater — with an R squared
of .8 or greater. And this was really quite good
for the European population. A little bit lower is expected
for the Yoruba population. One can use just a single marker
estimating one marker in LD with another marker not on the
platform, or you can use information from a couple
of markers nearby and use a multi-marker imputation,
which gives you much more information, using the LD that’s
surrounding that SNP. And there you can — you know,
you can increase your coverage slightly, and the Affymetrix
platform, sort of similar numbers, a little bit lower. The average call rate in
both of these platforms was over 98 percent,
which is great. Homozygous genotypes,
the concordance with the HapMap was 99.8 percent
in the pearlision for both, and a little bit higher
for the homozygotes. Tends to be a little bit tougher
to type the heterozygotes, and I’ll show you
why in a second. Okay, and then some of the
sample and QC metrics for the Affymetrix 5.0, which
had about 500,000 markers, and the 6.0, which has about a
hundred — a million markers, just shown here, these are
data, again, from GAIN. These are — this was actually
after we started doing this study. The previous ones were sort
of gearing up for it and looking at the
HapMap samples. And what you see here is that
we threw out about .4 percent of total samples because
they didn’t pass a variety of QC metrics, and .55 percent
didn’t have a greater than 98 percent call rate. These numbers are obviously not
exclusive, and we need to have sort of a flow chart of things
as they kind of drop down through these metrics. With the 6.0, the percent
failing was higher, but there were many more SNPs,
and many of them are much tougher to type, so about
4 percent of the samples failing, and about 1.4 percent
without a 98 percent call rate. And that’s 98 percent for that
sample across all of the SNPs that are typed. And then, looking at the
SNP-by-SNP quality — data quality information,
again about 500,000 and about a million showing here,
we dropped out about 6 percent of samples that did not —
sorry, 6 percent of SNPs that did not pass QC,
only .04 percent of the HapMap — sorry, of the 5.0
SNPs — had a minor allele frequency of greater than —
of less than 1 percent, while in the much more dense
platform, as you would expect, a larger proportion of them had
less than 1 percent minor allele frequency, greater than a
98 percent call rate of — 8 percent of the SNPs fell
out on this metric, 9 percent on the
6.0 platform. If you drop that level to
95 percent, people tend to want to do this because they
don’t want to lose data, and there are many various
combinations of babies and bathwaters talked about
in discussing this. But at any rate, you’d lose
about 4 percent in the 5.0 platform, and about
4 percent here. The Hardy-Weinberg here now
set at 10 to the minus sixth, so really very stringent and
less than a percent dropped out for that. Mendel errors, in this
particular study with the 5.0 platform, we dropped out
a lot because we had family samples, and in most studies
you don’t have family samples and you really can’t tell. About the only way you can tell
is if you have more than two sibs, and you can actually
compare, you know, to see that they have more
than four alleles. So here we lost a lot,
and less then or equal to one duplicate error,
again, relatively small. So if you can see here that it’s
the call rate that really tends to drop out your SNPs,
and also the Mendelian error in that one
particular study. This is a plot of
heterozygosity. What one does is just
basically, you know, calculate the proportion
of heterozygotes in the population for each SNP. And as you can see, it really
kind of clusters around this .27 to .30 range. You can’t see very well,
but there actually are SNPs kind of shown along the
kind of fallout here. And what I’ve done here is just
to kind of blow up this area in the 100 — you know,
the frequency of 100, kind of dropped out those,
so that now that you can see, there are some samples that
are coming out here, and some out on
the side as well. So we did have some outliers,
and this tends to be done as basically a — right now,
it’s just kind of — you kind of look at it and say,
“Hmm, those look a little bit odd, particularly
way out here.” Reasons for loss of
heterozygosity tend to be — well, cancers can
do it, but when you’re looking at germ line cells,
decreased heterozygosity usually is due to
genotyping error. Increased heterozygosity hadn’t
been worried about terribly in the past, and we learned
a harsh lesson from GAIN that probably what this
is telling us is sample mix contamination. So sample — then it turned out
there were about eight samples that actually had been
contaminated with other samples in this particular GAIN
study, and we didn’t pick this up initially, and realized,
“Hey, there’s a real problem here,” went back
and looked at these. And you can see that these are
very small numbers of samples. In fact, there probably are
about eight of them here, and that was what
the problem was. So all of these metrics, again,
are evolving as experience accrues, and one of the things
that we really want to do in GAIN is write up this experience
and make it available to the scientific community. These are intensity plots
for the genotyping data, what’s calculated, and I don’t
understand the chemistry, and I don’t pretend to
be able to explain it, but I know that what you
basically produce — for each one of the 500,000 or
million SNPs that is typed, you produce a cluster plot
like this, where you have — if there are two alleles of the
SNP, you have allele A frequency plotted down here, and allele B
frequency plotted here. And as you might imagine,
this would be the AA homozygote, because there’s
no B frequency to sort of spread this out. This would be the BB homozygote
because there’s no A allele there. And then this guy is
the heterozygote. And you can understand why it
might be difficult to type the heterozygote, because you do
have some bleeding over into this group. And these Xs are things
that are not typed. And basically, this is all
automated; you can’t sit and look at 500,000 of these,
but basically the calling algorithm, and there are a
variety of calling algorithms. There are probably about eight
or 10 of them out there, but the companies, you know,
sort of settle on one that they use, and here the calling
algorithm decided, “I’m not going to call these,
they’re just too close on the border.” These ellipses are kind of what
the boundaries of the — of what the algorithm were
called, but as you can see, sometimes it doesn’t call even
something that’s inside of an ellipse, so that’s what
these look like. This is a very nice looking one,
although there aren’t that many AA homozygotes, and this
actually is the — allele A is the T variant,
and here, you know, lots of homozygotes here,
and nicely separated from the heterozygote group. Here is a little
bit tougher one, there were not homozygotes,
so this is a rare allele, so there are no AA homozygotes,
but there is a heterozygote cluster, and there’s
a homozygote cluster for the common allele,
and again, no calls. But here you’ve got much
more of sort of a mess. So this is a, you know,
badly clustering SNP, badly performing, and probably
most labs would repeat this one and try to get it to —
the intensity to be much closer along the B line
without anything in the A line intensity. And here you’ve got — maybe
these are the homozygotes and maybe these are
the heterozygotes, but it’s really very
difficult to tell. What’s recommended now, having
gone through many of these studies, is that when you —
whenever you see an associated SNP, you know, something that
you want to call it 10 to the minus seventh or whatever
level it might be, that you look at these plots and
make absolutely sure that they look good, that you have
three distinct clusters, so you don’t — that means
you don’t have to look at, you know, all 500,000 or a
million, but you do have to look at maybe 25 or 50,
and if you’re carrying through 25,000 of them,
that makes it a little bit tougher. And most people wait until
they’ve gone through multiple stages before they
start looking at cluster plots. And just one sort of kicker
in this process, this is a paper published by
Clayton, et al., in 2005 that actually colored their
SNP intensities by whether something — a sample came
from a case or a control. So the red are cases,
the blue are the controls. You can see the reds kind
of clustering down here, and even difficult to see them
there — and then these — this is just a different way
of — it’s a normalization of the intensity data. And you can see that there
is a systematic difference between cases
and controls. This could arise by different
treatment of the case and control DNA samples. They tried very hard to control
for that and make sure that the — you know, the protocols
were exactly the same, they were done in the
same laboratories at the same time on the same
plates and all of that, and they still ran into this,
and basically concluded that it probably had something to do
with the collection at the site. Happily, this is something
that can be adjusted for, so you can kind of calculate
vectors associated with caseness and controlness,
at least in — for a given SNP and
correct for it. But it is a cumbersome problem,
and one that needs to be looked for. Another thing that can introduce
a systematic error is if you’ve amplified the samples in one
group and not in the other, or amplified some of the
samples and not others. Amplification is when you have a
very, very small amount of DNA and you use polymerase
chain reaction, PCR, to make as much of
it as you like. You don’t always get good copies
of that, and it often gives you artifacts, and sometimes
they look like this. So the laboratory that does
these — this testing should be aware of these problems and
should discuss them with you, but just so you’re
aware of them. I wanted to get in a little bit
into the problem of sort of unknown relateds or
cryptic relatedness. Tom mentioned yesterday
population stratification, and looking at a population
that’s made up of a mixture of different groups. What’s shown here is
from the CGEM study, actually a little experiment
that they did where they mixed together a whole bunch of
groups of different geographic ancestry, and they started
with the HapMap sample. So a lot of times when people
do these kinds of analyses, they start with three population
groups — or samples that are known to sort of cluster
tightly and to distribute out differently
from each other. So that’s — this is the
Yoruba from the HapMap, the CEPH population and then
the Asian populations. And they also typed an
African American group, and as one might expect,
they sort of fell on the midrange between — in this
principal component analysis, between the Yoruba and
the CEPH population. They also typed a Native
American and Latino group, who tend to have more
Asian ancestry alleles, or at least to cluster more
with Asian ancestry groups, and you can see where the
Native Americans actually clustered quite tightly here,
Latino a little bit spread out a little
bit more. And then they did a third
principal component, so the way — and it’s a shame
that Kang [phonetic sp] isn’t here, because he could explain
this much better than I can, but anyway, what one does is
to look for vectors that are basically separating the
samples in the most effective ways, essentially. And you can do this up to,
you know, hundreds of principal components, but
after the first several, you tend to run
out of steam. So it sounds like we have
some competition here. I hope you can all still
hear me, but, okay. So in doing this then,
with the — you noticed that the CEPH population for
the third principal component actually moved in quite a bit,
so they got kind of pushed in. The Asians and the Yoruba
stayed pretty much the same; the Yoruba shifted
over a little bit. And here are your African
Americans where you’d expect them, sort of between the
two — these two groups. And now the Latinos tend to sort
of take off in this direction, between the Asians, and when you
go on to the fourth and fifth principal components, you have
this tremendous sort of clustering of samples that you
really can’t separate out, and then you have a few down
here that seem to be driving this, but it’s not clear,
and they tried going to a fifth principal component,
and still didn’t much — and they’re kind of wondering,
you know, what the heck is going on. So in looking at this group,
they looked very carefully, particularly at these samples
that seem to be outliers, and that will be driving
the vector calculation. And what they found, actually,
was that many of these people were related. So these were two parent-child
groups, and then there were half sibs here, and these were
half sibs all along through here, who were all basically
unsuspected in this sample. When they then randomly chose
one person from each of the related groups, they were able
to separate out their clusters very nicely. So this was kind of a
surprise to many of us. This was presented at one of
the last GAIN meetings by Gilles Thomas, and it’s
something that you need to be aware of, that
unrelated — or sorry, that related people do
end up in your studies, even if you don’t expect them
to, and you may not always pick them up. What they pointed out was that
the studies that they were comparing, this again from
CGEMS, they had multiple studies around the world,
the ACS, the PLCO, the — this is a European study and
another northern European and another
European study. And you wouldn’t think that
there would be people in common across those studies,
but as it turned out, certainly in the U.S. studies,
there was a sib here that was participating in both the
Harvard professional study — the health professional
study and the ACS. There were two people who —
or sorry, seven people were called socially conscious
people who were participating in both studies. There were sib pairs in the
PLCO and three people in the PLCO and the health
professionals. So social consciousness is a
risk factor for unexpected twinning in your data,
and then sib pairs here, and a father/son
pair here. So this does happen;
it’s probably in your studies if you’re not aware
of it, you will find it with genotyping data. And if you go looking for a
structured population when you have two people that
are that closely related, it really sort of points
the vector at it. And it can also do other things
in terms of clustering when you’re trying to cluster your
SNPs and your data that really can mess up your
analysis. So, kind of summary points for
genotyping quality control: sample checks are done for
identity, for gender error, for cryptic relatedness. Sample handling differences
can usually be adjusted for, but you have to be
aware of them. And doing an association
analysis is often the quickest way to find your
genotyping errors. You’d like to think that that’s
how you would find your real signals, but it’s not. One of the things that we tried
to do in GAIN was actually to make all of the data,
the association data, the results, available to
everybody at the same time. We wanted GAIN to be, you know,
as forthright and sort of public community resource
as possible. And in doing that, we basically
said to the principal investigators and the genotyping
labs, “We’re not going to tell you who’s a case and who’s
a control, or what SNPs are associated with what,
we’ll just show you the, you know, the data,
kind of blinded, as it were, or masked
to SNP identity.” And it made it very, very
difficult for us to be able to sort of, you know,
separate out these things, and certainly doing quality
control and figuring out who was related and who wasn’t,
and we decided that blinding was really not the best
way to go about this, and so we now have a — sort of
a back and forth with the investigators who provided
the samples, and we allow a very short time for that,
because obviously quality control can go on for
months and months. But it is a
challenge. And rare SNPs are the most
difficult to call with these kinds of platforms,
and just be aware of that, particularly as we get into
rarer and rarer SNPs as we do more and more
sequencing. And the inspection of genotyping
cluster plots really is crucial.   Okay, once you have your
data cleaned up, what tends to be done is to look
at the association statistics and taking an easy example,
this is again from the Easton study, you basically take
your chi-square statistics, you plot them along — there’s
an expected distribution, I didn’t label this, sorry. So this is expected chi-square,
and remember if you have no association whatsoever
and you just choose, keep choosing randomly
samples, you’ll end up with the chi-squared
distribution, the normal distribution
squared. And this is the observed
chi-square, so if your observed population basically
followed the chi-squared distribution, that is,
there’s no association, you would expect all
of your plots — because you’re just lining
up your chi-squares — all of them to fall along
this little gray line here, which is the identity line. If they don’t fall on that line,
it means that they don’t — they’re not following that
distribution for some reason, which could be genotyping
artifact, or it could be that there’s a real association
there that’s sort of pulling them upward. And what you see in this
black line here is that, indeed, there are some
departures from the chi-squared distribution,
and some of them are actually, like this one here,
relatively dramatic from what would be expected in
the chi-squared distribution. These red dots are corrections
for the population structure in this population. So what tends to happen is
if you have differences in ancestry between cases and
controls, they will inflate these statistics; you can
correct for that. One way of correcting for it
is what’s called a genomic control or Lambda value —
we won’t got into this in detail, but you just
basically divide all of your statistics by whatever
the Lambda control value is, and that was done here,
and that’s what you end up seeing. So this one actually for Easton
didn’t look all that dramatic, but when they started doing
their replications of their top SNPs, it did sort of
pop out some results. You’ll also see people publish
tables like this, where they just basically plot
out a table of the observed, adjusted for that
genomic control, and the expected values by
the significance level, so .01 to .05, you’d expect
to see 934 based on the chi-squared distribution,
and they observe 1,239 corrected down
to this level. So, you know, a modest excess
of modestly significant, or really not very
significant at all SNPs. And then going down to
the increasing levels of significance until they got
to 10 to the minus fifth, and they had 13 times as many,
they would have expected to see one and they saw 13,
which suggested that they actually did have some important
associations outside of chance. You tend to see these tables
much less commonly. People like to look at pictures,
and so you tend to see Q-Q plots much more commonly. This is a very nice one
from Hafler et al., looking at multiple sclerosis,
and you can see the identity line here, it’s this red line. This gray line is their
initial associations, so their expected distributions,
and then what they observed, and you notice this
dramatic takeoff. Multiple sclerosis has long
been known to be associated with the MHC multiple
histocompatibility complex, a major histocompatibility
complex alleles, which are very, very diverse
and are very heavily genotyped on these platforms. So you have lots
and lots SNPs. This is actually not all that
many, but it looks like a lot that are sort of taken off for
very high chi-squared values. If you then take out the major
histocompatibility complex SNPs and just plot what’s left,
you end up with this here. And you’re still having some
strong departures from the expected distribution. So you need to be aware that
at some — sometimes people will take out a particular
locus, or a locus that’s already known to be associated,
and kind of see what’s left. This was done for prostate
cancer, and the Icelandic study, again here’s the —
show them a different scale, so here’s your identity level,
the blue is corrected, the red is uncorrected. You have, you know, very strong
associations shown here, but knowing that chromosome 8,
that AQ24 region we mentioned yesterday is known to be
strongly associated with prostate cancer. They basically took out all of
chromosome 8 and then looked again, and still you have a
departure from the expected level — even after you correct
for most of these kind of fall out once you correct,
but you still have a few SNPs that are outside
the expected distribution. And this is the Q-Q plot for
myocardial infarction that I showed you yesterday —
I think I showed it to you yesterday, but it’s from
the Samani study in the “New England Journal” that was
part of the Wellcome Trust Case-Control Consortium. And here you can see there’s
quite a strong distribution from the number of —
that are departing from the expected distribution. And these actually are all
of the SNPs that were — had a p-value of less than
10 to the minus seventh, which they sort of declared
as their initial threshold. And when one looks at the actual
plot now arrayed along the chromosome, those 30-some
SNPs are these here. So this is a nice plot because
they’re all sort of, you know, hanging right on top of
a particular chromosome. And one can then sort of look
at not only this level of significance, but this group
is also departing from what you would expect on the —
in the expected distribution. And that is it’s basically
corresponding to this level of association. So these were SNPs that the
Wellcome Trust group said, “You know, they’re not decisive
in our study, but they are sort of interesting,” And as we explained yesterday
with the 20 — the SNP number 24,000 in the initial
association ending up being one of the top SNPs
in a replication study, they said, you know,
“We certainly would want to look at these as SNPs
of interest, as it were, in a subsequent study.” And then there’s this area here
that has a modest departure from the expected distribution,
and that is probably all of this here. And what those are most likely
are SNPs related to the structure of that population,
so differences in ancestry that probably are just noise,
but one needs to be aware of and adjust for. And then this is the
plot from that study. These are the 39 SNPs that they
found to be strongly associated on chromosome 9, and here’s the
one that I mentioned yesterday, 3049, known by its nickname. So this is the most strongly
associated SNP, and this is just another study that they
did of — a replication study. And here again is that —
what I showed you yesterday, that Boston to Providence
sort of map distances. And you can see that here
you’ve got one linkage block, so maybe these SNPs are
independent of the associations with these,
because you do have a recombination hotspot here. And these are all sort of clues
that you can get as to where the kind of operative or
important SNP might be. Something that I think Tom
mentioned yesterday was the winner’s curse, which is the
tendency for initial studies to overestimate the effect —
the magnitude of an association, and this happens in all
kinds of studies. It was, you know, described
most colorfully, I think, in genetic studies. And it’s shown here for what
was probably the very, very first genome-wide
association study. It was done with only 20,000
SNPs, so we tend not to count it in our lists of studies,
but it was actually Ozaki et al. in a Japanese study
of myocardial infarction looking at lymphotoxin A. And you’ll see in their study
they found a strong — stronger odds ration than any
of the others found: 1.71, quite significant. And then in subsequent
replication studies, these all sort of fell
back toward essentially no association. When a metanalysis was done
of all these studies, it was significant — sorry, it was
not significant, .98. And probably this was a
spurious association that looked good because, you know,
when one does lots of these, you end up picking the extremes,
and those are the things that get published. So again, a little publication
bias, a little winner’s curse. &nbspGetting into, very briefly,
gene-gene interaction kinds of studies, this is a
genome-wide scan looking for — looking at Alzheimer’s
disease, 861 cases, 600 controls. This is a little bit hard to
see, but you probably can pick out this one red
line going up here. This is on chromosome 19,
and it’s the APOE locus. And it’s very, very, very
strongly associated with Alzheimer’s disease. This was known before
the study was done. It had been identified
in linkage studies, as we talked about
yesterday. And as you can see, the p-value
here is probably about 10 to the minus 40th,
so very, very strong association, but there —
you know, is there something else in here
or isn’t there? What was done by this group then
was to — sorry, there it is — was to then stratify and look
just at the people who were carrying APOE 4, so sort of get
rid of the impact of APOE 4, they actually stratified
people with it and without, and now, of course, vastly
expanded the y-axis. But you can see that there
are a couple that do sort of pop out as being associated
with — particularly here, with Alzheimer’s disease
in the E4 carriers, suggesting that there might
be some interaction, and these were not associated
in the E4 non-carriers and the sizes of the groups were,
you know reasonable enough that you would expect power
was not an issue in the other groups. So they then looked at the odds
ratio of late onset Alzheimer’s disease associated with
this particular SNP, 3115 in the GG homozygotes
by their E4 status, and found that people who
did not carry any of the E4 alleles had no association
with Alzheimer’s disease with the SNP, and those who were
carriers of the E4 allele had about a — almost a
threefold increased risk if they also carried
this particular — if they also were
homozygotes. These are small numbers and
they don’t give you the numbers, but you can see
the confidence interval is relatively generous. And in looking at everyone
together, there still was an association of this
particular SNP, so all of that sort of coming
together to suggest that there may be some interactions between
this particular variant, which is the GAB2,
I’m not familiar with it, and it’s being looked into
by the Reiman group, in terms of what that
interaction might be. These are — this is a
genome-wide scan for age-related macular
degeneration. I think I showed
this yesterday. It was probably the first —
the March 2005 — you know, earliest truly genome-wide
scan at 100,000 or so. They found actually two of
their SNPs that went about their genome-wide
cutoff level. It turned out that this
one was genotyping error, and it went away when they
looked more carefully at the genotype, so you know,
one of out two ending up being genotyping error. One can calculate population
attributable risks based on odds ratios and prevalences
from these studies. That was done here, and you
can see the — here are the associations — these are two
SNPs that were sort of right next to each other; you couldn’t
tell that on the plot, but they sort of calculated
these for both of them; here’s the more strongly
associated SNP. The frequency was quite high,
and the odds ratio for a dominant model was
also quite high. So population attributable
risk, which is essentially a function of the prevalence
and the odds ratio was very high for the homozygote, and
even for the heterozygote — sorry, for the heterozygote,
the homozygote odds ratio was higher, but obviously
prevalence was lower, so the attributable
risk was lower. So the feeling was that this
was — this may even be, you know, almost a
single-gene disease. These odds ratios have come down
somewhat in replication studies, so it now looks like
it’s about two or so, but it still accounts
probably for a large amount of this disease. What was interesting about this
variant was that it was then taken forward into the
Nurses’ Health Study, and various environmental
factors that might be interacting with this
very strong risk factor were examined. And these are our data —
Schaumberg et al. published these — these are
data that are available in any cohort study and that could
be looked at for any of these SNPs, you know, if they’ve typed
the SNPs, and it’s work that, you know, is begging to be done,
and really needs to be looked into. So what they did was just to
stratify based on obesity, less than 30 or greater
than 30 BMI, and use the group that had
the — this is actually — I believe this is the variant
allele that’s protective, but it’s the ancestral allele is
the one that’s a risk allele. But at any rate, this group
being the comparison group, and you can see that as you
carry more copies of the H allele, you risk goes up
whether you’re lean or obese, but your risk goes up
more if you’re obese. And similarly, if you’re
a non-smoker, you have increased risk; if you’re a
smoker you get considerably more increased risk, based on
both smoking and on the prevalence — the carriage
of this allele, so there’s an
interaction here. Neither of these were quite
statistically significant; they didn’t reach a
.05 level, but they were certainly suggestive. And again, this is work
that is not going to be done by geneticists;
it is work that could be done by epidemiologists, and while at times when
we raise — you know, we sort of raise the specter of
“Don’t you think you should be looking for gene-environment
interactions?” Very often the answer we get
back is, “Yes, but there are — you know, there’s multiple
comparison problems.” And think about it, and you say,
“Wait a minute, you did, you know, a million SNPs,
I’m only doing five environmental factors —
or a thousand environmental factors,” it’s still not going
to be nearly as bad. So that’s work that you should
just be aware it needs to be done, and could
readily be done. One of the nicest examples that
I’ve seen of gene-environment interaction has to do with some
work that Jose Ordovas did on the hepatic lipase C, or LIPC,
genotype in relationship to HDL, and basically these are
smooth plots as you might expect, and what he showed
was a relationship between fat intake and HDL that varied
by the genotype at this particular locus. And so basically,
in the TT genotype, the more fat you eat,
the lower your HDL. In the CC genotype,
the more fat you eat, the higher your HDL. Now these are ecologic data,
they’re not interventional data, so you know, forgive me
for implying causality, but that was what was
inferred, essentially. What’s interesting about this is
say you looked at this middle band, at people in a population
who were eating about 30 percent fat, which is about
the average American diet. You would conclude that there
was no association between LIPC genotype and HDL level,
if that was the question that you were asking. Whereas if you looked at
a population with very low fat intake, such as from
developing nations and that, you would conclude that the
TT genotype has high HDLs, the CC has low HDLs,
and that this is codominant with the CT
being in the middle. If you were then to look at a
population with a very high fat intake such as Scandinavia,
you’d conclude that, no, it’s not the TT that’s
associated with HDL, it’s the CC, and it’s actually a
dominant effect as you can see it here in the heterozygote
and then the variant — I’m sorry, the TT has the
lowest HDL levels. So this is how you can,
you know, tend to miss these kinds of associations or
get inconsistent relationships between genotypes
and phenotypes, and we need to be aware that
this happens all the time, and is rarely
looked for. So gene-environment
interactions can be an explanation for inconsistency
in associations. And I did come across another
example of this. This was endotoxin exposure
and allergic sensitization by CD14 genotype, and,
again, if you looked at the relationship of this
genotype to sensitive — probability of sensitization
to this particular allergen, you’d conclude that there was
no relationship with genotype. If you had a sort of middle
of the range endotoxin load, if you had a low range,
you’d conclude one thing about the CC homozygote,
and if you had a very high range, you’d conclude
the opposite about the CC. So gene-environment interaction
can be quite important, and it’s something we
should be pursuing. So I think I’ll close by just
saying there are challenges in studying gene-environment
interactions, as you might imagine. The genes are actually
pretty easy to measure, that’s why we can measure so
many of them at such low cost. The environment is very, very
difficult to measure, and you know, hopefully
it’s getting easier with more automated measures,
but it’s something that the Environmental Health
Sciences Institute is investing a fair amount of money
in, and I know a lot of work has been done here
on that as well. Variability over time,
it’s hard to argue that the DNA sequence varies over
time, whether it’s turned on or turned off. It probably does vary with time,
otherwise we would all keep growing and keep developing and
that, but for the most part, the sequence has low-to-none
variability. The environment obviously
has — can often have high variability — oops. And recall bias, there’s
probably none in the genes, it’s certainly possible
in the environment, temporal relationship to
disease, as I mentioned yesterday, is pretty easy for
the genes, it’s kind of hard for the environment. And just to close with being
aware of one’s environment, “I guess I’ll have the ham and
eggs, to the surprise of the chickens and pork here.” So thanks, and I’ll be happy
to answer any questions. I have to say I was delighted to
see that it was sort of raining and gray today, because I felt
so guilty about keeping you all in on a beautiful,
sunny day yesterday, so — yes? [Female Speaker]
When you’re talking about
sample handling differences, where along the chain are you
talking about, anywhere, or — [Dr. Teri Manolio]
Really anywhere, and you know,
some of that may be — you know, it may have to do with
the participants themselves. So samples can be different from
young people versus old people, just in the way your
DNA might be isolated, and that’s certainly in the
transformability of it, and all. But really any step
along the way. And I think there is
investigation going on in many of the laboratories now,
trying to figure out what are the steps that — you know,
what are the things that really perturb
it a lot? But granted that we can’t really
figure out what those steps are, at least look for it and try
to adjust for it if you can, if there are systematic
differences. [Female Speaker]
And then what about —
you mentioned genotyping errors — is that on the
platform itself or — [Dr. Teri Manolio]
It’s probably calling error
more than on the platform. You know, it’s basically that
the chemistry didn’t work well enough to be able to separate
out the intensity of the two alleles. Now, there are other things that
can cause genotyping error. What if you have
three alleles? That would be one. Or if you have a null allele,
you’re not going to have any intensity for that person,
it’s not clear how to count — you know, it’s deleted
basically, so you don’t have that SNP. Or if you have copy
number variants. You may have four or five
or six copies of it, and where does that show
up in your intensity? And there are algorithms being
attempted to sort of come up with those. But most of it is genotyping
error and clustering error. [Male Speaker]
Can we look at a principal
component analysis of one of the [unintelligible]
data you showed there? [Dr. Teri Manolio]
You want to go back? [Male Speaker]
Yeah. [Dr. Teri Manolio]
Okay, I’m really bad at
explaining the principal components, but I’ll
do my best. Sorry — don’t look at the
screen if you tend to have photic seizures. So this one, or — [Male Speaker]
No. [Dr. Teri Manolio]
Before that? [Male Speaker]
Previous — yeah, first and
second, third and fourth. [Dr. Teri Manolio]
First and second, okay, I’ll make these — [Male Speaker]
Okay, that’s fine. [Dr. Teri Manolio]
Okay? [Male Speaker]
So, what kind of variable
were used to construct this principal component,
and what kind of — [Dr. Teri Manolio]
This is allele frequency,
so it would be frequency of — and I believe what the
principal components does is pick out the SNPs that have
the most differences between groups. [Male Speaker]
Does it pick a few alleles
or all of the alleles? [Teri Manolio]
I believe it picks, you know,
whichever alleles are the most different between groups,
and it could be one, but it’s probably
many of them. [Male Speaker]
Okay, all right. Thank you. [Dr. Teri Manolio]
Sure. Yeah, and you know, the nicest
thing about principal components is, because they’re linked,
you know, they’re in LD, you would get — a bunch of them
would sort of come down with one component. [Male Speaker]
Yeah, I guess [unintelligible]
several allele, because here there’s several
principal components, so this — [Dr. Teri Manolio]
Oh, there are many, yeah. Yeah, I mean it can go out to,
you know, 100 principal components more, and then you
start, you know, separating out sort of clans and towns
and things like that. [Male Speaker]
Okay, thank you. [Dr. Teri Manolio]
Sure. [Female Speaker]
You just mentioned that the
gene-environmental interaction can be an explanation for
the different results in the different studies. And of course, through yesterday
and today, over and over it was mentioned that we need
replications studies. So how do you know, then,
if we know that several cohorts [unintelligible] collect
the same environmental factors, so one example, for example,
in the Nurses’ Health Study showed a positive
strong association, and nobody else shows that
[unintelligible] this association was real,
so how do you account for this differences,
because then you cannot make a statement, “Yes, this
is a false association, therefore it’s not –” [Dr. Teri Manolio]
Yeah, no, it’s a
huge problem. I mean, obviously, we can’t
make all of the cohort studies collect the
same information. You’d like to be able to at
least have some comparability between them, or collect some
kind of watered-down version, and the problem then, of course,
is then it’s watered-down, and so are you missing a lot of
the variability within there? I think what tends to happen
when there are inconsistencies is people just dismiss them,
and that I personally think is a mistake, you know,
there may be some very important information there. The problem is, you know,
when you have a million possibilities, you want to at
least separate down — you know, distill down the ones
that you really think are really interesting or
potentially important and try to pursue those. But you have to be,
you know, selective. [Female Speaker]
And this goes back to, then,
would it be necessary? And I know that it’s
already out there, the idea of having this huge
cohort that your institute was thinking about, therefore,
because of these issues, wouldn’t be necessary for the
future to have this huge cohort. [Dr. Teri Manolio]
Yeah, we think it’s
really necessary. We don’t see how we’re going
to get sort of the bottom of this without a very large
cohort that is well phenotyped. The large cohorts that are
going on now are not that well phenotyped, and the
environmental characteristics that are collected
are even poorer. But, yeah, we think
it’s necessary.

Tags:, ,

Add a Comment

Your email address will not be published. Required fields are marked *