15. Gene Regulatory Networks

The following
content is provided under a Creative
Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: So you’ll
recall last time we were working on
protein-protein interactions. We’re going to do a little
bit to finish that up, with a topic that will be a
good transition to the study of the gene regulatory networks. And the precise things we’re
going to discuss today, we’re going to start off
with Bayesian networks of protein-protein
interaction prediction. And then we’re going to get
into gene expression data, at several different levels. We’ll talk about
some basic questions, of how to compare
the two expression vectors for a gene,
distance metrics. We’ll talk about how to
cluster gene expression data. The idea of
identifying signatures of sets of genes, that
might be predictive of some biological property. For example, a
susceptibility to a disease. And then we’ll talk about
a number of different ways that people have
developed to try to identify gene
regulatory networks. That often goes by
the name of modules. I don’t particularly
like that name. But that’s what you’ll
find in the literature. And we’re going to
focus on a few of these, that have recently been
compared head to head, using both synthetic and real data. And we’ll see some
of the results from that head to
head comparison. So let’s just launch into it. Remember last time we had
started this unit looking at the structural
predictions for proteins. And we started talking
about how to predict protein-protein interactions. Last time we talked about
both computational methods, and also experimental
data, that could give us information about
protein-protein interactions. Ostensibly measuring
direct interactions, but we saw that there were
possibly very, very high error rates. So we needed ways of integrating
lots of different kinds of data in a probabilistic framework so
we could predict for any pair proteins what’s the
probability that they interact. Not just the fact that
they were detected in one assay or the other. And we started to talk
about Bayesian networks in this context. Both useful as we’ll see
today, for predicting protein-protein
interactions, and also for the gene regulatory
network problem. So the Bayesian
networks are a tool for reasoning probabilistically. That’s the fundamental purpose. And we saw that they consisted
of a graph, the network. And then the probabilities
that represent the probability for each edge, the conditional
probability tables. And that we can learn
these from the data, either in a completely
objective way, where we learn both the structure
and the probability. Or where we impose the
structure initially, and then we simply learn
the probability tables. And we had nodes that
represented the variables. They could be
hidden nodes, where we don’t know what the true
answer is, and observed nodes, where we do. So in our case, we’re
trying to predict protein-protein interactions. There’s some hidden
variable that represents weather protein
A and B truly interact. We don’t know that answer. But we do know whether
that interaction was detected in an experiment
one, two, three or 4. Those are the
effects, the observed. And so we want to
reason backwards from the observations,
to the hidden causes. So last time we talked about
the high throughput experiments, that directly
we’re measuring out protein-protein interactions. We talked about yeast two
hybrid and affinity capture mass spec– here
listed as pull-downs. And those could
be used to predict protein-protein
directions, by themselves. But we want to find out
what other kinds of data we can use to amplify
these results, to give us independent information about
whether two proteins interact. And one thing you
could look at is whether the expression
of the two genes that you think might interact
are similar. So if you look over many,
many different conditions, you might expect
the two proteins that interact with
each other, would be expressed under
similar conditions. Certainly if you saw
two proteins that had exactly opposite
expression patterns, you would be very unlikely to
believe that they interacted. So the question is, how much
is true at the other end of the spectrum? If things are very
highly correlated, do they have a high
probability of interaction? So this graph is a
histogram for proteins that are known to
interact, proteins that were shown in these
high throughput experiments to interact, and proteins that
are known not to interact, of how similar
the expression is. On the far right
are things that have extremely different expression
patterns, a high distance. And we’ll talk
specifically about what distance is in just a minute. But these are very dissimilar
expression patterns. These are very similar ones. So what do you
see from this plot we looked at the last time? We saw that the
interacting proteins are shifted a bit to the left. So the interacting ones
have a higher probability of having similar
expression patterns than the ones don’t interact. But we couldn’t
draw any cut off, and say everything with this
level expression similarity is guaranteed to interact. There’s no way to divide these. So this will be useful in
a probabilistic setting. But by itself, it would
not be highly predictive. We also talked about
evolutionary patterns, and we discussed whether the
red or the green patterns here, would be more predictive. And which one was
it, anyone remember? How many people thought the
red was more predictive? How many the green? Right, the greens win. And we talked about the
coevolution in other ways. So the paper that, I
think, was one of the first to do this really
nicely, try to predict protein-protein
interaction patterns using Bayesian networks, is this
one from Mark Gerstein’s lab. And they start off as we
talked about previously, we need some gold
standard interactions, where we know two proteins
really do interact or don’t. They built their gold
standard data set. The positive trending
data, they took from a database
called MIPS, which is a hand-curated database
that digs into the literature quite deeply, to find out
whether two proteins interact or not. And then the negative
data they took were proteins that
were identified as being localized to
different parts of the cell. And this was done
in yeast, to where there is pretty good data
for a lot of proteins, to subcellular localization. So these are the data that
went into their prediction. These were the experiments
we’ve already talked about, the affinity capture mass
spec and the yeast two hybrid. And then the other
kinds of data they used were expression correlation,
one just talked about. They also looked at
annotations, whether proteins had the same annotation
for function. And essentiality. So in yeast, it’s
pretty easy to go through every gene
in the genome, knock it out, and determine
whether that kills the cell or not. So they can label
every gene in yeast, as to whether it’s essential
for survival or not. And you can see here, the
number of interactions that were involved. And they decided
to break this down into two separate
prediction problems. So one was an
experimental problem, using the four
different large scale data sets in yeast from
protein-protein interactions, to predict expression. The other one wore
these other kinds of data, that were less direct. And they used slightly different
kinds of Bayesian networks. So for this one, they
used a naive Bayes. And what’s the underlying
assumption of the naive Bayes? The underlying assumption
is that all the data are independent. So we looked at this previously. We discussed how
you could, if you’re trying to identify
the likelihood ratio, and use it to rank things. You primarily need to
focus on this term. Because this term will be the
same for every pair of proteins that you’re examining. Yes? AUDIENCE: Could you state
again whether in a naive Bayes, all data are dependent
or independent? PROFESSOR: Independent. AUDIENCE: OK. PROFESSOR: OK. So let’s actually look
at some of their data. So in this table, they’re
looking at the likelihood ratio that two proteins
interact, based on whether the two
proteins are essential. One is essential, and
one is a nonessential. Both are nonessential. So that’s what these
two codes here mean. EE, both essential. NN, both nonessential,
and any one and the other. And so they’ve computed
for all those protein pairs, how many in
their gold standard, are EE, how many are
EN, how many are NN? So here are the
numbers for the EE. There are just over 1,000, out
of the 2,000, roughly 2,000 that are EE. So that comes up with a
probability of being essential, given that I know
that you’re positive. You’re in the gold standard
of roughly 50%, right? And you can assume something
similar for the negatives. So these are the ones that
definitely don’t interact. So the probability of
both being essential, given that it’s negative,
is about 15%, 14%. And so then the likelihood ratio
comes out to just under four. So there’s a fourfold
increase in probability that something is interacting,
given that it’s essential, then not. And this is the table
for all of the terms, for all of the different things
that they were considering, that were not
direct experiments. So this is the sensuality. This is expression correlation,
with various values for the threshold, how similar
the expression had to be. And these are the terms from
the databases for annotation. And then for each
of these, then we get a likelihood ratio
of how predictive it is. So it’s kind of informative to
look at some of these numbers. We already saw that
essentiality is pretty weak, predicted the fact
that two genes are essential. It only gives you a
slightly increased chance that they’re
interacting than not. But if two things, two genes
have extremely high expression correlation, then they’re
more than a hundredfold more likely to interact than not. And the numbers
for the annotations are significantly
less than that. So this is a naive Bayes. We’re going to multiply all
those probabilities together. Now for the
experimental data, they said, well, these are
probably not all independent. The probably that you pick
something up in one two hybrid experiment,
is probably highly correlated with the
probability that you pick it up in another two
hybrid experiment. And one would hope that there’s
some correlation between things are identifying in two hybrid
and affinity caption mass spec. Although we’ll see whether
or not that’s the case. So they used what they refer
to as a fully connected Bayes. And what do we mean by that? Remember, this was
the naive Bayes, where everything is independent. So the probability
of some observation is the product of all the
individual probabilities. But in a fully
connected Bayes, we don’t have that
independence assumption. So you need to actually
explicitly compute what the probability
is for an interaction, based on all the possible
outcomes in those experiments. So that’s not that much harder. We simply have a table now,
where these columns represent each of the experimental
data types– the affinity capture mass
spec and the two hybrids. Ones indicate that
it was detected, Zero is that it’s not. And then we simply look
again in our gold standard, and see how often a protein that
had been detected in whatever the setting is here, in all
of them except Ito, how often was it, how many of the
gold positives do we get? And how many of
the gold negatives? And then we can compute
the probabilities. Now it’s important
to look at some of the numbers in these
tables and dig in. Because you’ll see the numbers
here are really, really small. So they have to be
interpreted with caution. So some of the
things that might not hold up with much
larger data sets. You might imagine the things
that are experimentally detected in all of the
high-throughput assays would be the most confident. That doesn’t turn
out to be the case. So these are sorted by the
law of likelihood ratio, and the best one is not 1, 1, 1. It’s up there. But it’s not the
top of the pack. And that’s probably just the
statistics of small numbers. If the databases were larger,
experiments were larger, it probably would
work out that way. So any question about how
they formulated this problem, as a Bayesian network, or
how they implemented it? OK. So the results then– so once
we have these likelihood ratios, we can try to choose a threshold
for deciding what we’re going to consider to be a
true interaction and not. So here they’ve plotted for
different likelihood ratio thresholds. On the x-axis, how many of the
true positives you get right, versus how many you get wrong. So the true positive
over the false positive. And you can
arbitrarily decide, OK, well I want to be more– I want
to get more right than wrong. Not a bad way to decide things. So your passing
grade here is 50%. So if I draw a line,
a horizontal line, and wanted to get
more right than wrong, you’ll see that any
of the individual signals that they were
using, essentiality, database sanitation, and so on–
all of those fall below that. So individually, they predict
more wrongs than rights. But if you combine the data
using this Bayesian network, then you can choose a
likelihood threshold, where you do get more
right than wrong. And you can set your
threshold wherever you want. Similarly for the direct
experimental data, you do better by combining–
these are light pink lines, than you would with any of
the individual data sets. So this shows the utility
of combining the data, and reasoning from the
data probabilistically. Any questions? So we’ll return to
Bayesian networks in a bit in the
context of discovering gene regulatory networks. So we now want to move
to gene expression data. And the primary reason to be so
interested in gene expression data is simply that there’s a
huge amount of it out there. So just a short time ago
we passed the million mark, with a number of
expression data sets that had been collected
in the databases. There’s much less of any other
kind of high throughput data. So if you look at proteomics
or high-throughput genetic screens, there are
tiny numbers, compared to gene expression data. So obviously techniques for
analyzing gene expression data are going to play a very
important role for a long time to come. Some of what I’m
going to discuss today is covered in your textbooks. I encourage you to look
at text section 16.2. The fundamental thing that
we’re interested in doing, is seeing how much
biological knowledge we can infer from the
gene expression data. So we might imagine
that genes that are coexpressed under
particular sets and conditions, have functional
similarity, reflect common regulatory mechanisms,
and our goal then, is to discover those mechanisms. So fundamental to this then, any
time we have a pair of genes– and we look at their
gene expression data– we want to decide
how similar they are. So let’s imagine that we had
these data for four genes. And it’s a time
series experiment. And we’re looking at the
different expression levels. And we want some
quantitative measure to decide which two
genes are most similar. Well, it turns out
it’s a lot more subtle than we might think. So at first glance,
oh, it’s pretty obvious that these two are
the most similar. But it really depends on
what kind of similarity you’re asking about. So we can describe
any expression data set for any gene, is simply
a multi-dimensional vector. Where this is the set
of expression values we detected for the
first gene, across all the different experimental
conditions and so on, for the second. And what would be the
most intuitive way of describing the
distance between two multi-dimensional vectors? It would simply be
Euclidean distance, right? So that’s perfectly reasonable. So we can decide that the
distance between two gene expression data sets, is
simply the square root of the sum of the
squares of the distances. So we’ll take the sum over all
the experimental conditions that we’ve looked at. Maybe it’s a time series. Maybe it’s different
perturbations. And look at the difference in
expression of gene A and gene B in that condition, K.
And then evaluating this will tell us how similar two
genes are in their expression profiles. Well, that’s a specific
example of a distance metric. It turns out that there’s
a formal definition for a distance metric. Distances have the
following properties. They’re always
greater than zero. We never have
negative distances. They are equal to zero under
exactly one condition– the two data points are the same. And they’re symmetric. So the distance from
A to B is the same as the distance from B to A. Now, to be a true
distance, then you also have to satisfy the
triangle inequality, that the distance from
x to z is less than or equal to the sum
of the distances through a third point. But we will find out
that we don’t actually need that for
similarity measures. So we can have either
a true distance metric for comparing gene
expression data sets, or similarity measures as well. So let’s go back to
the simple example. So we decided that the
red and the blue genes were nearly identical, in terms
of their distance metrics. But that’s not always
exactly what we care about. So in biological settings,
frequently the absolute level of gene expression is
on some arbitrary scale. Certainly with
expression arrays, it was completely arbitrary. It had to do with
fluorescence properties, and how well probes
hybridize to each other. But even with mRNA,
how do we really know that 1,000 copies
is fundamentally different from 1,200 copies
of an RNA in the cell? We don’t. So we might be more interested
in distance metrics that capture not just the
similarity of these two. But the fact that these
two are also quite similar, in terms of the trajectory
of the plot to this one. So can we come up with measures
that capture this one as well? A very common one for this
is Pearson correlation. So in Pearson correlation,
we’re gonna look at not just the expression of a
gene across conditions. But we’re gonna look at
the z-score of that gene. So we’ll take all
of the data for all of the genes in a
particular condition. And we’ll compute the
z-score by looking at the difference
between the expression of a particular gene, in
the average expression across the whole data set. And we’re going to normalize
it by the standard deviation. Yes? AUDIENCE: [INAUDIBLE]
square there? PROFESSOR: Yes, you’re right. There should be a square there. Thank you. So then to compute the
Pearson correlation, we’re going between
two genes, A and B, we’re going to take the
sum over all experiments, that the z-score for A
and the z-score for B, the product of that, summed
over all the experiments. And these values as
we’ll see in a second, are going to range
from plus 1, which would be a perfect
correlation, to minus 1, which would be a perfect
anti-correlation. And then we’re going to find the
distance is 1 minus this value. So things that are
perfectly correlated then, would have an r of zero. And things that
are anti-correlated would have a large one. So if we take a
look at these two obviously by Euclidean
distance, they’d be quite different
from each other. But the z-scores have
converted the expression values into z-scores over
here, you can see that the z-scores
obviously, this one is the most negative
of all of the ones. And this as the lowest
one in all of these. This one’s the highest. And similarly for the red
one, lowest to the highest. So the z-scores track very well. And when I take the
product of this, the signs of the z-score for
A and B are always the same. So I summed the product of the
z-scores, I get a large number. And then the
normalization guarantees that it comes out to one. And so the red
and blue here will have a very high
correlation coefficient. In this case, it’s going
to be an r correlation coefficient of 1. Whereas compared to this one,
which is relatively flat, the correlation coefficient
will be approximately zero. Any questions on that? So what about, say
the blue and the red? Well, their z-scores
are going to have almost the opposite
sign every single time. And so that’s going to add
up to a large negative value. So for these, they’ll be
highly anti-correlated. So A, the blue and the red,
have a correlation coefficient of minus 1. OK. So we have these
two different ways of computing distance measures. We can compute the
Euclidean distance, which would make the
red and blue the same, but treat the green one as
being completely different. Or we have the
correlation, which would group all of these
together, as being similar. What you want to do is going
to depend on your setting. If you look in your
textbook, you’ll see a lot of other definitions
of distance as well. Now what if you’re missing
a particular data point? This used to be a lot more
of a problem with arrays than it is with [? RNAC. ?]
With arrays, you’d often have dirt on the array, that
it actually would literally cover up spots. But you have a bunch of choices. The most extreme would
just be to ignore that row or column of your
matrix across old data sets. That’s usually not
what we want to do. You could put in some
arbitrary small value. But frequently we will do what’s
called imputing, where we’ll try to identify the
genes that have the most similar expression, and replace
the value for the missing one with a value from
the ones that we do know. Distance metrics,
pretty straightforward. Now we want to
use these distance metrics to actually
cluster the data. And what’s the idea here? That if we look across
enough data sets, we might find certain groups of
genes that function similarly across all those
data sets, that might be revealing as to their
biological function. So this is an example of an
unsupervised learning problem. We don’t know what the
classes are, before we go in. We don’t even know
how many there are. We want to learn from the data. This is a very large
area of machine learning. We’re just gonna
scrape the surface. Some of you may be
familiar with the fact that these kinds of
machine learning algorithms are used widely
outside of biology. They’re used by
Netflix to tell you what would movie to choose next. Or Amazon, to try to
sell you new products. And all the advertisers who send
pop-up ads on your computer. But in our biological
setting then, we have our gene expression
data, collected possibly over very large
numbers of conditions. And we want to find
groups of genes that have some similarity. This is a figure from one of
these very early papers, that sort of establish how
people present these datas. So you’ll almost always see
the same kind of presentation. Typically you’ll get a heat
map, where genes are rows. And the different
experiments here time, but it could be different
perturbations, are the columns. And genes that go up
in expression are red, and genes ago down in
expression are green. And apologies to anyone
who’s colorblind. But that’s just what the
convention has become. OK, so then why cluster? So if we cluster
across the rows, then we’ll get sets of genes
that potentially behave– that hopefully if
we do this properly, behave similarly across
different subsets of the experiments. And those might represent
similar functions. And if we cluster
the columns, then we get different experiments
that show similar responses. So that might be in this
case, different times that are similar. Hopefully those are ones
that are close to each other. But if we have lots
of different patients, as we’ll see in a
second, they might represent patients who have a
similar version of a disease. And in fact, the clustering
of genes does work. So even in this
very early paper, they were able to identify a
bunch of subsets of genes that showed similar expression
at different time points, and turned out to be enriched
in different categories. These ones were enriched in
cholesterol biosynthesis, whereas these were enriched
in wound healing, and so on. So how do you actually
do clustering? This kind of clustering
is called hierarchical. That’s pretty straightforward. There are two versions of
hierarchical clustering. There’s what’s called
agglomerative and divisive. In agglomerative, you start
off with each data point in its own cluster. And then you search for the
most similar data point to it, and you group those together. And you keep doing
that iteratively, building up larger
and larger clusters. So we’ve discussed how to
compare our individual genes. But you should be
able to, right now, to find, if I gave you
the vector of expression for a single gene, to find
the other genes in the data set that’s most similar,
by either say, Euclidean or Pearson correlation,
or what have you. But once you’ve grouped
two genes together, how do you decide
whether a third gene is similar to those two? So now we have to
make some choices. And so there are number
of different choices that are commonly made. So let’s say these are our data. We’ve got these two
clusters, Y and Z. And each circle represents a
data point in those clusters. So we’ve got four
genes in each cluster. Now we want to decide
on a distance measure to compare cluster Y to
cluster Z. So what could we do? So what are some possibilities? What might you do? AUDIENCE: We could take
the average of all points. PROFESSOR: You could take the
average of all points, right. What else could you do? Only a limited number
of possibilities. AUDIENCE: Centroid? PROFESSOR: Yeah,
so centroid, you could take some sort
of average, right. Any other possibilities? AUDIENCE: You can
pick a representative from each set [INAUDIBLE]. PROFESSOR: So you could pick
a representative, right? How would you decide in advance
what that would be though? So maybe you have
a way, maybe not. And what other
possibilities are there? Yeah? AUDIENCE: Measure all
the distances [INAUDIBLE] to all the nodes in the other. PROFESSOR: Right. So you could do all to all. What else could you do? You can take the minimum
of all those values. You can take the maximum
of all those values. And we’ll see that all those
are things that people do. So this clustering,
there are already rather uninformative
terms for some of these kinds of decisions. So it’s called
single linkage, is you decide that the distance
between two clusters is based on the minimum distance
between any member of cluster Y and any member of cluster Z. Complete linkage takes
the maximum distance. And then the extremely
unfortunately named Unweighted Pair Group Method
using Centroids– UPGMC, I won’t try to say
that very often– takes the centroid, which was an
early suggestion from the class. And then the UPGMA,
Unweighted Pair Group Method with Arithmetic Mean,
takes the average of all the distances,
all suggestions that people have made. So when would you use
one versus the other? Well, a priori, you
don’t necessarily know. But it’s good to know
how they’ll behave. So what do you imagine
is going to happen if you use single linkage,
versus complete linkage. Remember, single linkage
is the minimum distance. And complete linkage is
the maximum distance. So what’s going to
happen in this case, if I use the minimum distance. Which two groups will I combine? AUDIENCE: The blue and the red. PROFESSOR: The blue
and the red, right? Whereas if I use the
maximum distance, then I’ll combine the
green and the red. So it’s important
to recognize, then, that the single linkage has this
property of chaining together clusters, based on points
that are near each other. Whereas the complete linkage
is resistant to grouping things together, if they have outliers. So they’ll behave differently. Now, if your data are
compact, and you really do have tight clusters,
it’s not going to matter too much
would you use. But in most biological settings,
we’re dealing with much noise, there’s data. So you actually will
get different results based on this. And as far as I know, there’s
no really principal way to figure out if you have no
prior knowledge, which to use. Now all of these
hierarchical clustering come with what’s
called a dendogram. And you’ll see these at the
top of all the clustering. And this represents
the process by which the data were clustered. So the things that
are most similar are most tightly connected
in this dendogram. So these two data
points, one and two, you have to go up very little in the
y-axis, to get from one to two. Whereas if you want
to go from one to 16, you have to traverse
the entire dendogram. So the distance
between two samples is how far vertically you have
to go to connect between them. Now the good things
are that the dendogram is, you can then understand
the clustering of the data. So I can cut this dendogram
at any particular distance, and get clearly divisions
among my data sets. So if I cut here at
this distance level, then I have two groups. One small, one
consisting of these data. And one large, one
consisting of these. Whereas if I cut down here, I
have more groups of my data. So it doesn’t require me in
advance to know how many groups I have. I can look at the
dendogram and infer it. The one risk is that you
always get a dendogram that’s hierarchical, regardless of
if the data were hierarchical or not. So it’s more a reflection of
how you did your clustering than any fundamental
structure of the data. So the fact that you get
a hierarchical dendogram means really nothing
about your data. It’s simply a tool
that you can use to try to divide it up
into different groups. Any questions on the
hierarchical clustering? Yes? AUDIENCE: If each data
point is its own cluster, then won’t that be consistent
across, like, single linkage, complete linkage– like,
why would you cluster? Does that question make sense? Like if you cut it down below,
then haven’t you minimized– don’t you successively
minimize the variance, I guess, up to your clusters, by– PROFESSOR: So if I cut
it at the lowest level, everybody is their own cluster. That’s true. Right. I’m interested in
finding out whether there are genes that behave
similarly across the data sets. Or– AUDIENCE: My question
is, how would you go about determining how
many clusters your want? PROFESSOR: Oh, OK. So we’ll come to
that in a second. So hierarchical clustering,
you don’t actually have any objective
way of doing that. But we’ll talk about
other means right now, where it’s a little bit clearer. But actually
fundamentally, there aren’t a lot of good ways
of knowing a priori what the right number of clusters is. But we’ll look at some
measures in a second that help. So hierarchical clustering,
as your question implies, doesn’t really tell you how
many clusters there are. Another approach is
to decide in advance how many clusters you expect. And then see whether you can
get the data of the group into that number or not. And an example of
that is something called k-means clustering. So the nice thing
about it, is it does give you the
sharp divisions. But again if you
chose k incorrectly, we’ll see in a
second, you will get– you’ll never less
still get K-clusters. So K refers the
number of clusters that you tell the algorithm
you expect to get. So you specify that in advance. And then you try to
find a set of clusters that minimizes the distance. So everybody’s assigned
to a particular cluster, and the center of that cluster. Is that clear? So that’s what these
equations represent. So the center of the
cluster, the centroid, is just the average coordinates,
over all the components of that cluster. And we’re trying to find
this set of clusters, C, that minimizes the sum of
the square of the distances between each member of that
cluster and the centroid. Any questions on how
we’re doing this? OK. All right. So what’s the actual algorithm? That’s remarkably simple. I’m choosing that initial
set of random positions. And then I have the simple loop,
I repeat until convergence. For every point, I assign
it to the nearest centroid. So if my starting
centroids would be circles, I look at every data
point, and I ask, how close is it to
any one of these? That’s what the boundaries
are, defined by these lines. So everything above this
line belongs to the centroid. Everything over here
belongs to this centroid. So I divide the data up by which
centroid you are closest to. And I assign you
to that centroid. That’s step one. And step two, I
compute new centroids. And that’s what these
triangles represent. So after I did
that partitioning, it turns out that most
of the things that were assigned to the triangular
cluster live over here. So the centroid moves
from being here to here. And I iterate this process. That’s the entire K-means
clustering algorithm. So here’s an example
where I generated data from three [? calcines. ?]
I chose initial data points, which are the circles. I follow that protocol. Here’s the first step. It computes new triangles. Second step, and
then it converges. The distance stops changing. Now this question’s
already come up. So what happens if you
choose the wrong K? So I believe there
are three clusters. And really that’s not the case. So what’s going to happen? So in this data set,
there really were. How many, there really
were five clusters. Here, they’re
clustered correctly. What if I told the
algorithm to do K-means clustering
with a K of three? It would still find a way to
come up with three clusters. So now it’s grouped these
two things, which are clearly generated from different
[? calcines ?] scenes together. It’s grouped these two,
which were generated from different [? calcines ?]
together, and so on. All right. So K-means clustering
will do what you tell it to do, regardless of
whether that’s the right answer or not. And if you tell it there are
more clusters than you expect– than really are
there, then it’ll start chopping up well-defined
clusters into sub-clusters. So here it split this elongated
one into two sub-clusters. It split this one
arbitrarily into two. Just so it gets the final
number that we asked for. Then how do you know what to do? Well, as I said, you don’t–
there’s no guarantee to know. But one thing you can
do is make this kind of plot, which shows for
different values of K on the x-axis, the sum of the
distances within the cluster. So the distance to the
centroid within each cluster on the y-axis. And as I increase the number
of K’s, when I’m correctly [? purchasing ?] my
data, when there really are more subgroups than
I’ve already defined, then I’ll see big drops. So I go from saying there are
two to three in that case. I get a big drop in the distance
between members of the cluster. Because I’m no longer including
a data point over here. And in this cluster, with a
data point in that cluster. But once I go beyond the
correct number, which was five, you see that the benefits
really start to trail off. So there’s an
inflection point here. There’s an elbow– sometimes
it’s called an elbow plot. After I go past
the right number, I get less and less benefit
from each additional clustering. So this gives us
an empirical way of choosing approximately
a correct value for K. Any questions on K-means? Yes? AUDIENCE: Does K-means
recapitulate the clusters that you would
get if you cut off your dendogram from hierarchical
clustering at a certain level? PROFESSOR: Not necessarily. AUDIENCE: OK. But maybe. I don’t know. It sort of seems to me
as if you picked a level where you have a certain
number of clusters, that that’s similar, at least by
centroid, by using the center? PROFESSOR: Yeah, I think because
of the way that you do it, you’re not even guaranteed
to have a level, where you have exactly the
right– other questions? Yes? AUDIENCE: Could you
just very quickly go over how you
initialized where the starting points
are, and the break ups? PROFESSOR: All right,
so the question is how do you initialize
the starting points? In fact, you have to make some
arbitrary decisions about how the initialize the
starting points. So they’re usually
chose in a random. And you will get
different results, depending on how you do that. So that’s another–
so when you do it, it’s non-deterministic
in that sense. And you often want to
initialize multiple times. And make sure you
get similar results. Very good question. And in fact, that
was not a set up. But what happens if you
choose pathologically bad initial conditions? So you have the potential to
converge to the right answer. But you’re not guaranteed to
converge to the right answer. So here’s an example where
I had– I guess there really are three clusters in the data. I chose [INAUDIBLE]
three, but I stuck all my initial coordinates down
in the lower right-hand corner. And then when I do the
clustering, if things go well, I get the right answer. But we’re not guaranteed. But one thing we are guaranteed,
is we always get convergence. So the algorithm will converge. Because at each
step, it’s either reducing the objective function,
or it’s leaving it the same. So we’re guaranteed convergence. But it may be as we’ve seen
previously in other settings, we may end up with local
minimum, rather than the global optimum. And the way to fix
that then would be to initialize again,
with new starting positions. Other questions? What about a setting like this? Where we’ve got two
well-defined clusters, and somebody who lives
straight in the middle. So what’s the
algorithm going to do? Well, sometimes it’ll put it
in one side, of one cluster. And sometimes it’ll end
up in the other side. So an alternative to
K-means clustering, which has to make one or the
other arbitrary decision, is something that’s called
fuzzy K-means, which can put something
actually literally, membership into both clusters. And it’s very similar in
structure to the K-means, with one important
difference, which is a membership variable,
that tells you for every data point, how much it belongs
to the cluster one, cluster two, cluster three, and so on. So in both algorithms,
we start off by choosing initial
points as a cluster means, and looping through
each of them. Now previously, we would make
a hard assignment of each data point x sub i to
a single cluster. And here we’re going to
calculate the probability that each data point
belongs to a cluster. And that’s where you
get the fuzziness, because you could have a non
unit, or a nonzero probability, belonging to any
of the clusters. And now we’re going, K-means,
we recalculated the mean value, by just looking at the average
of everybody in that cluster. Now in fuzzy K-means, we don’t
have everybody in the cluster. Because everybody belongs
partially to the cluster. So we’re going to take
a weighted average. So here are the details
of how you do that. In K-means, we are
minimizing this function. We were trying to decide the
class structure, the class memberships, that would minimize
the distance of every member of that cluster, to the defined
centroid of that cluster. Here it looks almost the same. Except we now have
this new variable, mu, which is the membership. It’s the membership of
point j, in cluster i. So I’m trying to minimize
a very similar function. But now if mu is one–
if all my mus are one, then what do I get? K-means, right? But as soon as the mus are
allowed to vary from one, they can be between
zero and one, then points can
contribute more or less. So that point there was stuck in
the middle of the two clusters, if it had a mu of
0.5 for each, it would contribute half to each. And then both the
centroids would move a little bit
towards the middle. So what’s the
result of K-means– I’m sorry, fuzzy
K-means clustering? We still get K clusters. But now every gene or every
object that we’re clustering has a partial membership. So here’s an example
of that, where they did K-means
clustering, with these six different clusters. But now every
profile, every gene, has a color associated with it,
that represents this mu value. Whether it goes from zero to
one, with these rainbow colors, to the things that
are reddish, or pink– those are the high
confidence things that are very strongly,
only in that cluster. Whereas the things that are
more towards the yellow end of the spectrum are
partially in this cluster and partially in other clusters. Questions? Any questions? So K-means, we’ve defined in
terms of Euclidean distance. And that has clear advantages,
in terms of computing things very easily. But it has some
disadvantages as well. So one of the disadvantages
is because we’re using the squared
distance, then outliers have a very big effect. Because I’m squaring the
difference between vectors. That may not be the worst thing. But they also
restrict us to things for which we can
compute a centroid. We have to have data
that are– four or more, you can actually
compute the mean value of all members of the cluster. Sometimes you want
to cluster things that we only have
qualitative data. Where instead of having
a distance measure, we have similarity. This doesn’t come up
quite as often in– well, it certainly doesn’t come
up in gene expression data or [? RNAC. ?] But you can
imagine more qualitative data, where you ask people
about similarity between different things
or behavioral features, where you know the similarity
between two objects. But you have no way of
calculating the average object. One setting that you might
[INAUDIBLE] have looked at– if you’re trying to
cluster say, sequence motifs that you’ve computed
with the EM algorithm. So what’s the average
sequence motif? That doesn’t necessarily
represent any true object, right? You might be better off–
you can calculate it. But it doesn’t mean anything. You might be better
off calculating using rather than the
average motif, the most central of the motifs that
you actually observed. So that would be called
a medoid, or an exemplar. It’s a member of your cluster
that’s closest to the middle, even if it it’s not
smack dab in the middle. So instead of K-means, we can
just think, well, K-medoids. So in K-means, we actually
computed a centroid. And in medoids, we’ll
choose the existing data point that’s most central. So what does that mean? If these are my data, the true
mean is somewhere over here. But this one is the medoid. It’s an exemplar that’s
close to the central point. But if there actually
isn’t anything here, then there isn’t. So we’re going to use
the thing that’s closest. So if these were
all sequence motifs, rather than using some
sequence motif that doesn’t exist as the
center of your cluster, you would use a sequence motif
that actually does exist, and it’s close to the center. So it’s a simple
variation on the K-means. Instead choosing K
points in arbitrary space as our starting
positions, we’re going to choose K examples from the
data as our starting medoids. And then we’re going to place
each point in the cluster that has the closest medoid,
rather than median. And then when we
do the update step, instead of choosing the
average position to represent the cluster, we’ll
choose the medoid. The exemplar that’s
closest to the middle. Any questions on this? Yes? AUDIENCE: So if
you use the medoid, do you lose the
guaranteed convergence? Because I can
picture a situation where you’re sort of
oscillating because now you have a discrete stack. PROFESSOR: That’s
a good question. That’s probably right. Actually, I should
think about that. I”m not sure. Yeah, that’s probably right. Other questions? OK. There are a lot of other
techniques for clustering. Your textbook talks about
self organizing maps, which were popular at
one point quite a lot. And there’s also
a nice technique called affinity
propagation, which is a little bit outside
the scope of this course, but has proved quite
useful for clustering. OK. So why bother to do
all this clustering? Our goal is to try to find
some biological information, not just to find
groups of genes. So what can you do
with these things? Well, one thing that
was identified early on, is if I could find sets of genes
that behave similarly, maybe those could be used
in a predictive way, to predict outcomes
for patients, or some biological function. So we’re going to
look at that first. So one of the early
papers in this field did clustering of
microarrays for patients who had B-cell lymphoma. The patients had different
kinds of B-cell lymphomas. And so they took their
data, they clustered it. Again, each row
represents a gene. And each column
represents a patient here. And with this projector, it’s
a little bit hard to see. But when you look at
the notes separately, you’ll be able see
that in the dendogram, there’s a nice, sharp
division between two large groups of patients. And it turns out that when
you look at the pathologist’s annotations for these
patients, which was completely independent of the
gene expression data, all of patients in
the left hand group– almost all the patients
in the left hand group, had one kind of lymphoma. And all the patients
in the right hand group had a different
kind of lymphoma. And this got people
very excited. Because it suggested that
the pure molecular features might be at least as good
as pathological studies. So maybe you could completely
automate the identification of different tumor types. Now the next thing that got
people even more excited, was the idea that maybe
you could actually use these patterns not
just to recapitulate what a pathologist would
find, but go beyond it, and actually make predictions
from the patients. So in these plots–
I don’t know if we’ve seen these before
yet in the class. But on the x-axis is survival. In the y-axis are the
fraction of patients in a particular group,
who survived that long. So as the patient’s
die, obviously the curve is dropping down. Each one of these
drops represents the death of a patient,
or the loss of the patient to the study for other reasons. And so in the middle,
let’s start with this one. This is what the clinicians
would have decided. There are here, patients
that they defined by clinical standards as
being likely to do well, versus patients whom they
defined by clinical standards, as likely to do poorly. And you could see there
is a big difference in the plots for the
low clinical risk patients at the top, and
the high clinical risk patients at the bottom. On the left hand
side, or what you get when you use purely
gene expression data to cluster the patients into
groups that you turn out to be high risk or low risk. And you can see that
it’s a little bit more statistically significant
for the clinical risk. But it’s pretty
good over here, too. Now the really
impressive thing is, what if you take the
patients that the clinicians define as low clinical risk? And then you look at their
gene expression data. Could you separate
out the patients in that allegedly
low clinical risk who are actually at high risk? And maybe then they
would be diverted to have more aggressive
therapy than patients who really and truly
are low risk patients. And what they will
show with just barely statistical significance,
is that even among the clinically
defined low risk patients, there is– based on
these gene signatures– the ability to
distinguish patients who are going to do
better, and patients who are going to do worse. So this was over a decade ago. And it really set off a
frenzy of people looking for gene signatures for
all sorts of things, that might be highly predictive. Now the fact that
something is correlated, doesn’t of course
prove any causality. So one of the questions is, if
I find a gene signature that is predictive of an outcome
in one of the studies, can I use it then to go
backwards, and actually define a therapy? In the ideal setting, I would
have these gene signatures. I’d discover that
they are clinically associated with outcome. I could dig in and
discover what makes the patients to do worse, worse. And go and treat that. So is that the case or not? So let me show you some data
from a breast cancer data set. Here’s a breast cancer data set. Again the same kind of plot,
where we’ve got the survival statistic on the y-axis, the
number of years on the x-axis. And based on a gene
signature, this group has defined a group
that does better, and a group that does worse,
the p value is significant. And it has a ratio, the
death rate versus control is approximately two. OK. So does this lead us to
any mechanistic insight into breast cancer. Well, it turns out in this
case, the gene signature was defined based on
postprandial laughter. So after dinner humor. Here’s a gene set
that defined something that has absolutely nothing
to do with breast cancer, and it’s predicting the outcome
of breast cancer patients. Which leads to
somewhat more of a joke that the testing
whether laughter really is the best medicine. OK. So they went on– they
tried other genes sets. Here’s the data
set– gene set that’s not even defined in humans. It’s the homologs of
genes that are associated with social defeat in mice. And once again, you get a
statistically significant p-value, and good hazard ratios. So what’s going on? Well, these are not from
a study that’s actually trying to predict an
outcome in breast cancer. It’s a study that shows
that most gene expression– most randomly selected
sets of genes in the genome will give an outcome that’s
correlated– a result that’s correlated with a patient
outcome in breast cancer. Yes? AUDIENCE: I’m a little confused. In the previous graph, could you
just explain what is the black and what is the red? Is that individuals or groups? PROFESSOR: So the
black are people that have the genes
set signature, who have high levels
of the genes that are defined in this gene set. And the red are ones have
low, or the other way around. But it’s defining all patients
into two groups, based on whether they have a
particular level of expression in this gene set,
and then following those patients over time. Do they do better or worse? And similarly for
all these plots. And he had another one which is
a little less amusing, location of skin fibroblasts. The real critical point is this. Here, they compared
the probability based on expectation
an that all genes are independent of each
other, the probability that that gene signatures
correlated with outcome, for genes there were chosen
at random or genes that were chosen from a database
of gene signatures, that people have identified as
being associated with pathways. And you get a very,
very large fraction. So this is the p-value. So negative log of
p-value, so negative values are more significant. A huge fraction
of all genes sets that you pull at
random from the genome, or that you pull from a
compendium of known pathways, are going to be
associated with outcome, in this breast cancer data set. So it’s not just well
annotated cancer pathways, that are associated. Its gene sets
associated as we’ve seen, with laughter or
social defeat in mice, and so on– all sorts
of crazy things, that have no mechanistic
link to breast cancer. Let’s take a second
for that to sink in. I pull genes at random
from the genome. I define patients
based on whether they have high levels of expression
of a random set of genes, or low levels of expression
of that random set of genes. And I’m extremely likely
to be able to predict the outcome in breast cancer. So that should be rather
disturbing, right? So it turns out– before
we get to the answer then– so this is not
unique to breast cancer. They went through a whole bunch
of data sets in the literature. Each row is a different
previously published study, where someone had claimed
to identify a signature for a particular kind
of disease or outcome. And they took their
random gene sets and asked how well the random
genes sets did in predicting the outcome
in these patients? And so these yellow
plots represent the probability distribution
for the random gene sets– again on
this projector, it’s hard to see– but there’s a
highlight in the left hand side at where the 5%, the best
5% of the random gene sets are. This blue line is
the near measure of statistical significance. It turns out that a
few of these studies didn’t even reach a normal level
of statistical significance, let alone comparing
to random gene sets. But for most of
these, you don’t do better than a good fraction
of the randomly selected gene sets. So how could this be? So it turns out there is an
answer to why this happens. And it’s really
quite fascinating. So here, we’re using
the hazard ratio, which is the death rate
for the patients who have the signature,
over the control group. So high hazard ratio means
it’s a very, very dissociative outcome. And they’ve plotted that against
the correlation of the genes in the gene signature, with
the expression of a gene called PCNA, Proliferating
Cell Nuclear Antigen And it turns out a very, very
large fraction of the genome is coexpressed. So genes are not expressed like
random, completely independent random variables. There are lots of genes that
show very similar expression levels, across
all the data sets. Now PCNA is a gene that’s
been known by pathologists for a long time,
as having higher levels than most
digressive tumors. So a very, very large
fraction of the genome is coexpressed with PCNA. Then high levels of
randomly selected genes are going to be a very good
predictor of tumor outcome. Because high levels of
randomly expressed genes also means a very
high probability of having a high level PCNA,
which is a tumor marker. So we have to proceed
with a lot of caution. We can find things that are
highly correlated with outcome, that could have good value in
terms of prognostic indicators. But there are going to
be a lot of possibilities for sets of genes that
have that property, they’re good
predictors of outcome. And many of them will
have absolutely nothing to causally, with the
process of the disease. So at the very least, it means
don’t start a drug company over every set of genes,
if you identify this as associated with outcome. But the worst case
scenario, it also means that those predictions
will break down under settings that we haven’t yet examined. And so that’s the
real fear, that you have a gene set
signature that you think has a highly predictive outcome. It’s only because you looked at
a particular set of patients. But you look at a
different set of patients, and that correlation
will break down. So this is an area of research
that’s still quite in flux, in terms of how
much utility there will be in identifying
genes set signatures, in this completely
objective way. And what we’ll see in the
course of this lecture and the next one,
is it’s probably going to be much more
useful to incorporate other kinds of information
that will constrain us to be more mechanistic. Any questions? All right. So now we’re going to
really get into the meat of the identification
of gene modules. And we’re going to try to
see how much we can learn about regulatory structure
from the gene expression data. So we’re going to move up from
just the pure expression data– say these genes at the
bottom, to try to figure out what set of transcription
factors we’re driving, and maybe what signaling
pathways lived upstream in those transcription
factors, and turn them on. And the fundamental difference
then between clustering– which is what we’ve been looking in
until now, and these modules, as people like to call
them– is that you can have a whole bunch
of genes, and we’ve just seen that, that are
correlated with each other, without being causally
linked to each other. So we like to figure out which
ones are actually functionally related, and not just
statistically related. And the paper that’s going
to serve as our organizing principle in the
rest of this lecture, maybe bleeding into
the next lecture, is this paper,
recently published that’s called The
DREAM5 Challenge. And this, like some of these
other challenges that we’ve seen before, is the case where
the organizers have data sets, where are they know
the answer to what the regulatory structure is. They send out the data. People try to make the
best predictions they can. And then they unseal
the data, to let people know how well they did. And so you can get a
relatively objective view of how well different
kinds of approaches work. So this is the overall
structure of this challenge. They had four different
kinds of data. Three are real data sets from
different organisms, E. coli, yeast, and
Staphylococcus aureus. And then the fourth one,
the one at the top here, is completely synthetic
data that they generated it. And you get a sense of the
scale of the data sets. So how many genes are involved,
how many potential regulators. In some cases, they’ve given
you specific information on knockouts, antibiotics,
toxins, that are perturbing. And again here, the
number of conditions that are being looked at,
the number of arrays. So then they provide
this data in a way that’s very hard
for the groups that are analyzing to trace it
back to particular genes. Because you don’t want people to
use external data necessarily, to make their predictions. So every makes
their predictions. They also, as part
of this challenge, they actually they made
their own metapredictions, based on the
individual predictions by different groups. And we’ll take a look
at that in a second. And then they score
how well they did. Now we’ll get into the details
of the scoring a little bit later. But what they found
at the highest levels, that different kinds of
methods behaved similarly. So the main groups
that they found were these regression-based
techniques. We’ll talk about
those in a second. Bayesian networks,
which we’ve already discussed in a
different context. A hodgepodge of different
kinds of things. And then mutual information
and correlation. So we’re going to look in
each of these main categories of prediction methods. So we’re going to start with
the Bayesian networks, which we just finished talking
about in a completely different context. Here, instead of trying to
predict whether interaction is true, based on the
experimental data, we’re going to try to predict
whether a particular protein is involved in regulating a set of
genes, based on the expression data. So in this context– let’s
say I have cancer data sets, and I wanted to decide
whether p53 is activated in those tumors, So this
is a known pathway for p53. So if I told you
the pathway, how might you figure out if p53
is active from gene expression data? I tell you this pathway, give
you this expression data– what’s kind of a simple
thing that you could do right away, to decide whether you
think p53 is active or not? p53 is a transcriptional
activator, but it should be turning
on the genes of its targets when its on. So what’s an
obvious thing to do? AUDIENCE: Check the expression
levels from the targets. PROFESSOR: Thank you. Right, so we could check
the expression levels. The targets compute some
simple statistics, right? OK. Well, that could work. But of course there could
be other transcriptional regulators that regulate
a similar set of genes. So that’s not a
guarantee that p53 is on. It might be some other
transcriptional regulator. We could look for the
pathways that activate p53. We could ask whether
those genes are on. So we’ve got in this
pathway, a bunch of kinases, an ATM, CHK1, and so
on, that activate p53. Now if we had proteomic
data, we could actually look whether those proteins
are phosphorylated. But we have much, much
less proteomic data. And most of these settings
only have gene expression data. But you look at, is
that gene expressed? Has the expression of one
of these activating proteins gone up? And you can try to
make an inference then. From whether there’s more of
these activating proteins, then maybe p53 is active. And therefore it’s
turning on it’s targets. That’s one step removed. So just the fact that there’s
a lot of ATM mRNA around doesn’t mean that there’s a
lot of the ATM protein, which certainly doesn’t mean that
the ATM is phosphorylated and turning on its target. So again, we don’t
have a guarantee there. We could look more specifically
whether the genes are differentially expressed. So the fact that they’re on
may not be as informative as if they were uniquely
on in this tumor, and not on in control cells
from the same patient. So that can be informative. But again changes
in gene expression are not uniquely related to
changes in protein level. So we’re going to have to
behave with a bit of caution. So the first step we’re going
to take in this direction, is try to build a
Bayesian network. That’s going to give us a way
to reason probabilistically over all of these kinds of data,
which by themselves are not great guarantees that we’re
getting the right answer. Just like in the protein
prediction interaction problem, where individually coexpression
wasn’t all that great, essentiality wasn’t
all that great. But taken together, they
could be quite helpful. So we want to compute
the probability that the p53 pathway is
active, given the data. And the only data we’re
going to have in the setting is gene expression data. So we’re going to assume
that for the targets of a transcription
factor to be active, the transcription
factor itself has to be expressed
at a higher level. That’s a restriction
of analyzing these kinds of data
that’s very commonly used. So we’re going to try to
compute the probability that p53 is activated, given the data. So how would I compute
the probability, that given that some
transcription factors on, that I see expression
from target genes? How would I do this? I would just go into the data,
and just count in the same way that we did in our
previous setting. We could just look over
all the experiments and tabulate whether one of the
targets is up in expression, how often is the transcription
factor that’s potentially activating it up? And how often are all the
possible combinations the case? And then we can use
Bayesian statistics to try to compute
the probability that a transcription
factor is up, activated, given that I’ve seen the
gene expression data. Is that clear? Good. So we want to try to not include
just the down stream factors. Because that leads
possibly, maybe there are multiple
transcription factors that are equally
likely to be driving expressions instead of genes. We want to include the
upstream regulators as well. And so here, we’re going
to take advantage of one of the properties
of Bayesian nets at where we looked
at, explaining a way. And you’ll remember
this example, where we decided that if see
that the grass is wet, and I know that
it’s raining, then I can consider less likely
that the sprinklers were on. Even though there’s no causal
relationship between them. So if I see that a set of
targets of transcription factor A are on, and I have evidence
that the pathway upstream of A is on, that reduces my
inferred probability that the transcription
factor B is responsible. So that’s of the nice things
about Bayesian networks that gives us a way of
reasoning automatically, over all the data, and not
just the down stream targets. And the Bayesian networks
can have multiple layers. So we can have one
transcription factor turning another one,
turns on other one, turns on another one. Again, we can have as
many layers as necessary. But one thing we
can’t have are cycles. So we can’t have a
transcription factor that’s at the bottom of this,
going back and activating things that are at the top. And that’s a
fundamental limitation of Bayesian networks. We’ve already talked
about the fact that in Bayesian networks,
with these two problems that we to have to solve, we have to be
able to define the structure. If we don’t know any a priori. Here, we don’t
know what a priori. So we’re going to have to learn
the structure of the network. And then with the
structure of the network, we’re going to have to
learn all the probabilities. So the conditional
probability tables that relate to each
variable to every other one. And then just two more
small points about it. So if I just give
you expression data, without any interventions–
just the observations, then I can’t decide what is a
cause and what is an effect. So here this was done in
the context of proteomics, but the same is true for
gene expression data. If I have two variables, x and
y, that are highly correlated, it could be that x activates y. It could be that y activates x. But if I perturb the system,
and I block the activity of one of these two genes
or proteins, then I can start to tell
the difference. In this case, if
you inhibit x, you don’t see any activation of y. That’s the yellow,
all down here. But if you inhibit y,
you see the full range of activity of x. So that implies that x
is the activator of y. And so in these settings, if
you want to learn a Bayesian network from data, you need
more than just a compendium of gene expression data. If you want to get the
directions correct, you need perturbations
where someone has actually inhibited particular
genes or proteins. Now, in a lot of these
Bayesian networks, we’re not going
to try to include every possible gene and
every possible protein. Either because we don’t
have measurements of it, or because we need
a compact network. So there will often
be cases where the true regulator
in some causal chain, is missing from our data. So imagine this is the true
causal chain– x activates y, which then activates z and w. But either because we
don’t have the data on y, or because we left it out to
make our models more compact, it’s not in the model. We can still pick up the
relationships between x and z, and x and w. But the data will
be much noisier. Because we’re missing
that information. In the conditional
probability tables, relating x to y, and then
y because it’s too targets. So Bayesian networks, we’ve
already seen quite a lot. We now have some idea of how to
transfer them from one domain to the domain of
gene expression data. The next approach
we want to look at is a regression-based approach. So the regression-based
approaches are founded on a
simple idea, which is that the expression
gene is going to be some function
of the expression levels of the regulator. We’re going to
actually try to come up with a formula that
relates the activity levels of the
transcription factors, and the activity
level of the target. In this cartoon, I’ve
got a gene that’s on under one
condition, that’s off under some other conditions. What transforms it
from being off to on, is the introduction of
more of these transcription factors, that are
binding to the promoter. So in general, I have
some predicted level of expression for the gene. It’s called the
predicted level y. And it’s some function,
unspecified at this point, f of g, of all the expression
levels of the transcription factors that regulate that gene. So just again,
nomenclature is straight, x sub g is going to be
the expression of gene x– I’m sorry, expression of gene g. This capital X, sub t of g
is the set of transcription factors, that I believe
are regulating that gene. And then f is an
arbitrary function. We’re going to have
a noise term as well. Because this is the
observed gene expression, not some sort of platonic view
of the true gene expression. Now frequently, we’ll
have a specific function. So the simplest one
you can imagine, which is a linear function. So the expression of
any particular gene is going to be a
linear function, a sum, of the expression of
all of it’s regulators, where each one has associated
with it a coefficient beta. And that beta
coefficient tells us how much particular a
regulator influences that gene. So say, p53 might have
a very large value. Some other
transcriptional regulator might have a small
value, representing their relative influence. Now, I don’t know the
beta values in advance. So that’s one of the things
that I need to learn. So I want to be able to
find a setting that tells me what the beta values are for
every possible transcription factor. If the algorithm sets
the beta value to zero, what does that tell me? If a beta value
is zero here, what does that tell me about that
transcriptional regulator? No influence, right. And the higher the value, then
the greater the influence. OK. So how do we discover these? So the tip of the
approach then is to come up with some
objective function that we’re going
to try to optimize. And an obvious
objective function is the difference between
the observed expression value for each gene,
and the expected one, based on that linear function. And we’re going to choose
a set of data parameters that minimize the difference
between the observed and the expected, minimize
the sum of the squares. So the residual sum
of the squares error, between the predicted
and the observed. So this is a relatively
standard regression problem, just in different setting. Now one of the problems with
a standard regression problem, is that we’ll
typically get a lot of very small values of beta. So we won’t get all
zeros or all ones, meaning the algorithm is
100% certain that these are the drivers
and these are not. We’ll get a lot of small values
for many, many transcription factors. And OK, that could
represent the reality. But the bad thing is
that those data values are going to be unstable. So small changes in
the training data will give you big changes, in
which transcription factors have which values. So that not a desirable setting. There’s a whole field
built up around trying to come up with
better solutions. I’ve given you some
references here. One of them is to a
paper that did well in the DREAM challenge. The other one is to
a very good textbook, Elements of
Statistical Learning. And there are various
techniques that allow you to try to
limit the number of betas that are non-zero. And by doing that, you get
more robust predictions. At a cost, right, because there
could be a lot of transcription factors that really do
have small influences. But we’ll trade
that off, by getting more accurate predictions
from the ones that have the big influences. Are there any questions
on regression? So the last of the methods
that we’re examining– this is a mutual information. We’ve already seen mutual
information in the course. So information
content is related to the probability of observing
some variable in an alphabet. So in most languages,
the probability of observing letters
is quite variable. So Es are very common
in the English language. Other letters are less common. As anyone who plays Hangman or
watches Wheel of fortune knows. And we defined the
entropy as the sum over all possible outcomes. The probability of
observing some variable, and the information
on to that variable, we can define the discrete
case, or in the continuous case. And the critical
thing is to have mutual information
between two variables. So that’s the difference between
the entropy of those variables independently, and
then the joint entropy. So things with a high
mutual information, means that one variable gives
the significant knowledge of what the other
variable is doing. It reduces my uncertainty. That’s the critical idea. OK. So we looked at
correlation before. There could be
settings where you have very low correlation
between two variables, but have high
mutual information. So consider these two genes,
protein A and protein b, and the blue dots are the
relationship between them. You can see that there’s
a lot of information content in these two variables. Knowing the value of A
gives me a high confidence in the value of B. But
there’s no linear relationship that describes these. So if I use mutual
information, I can capture
situations like this, that I can’t capture
with correlation. And these kinds of
situations actually occur. So for example in a
feed-forward loop– say we’ve got a regulator
A, and it directly activates B. It also directly
activates C. But C inhibits B. So you’ve got the path on
the left-hand sides that are pressing the accelerator. And the path on the right hand
side pressing the stop pedal. That’s called an incoherent
feed-forward loop. And you can get under different
settings, different kinds of results, where this
is one of those examples. You can get much
more complicated behavior. [INAUDIBLE] are papers
that have really mapped out these behaviors across
many parameters settings. You can get switches
in the behavior. But in a lot of
these settings, you will have high
mutual information between two
variables, even if you don’t have any correlation,
linear correlation between them. A well-publicized algorithm
that uses mutual information to infer gene regulatory
networks is called ARACNe. They go through and they
compute the mutual information between all pairs of
genes in their data set. And now one question you have
with mutual information is, what defines a significant
level of mutual information? So an obvious way to do
this, to try to figure out what’s significant, is
to do randomizations. And so that’s what they did. They shuffled the
expression data, to compute mutual information
among pairs of genes, where there isn’t
actually a need– there shouldn’t be
any relationships. Because the data
had been shuffled. And then you can decide whether
the observed mutual information is significantly
greater than when you get from the
randomized data. Now, the other thing that
happens with mutual information is that indirect
effects still apply to degrees of
mutual information. So let’s consider the set of
genes that are shown on this. So you’ve got G2,
which is actually a regulator of G1 and G3. So G2 is going to have
high mutual information with G1, and with G3. Now, what’s it going
to be about G1 and G3? They’re going to behave
very similarly, as well. So it’ll be a high degree
of mutual information between G1 and G3. So if I just rely on
mutual information, I can’t tell what’s a
regulator and what’s a fellow at the same
level of regulation. They’re both being affected
by something above them. I can’t tell the difference
between those two. So they use what’s called the
data processing inequality, where they say, well, these
regulatory interactions should have higher
mutual information, than this, which is just
between two common targets in the same parent. And so they drop from their
network, those things which are the lower of the
three in a triangle. So that was the original
ARACNe algorithm, and then they modified
it a little bit, to try to be more specific in
terms of the regulators that were being picked up. And so they called
this approach MINDy. And the core idea here,
is that in addition to the transcription
factors, you might have another protein that
turns a transcription factor on or off. So if I look over
different concentrations of the transcription factor,
different levels of expression between transcription
factors, I might find that there are some cases
where this other protein turns it on, and other cases
where it turns it off. So here, consider
these two data sets. Looking at different
concentrations of particular transcription
factor and different expression levels, and in one
case– the blue ones, the modulator isn’t
present at all, or present at it’s
lowest possible level. And in the red case, it’s
present as a high level. And you can see that when
the modulator is present only in low levels, there’s no
relationship between a target and it’s transcription factor. Or when the modulator is
present at a high level, then there’s this
linear response of the target to it’s
transcription factor. So this modulator seems to
be a necessary component. So they went through and
defined a whole bunch of settings like this. And then systematically search
the data for these modulators. So they started off with the
expression data set, genes in rows, experiments in columns. They do a set of
filtering to remove things that are going to be
problematic for the analysis. They look, for
example, for settings where you have– they had to
start with a list of modulators and transcription factors,
and they moved the ones where there isn’t enough
variation, and so on. And then they examine, for every
modulator and transcription factor pair, cases
where the modulator is present at its highest
level, and where it’s present at
it’s lowest level. So when the modulator is present
at a high level– let’s say, when the modulator is
present at a high level, there’s a high
mutual information between the transcription
factor and the target. When the modulator is absent,
there’s no mutual information. That’s a setting we
looked at before. That would suggest that the
modulator is an activator. It’s a positive modulator. You can have the
opposite situation, where when the modulator
is present at low levels, there’s mutual information
between a transcription factor and it’s target. When the modulator is
present at a high level, you don’t see anything. That would suggest
that the modulator is a negative regulator. And then there are scenarios
where there’s either uniformly high
information content between transcription factor
target, or uniformly low. So the modulator doesn’t
seem to be doing anything. So we break it down
into these categories. And you can look at all
the different categories, in their supplemental tables. One thing that’s
kind of interesting is they assume that regardless
of how high the transcription factor goes, you’ll
always see an increase in the expression of the target. So there is no
saturation, which is an unnatural assumption
in these data sets. OK. So I think I’ll close with this
example, from their experiment. And then in the
next lecture, we’ll look at how these
different methods fare against each other in
the DREAM challenge. So they specifically wanted
to find regulators of MYC. So here’s data for a
particular regulator, SDK 38. Here’s the set of
expression of tumors where SDK 38
expression is lowest. And a set of tumors where
SDK 38 expression is highest. And they’re sorted by the
expression level of MYC. So on the left hand
side, you’ll see there’s no particular
relationship between the expression level
of MYC and the targets. In the right hand side,
there is a relationship between the expression
level of MYC and targets. So having, apparently–
at least at this level of mutual information, having
higher levels of SDK 38, cause a relationship to occur. That would be example
of an activator. OK. So this technique has
a lot of advantages, and allows you to search rapidly
over very large data sets, to find potential target
transcription factor relationships, and also
potential modulators. It has some limitations. Where the key limitations
is that the signal has to be present in the
expression data set. So in the case of
a protein like p53, where we know it’s activated
by all sort of other processes, phosphorylation or NF-kappaB,
where it’s regulated by phosphorylation, you
might not get any signal. So there has to be a case
where the transcription factor itself, is
changing expression. It also won’t work if
the modulator is always highly correlated
with its target, for some other
biological reason. So the modulator has to be
on, for other reasons, when the target is, then
you’ll never be able to divide the
data in this way. One of the other
things I think that is problematic with
these networks is that you get such
large networks, and they’re very
hard to interpret. So in this case, this
is the nearest neighbors of just one node in ARACNe. This is the mutual information
network of microRNA modulators that has a quarter of
a million interactions. And in these data
sets, often you end up selecting a very,
very large fraction of all the potential modulators. So of all the candidate
transcription factors in modulators, it
comes up with an answer that’s roughly 10% to 20%
of them are regulating any particular gene,
which seems awfully high. OK. So any questions on the
methods we’ve seen so far? OK. So when we come
back on Thursday, we’ll take a look
at head to head of how these different methods
perform on both the synthetic and the real data sets.

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *