# 15. Gene Regulatory Networks

The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: So you’ll

recall last time we were working on

protein-protein interactions. We’re going to do a little

bit to finish that up, with a topic that will be a

good transition to the study of the gene regulatory networks. And the precise things we’re

going to discuss today, we’re going to start off

with Bayesian networks of protein-protein

interaction prediction. And then we’re going to get

into gene expression data, at several different levels. We’ll talk about

some basic questions, of how to compare

the two expression vectors for a gene,

distance metrics. We’ll talk about how to

cluster gene expression data. The idea of

identifying signatures of sets of genes, that

might be predictive of some biological property. For example, a

susceptibility to a disease. And then we’ll talk about

a number of different ways that people have

developed to try to identify gene

regulatory networks. That often goes by

the name of modules. I don’t particularly

like that name. But that’s what you’ll

find in the literature. And we’re going to

focus on a few of these, that have recently been

compared head to head, using both synthetic and real data. And we’ll see some

of the results from that head to

head comparison. So let’s just launch into it. Remember last time we had

started this unit looking at the structural

predictions for proteins. And we started talking

about how to predict protein-protein interactions. Last time we talked about

both computational methods, and also experimental

data, that could give us information about

protein-protein interactions. Ostensibly measuring

direct interactions, but we saw that there were

possibly very, very high error rates. So we needed ways of integrating

lots of different kinds of data in a probabilistic framework so

we could predict for any pair proteins what’s the

probability that they interact. Not just the fact that

they were detected in one assay or the other. And we started to talk

about Bayesian networks in this context. Both useful as we’ll see

today, for predicting protein-protein

interactions, and also for the gene regulatory

network problem. So the Bayesian

networks are a tool for reasoning probabilistically. That’s the fundamental purpose. And we saw that they consisted

of a graph, the network. And then the probabilities

that represent the probability for each edge, the conditional

probability tables. And that we can learn

these from the data, either in a completely

objective way, where we learn both the structure

and the probability. Or where we impose the

structure initially, and then we simply learn

the probability tables. And we had nodes that

represented the variables. They could be

hidden nodes, where we don’t know what the true

answer is, and observed nodes, where we do. So in our case, we’re

trying to predict protein-protein interactions. There’s some hidden

variable that represents weather protein

A and B truly interact. We don’t know that answer. But we do know whether

that interaction was detected in an experiment

one, two, three or 4. Those are the

effects, the observed. And so we want to

reason backwards from the observations,

to the hidden causes. So last time we talked about

the high throughput experiments, that directly

we’re measuring out protein-protein interactions. We talked about yeast two

hybrid and affinity capture mass spec– here

listed as pull-downs. And those could

be used to predict protein-protein

directions, by themselves. But we want to find out

what other kinds of data we can use to amplify

these results, to give us independent information about

whether two proteins interact. And one thing you

could look at is whether the expression

of the two genes that you think might interact

are similar. So if you look over many,

many different conditions, you might expect

the two proteins that interact with

each other, would be expressed under

similar conditions. Certainly if you saw

two proteins that had exactly opposite

expression patterns, you would be very unlikely to

believe that they interacted. So the question is, how much

is true at the other end of the spectrum? If things are very

highly correlated, do they have a high

probability of interaction? So this graph is a

histogram for proteins that are known to

interact, proteins that were shown in these

high throughput experiments to interact, and proteins that

are known not to interact, of how similar

the expression is. On the far right

are things that have extremely different expression

patterns, a high distance. And we’ll talk

specifically about what distance is in just a minute. But these are very dissimilar

expression patterns. These are very similar ones. So what do you

see from this plot we looked at the last time? We saw that the

interacting proteins are shifted a bit to the left. So the interacting ones

have a higher probability of having similar

expression patterns than the ones don’t interact. But we couldn’t

draw any cut off, and say everything with this

level expression similarity is guaranteed to interact. There’s no way to divide these. So this will be useful in

a probabilistic setting. But by itself, it would

not be highly predictive. We also talked about

evolutionary patterns, and we discussed whether the

red or the green patterns here, would be more predictive. And which one was

it, anyone remember? How many people thought the

red was more predictive? How many the green? Right, the greens win. And we talked about the

coevolution in other ways. So the paper that, I

think, was one of the first to do this really

nicely, try to predict protein-protein

interaction patterns using Bayesian networks, is this

one from Mark Gerstein’s lab. And they start off as we

talked about previously, we need some gold

standard interactions, where we know two proteins

really do interact or don’t. They built their gold

standard data set. The positive trending

data, they took from a database

called MIPS, which is a hand-curated database

that digs into the literature quite deeply, to find out

whether two proteins interact or not. And then the negative

data they took were proteins that

were identified as being localized to

different parts of the cell. And this was done

in yeast, to where there is pretty good data

for a lot of proteins, to subcellular localization. So these are the data that

went into their prediction. These were the experiments

we’ve already talked about, the affinity capture mass

spec and the yeast two hybrid. And then the other

kinds of data they used were expression correlation,

one just talked about. They also looked at

annotations, whether proteins had the same annotation

for function. And essentiality. So in yeast, it’s

pretty easy to go through every gene

in the genome, knock it out, and determine

whether that kills the cell or not. So they can label

every gene in yeast, as to whether it’s essential

for survival or not. And you can see here, the

number of interactions that were involved. And they decided

to break this down into two separate

prediction problems. So one was an

experimental problem, using the four

different large scale data sets in yeast from

protein-protein interactions, to predict expression. The other one wore

these other kinds of data, that were less direct. And they used slightly different

kinds of Bayesian networks. So for this one, they

used a naive Bayes. And what’s the underlying

assumption of the naive Bayes? The underlying assumption

is that all the data are independent. So we looked at this previously. We discussed how

you could, if you’re trying to identify

the likelihood ratio, and use it to rank things. You primarily need to

focus on this term. Because this term will be the

same for every pair of proteins that you’re examining. Yes? AUDIENCE: Could you state

again whether in a naive Bayes, all data are dependent

or independent? PROFESSOR: Independent. AUDIENCE: OK. PROFESSOR: OK. So let’s actually look

at some of their data. So in this table, they’re

looking at the likelihood ratio that two proteins

interact, based on whether the two

proteins are essential. One is essential, and

one is a nonessential. Both are nonessential. So that’s what these

two codes here mean. EE, both essential. NN, both nonessential,

and any one and the other. And so they’ve computed

for all those protein pairs, how many in

their gold standard, are EE, how many are

EN, how many are NN? So here are the

numbers for the EE. There are just over 1,000, out

of the 2,000, roughly 2,000 that are EE. So that comes up with a

probability of being essential, given that I know

that you’re positive. You’re in the gold standard

of roughly 50%, right? And you can assume something

similar for the negatives. So these are the ones that

definitely don’t interact. So the probability of

both being essential, given that it’s negative,

is about 15%, 14%. And so then the likelihood ratio

comes out to just under four. So there’s a fourfold

increase in probability that something is interacting,

given that it’s essential, then not. And this is the table

for all of the terms, for all of the different things

that they were considering, that were not

direct experiments. So this is the sensuality. This is expression correlation,

with various values for the threshold, how similar

the expression had to be. And these are the terms from

the databases for annotation. And then for each

of these, then we get a likelihood ratio

of how predictive it is. So it’s kind of informative to

look at some of these numbers. We already saw that

essentiality is pretty weak, predicted the fact

that two genes are essential. It only gives you a

slightly increased chance that they’re

interacting than not. But if two things, two genes

have extremely high expression correlation, then they’re

more than a hundredfold more likely to interact than not. And the numbers

for the annotations are significantly

less than that. So this is a naive Bayes. We’re going to multiply all

those probabilities together. Now for the

experimental data, they said, well, these are

probably not all independent. The probably that you pick

something up in one two hybrid experiment,

is probably highly correlated with the

probability that you pick it up in another two

hybrid experiment. And one would hope that there’s

some correlation between things are identifying in two hybrid

and affinity caption mass spec. Although we’ll see whether

or not that’s the case. So they used what they refer

to as a fully connected Bayes. And what do we mean by that? Remember, this was

the naive Bayes, where everything is independent. So the probability

of some observation is the product of all the

individual probabilities. But in a fully

connected Bayes, we don’t have that

independence assumption. So you need to actually

explicitly compute what the probability

is for an interaction, based on all the possible

outcomes in those experiments. So that’s not that much harder. We simply have a table now,

where these columns represent each of the experimental

data types– the affinity capture mass

spec and the two hybrids. Ones indicate that

it was detected, Zero is that it’s not. And then we simply look

again in our gold standard, and see how often a protein that

had been detected in whatever the setting is here, in all

of them except Ito, how often was it, how many of the

gold positives do we get? And how many of

the gold negatives? And then we can compute

the probabilities. Now it’s important

to look at some of the numbers in these

tables and dig in. Because you’ll see the numbers

here are really, really small. So they have to be

interpreted with caution. So some of the

things that might not hold up with much

larger data sets. You might imagine the things

that are experimentally detected in all of the

high-throughput assays would be the most confident. That doesn’t turn

out to be the case. So these are sorted by the

law of likelihood ratio, and the best one is not 1, 1, 1. It’s up there. But it’s not the

top of the pack. And that’s probably just the

statistics of small numbers. If the databases were larger,

experiments were larger, it probably would

work out that way. So any question about how

they formulated this problem, as a Bayesian network, or

how they implemented it? OK. So the results then– so once

we have these likelihood ratios, we can try to choose a threshold

for deciding what we’re going to consider to be a

true interaction and not. So here they’ve plotted for

different likelihood ratio thresholds. On the x-axis, how many of the

true positives you get right, versus how many you get wrong. So the true positive

over the false positive. And you can

arbitrarily decide, OK, well I want to be more– I want

to get more right than wrong. Not a bad way to decide things. So your passing

grade here is 50%. So if I draw a line,

a horizontal line, and wanted to get

more right than wrong, you’ll see that any

of the individual signals that they were

using, essentiality, database sanitation, and so on–

all of those fall below that. So individually, they predict

more wrongs than rights. But if you combine the data

using this Bayesian network, then you can choose a

likelihood threshold, where you do get more

right than wrong. And you can set your

threshold wherever you want. Similarly for the direct

experimental data, you do better by combining–

these are light pink lines, than you would with any of

the individual data sets. So this shows the utility

of combining the data, and reasoning from the

data probabilistically. Any questions? So we’ll return to

Bayesian networks in a bit in the

context of discovering gene regulatory networks. So we now want to move

to gene expression data. And the primary reason to be so

interested in gene expression data is simply that there’s a

huge amount of it out there. So just a short time ago

we passed the million mark, with a number of

expression data sets that had been collected

in the databases. There’s much less of any other

kind of high throughput data. So if you look at proteomics

or high-throughput genetic screens, there are

tiny numbers, compared to gene expression data. So obviously techniques for

analyzing gene expression data are going to play a very

important role for a long time to come. Some of what I’m

going to discuss today is covered in your textbooks. I encourage you to look

at text section 16.2. The fundamental thing that

we’re interested in doing, is seeing how much

biological knowledge we can infer from the

gene expression data. So we might imagine

that genes that are coexpressed under

particular sets and conditions, have functional

similarity, reflect common regulatory mechanisms,

and our goal then, is to discover those mechanisms. So fundamental to this then, any

time we have a pair of genes– and we look at their

gene expression data– we want to decide

how similar they are. So let’s imagine that we had

these data for four genes. And it’s a time

series experiment. And we’re looking at the

different expression levels. And we want some

quantitative measure to decide which two

genes are most similar. Well, it turns out

it’s a lot more subtle than we might think. So at first glance,

oh, it’s pretty obvious that these two are

the most similar. But it really depends on

what kind of similarity you’re asking about. So we can describe

any expression data set for any gene, is simply

a multi-dimensional vector. Where this is the set

of expression values we detected for the

first gene, across all the different experimental

conditions and so on, for the second. And what would be the

most intuitive way of describing the

distance between two multi-dimensional vectors? It would simply be

Euclidean distance, right? So that’s perfectly reasonable. So we can decide that the

distance between two gene expression data sets, is

simply the square root of the sum of the

squares of the distances. So we’ll take the sum over all

the experimental conditions that we’ve looked at. Maybe it’s a time series. Maybe it’s different

perturbations. And look at the difference in

expression of gene A and gene B in that condition, K.

And then evaluating this will tell us how similar two

genes are in their expression profiles. Well, that’s a specific

example of a distance metric. It turns out that there’s

a formal definition for a distance metric. Distances have the

following properties. They’re always

greater than zero. We never have

negative distances. They are equal to zero under

exactly one condition– the two data points are the same. And they’re symmetric. So the distance from

A to B is the same as the distance from B to A. Now, to be a true

distance, then you also have to satisfy the

triangle inequality, that the distance from

x to z is less than or equal to the sum

of the distances through a third point. But we will find out

that we don’t actually need that for

similarity measures. So we can have either

a true distance metric for comparing gene

expression data sets, or similarity measures as well. So let’s go back to

the simple example. So we decided that the

red and the blue genes were nearly identical, in terms

of their distance metrics. But that’s not always

exactly what we care about. So in biological settings,

frequently the absolute level of gene expression is

on some arbitrary scale. Certainly with

expression arrays, it was completely arbitrary. It had to do with

fluorescence properties, and how well probes

hybridize to each other. But even with mRNA,

how do we really know that 1,000 copies

is fundamentally different from 1,200 copies

of an RNA in the cell? We don’t. So we might be more interested

in distance metrics that capture not just the

similarity of these two. But the fact that these

two are also quite similar, in terms of the trajectory

of the plot to this one. So can we come up with measures

that capture this one as well? A very common one for this

is Pearson correlation. So in Pearson correlation,

we’re gonna look at not just the expression of a

gene across conditions. But we’re gonna look at

the z-score of that gene. So we’ll take all

of the data for all of the genes in a

particular condition. And we’ll compute the

z-score by looking at the difference

between the expression of a particular gene, in

the average expression across the whole data set. And we’re going to normalize

it by the standard deviation. Yes? AUDIENCE: [INAUDIBLE]

square there? PROFESSOR: Yes, you’re right. There should be a square there. Thank you. So then to compute the

Pearson correlation, we’re going between

two genes, A and B, we’re going to take the

sum over all experiments, that the z-score for A

and the z-score for B, the product of that, summed

over all the experiments. And these values as

we’ll see in a second, are going to range

from plus 1, which would be a perfect

correlation, to minus 1, which would be a perfect

anti-correlation. And then we’re going to find the

distance is 1 minus this value. So things that are

perfectly correlated then, would have an r of zero. And things that

are anti-correlated would have a large one. So if we take a

look at these two obviously by Euclidean

distance, they’d be quite different

from each other. But the z-scores have

converted the expression values into z-scores over

here, you can see that the z-scores

obviously, this one is the most negative

of all of the ones. And this as the lowest

one in all of these. This one’s the highest. And similarly for the red

one, lowest to the highest. So the z-scores track very well. And when I take the

product of this, the signs of the z-score for

A and B are always the same. So I summed the product of the

z-scores, I get a large number. And then the

normalization guarantees that it comes out to one. And so the red

and blue here will have a very high

correlation coefficient. In this case, it’s going

to be an r correlation coefficient of 1. Whereas compared to this one,

which is relatively flat, the correlation coefficient

will be approximately zero. Any questions on that? So what about, say

the blue and the red? Well, their z-scores

are going to have almost the opposite

sign every single time. And so that’s going to add

up to a large negative value. So for these, they’ll be

highly anti-correlated. So A, the blue and the red,

have a correlation coefficient of minus 1. OK. So we have these

two different ways of computing distance measures. We can compute the

Euclidean distance, which would make the

red and blue the same, but treat the green one as

being completely different. Or we have the

correlation, which would group all of these

together, as being similar. What you want to do is going

to depend on your setting. If you look in your

textbook, you’ll see a lot of other definitions

of distance as well. Now what if you’re missing

a particular data point? This used to be a lot more

of a problem with arrays than it is with [? RNAC. ?]

With arrays, you’d often have dirt on the array, that

it actually would literally cover up spots. But you have a bunch of choices. The most extreme would

just be to ignore that row or column of your

matrix across old data sets. That’s usually not

what we want to do. You could put in some

arbitrary small value. But frequently we will do what’s

called imputing, where we’ll try to identify the

genes that have the most similar expression, and replace

the value for the missing one with a value from

the ones that we do know. Distance metrics,

pretty straightforward. Now we want to

use these distance metrics to actually

cluster the data. And what’s the idea here? That if we look across

enough data sets, we might find certain groups of

genes that function similarly across all those

data sets, that might be revealing as to their

biological function. So this is an example of an

unsupervised learning problem. We don’t know what the

classes are, before we go in. We don’t even know

how many there are. We want to learn from the data. This is a very large

area of machine learning. We’re just gonna

scrape the surface. Some of you may be

familiar with the fact that these kinds of

machine learning algorithms are used widely

outside of biology. They’re used by

Netflix to tell you what would movie to choose next. Or Amazon, to try to

sell you new products. And all the advertisers who send

pop-up ads on your computer. But in our biological

setting then, we have our gene expression

data, collected possibly over very large

numbers of conditions. And we want to find

groups of genes that have some similarity. This is a figure from one of

these very early papers, that sort of establish how

people present these datas. So you’ll almost always see

the same kind of presentation. Typically you’ll get a heat

map, where genes are rows. And the different

experiments here time, but it could be different

perturbations, are the columns. And genes that go up

in expression are red, and genes ago down in

expression are green. And apologies to anyone

who’s colorblind. But that’s just what the

convention has become. OK, so then why cluster? So if we cluster

across the rows, then we’ll get sets of genes

that potentially behave– that hopefully if

we do this properly, behave similarly across

different subsets of the experiments. And those might represent

similar functions. And if we cluster

the columns, then we get different experiments

that show similar responses. So that might be in this

case, different times that are similar. Hopefully those are ones

that are close to each other. But if we have lots

of different patients, as we’ll see in a

second, they might represent patients who have a

similar version of a disease. And in fact, the clustering

of genes does work. So even in this

very early paper, they were able to identify a

bunch of subsets of genes that showed similar expression

at different time points, and turned out to be enriched

in different categories. These ones were enriched in

cholesterol biosynthesis, whereas these were enriched

in wound healing, and so on. So how do you actually

do clustering? This kind of clustering

is called hierarchical. That’s pretty straightforward. There are two versions of

hierarchical clustering. There’s what’s called

agglomerative and divisive. In agglomerative, you start

off with each data point in its own cluster. And then you search for the

most similar data point to it, and you group those together. And you keep doing

that iteratively, building up larger

and larger clusters. So we’ve discussed how to

compare our individual genes. But you should be

able to, right now, to find, if I gave you

the vector of expression for a single gene, to find

the other genes in the data set that’s most similar,

by either say, Euclidean or Pearson correlation,

or what have you. But once you’ve grouped

two genes together, how do you decide

whether a third gene is similar to those two? So now we have to

make some choices. And so there are number

of different choices that are commonly made. So let’s say these are our data. We’ve got these two

clusters, Y and Z. And each circle represents a

data point in those clusters. So we’ve got four

genes in each cluster. Now we want to decide

on a distance measure to compare cluster Y to

cluster Z. So what could we do? So what are some possibilities? What might you do? AUDIENCE: We could take

the average of all points. PROFESSOR: You could take the

average of all points, right. What else could you do? Only a limited number

of possibilities. AUDIENCE: Centroid? PROFESSOR: Yeah,

so centroid, you could take some sort

of average, right. Any other possibilities? AUDIENCE: You can

pick a representative from each set [INAUDIBLE]. PROFESSOR: So you could pick

a representative, right? How would you decide in advance

what that would be though? So maybe you have

a way, maybe not. And what other

possibilities are there? Yeah? AUDIENCE: Measure all

the distances [INAUDIBLE] to all the nodes in the other. PROFESSOR: Right. So you could do all to all. What else could you do? You can take the minimum

of all those values. You can take the maximum

of all those values. And we’ll see that all those

are things that people do. So this clustering,

there are already rather uninformative

terms for some of these kinds of decisions. So it’s called

single linkage, is you decide that the distance

between two clusters is based on the minimum distance

between any member of cluster Y and any member of cluster Z. Complete linkage takes

the maximum distance. And then the extremely

unfortunately named Unweighted Pair Group Method

using Centroids– UPGMC, I won’t try to say

that very often– takes the centroid, which was an

early suggestion from the class. And then the UPGMA,

Unweighted Pair Group Method with Arithmetic Mean,

takes the average of all the distances,

all suggestions that people have made. So when would you use

one versus the other? Well, a priori, you

don’t necessarily know. But it’s good to know

how they’ll behave. So what do you imagine

is going to happen if you use single linkage,

versus complete linkage. Remember, single linkage

is the minimum distance. And complete linkage is

the maximum distance. So what’s going to

happen in this case, if I use the minimum distance. Which two groups will I combine? AUDIENCE: The blue and the red. PROFESSOR: The blue

and the red, right? Whereas if I use the

maximum distance, then I’ll combine the

green and the red. So it’s important

to recognize, then, that the single linkage has this

property of chaining together clusters, based on points

that are near each other. Whereas the complete linkage

is resistant to grouping things together, if they have outliers. So they’ll behave differently. Now, if your data are

compact, and you really do have tight clusters,

it’s not going to matter too much

would you use. But in most biological settings,

we’re dealing with much noise, there’s data. So you actually will

get different results based on this. And as far as I know, there’s

no really principal way to figure out if you have no

prior knowledge, which to use. Now all of these

hierarchical clustering come with what’s

called a dendogram. And you’ll see these at the

top of all the clustering. And this represents

the process by which the data were clustered. So the things that

are most similar are most tightly connected

in this dendogram. So these two data

points, one and two, you have to go up very little in the

y-axis, to get from one to two. Whereas if you want

to go from one to 16, you have to traverse

the entire dendogram. So the distance

between two samples is how far vertically you have

to go to connect between them. Now the good things

are that the dendogram is, you can then understand

the clustering of the data. So I can cut this dendogram

at any particular distance, and get clearly divisions

among my data sets. So if I cut here at

this distance level, then I have two groups. One small, one

consisting of these data. And one large, one

consisting of these. Whereas if I cut down here, I

have more groups of my data. So it doesn’t require me in

advance to know how many groups I have. I can look at the

dendogram and infer it. The one risk is that you

always get a dendogram that’s hierarchical, regardless of

if the data were hierarchical or not. So it’s more a reflection of

how you did your clustering than any fundamental

structure of the data. So the fact that you get

a hierarchical dendogram means really nothing

about your data. It’s simply a tool

that you can use to try to divide it up

into different groups. Any questions on the

hierarchical clustering? Yes? AUDIENCE: If each data

point is its own cluster, then won’t that be consistent

across, like, single linkage, complete linkage– like,

why would you cluster? Does that question make sense? Like if you cut it down below,

then haven’t you minimized– don’t you successively

minimize the variance, I guess, up to your clusters, by– PROFESSOR: So if I cut

it at the lowest level, everybody is their own cluster. That’s true. Right. I’m interested in

finding out whether there are genes that behave

similarly across the data sets. Or– AUDIENCE: My question

is, how would you go about determining how

many clusters your want? PROFESSOR: Oh, OK. So we’ll come to

that in a second. So hierarchical clustering,

you don’t actually have any objective

way of doing that. But we’ll talk about

other means right now, where it’s a little bit clearer. But actually

fundamentally, there aren’t a lot of good ways

of knowing a priori what the right number of clusters is. But we’ll look at some

measures in a second that help. So hierarchical clustering,

as your question implies, doesn’t really tell you how

many clusters there are. Another approach is

to decide in advance how many clusters you expect. And then see whether you can

get the data of the group into that number or not. And an example of

that is something called k-means clustering. So the nice thing

about it, is it does give you the

sharp divisions. But again if you

chose k incorrectly, we’ll see in a

second, you will get– you’ll never less

still get K-clusters. So K refers the

number of clusters that you tell the algorithm

you expect to get. So you specify that in advance. And then you try to

find a set of clusters that minimizes the distance. So everybody’s assigned

to a particular cluster, and the center of that cluster. Is that clear? So that’s what these

equations represent. So the center of the

cluster, the centroid, is just the average coordinates,

over all the components of that cluster. And we’re trying to find

this set of clusters, C, that minimizes the sum of

the square of the distances between each member of that

cluster and the centroid. Any questions on how

we’re doing this? OK. All right. So what’s the actual algorithm? That’s remarkably simple. I’m choosing that initial

set of random positions. And then I have the simple loop,

I repeat until convergence. For every point, I assign

it to the nearest centroid. So if my starting

centroids would be circles, I look at every data

point, and I ask, how close is it to

any one of these? That’s what the boundaries

are, defined by these lines. So everything above this

line belongs to the centroid. Everything over here

belongs to this centroid. So I divide the data up by which

centroid you are closest to. And I assign you

to that centroid. That’s step one. And step two, I

compute new centroids. And that’s what these

triangles represent. So after I did

that partitioning, it turns out that most

of the things that were assigned to the triangular

cluster live over here. So the centroid moves

from being here to here. And I iterate this process. That’s the entire K-means

clustering algorithm. So here’s an example

where I generated data from three [? calcines. ?]

I chose initial data points, which are the circles. I follow that protocol. Here’s the first step. It computes new triangles. Second step, and

then it converges. The distance stops changing. Now this question’s

already come up. So what happens if you

choose the wrong K? So I believe there

are three clusters. And really that’s not the case. So what’s going to happen? So in this data set,

there really were. How many, there really

were five clusters. Here, they’re

clustered correctly. What if I told the

algorithm to do K-means clustering

with a K of three? It would still find a way to

come up with three clusters. So now it’s grouped these

two things, which are clearly generated from different

[? calcines ?] scenes together. It’s grouped these two,

which were generated from different [? calcines ?]

together, and so on. All right. So K-means clustering

will do what you tell it to do, regardless of

whether that’s the right answer or not. And if you tell it there are

more clusters than you expect– than really are

there, then it’ll start chopping up well-defined

clusters into sub-clusters. So here it split this elongated

one into two sub-clusters. It split this one

arbitrarily into two. Just so it gets the final

number that we asked for. Then how do you know what to do? Well, as I said, you don’t–

there’s no guarantee to know. But one thing you can

do is make this kind of plot, which shows for

different values of K on the x-axis, the sum of the

distances within the cluster. So the distance to the

centroid within each cluster on the y-axis. And as I increase the number

of K’s, when I’m correctly [? purchasing ?] my

data, when there really are more subgroups than

I’ve already defined, then I’ll see big drops. So I go from saying there are

two to three in that case. I get a big drop in the distance

between members of the cluster. Because I’m no longer including

a data point over here. And in this cluster, with a

data point in that cluster. But once I go beyond the

correct number, which was five, you see that the benefits

really start to trail off. So there’s an

inflection point here. There’s an elbow– sometimes

it’s called an elbow plot. After I go past

the right number, I get less and less benefit

from each additional clustering. So this gives us

an empirical way of choosing approximately

a correct value for K. Any questions on K-means? Yes? AUDIENCE: Does K-means

recapitulate the clusters that you would

get if you cut off your dendogram from hierarchical

clustering at a certain level? PROFESSOR: Not necessarily. AUDIENCE: OK. But maybe. I don’t know. It sort of seems to me

as if you picked a level where you have a certain

number of clusters, that that’s similar, at least by

centroid, by using the center? PROFESSOR: Yeah, I think because

of the way that you do it, you’re not even guaranteed

to have a level, where you have exactly the

right– other questions? Yes? AUDIENCE: Could you

just very quickly go over how you

initialized where the starting points

are, and the break ups? PROFESSOR: All right,

so the question is how do you initialize

the starting points? In fact, you have to make some

arbitrary decisions about how the initialize the

starting points. So they’re usually

chose in a random. And you will get

different results, depending on how you do that. So that’s another–

so when you do it, it’s non-deterministic

in that sense. And you often want to

initialize multiple times. And make sure you

get similar results. Very good question. And in fact, that

was not a set up. But what happens if you

choose pathologically bad initial conditions? So you have the potential to

converge to the right answer. But you’re not guaranteed to

converge to the right answer. So here’s an example where

I had– I guess there really are three clusters in the data. I chose [INAUDIBLE]

three, but I stuck all my initial coordinates down

in the lower right-hand corner. And then when I do the

clustering, if things go well, I get the right answer. But we’re not guaranteed. But one thing we are guaranteed,

is we always get convergence. So the algorithm will converge. Because at each

step, it’s either reducing the objective function,

or it’s leaving it the same. So we’re guaranteed convergence. But it may be as we’ve seen

previously in other settings, we may end up with local

minimum, rather than the global optimum. And the way to fix

that then would be to initialize again,

with new starting positions. Other questions? What about a setting like this? Where we’ve got two

well-defined clusters, and somebody who lives

straight in the middle. So what’s the

algorithm going to do? Well, sometimes it’ll put it

in one side, of one cluster. And sometimes it’ll end

up in the other side. So an alternative to

K-means clustering, which has to make one or the

other arbitrary decision, is something that’s called

fuzzy K-means, which can put something

actually literally, membership into both clusters. And it’s very similar in

structure to the K-means, with one important

difference, which is a membership variable,

that tells you for every data point, how much it belongs

to the cluster one, cluster two, cluster three, and so on. So in both algorithms,

we start off by choosing initial

points as a cluster means, and looping through

each of them. Now previously, we would make

a hard assignment of each data point x sub i to

a single cluster. And here we’re going to

calculate the probability that each data point

belongs to a cluster. And that’s where you

get the fuzziness, because you could have a non

unit, or a nonzero probability, belonging to any

of the clusters. And now we’re going, K-means,

we recalculated the mean value, by just looking at the average

of everybody in that cluster. Now in fuzzy K-means, we don’t

have everybody in the cluster. Because everybody belongs

partially to the cluster. So we’re going to take

a weighted average. So here are the details

of how you do that. In K-means, we are

minimizing this function. We were trying to decide the

class structure, the class memberships, that would minimize

the distance of every member of that cluster, to the defined

centroid of that cluster. Here it looks almost the same. Except we now have

this new variable, mu, which is the membership. It’s the membership of

point j, in cluster i. So I’m trying to minimize

a very similar function. But now if mu is one–

if all my mus are one, then what do I get? K-means, right? But as soon as the mus are

allowed to vary from one, they can be between

zero and one, then points can

contribute more or less. So that point there was stuck in

the middle of the two clusters, if it had a mu of

0.5 for each, it would contribute half to each. And then both the

centroids would move a little bit

towards the middle. So what’s the

result of K-means– I’m sorry, fuzzy

K-means clustering? We still get K clusters. But now every gene or every

object that we’re clustering has a partial membership. So here’s an example

of that, where they did K-means

clustering, with these six different clusters. But now every

profile, every gene, has a color associated with it,

that represents this mu value. Whether it goes from zero to

one, with these rainbow colors, to the things that

are reddish, or pink– those are the high

confidence things that are very strongly,

only in that cluster. Whereas the things that are

more towards the yellow end of the spectrum are

partially in this cluster and partially in other clusters. Questions? Any questions? So K-means, we’ve defined in

terms of Euclidean distance. And that has clear advantages,

in terms of computing things very easily. But it has some

disadvantages as well. So one of the disadvantages

is because we’re using the squared

distance, then outliers have a very big effect. Because I’m squaring the

difference between vectors. That may not be the worst thing. But they also

restrict us to things for which we can

compute a centroid. We have to have data

that are– four or more, you can actually

compute the mean value of all members of the cluster. Sometimes you want

to cluster things that we only have

qualitative data. Where instead of having

a distance measure, we have similarity. This doesn’t come up

quite as often in– well, it certainly doesn’t come

up in gene expression data or [? RNAC. ?] But you can

imagine more qualitative data, where you ask people

about similarity between different things

or behavioral features, where you know the similarity

between two objects. But you have no way of

calculating the average object. One setting that you might

[INAUDIBLE] have looked at– if you’re trying to

cluster say, sequence motifs that you’ve computed

with the EM algorithm. So what’s the average

sequence motif? That doesn’t necessarily

represent any true object, right? You might be better off–

you can calculate it. But it doesn’t mean anything. You might be better

off calculating using rather than the

average motif, the most central of the motifs that

you actually observed. So that would be called

a medoid, or an exemplar. It’s a member of your cluster

that’s closest to the middle, even if it it’s not

smack dab in the middle. So instead of K-means, we can

just think, well, K-medoids. So in K-means, we actually

computed a centroid. And in medoids, we’ll

choose the existing data point that’s most central. So what does that mean? If these are my data, the true

mean is somewhere over here. But this one is the medoid. It’s an exemplar that’s

close to the central point. But if there actually

isn’t anything here, then there isn’t. So we’re going to use

the thing that’s closest. So if these were

all sequence motifs, rather than using some

sequence motif that doesn’t exist as the

center of your cluster, you would use a sequence motif

that actually does exist, and it’s close to the center. So it’s a simple

variation on the K-means. Instead choosing K

points in arbitrary space as our starting

positions, we’re going to choose K examples from the

data as our starting medoids. And then we’re going to place

each point in the cluster that has the closest medoid,

rather than median. And then when we

do the update step, instead of choosing the

average position to represent the cluster, we’ll

choose the medoid. The exemplar that’s

closest to the middle. Any questions on this? Yes? AUDIENCE: So if

you use the medoid, do you lose the

guaranteed convergence? Because I can

picture a situation where you’re sort of

oscillating because now you have a discrete stack. PROFESSOR: That’s

a good question. That’s probably right. Actually, I should

think about that. I”m not sure. Yeah, that’s probably right. Other questions? OK. There are a lot of other

techniques for clustering. Your textbook talks about

self organizing maps, which were popular at

one point quite a lot. And there’s also

a nice technique called affinity

propagation, which is a little bit outside

the scope of this course, but has proved quite

useful for clustering. OK. So why bother to do

all this clustering? Our goal is to try to find

some biological information, not just to find

groups of genes. So what can you do

with these things? Well, one thing that

was identified early on, is if I could find sets of genes

that behave similarly, maybe those could be used

in a predictive way, to predict outcomes

for patients, or some biological function. So we’re going to

look at that first. So one of the early

papers in this field did clustering of

microarrays for patients who had B-cell lymphoma. The patients had different

kinds of B-cell lymphomas. And so they took their

data, they clustered it. Again, each row

represents a gene. And each column

represents a patient here. And with this projector, it’s

a little bit hard to see. But when you look at

the notes separately, you’ll be able see

that in the dendogram, there’s a nice, sharp

division between two large groups of patients. And it turns out that when

you look at the pathologist’s annotations for these

patients, which was completely independent of the

gene expression data, all of patients in

the left hand group– almost all the patients

in the left hand group, had one kind of lymphoma. And all the patients

in the right hand group had a different

kind of lymphoma. And this got people

very excited. Because it suggested that

the pure molecular features might be at least as good

as pathological studies. So maybe you could completely

automate the identification of different tumor types. Now the next thing that got

people even more excited, was the idea that maybe

you could actually use these patterns not

just to recapitulate what a pathologist would

find, but go beyond it, and actually make predictions

from the patients. So in these plots–

I don’t know if we’ve seen these before

yet in the class. But on the x-axis is survival. In the y-axis are the

fraction of patients in a particular group,

who survived that long. So as the patient’s

die, obviously the curve is dropping down. Each one of these

drops represents the death of a patient,

or the loss of the patient to the study for other reasons. And so in the middle,

let’s start with this one. This is what the clinicians

would have decided. There are here, patients

that they defined by clinical standards as

being likely to do well, versus patients whom they

defined by clinical standards, as likely to do poorly. And you could see there

is a big difference in the plots for the

low clinical risk patients at the top, and

the high clinical risk patients at the bottom. On the left hand

side, or what you get when you use purely

gene expression data to cluster the patients into

groups that you turn out to be high risk or low risk. And you can see that

it’s a little bit more statistically significant

for the clinical risk. But it’s pretty

good over here, too. Now the really

impressive thing is, what if you take the

patients that the clinicians define as low clinical risk? And then you look at their

gene expression data. Could you separate

out the patients in that allegedly

low clinical risk who are actually at high risk? And maybe then they

would be diverted to have more aggressive

therapy than patients who really and truly

are low risk patients. And what they will

show with just barely statistical significance,

is that even among the clinically

defined low risk patients, there is– based on

these gene signatures– the ability to

distinguish patients who are going to do

better, and patients who are going to do worse. So this was over a decade ago. And it really set off a

frenzy of people looking for gene signatures for

all sorts of things, that might be highly predictive. Now the fact that

something is correlated, doesn’t of course

prove any causality. So one of the questions is, if

I find a gene signature that is predictive of an outcome

in one of the studies, can I use it then to go

backwards, and actually define a therapy? In the ideal setting, I would

have these gene signatures. I’d discover that

they are clinically associated with outcome. I could dig in and

discover what makes the patients to do worse, worse. And go and treat that. So is that the case or not? So let me show you some data

from a breast cancer data set. Here’s a breast cancer data set. Again the same kind of plot,

where we’ve got the survival statistic on the y-axis, the

number of years on the x-axis. And based on a gene

signature, this group has defined a group

that does better, and a group that does worse,

the p value is significant. And it has a ratio, the

death rate versus control is approximately two. OK. So does this lead us to

any mechanistic insight into breast cancer. Well, it turns out in this

case, the gene signature was defined based on

postprandial laughter. So after dinner humor. Here’s a gene set

that defined something that has absolutely nothing

to do with breast cancer, and it’s predicting the outcome

of breast cancer patients. Which leads to

somewhat more of a joke that the testing

whether laughter really is the best medicine. OK. So they went on– they

tried other genes sets. Here’s the data

set– gene set that’s not even defined in humans. It’s the homologs of

genes that are associated with social defeat in mice. And once again, you get a

statistically significant p-value, and good hazard ratios. So what’s going on? Well, these are not from

a study that’s actually trying to predict an

outcome in breast cancer. It’s a study that shows

that most gene expression– most randomly selected

sets of genes in the genome will give an outcome that’s

correlated– a result that’s correlated with a patient

outcome in breast cancer. Yes? AUDIENCE: I’m a little confused. In the previous graph, could you

just explain what is the black and what is the red? Is that individuals or groups? PROFESSOR: So the

black are people that have the genes

set signature, who have high levels

of the genes that are defined in this gene set. And the red are ones have

low, or the other way around. But it’s defining all patients

into two groups, based on whether they have a

particular level of expression in this gene set,

and then following those patients over time. Do they do better or worse? And similarly for

all these plots. And he had another one which is

a little less amusing, location of skin fibroblasts. The real critical point is this. Here, they compared

the probability based on expectation

an that all genes are independent of each

other, the probability that that gene signatures

correlated with outcome, for genes there were chosen

at random or genes that were chosen from a database

of gene signatures, that people have identified as

being associated with pathways. And you get a very,

very large fraction. So this is the p-value. So negative log of

p-value, so negative values are more significant. A huge fraction

of all genes sets that you pull at

random from the genome, or that you pull from a

compendium of known pathways, are going to be

associated with outcome, in this breast cancer data set. So it’s not just well

annotated cancer pathways, that are associated. Its gene sets

associated as we’ve seen, with laughter or

social defeat in mice, and so on– all sorts

of crazy things, that have no mechanistic

link to breast cancer. Let’s take a second

for that to sink in. I pull genes at random

from the genome. I define patients

based on whether they have high levels of expression

of a random set of genes, or low levels of expression

of that random set of genes. And I’m extremely likely

to be able to predict the outcome in breast cancer. So that should be rather

disturbing, right? So it turns out– before

we get to the answer then– so this is not

unique to breast cancer. They went through a whole bunch

of data sets in the literature. Each row is a different

previously published study, where someone had claimed

to identify a signature for a particular kind

of disease or outcome. And they took their

random gene sets and asked how well the random

genes sets did in predicting the outcome

in these patients? And so these yellow

plots represent the probability distribution

for the random gene sets– again on

this projector, it’s hard to see– but there’s a

highlight in the left hand side at where the 5%, the best

5% of the random gene sets are. This blue line is

the near measure of statistical significance. It turns out that a

few of these studies didn’t even reach a normal level

of statistical significance, let alone comparing

to random gene sets. But for most of

these, you don’t do better than a good fraction

of the randomly selected gene sets. So how could this be? So it turns out there is an

answer to why this happens. And it’s really

quite fascinating. So here, we’re using

the hazard ratio, which is the death rate

for the patients who have the signature,

over the control group. So high hazard ratio means

it’s a very, very dissociative outcome. And they’ve plotted that against

the correlation of the genes in the gene signature, with

the expression of a gene called PCNA, Proliferating

Cell Nuclear Antigen And it turns out a very, very

large fraction of the genome is coexpressed. So genes are not expressed like

random, completely independent random variables. There are lots of genes that

show very similar expression levels, across

all the data sets. Now PCNA is a gene that’s

been known by pathologists for a long time,

as having higher levels than most

digressive tumors. So a very, very large

fraction of the genome is coexpressed with PCNA. Then high levels of

randomly selected genes are going to be a very good

predictor of tumor outcome. Because high levels of

randomly expressed genes also means a very

high probability of having a high level PCNA,

which is a tumor marker. So we have to proceed

with a lot of caution. We can find things that are

highly correlated with outcome, that could have good value in

terms of prognostic indicators. But there are going to

be a lot of possibilities for sets of genes that

have that property, they’re good

predictors of outcome. And many of them will

have absolutely nothing to causally, with the

process of the disease. So at the very least, it means

don’t start a drug company over every set of genes,

if you identify this as associated with outcome. But the worst case

scenario, it also means that those predictions

will break down under settings that we haven’t yet examined. And so that’s the

real fear, that you have a gene set

signature that you think has a highly predictive outcome. It’s only because you looked at

a particular set of patients. But you look at a

different set of patients, and that correlation

will break down. So this is an area of research

that’s still quite in flux, in terms of how

much utility there will be in identifying

genes set signatures, in this completely

objective way. And what we’ll see in the

course of this lecture and the next one,

is it’s probably going to be much more

useful to incorporate other kinds of information

that will constrain us to be more mechanistic. Any questions? All right. So now we’re going to

really get into the meat of the identification

of gene modules. And we’re going to try to

see how much we can learn about regulatory structure

from the gene expression data. So we’re going to move up from

just the pure expression data– say these genes at the

bottom, to try to figure out what set of transcription

factors we’re driving, and maybe what signaling

pathways lived upstream in those transcription

factors, and turn them on. And the fundamental difference

then between clustering– which is what we’ve been looking in

until now, and these modules, as people like to call

them– is that you can have a whole bunch

of genes, and we’ve just seen that, that are

correlated with each other, without being causally

linked to each other. So we like to figure out which

ones are actually functionally related, and not just

statistically related. And the paper that’s going

to serve as our organizing principle in the

rest of this lecture, maybe bleeding into

the next lecture, is this paper,

recently published that’s called The

DREAM5 Challenge. And this, like some of these

other challenges that we’ve seen before, is the case where

the organizers have data sets, where are they know

the answer to what the regulatory structure is. They send out the data. People try to make the

best predictions they can. And then they unseal

the data, to let people know how well they did. And so you can get a

relatively objective view of how well different

kinds of approaches work. So this is the overall

structure of this challenge. They had four different

kinds of data. Three are real data sets from

different organisms, E. coli, yeast, and

Staphylococcus aureus. And then the fourth one,

the one at the top here, is completely synthetic

data that they generated it. And you get a sense of the

scale of the data sets. So how many genes are involved,

how many potential regulators. In some cases, they’ve given

you specific information on knockouts, antibiotics,

toxins, that are perturbing. And again here, the

number of conditions that are being looked at,

the number of arrays. So then they provide

this data in a way that’s very hard

for the groups that are analyzing to trace it

back to particular genes. Because you don’t want people to

use external data necessarily, to make their predictions. So every makes

their predictions. They also, as part

of this challenge, they actually they made

their own metapredictions, based on the

individual predictions by different groups. And we’ll take a look

at that in a second. And then they score

how well they did. Now we’ll get into the details

of the scoring a little bit later. But what they found

at the highest levels, that different kinds of

methods behaved similarly. So the main groups

that they found were these regression-based

techniques. We’ll talk about

those in a second. Bayesian networks,

which we’ve already discussed in a

different context. A hodgepodge of different

kinds of things. And then mutual information

and correlation. So we’re going to look in

each of these main categories of prediction methods. So we’re going to start with

the Bayesian networks, which we just finished talking

about in a completely different context. Here, instead of trying to

predict whether interaction is true, based on the

experimental data, we’re going to try to predict

whether a particular protein is involved in regulating a set of

genes, based on the expression data. So in this context– let’s

say I have cancer data sets, and I wanted to decide

whether p53 is activated in those tumors, So this

is a known pathway for p53. So if I told you

the pathway, how might you figure out if p53

is active from gene expression data? I tell you this pathway, give

you this expression data– what’s kind of a simple

thing that you could do right away, to decide whether you

think p53 is active or not? p53 is a transcriptional

activator, but it should be turning

on the genes of its targets when its on. So what’s an

obvious thing to do? AUDIENCE: Check the expression

levels from the targets. PROFESSOR: Thank you. Right, so we could check

the expression levels. The targets compute some

simple statistics, right? OK. Well, that could work. But of course there could

be other transcriptional regulators that regulate

a similar set of genes. So that’s not a

guarantee that p53 is on. It might be some other

transcriptional regulator. We could look for the

pathways that activate p53. We could ask whether

those genes are on. So we’ve got in this

pathway, a bunch of kinases, an ATM, CHK1, and so

on, that activate p53. Now if we had proteomic

data, we could actually look whether those proteins

are phosphorylated. But we have much, much

less proteomic data. And most of these settings

only have gene expression data. But you look at, is

that gene expressed? Has the expression of one

of these activating proteins gone up? And you can try to

make an inference then. From whether there’s more of

these activating proteins, then maybe p53 is active. And therefore it’s

turning on it’s targets. That’s one step removed. So just the fact that there’s

a lot of ATM mRNA around doesn’t mean that there’s a

lot of the ATM protein, which certainly doesn’t mean that

the ATM is phosphorylated and turning on its target. So again, we don’t

have a guarantee there. We could look more specifically

whether the genes are differentially expressed. So the fact that they’re on

may not be as informative as if they were uniquely

on in this tumor, and not on in control cells

from the same patient. So that can be informative. But again changes

in gene expression are not uniquely related to

changes in protein level. So we’re going to have to

behave with a bit of caution. So the first step we’re going

to take in this direction, is try to build a

Bayesian network. That’s going to give us a way

to reason probabilistically over all of these kinds of data,

which by themselves are not great guarantees that we’re

getting the right answer. Just like in the protein

prediction interaction problem, where individually coexpression

wasn’t all that great, essentiality wasn’t

all that great. But taken together, they

could be quite helpful. So we want to compute

the probability that the p53 pathway is

active, given the data. And the only data we’re

going to have in the setting is gene expression data. So we’re going to assume

that for the targets of a transcription

factor to be active, the transcription

factor itself has to be expressed

at a higher level. That’s a restriction

of analyzing these kinds of data

that’s very commonly used. So we’re going to try to

compute the probability that p53 is activated, given the data. So how would I compute

the probability, that given that some

transcription factors on, that I see expression

from target genes? How would I do this? I would just go into the data,

and just count in the same way that we did in our

previous setting. We could just look over

all the experiments and tabulate whether one of the

targets is up in expression, how often is the transcription

factor that’s potentially activating it up? And how often are all the

possible combinations the case? And then we can use

Bayesian statistics to try to compute

the probability that a transcription

factor is up, activated, given that I’ve seen the

gene expression data. Is that clear? Good. So we want to try to not include

just the down stream factors. Because that leads

possibly, maybe there are multiple

transcription factors that are equally

likely to be driving expressions instead of genes. We want to include the

upstream regulators as well. And so here, we’re going

to take advantage of one of the properties

of Bayesian nets at where we looked

at, explaining a way. And you’ll remember

this example, where we decided that if see

that the grass is wet, and I know that

it’s raining, then I can consider less likely

that the sprinklers were on. Even though there’s no causal

relationship between them. So if I see that a set of

targets of transcription factor A are on, and I have evidence

that the pathway upstream of A is on, that reduces my

inferred probability that the transcription

factor B is responsible. So that’s of the nice things

about Bayesian networks that gives us a way of

reasoning automatically, over all the data, and not

just the down stream targets. And the Bayesian networks

can have multiple layers. So we can have one

transcription factor turning another one,

turns on other one, turns on another one. Again, we can have as

many layers as necessary. But one thing we

can’t have are cycles. So we can’t have a

transcription factor that’s at the bottom of this,

going back and activating things that are at the top. And that’s a

fundamental limitation of Bayesian networks. We’ve already talked

about the fact that in Bayesian networks,

with these two problems that we to have to solve, we have to be

able to define the structure. If we don’t know any a priori. Here, we don’t

know what a priori. So we’re going to have to learn

the structure of the network. And then with the

structure of the network, we’re going to have to

learn all the probabilities. So the conditional

probability tables that relate to each

variable to every other one. And then just two more

small points about it. So if I just give

you expression data, without any interventions–

just the observations, then I can’t decide what is a

cause and what is an effect. So here this was done in

the context of proteomics, but the same is true for

gene expression data. If I have two variables, x and

y, that are highly correlated, it could be that x activates y. It could be that y activates x. But if I perturb the system,

and I block the activity of one of these two genes

or proteins, then I can start to tell

the difference. In this case, if

you inhibit x, you don’t see any activation of y. That’s the yellow,

all down here. But if you inhibit y,

you see the full range of activity of x. So that implies that x

is the activator of y. And so in these settings, if

you want to learn a Bayesian network from data, you need

more than just a compendium of gene expression data. If you want to get the

directions correct, you need perturbations

where someone has actually inhibited particular

genes or proteins. Now, in a lot of these

Bayesian networks, we’re not going

to try to include every possible gene and

every possible protein. Either because we don’t

have measurements of it, or because we need

a compact network. So there will often

be cases where the true regulator

in some causal chain, is missing from our data. So imagine this is the true

causal chain– x activates y, which then activates z and w. But either because we

don’t have the data on y, or because we left it out to

make our models more compact, it’s not in the model. We can still pick up the

relationships between x and z, and x and w. But the data will

be much noisier. Because we’re missing

that information. In the conditional

probability tables, relating x to y, and then

y because it’s too targets. So Bayesian networks, we’ve

already seen quite a lot. We now have some idea of how to

transfer them from one domain to the domain of

gene expression data. The next approach

we want to look at is a regression-based approach. So the regression-based

approaches are founded on a

simple idea, which is that the expression

gene is going to be some function

of the expression levels of the regulator. We’re going to

actually try to come up with a formula that

relates the activity levels of the

transcription factors, and the activity

level of the target. In this cartoon, I’ve

got a gene that’s on under one

condition, that’s off under some other conditions. What transforms it

from being off to on, is the introduction of

more of these transcription factors, that are

binding to the promoter. So in general, I have

some predicted level of expression for the gene. It’s called the

predicted level y. And it’s some function,

unspecified at this point, f of g, of all the expression

levels of the transcription factors that regulate that gene. So just again,

nomenclature is straight, x sub g is going to be

the expression of gene x– I’m sorry, expression of gene g. This capital X, sub t of g

is the set of transcription factors, that I believe

are regulating that gene. And then f is an

arbitrary function. We’re going to have

a noise term as well. Because this is the

observed gene expression, not some sort of platonic view

of the true gene expression. Now frequently, we’ll

have a specific function. So the simplest one

you can imagine, which is a linear function. So the expression of

any particular gene is going to be a

linear function, a sum, of the expression of

all of it’s regulators, where each one has associated

with it a coefficient beta. And that beta

coefficient tells us how much particular a

regulator influences that gene. So say, p53 might have

a very large value. Some other

transcriptional regulator might have a small

value, representing their relative influence. Now, I don’t know the

beta values in advance. So that’s one of the things

that I need to learn. So I want to be able to

find a setting that tells me what the beta values are for

every possible transcription factor. If the algorithm sets

the beta value to zero, what does that tell me? If a beta value

is zero here, what does that tell me about that

transcriptional regulator? No influence, right. And the higher the value, then

the greater the influence. OK. So how do we discover these? So the tip of the

approach then is to come up with some

objective function that we’re going

to try to optimize. And an obvious

objective function is the difference between

the observed expression value for each gene,

and the expected one, based on that linear function. And we’re going to choose

a set of data parameters that minimize the difference

between the observed and the expected, minimize

the sum of the squares. So the residual sum

of the squares error, between the predicted

and the observed. So this is a relatively

standard regression problem, just in different setting. Now one of the problems with

a standard regression problem, is that we’ll

typically get a lot of very small values of beta. So we won’t get all

zeros or all ones, meaning the algorithm is

100% certain that these are the drivers

and these are not. We’ll get a lot of small values

for many, many transcription factors. And OK, that could

represent the reality. But the bad thing is

that those data values are going to be unstable. So small changes in

the training data will give you big changes, in

which transcription factors have which values. So that not a desirable setting. There’s a whole field

built up around trying to come up with

better solutions. I’ve given you some

references here. One of them is to a

paper that did well in the DREAM challenge. The other one is to

a very good textbook, Elements of

Statistical Learning. And there are various

techniques that allow you to try to

limit the number of betas that are non-zero. And by doing that, you get

more robust predictions. At a cost, right, because there

could be a lot of transcription factors that really do

have small influences. But we’ll trade

that off, by getting more accurate predictions

from the ones that have the big influences. Are there any questions

on regression? So the last of the methods

that we’re examining– this is a mutual information. We’ve already seen mutual

information in the course. So information

content is related to the probability of observing

some variable in an alphabet. So in most languages,

the probability of observing letters

is quite variable. So Es are very common

in the English language. Other letters are less common. As anyone who plays Hangman or

watches Wheel of fortune knows. And we defined the

entropy as the sum over all possible outcomes. The probability of

observing some variable, and the information

on to that variable, we can define the discrete

case, or in the continuous case. And the critical

thing is to have mutual information

between two variables. So that’s the difference between

the entropy of those variables independently, and

then the joint entropy. So things with a high

mutual information, means that one variable gives

the significant knowledge of what the other

variable is doing. It reduces my uncertainty. That’s the critical idea. OK. So we looked at

correlation before. There could be

settings where you have very low correlation

between two variables, but have high

mutual information. So consider these two genes,

protein A and protein b, and the blue dots are the

relationship between them. You can see that there’s

a lot of information content in these two variables. Knowing the value of A

gives me a high confidence in the value of B. But

there’s no linear relationship that describes these. So if I use mutual

information, I can capture

situations like this, that I can’t capture

with correlation. And these kinds of

situations actually occur. So for example in a

feed-forward loop– say we’ve got a regulator

A, and it directly activates B. It also directly

activates C. But C inhibits B. So you’ve got the path on

the left-hand sides that are pressing the accelerator. And the path on the right hand

side pressing the stop pedal. That’s called an incoherent

feed-forward loop. And you can get under different

settings, different kinds of results, where this

is one of those examples. You can get much

more complicated behavior. [INAUDIBLE] are papers

that have really mapped out these behaviors across

many parameters settings. You can get switches

in the behavior. But in a lot of

these settings, you will have high

mutual information between two

variables, even if you don’t have any correlation,

linear correlation between them. A well-publicized algorithm

that uses mutual information to infer gene regulatory

networks is called ARACNe. They go through and they

compute the mutual information between all pairs of

genes in their data set. And now one question you have

with mutual information is, what defines a significant

level of mutual information? So an obvious way to do

this, to try to figure out what’s significant, is

to do randomizations. And so that’s what they did. They shuffled the

expression data, to compute mutual information

among pairs of genes, where there isn’t

actually a need– there shouldn’t be

any relationships. Because the data

had been shuffled. And then you can decide whether

the observed mutual information is significantly

greater than when you get from the

randomized data. Now, the other thing that

happens with mutual information is that indirect

effects still apply to degrees of

mutual information. So let’s consider the set of

genes that are shown on this. So you’ve got G2,

which is actually a regulator of G1 and G3. So G2 is going to have

high mutual information with G1, and with G3. Now, what’s it going

to be about G1 and G3? They’re going to behave

very similarly, as well. So it’ll be a high degree

of mutual information between G1 and G3. So if I just rely on

mutual information, I can’t tell what’s a

regulator and what’s a fellow at the same

level of regulation. They’re both being affected

by something above them. I can’t tell the difference

between those two. So they use what’s called the

data processing inequality, where they say, well, these

regulatory interactions should have higher

mutual information, than this, which is just

between two common targets in the same parent. And so they drop from their

network, those things which are the lower of the

three in a triangle. So that was the original

ARACNe algorithm, and then they modified

it a little bit, to try to be more specific in

terms of the regulators that were being picked up. And so they called

this approach MINDy. And the core idea here,

is that in addition to the transcription

factors, you might have another protein that

turns a transcription factor on or off. So if I look over

different concentrations of the transcription factor,

different levels of expression between transcription

factors, I might find that there are some cases

where this other protein turns it on, and other cases

where it turns it off. So here, consider

these two data sets. Looking at different

concentrations of particular transcription

factor and different expression levels, and in one

case– the blue ones, the modulator isn’t

present at all, or present at it’s

lowest possible level. And in the red case, it’s

present as a high level. And you can see that when

the modulator is present only in low levels, there’s no

relationship between a target and it’s transcription factor. Or when the modulator is

present at a high level, then there’s this

linear response of the target to it’s

transcription factor. So this modulator seems to

be a necessary component. So they went through and

defined a whole bunch of settings like this. And then systematically search

the data for these modulators. So they started off with the

expression data set, genes in rows, experiments in columns. They do a set of

filtering to remove things that are going to be

problematic for the analysis. They look, for

example, for settings where you have– they had to

start with a list of modulators and transcription factors,

and they moved the ones where there isn’t enough

variation, and so on. And then they examine, for every

modulator and transcription factor pair, cases

where the modulator is present at its highest

level, and where it’s present at

it’s lowest level. So when the modulator is present

at a high level– let’s say, when the modulator is

present at a high level, there’s a high

mutual information between the transcription

factor and the target. When the modulator is absent,

there’s no mutual information. That’s a setting we

looked at before. That would suggest that the

modulator is an activator. It’s a positive modulator. You can have the

opposite situation, where when the modulator

is present at low levels, there’s mutual information

between a transcription factor and it’s target. When the modulator is

present at a high level, you don’t see anything. That would suggest

that the modulator is a negative regulator. And then there are scenarios

where there’s either uniformly high

information content between transcription factor

target, or uniformly low. So the modulator doesn’t

seem to be doing anything. So we break it down

into these categories. And you can look at all

the different categories, in their supplemental tables. One thing that’s

kind of interesting is they assume that regardless

of how high the transcription factor goes, you’ll

always see an increase in the expression of the target. So there is no

saturation, which is an unnatural assumption

in these data sets. OK. So I think I’ll close with this

example, from their experiment. And then in the

next lecture, we’ll look at how these

different methods fare against each other in

the DREAM challenge. So they specifically wanted

to find regulators of MYC. So here’s data for a

particular regulator, SDK 38. Here’s the set of

expression of tumors where SDK 38

expression is lowest. And a set of tumors where

SDK 38 expression is highest. And they’re sorted by the

expression level of MYC. So on the left hand

side, you’ll see there’s no particular

relationship between the expression level

of MYC and the targets. In the right hand side,

there is a relationship between the expression

level of MYC and targets. So having, apparently–

at least at this level of mutual information, having

higher levels of SDK 38, cause a relationship to occur. That would be example

of an activator. OK. So this technique has

a lot of advantages, and allows you to search rapidly

over very large data sets, to find potential target

transcription factor relationships, and also

potential modulators. It has some limitations. Where the key limitations

is that the signal has to be present in the

expression data set. So in the case of

a protein like p53, where we know it’s activated

by all sort of other processes, phosphorylation or NF-kappaB,

where it’s regulated by phosphorylation, you

might not get any signal. So there has to be a case

where the transcription factor itself, is

changing expression. It also won’t work if

the modulator is always highly correlated

with its target, for some other

biological reason. So the modulator has to be

on, for other reasons, when the target is, then

you’ll never be able to divide the

data in this way. One of the other

things I think that is problematic with

these networks is that you get such

large networks, and they’re very

hard to interpret. So in this case, this

is the nearest neighbors of just one node in ARACNe. This is the mutual information

network of microRNA modulators that has a quarter of

a million interactions. And in these data

sets, often you end up selecting a very,

very large fraction of all the potential modulators. So of all the candidate

transcription factors in modulators, it

comes up with an answer that’s roughly 10% to 20%

of them are regulating any particular gene,

which seems awfully high. OK. So any questions on the

methods we’ve seen so far? OK. So when we come

back on Thursday, we’ll take a look

at head to head of how these different methods

perform on both the synthetic and the real data sets.

ddfdx