Gene Expression Matrices


We will now talk about how
biologists study gene expression to analyze genes implicated
in the diauxic shift. Let’s make a simple experiment. Let’s measure the expression of
three genes at seven checkpoints before and after the diauxic shift. And each element in the resulting
three-by-seven matrix is the expression level of
gene i at checkpoint j. What is the expression level? Well, one gene in a genome may generate many transcripts, while another
may generate few transcripts. Expression level is a relative
measurement of the number of transcripts in the cell. And, we can visualize this 7-mer expression vector as
a plot on seven points, and looking at this plot, you can
immediately see that the blue gene most likely has nothing to do
with the diauxic shift because the expression of this gene doesn’t
change before and after the diauxic shift. However, the green gene mostly likely is implicated in the diauxic shift
because its expression increases, and the red gene is also most likely
implicated in the diauxic shift because its expression is decreased. Instead of analyzing expression levels
directly, biologists often prefer to study the logarithm of expression level,
and we will follow this approach. Below this line, you see expression
level represented as logarithm. We just need to remember that positive
expression levels, after transforming to logarithm, correspond to an
increase in the expression of gene, and negative expression levels,
after transforming to logarithm, correspond to a decrease
in the gene expression. And here’s a little bit larger
ten-by-seven gene expression matrix, and these ten genes, also
below, are shown as plots. Can we cluster these genes into groups
of genes with the similar behavior? Each row in the gene expression matrix
corresponds to a gene expression vector, and you can see the plot of this
gene expression vector at the bottom. And clearly, just looking
at this picture at the bottom reveals three types of behavior: Increasing expression, corresponding to
green genes; flat, corresponding to blue genes; and red, corresponding to genes
whose expression is decreasing. And in fact, in 1997,
Joseph deRisi constructed a much larger gene expression matrix by measuring
gene expression for 6,400 yeast genes, nearly all yeast genes, before and
after the diauxic shift. And the goal he tried to achieve by
using this gene expression matrix was to partition all
yeast genes into clusters so that genes in the same
cluster have similar behavior, and genes in different clusters
have different behavior. We can also represent genes as points
in multi-dimensional space, so that an n x m gene expression matrix will
turn into n points in m-dimensional space, and our ten genes will
turn into these ten points, and we clearly see that blue points, whose expression remain
flat, cluster together. The same is true for
green points and red points. However, I’m cheating here because
our points are actually points in seven-dimensional space and we measured
gene expression at seven checkpoints. But here, I show these points
in two-dimensional space. How have I done it? In fact, the clustering program is much
harder than it looks because there is so called cursive dimensionality of studying a clustering of points in
multi-dimensional spaces. So far, we talked about yeast gene expression, but we can also talk about expression
of genes implicated in cancer. In 1999, Uri Alon measured
the expression of 2,000 genes from 40 samples of colon tumors
from cancer patients and compared it with the gene expression
matrix constructed for the same 2,000 genes for healthy patients. In this case,
the result is not one, but two, 2000-by-40 gene expression matrices, and the goal in this case is
to find genes with significantly different expression vectors in tumor
patients as compared to healthy people. And these genes, if found, would represent potential cancer biomarkers that
can be used for cancer diagnostics. So if we look at the
genes in a healthy patient in multi-dimensional space, then for these,
let’s say for ten, genes, let’s not change the expression matrix,
we will see a picture like this. But if we superimpose it with
the expression levels for a cancer patients, and I represent them by different a type of
circles, we see that for blue genes, expression level
in cancerous patients and in healthy patients are roughly the same. But in green genes and in red genes, they
differ and thus potentially green genes or red genes represent potential,
I emphasize potential, cancer biomarkers. In fact, there are already
approaches to cancer diagnostics, such as MammaPrint, that is
a test that evaluates the likelihood of breast cancer recurrence based on
the expression levels of just 70 genes. The question that we will be interested
in is: How did scientists discover these 70 human genes
implicated in breast cancer? But to answer this question, we need to formally clustering
as an optimization problem.

Add a Comment

Your email address will not be published. Required fields are marked *