11. RNA Secondary Structure; Biological Functions and Predictions


The following
content is provided under a Creative
Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. We should probably get started. So RNA plays important
regulatory and catalytic roles in biology, and so it’s
important to understand its function. And so that’s going to be the
main theme of today’s lecture. But before we get
to that, I wanted to briefly review what
we went over last time. So we talked about hidden
Markov models, some of the terminology, thinking
of them as generative models, terminology of the different
types of parameters, the initiation probabilities
and transition probabilities and so forth. And Viterbi algorithm, just
sort of the core algorithm used whenever you apply HMMs. Essentially, you always
use the Viterbi algorithm. And then we gave as an
example the CpG Island HMM, which is admittedly a
bit of a toy example. It’s not really
used in practice, that illustrates the principles. And then today we’re going
to talk about a couple of real world HMMs. But before we get
to that, I just wanted to– sort
of toward the end, we talked about the
computational complexity of the algorithm, and concluded
that if you have a case state HMM run on a sequence of length
L, it’s order k squared L. And this diagram is helpful
to many people in sort of thinking about that. So you can have transitions
from any state– for example, from this state–
to any of the other five states, and there’s
five-state HMM. And when you’re
doing the Viterbi, you have to maximize over the
five possible input transitions into each state. And so the full
set of computations that you have to do from going
from position i to i plus 1 is k squared. Does that make sense? And then there’s L different
transitions you have to do, so it’s k squared L. Any questions about that? OK. All right and, so the example
that we gave is shown here. And what we did was to take an
example sort of where you could sort of see the answer–
not immediately see it, but if we’re thinking about it
a little, figure out the answer. And then we talked about how
the Viterbi algorithm actually works, and why it makes the
transitions at the right place. It seems to intuitively like it
would make a transition later, but actually transitions
at the right place. And one way to
think about that is that these are not hard and
fast decisions because you’re optimizing two different paths. At every state, you’re
considering two possibilities. And so you explore the
possibility of– the first time you hit a c, you explore the
possibility of transitioning from genome to
island, but you’re not confirming whether you’re going
to do that yet until you get to the end and see whether that
path ends up having a higher probability at the end of the
sequence than the alternative. So that’s sort of one way
of thinking about that. Any questions about
this sort of thing, how to understand when a
transition will be made? And I want to emphasize,
for this simple HMM, we talked about
you can kind of see what the answer’s going to be. But if you have any HMM, any
sort of interesting real world HMM with multiple
states, there’s no way you’re going
to be able to see it. Maybe you could guess
what the answer might be, but you’re not going to be able
to be confident of what that is, which is why you have
to actually implement it. All right, good. Let’s talk about a couple
of real world HMMs. So I mentioned gene finding. That’s been a popular
application of HMMs, both in prokaryotes
and eukaryotes. There’s some examples
discussed in the text. Another very popular application
are so-called profile HMMs. And so this is a
hidden Markov model that’s made based on a multiple
alignment of proteins which have a related function
or share a common domain. For example, there’s
a database called Pfam, which includes
profile HMMs for hundreds of different types
of protein domains. And so once you have many
dozens or hundreds or thousands of examples of a
protein domain, you can learn lots of
things about it– not just what the
frequencies of each residue are in each position,
but how likely you are to have an
insertion at each position. And if you do have
an insertion, what types of amino acid residues
are likely to be inserted in that position,
and how often you are likely to have a
deletion at each position in the multiple alignment. And so the challenge then
is to take a query protein and to thread it through all
of these profile HMMs and ask, does it have a significant
match to any of them? And so that’s basically
how Pfam works. And the nice thing about
HMMs is that they allow you to– if you want to have
the same probability of an insertion at each position
in your multiple alignment, you can do that. But if you have enough data
to observe that there’s a five-fold higher likelihood of
having an insertion at position three in a multiple alignment
than there is at position two, you can put that in. You just change
those probabilities. So in this HMM, each
of the hidden states is either an M state, which is
a match state, or an I state, or an insert state. And so those will emit
actual amino acid residues. Or it could be a
delete state, which is thought of as emitting
a dash, a placeholder in the multiple alignment. So these are also widely used. And then one of my favorite
examples– it’s fairly simple, but it turns out to
be quite useful– is the so-called
TMHMM for prediction of transmembrane
helices in protein. So we know that many,
especially eukaryotic proteins, are embedded in membranes. And there’s one famous family
of seven transmembrane helix proteins, and there
are others that have one or a few
transmembrane helices. And knowing that a protein
has at least one transmembrane helix is very useful in terms
of predicting its function. You predict it’s localization. And knowing that it’s a seven
transmembrane helix protein is also useful. And so you want to predict
whether the protein has transmembrane helices and
what their orientation is. That is, proteins can
have their end terminus either inside the cell
or outside the cell. And then of course, where
exactly those helices are. And this program has
about a 97% accuracy, according to [? the author. ?]
So it works very well. So what properties
do you think– we said before that
you have to have strongly different
emission probabilities in the different hidden states
to have a chance of being able to predict
things accurately. So what properties do
you think are captured in a model of
transmembrane helices? What types of
emission probabilities would you when you have for the
different states in this model? Anyone? So for this protein,
what kind of residues would you have in here? Oops, sorry. I’m having trouble
with this thing. All right, here in the
middle of the membrane, what kind of residues are
you going to see there? AUDIENCE: [INAUDIBLE] PROFESSOR: Those are
going to be hydrophobic. Exactly. And what about right
where the helix emerges from the membrane? [INAUDIBLE] charge residue’s
there to kind of anchor it and prevent it from
sliding back into membrane. And then in general, both on
the exterior and interior, you’ll tend to have more
hydrophilic residues. So that’s sort of
the basis of TMHMM. So this is the structure. And you’ll notice that these are
not exactly the hidden states that correspond to individual
amino acid residues. These are like meta
states, just to illustrate the overall structure. I’ll show you the actual
states on the next slide. But these were the
types of states that the author, Anders
[? Crow ?], decided to model. So he has sort of a– focuses
here on the helix core. There’s also a cytoplasmic
cap and a non-cytoplasmic cap. Oops, didn’t mean that. And then there’s sort of a
globular domain on each side– both on the cytoplasmic
side, or you could have one on the
non-cytoplasmic side. OK, so there’s going to be
different compositions in each of these regions. Now one of the things we
talked about with HMMs is that if you were– now let’s
think about the helix core. The simplest model
you might think of would be to have sort
of a helix state, and then to allow that
state to recur to itself. OK, so this type of thing where
you then have some transition to some sort of cap state
after, this would allow you to model helices of any length. But now how long are
transmembrane helices? What does that
distribution look like? Anyone have an idea? There’s a certain
physical dimension. [INAUDIBLE] It takes a certain number
residues to get across here, and then that number
is about 20-ish. So transmembrane helices
tend to be sort of on the order of 20
plus or minus a few. And so it’s totally unrealistic
to have a transmembrane helix that’s, like, five
residues long. So if you run this algorithm
in generative mode, what distribution of helix
lengths will you produce? We’re running in
generative mode where we’re going to let,
remember, to generate a series of hidden
states and then associated amino acid sequences. It’s coming from some,
let’s say– I don’t know. What kind of states are there
here? [INAUDIBLE] plasmic. Let’s say goes into
helix, hangs out here. I’m sorry, is there an
answer to this question? Anyone? I don’t know how
long– if I let it run, it’ll generate a random number. It depends on what this
probability is here. Let’s call this probability
p, and then this would be 1 minus p. OK, so obviously if
1 minus p is bigger, it’ll tend to produce
longer helices. But in general,
what is the shape of the distribution there of
consecutive helical states that this model will generate? AUDIENCE: Binomial. PROFESSOR: Binomial. OK, can you explain why? AUDIENCE: Because
the helix would have to have probable–
the helix of length n would occur 1 minus
p to the n power. PROFESSOR: OK, so a helix of
length 10 with a probability of then, say, let’s call it L,
for the length of the helix, equals n is 1 minus
p to the n, right? Is that binomial? Someone else? AUDIENCE: Yeah. Is it a negative binomial? PROFESSOR: Negative binomial. OK. AUDIENCE: [INAUDIBLE] states and
a helix state before moving out [INAUDIBLE]. PROFESSOR: Yeah. So the distribution is
going to be like that. You have to stay in here
for n and then leave. So this is the
simplest– you can have special cases of binomial
and negative binomial. But in general,
this distribution is called the
geometric distribution. Or a continuous version would
be the exponential distribution. So what is the shape
of this distribution? If I were to plot n down here on
this axis, and the probability that L equals n on this
axis, what kind of shape– could someone draw in the air? So you had up and then down? OK, so actually, it’s
going to be just down. Like that, right? Because as n increases,
this goes down because 1 minus
p is less than 1. So it just steadily goes down. And what is the mean
of this distribution? Anyone remember this? Yeah, so there’s sort
of two versions of this that you’ll see. One of them is the 1 minus p
n minus 1 p, and one of them is this. And so this is the number of
failures before a success, if you will. Successes lead to the helix. And this is the number of
trials till the first success. So one of them has
a mean that’s 1/p, and the other has a mean
that’s 1 minus p over p. So usually, p is small, and
so those are about the same. So 1/p. You could think that
1/p is roughly right. And so if we were to model
transmembrane helices, and if transmembrane
heresies are about– I said about 20
residues long– you would set p to what value
to get the right mean? AUDIENCE: 0.05. PROFESSOR: Yeah. 0.05. 1/20, so that 1 over that
will be about 20, right? And then 1 minus p
would, of course, be 0.9. So if I were to do that,
I would get a distribution that looks about like
this with a mean of 20. But if I were to then look
at real transmembrane helices and look at their
distribution, I would see something
totally different. It would probably
look like that. It would have a mean around 20. But the probability of anything
less than 15 would be 0. That’s too short. It can’t go across the membrane. And then again, you don’t
have ones that are 40. They don’t kind of wiggle around
in there and then come out. They tend to just
go straight across. So there’s a problem here. You can see that if you want
to make a more accurate model, you want to not only get the
right emission probabilities with the right probabilities of
hydrophobics and hydrophilics and the different
states, but you also want to get the length right. And so the trick that–
well, actually, yeah. Can anyone think of tricks
to get the right length distribution here? How do we do better than this? Basically, hidden
Markov models where you have a state that
will recur to itself, it will always be a
geometric distribution. The only choice you have is
what is that probability. And so you can get
any mean you want, but you always get this shape. So if you want a
more general shape, what are some tricks
that you could do? How could you change the model? any ideas? Yeah, go ahead. AUDIENCE: [INAUDIBLE] have
multiple helix states. PROFESSOR: Multiple
helix states. OK. How many? AUDIENCE: Proportional to the
length we want, [INAUDIBLE]. PROFESSOR: Like one for
each possible length. AUDIENCE: It’d be
less than one length. PROFESSOR: Or less than one. OK. So you could have
something like– I mean, let’s say you have like this. Helix begin– or,
helix 1, helix 2. You allow each of these
to recur to themselves. What does that get you? This actually gets you
something a little bit better. It gives you a little bit
about of– it’s more like that. So that’s better. But if I want to get the exact
distribution, then actually one– so this is the solution
that the authors actually used. They made essentially 25
different helix states, and then they allowed various
different transitions here. So it’s a larger
arbitrary here, but they have this special state three
that can kind of take a jump. So it can just
continue on to four, and that’ll make your
maximum length helix core. Or it can skip one, go
to five, and that’ll make a helix core that’s one
residue shorter than that, or it can skip
two, and so forth. And you can set
any probabilities you want on these transitions. As so you can fit basically
an arbitrary distribution within a fixed range
of lengths that’s determined by how
many states you have. OK, so they really wanted to get
the length distribution right, and that’s what they did. What’s the cost of this? What’s the downside? Simona? AUDIENCE: I was
just going to ask, it looks like from
this your minimum helix length could be four. PROFESSOR: Yeah. That’s a good question. Well, we don’t know what
the probabilities– they say said on that. Well, did they really mean that? And also, that’s only the core,
and maybe these cap things can be– yeah, that seems
a little short to me. So yeah, I agree. I’m not sure. It could just be for the
sake of illustration, but they don’t
actually use those. But anyway, I’ll probably
have to read the paper. I haven’t read this
paper for many years so I don’t remember
exactly the answer to that. But I have a citation. You can look it up
if you’re curious. But the main point I
wanted to make with this is just that by setting an
arbitrary number of states and putting in possible
transitions between them, you can actually construct
any length of distribution you want. But there is a downside,
and what is that downside? AUDIENCE: Computational cost. PROFESSOR: Yeah, the
computational cost. Instead of having
one helix state, now we’ve got 25 or something. So and the time goes up by the
square of the number of states, so it’s going to run slower. And you also have to estimate
all these parameters. OK, so here’s an example
of the output of the TMHMM program for a mouse
chloride channel gene, CLC6. So the program
predicts that there are seven transmembrane
helices, as shown by these little red blocks here. You can see they’re all about
the same– about 20 or so– and that the program starts
outside and ends inside. So let’s say you were going
to do some experiments on this protein to
test this prediction. So one of the types of
experiments people do is they put some
sort of modifiable or modified residue
into one of the spaces between the
transmembrane helices. And then you can test,
by modifying this cell with something that’s a
non-permeable chemical, can you modify that protein? So only if that stretches
on the outside of the cell will you be able to predict it. So that’s a way of
testing the topology. So if you were doing those
types of experiments, you might actually– like
maybe you’re not sure if every transmembrane
helix is correct. There could be some
where the boundaries were a little off, or
even a wrong helix. And so one of the
things that you often want with a
prediction is not only to know what is the optimal
or most likely prediction, but also how confident
is the algorithm in each of the parts of its prediction. How confident is it in the
location of transmembrane helix three or the probability
that actually there is a transmembrane helix three. And so the way that this program
does that is using something called the
forward-backward algorithm. So those of you who read
the Rabener tutorial, it’s described
pretty well there. The basic idea is
that I mentioned that this Po– the probability
of the observable sequence summing over all
possible HMM structures or all possible sequences
of hidden states– that is possible to calculate. And the way that
you do it is you run an algorithm that’s
similar to the Viterbi, but instead of taking
the maximum entering each hidden state at
intermediate positions, you sum those inputs. So you just do the
sum at every point. And it turns out that will
calculate the sum of the two values at the end– or
the k values at the end will be equal to the sum of
the probabilities of generating the observable sequence
over all possible sequences of hidden states. OK, so that’s useful. And then you can also
run it backwards. There’s no reason it has to be
only going in one direction. And so what you do is you run
these sort of summing versions of the Viterbi in both
the forward direction and also run one in
the backward direction. And then you take a
particular position here– like let’s say this is your
helix state, for example. And we’re interested
in this position somewhere in the
middle of the protein. Is that a helix or not? And so basically
you take the value that you get here
from the forward in your forward
algorithm and the value that you get here in
the backward algorithm, and multiply those two
together, and divide by this Po. And that gives you
the probability. So that ends up being
a way of calculating the sum of all
the parses that go through this particular
position i in the sequence in that particular state. I mean, I realize that may
not have been totally clear, and I don’t want to take more
time to totally go into it, but it is pretty well
described and Rabener. And I’ll just give
you an example. So if you’re motivated,
please take a look at that. And if you have
further questions, I’d be happy to discuss
during office hours next week. And this is what it looks like
for this particular protein. So you get something called the
posterior probability, which is the sum of the probabilities
of all the parses. And they’ve plotted it for
the particular state that is in the Viterbi path, that
is in the optimal parse– so for example, in blue here. Well, actually, they’ve done
it for all the different states here. So blue is the probability
that you’re outside. OK, so it’s very, very
confident that the end terminus of the protein is
outside the cell. It’s very, very confident
in the locations of transmembrane
helices one and two. It actually more
often than not thinks there’s actually a
third helix right here, but that didn’t make it
in the optional parse. That actually occurs in
the majority of parses, but not in the optimal. And it’s probably because it
would then cause other things to be flipped later on if you
had transmembrane helix there. It’s not sure whether
there’s a helix there or not, but then it’s
confident in this one. OK, so this gives you an idea. Now if you wanted to do some
sort of test of the prediction, you want to test probably
first the higher confidence predictions, so you might
do something right here. Or if maybe from
experience you know that when it has a
probability that’s that high, it’s always right, so
there’s no point testing it. So you should test one of these
kind of less confident regions. So this actually makes the
prediction much more useful to have some degree
of confidence assigned to each part of the prediction. So for the remainder
of today, I want to turn to the topic of
RNA secondary structure. So at the beginning,
I will sort of get through some nomenclature. And then to motivate the topic,
give some biological examples of RNA structure. Gives me an excuse to show some
pretty pictures of structure. And then we’ll talk about
two approaches which are two of the most widely used
approaches toward predicting structure. So using evolution
to predict structure by method of co-variations,
which works well when you have many homologous sequences. And then using sort
of first principles thermodynamics to predict
secondary structure by energy minimization
where obviously you don’t need to have a
homologous sequence present. And the nature
biotechnology primer on RNA folding
that I recommended is a good intro to the
energy minimization approach. So what is RNA
secondary structure? So you all know that
RNAs, like proteins, have a three-dimensional
tertiary fold structure that, in many cases, determines
their function. But there’s also sort of
a simpler representation of this structure where you just
describe which pairs of bases are hydrogen bonded
to one other. OK, and so for RNA– so
it’s a famous example of an RNA structure, this
sort of clover leaf structure that all tRNAs have. The secondary structure of the
tRNA is the set of base pairs. So it’s this base pair
here between the first base and this one toward
the end, and then base right here, and so forth. And so if you specify
all those base pairs, then you can then draw a picture
like this, which gives you a good idea of what parts of
the RNA molecule are accessible. So for example,
it won’t tell you where the anticodon
loop is, which is sort of the business
end of the tRNA. But it narrows it down
to three possibilities. You might consider that,
or that, or down here. It’s unlikely to be
something in here because these bases
are already paired. They can’t pair to message. So it gives you sort of a first
approximation toward the 3D structure, and so
it’s quite useful. So how do we represent
secondary structure? So there’s a few different
common representations that you’ll see. So one is– and this is sort
of a computer-friendly but not terribly human-friendly
representation, I would say– is
this sort of dot in parentheses notation here. So the dot is an unpaired
base and the parenthesis is a paired base. And how do you know– chalk
is sort of non-uniformly distributed here– so if you
have a structure like this and you have these
three parentheses, what are they paired to? Well, you don’t know yet
until you get further down. And then each left
parenthesis has to have a right
parenthesis somewhere. So now if we see
this, then we know that there are two
unpaired bases here, and then there’s
going to be three in a row that are
paired– these guys. We don’t know what
they’re paired to yet. Then there’s going to be a
five base pair loop, maybe a little pentagon type thing. Two, three, four–
oops– four, five. And this one would be
the right parentheses that pair with the left
parentheses over here. I should probably
draw this coming out to make it clearer
that it’s not paired. So this notation you
can convert to this. So after a while, it’s
relatively easy to do this, except when they’re super long. So that’s what the left part
of that would look like. So what about the right part? So the right part, we have
something like one, two, three, four, bunch of dots, and then
we have two, and then a dot, and then two. What does that thing look like? So that’s going to look like
four bases here in a stem. Big loop, and then there’s
going to be two bases that are paired, and then
a bulge, and then two more that are paired. These things happen
in real structures. OK and then the
arced notation is a little more human-friendly. It actually draws an
arc between each pair of bases that are
hydrogen bonded. So I’m sure you can imagine
what those structures would look like. And it turns out that the
arcs are very important. Like whether those
arcs cross each other or not is sort of a fundamental
classification of RNA secondary structures, into
the ones that are tractable and the ones that
are really difficult. So pretty pictures of RNA. So this is a lower
resolution cryo-EM structure of the bacterial ribosomes. Remember, ribosomes have two
sub-units– a large sub-unit, 50S, and a small sub-unit, 30S. And if you crack it open–
OK, so you basically split. You sort of break the ribosome
like that, and you look inside, they’re full of tRNAs. So there are three
pockets that are normally distinguished within ribosomes. The A site– this
is the site where the tRNA enters
that’s going to add a new amino acid to the
growing peptide chain. The P site, which is
this tRNA will have it [INAUDIBLE] with the
actual growing peptide. And then the exit tunnel where
this tRNA will eventually– the exit, the E site,
which is the one that was added a couple
of residues ago. So people often think
of RNA structure just in terms of these
secondary structures because they’re much
easier to generate than tertiary structures, and
they give you– like for tRNA, it gives you some pretty good
information about how it works. But for a large and complex
structure like the ribosome, it turns out that
RNA is actually not bad at building
complex structures. I would say it’s not
as good as protein, but it is capable of
constructing something like a long tube. And in fact, in
the ribosome, you find such a long
tube right here. That is where the peptide
that’s been synthesized exits the ribosome. And you’ll notice it’s not
a large cavity in which the protein might start folding. It’s a skinny tube that is thin
enough that the polypeptide has to remain linear, cannot
start folding back on itself. So you sort of
extrude the protein in a linear, unfolded
confirmation, and let it fold outside
of the ribosome. If it could fold inside
that, that might clog it up. That’s probably one reason why
it’s not designed that way. I’m sure that was tried
bye evolution and rejected. So if you look at the
ribosome– now remember, the ribosome is composed
of both RNA and protein– you’ll see that it’s much
more of one than the other. And so it’s really much more
of the fettuccine, which is the RNA part, than the
linguini of the protein. And if you also look
at the distribution of the proteins on
the ribosome, you’ll see that they’re
not in the core. They’re kind of decorated
around the edges. It really looks like something
that was originally made out of RNA, and then you sort of
added proteins as accessories later. And that’s probably
what happened. This is based on
the structures that were solved a few years ago. If you then look at where
the nearest proteins are to the active site– actual
catalytic site– remember, the ribosome catalyzes peptide
in addition to an amino acid to a growing peptide, so
peptide bond formation– you’ll find that the
nearest proteins are around 18 to 20 angstroms away. And this is too far
to do any chemistry, so the active site
residues or molecules need to be within
a few angstroms to do any useful chemistry. And so this basically
proves that the ribosome. Is a ribozyme. That is, it’s an RNA enzyme. RNAs is [INAUDIBLE]. So here is the
structure of a ribosome. It’s very kind of beautiful,
and it’s impressive that somebody can actually
solve the structure of something this big. But what is actually
the practical use of this structure? Turns out there’s quite an
important practical application of knowing the structure. Any ideas? AUDIENCE: Antibiotics. PROFESSOR: Antibiotics. Exactly. So many antibiotics work by
taking advantage of differences between the prokaryotic
ribosome structure and eukaryotic
ribosome structure. So if you can make
a small molecule– these are some examples–
that will inhibit prokaryotic ribosomes
but hopefully not inhibit eukaryotic
ribosome, then you can kill bacteria that
might be infecting you. So non-coding RNA. So there’s many different
families of non-coding RNAs, and I’m going to list
some in a moment. And I’m going to
actually challenge you, see if you can
come up with any more families of non-coding RNAs. But they’re receiving
increasing interest, I would say, ever since
micro RNA’s were discovered. Sort of a boom in looking
at different types of non-coding RNAs. Link RNA is also important and
interesting, as well as many of the classical RNA’s like
tRNAs and rRNAs and snoRNAs. There may be new aspects of
their regulation and function that will be interesting. And so when you’re
studying a non RNA, it’s very, very helpful
to know its structure. If it’s going to base pair in
trans with some other RNA– as tRNAs do, as micro RNA’s
do, for example, or snRNAs and snoRNAs– then
you want to know which parts of the
molecule are free and which are
internally based paired. And if you want to predict
non RNAs genes in a genome, you may want to look
for regions that are under selection for
conservation of RNA structure, for conservation
of the potential to base pair at some distance. If you see that,
it’s much more likely that that region of the genome
encodes a non-coding RNA than it codes, for example–
there’s a coding axon or that it’s a
transcription factor binding site or something like that
that functions at the DNA level. So having this
notion of structure– even just secondary structure–
is helpful for that application as well, and predicting
functions as well, as I mentioned. So co-variation. So let’s take a look
at these sequences. So imagine you’ve discovered a
new class of mini micro RNA’s. They’re only eight bases
long, and you’ve sequence five homologues from your
five favorite mammals. And these are the
sequences that you get. And you know that
they’re homologous by [? a centimeter ?],
they’re in the same place in the genome, and they seem
to have the same function. What could you say about
their secondary structure based on this
multiple alignment? You have to stare at it a
little bit to see the pattern. There’s a pattern here. Any ideas? Anyone have a guess about
what the structure is? Yeah, go ahead. AUDIENCE: There’s a two
base pair stem, and then a four base loop. PROFESSOR: Two base pair
stem, four base loop, and you have of the stem. So how do you know that? AUDIENCE: So if you
look at the first two and last two bases
of each sequence, the first and the
eighths nucleotide can pair with each other, and so
can the second and the seventh. PROFESSOR: Yeah. Everyone see that? So in the first
column you have AUACG, and that’s
complementary to UAUGC. Each base is complementary. And the second position is
CAGGU complementary to GUCUA. There’s one slight
exception there. AUDIENCE: [INAUDIBLE] PROFESSOR: Yeah. Well, it turns out that that
RNA– although the Watson Crick pairs GC and AU are the
most stable– GU pairs are only a little bit
less stable than AU pairs, and they occur in
natural RNA molecules. So GU is allowed in RNA
even though you would never see that in DNA. OK, so everyone see that? So the structure is–
I think I have it here. This would be co-variation
You’re changing the bases, but preserving the
ability to pair. So when one base change– when
the first base changes from A to U, the last base
changes from U to A in order to preserve
that pairing. You wouldn’t know that if
you just had two sequences, but once you get
several sequences, it can be pretty
compelling and allow you to make a pretty
strong inference that that is the structure
of that molecule. So how would you do this? So imagine you had a more
realistic example where you’ve got a non-coding RNA
that’s 100 or a few hundred bases long, and you might have
a multiple alignment of 50 homologous sequences. You want something,
you’re not going to be able to see it by eye. You need sort of a more
objective criterion. So one method
that’s commonly used is this statistic IX
mutual information. So if you look in your
multiple alignment– I’ll just draw this here. You have many sequences. You consider every
pair of columns– this is a multiple alignment,
so this column and this column– and you calculate
what we’re going to call– what are we
going to call it? f ix. That would be the frequency
of a nucleotide x. You’re in column i, so you just
count how many A’s, C’s, G’s, and T’s there are. And similarly, f jy for
all the possible values of x and all the
possible values of y. So these are the base
frequencies in each column. And then you calculate the
dinucleotide frequencies xy at each pair of columns. So in this colony, you say if
there’s an A here and a C here, and then there’s
another AC down here, and there’s a total of one,
two, three, four, five, six, seven sequences,
then f AC ij is 2/7. So you just calculate the
frequency of each dinucleotide. These are no longer consecutive
dinucleotides in a sequence necessarily there. They can be in
arbitrary spacing. OK, so you calculate
those and then you throw them
into this formula, and out comes a number. So what does this
formula remind of? Have you seen a
similar formula before? AUDIENCE: [INAUDIBLE] PROFESSOR: Someone said
[INAUDIBLE] Yeah, go ahead. AUDIENCE: It reminds me of the
Shannon entropy [INAUDIBLE]. PROFESSOR: Yeah, it looks
like Shannon entropy, but there’s a log
of a ratio in there, so it’s not exactly
Shannon entropy. So what other formula has
a log of a ratio in it? AUDIENCE: [INAUDIBLE] PROFESSOR: Relative. So it actually looks
like relative entropy. So relative entropy
of what versus what? Who can sort of say more
precisely if it’s– we’ll say it’s relative entropy of
something versus a p versus q. And what is p and what is q? Yeah, in the back. AUDIENCE: Is it relative
entropy of co-occurrence versus independent occurrence? PROFESSOR: Good. Yeah. co-occurence–
everyone get that? Co-occurrence of a pair of
nucleotide xy at positions ij. Versus q is an
independent occurrence. So if x and y occurred
independently, they would have this frequency. So if you think about it,
you calculate the frequency of each base at each column
in the multiple alignment. And this is like
your null hypothesis. You’re going to assume, what if
they’re evolving independently? So if it’s not a folded
RNA– or if it’s a folded RNA but those two columns
don’t happen to interact– there’s no reason to suspect
that those bases would have any relationship
to each other. So this is like
your expected value of the frequency of
xy in position ij. And then this p is
your observed value. So you’re taking relative
entropy of basically observed over expected. And so relative entropy
has– I haven’t proved this, but it’s non-negative. It can be 0, and then it
goes up to some maximum, a positive value, but
it’s never negative. And what would it be if,
in fact, p were equal to q? What would this formula give? This is where we’re
saying suppose. Suppose this. In general, this won’t
be sure, but suppose it was equal to that. We’ve got mi ij equals
summation of what? That log of this,
which is equal to this, so it’s fx i fy j
over the same thing– hope you can see that– log
of– log of 1 is 0, right? So it’s just 0. So if the nucleotides
of the two columns occur completely independently,
mutual information is 0. And that’s one reason it’s
called mutual information. There’s no information. Knowing what’s in
column i gives you no information about column j. So remember, relative entities
are measures of information, not entropy. And what is the maximum value
that the mutual information could have? Any ideas on that? Any guesses? Joe, yeah. AUDIENCE: You could have
log base 2 log over f sub x, f sub y. PROFESSOR: Of 1? OK, so you’re saying if one of
the particular dinucleotides had a frequency of 1? AUDIENCE: Yeah. So if they’re always the same
whenever there’s– like an A, there’s always going to be a T. PROFESSOR: Right. So whenever there’s an A,
there’s always a G or a T. AUDIENCE: So then you’d
get a 1 in the numerator, and they’re relative
probably in the bottom, which would be maximized if
they were all even. PROFESSOR: If they were all? [INTERPOSING VOICES] PROFESSOR: If they were uniform. Yeah. So did everyone get that? So the maximum occurs if fx i
and j– they’re both uniform, so they’re a quarter for
every base at both positions. That’s the maximum entropy in
the background distribution. But then if fx y ij equals
1/4, for example, x equals y– or in our case, we’re
not interested in that. We’re interested in x
equals complement of y. C of y is going to be
the complement of y. And 0 otherwise for x not
equal complement of y. OK, so for example, if we have
only the dinucleotides AT, CG, GC, and TA occur,
and each of them occurs with a
frequency of 1/4, then you’ll have four terms in
the sum because, remember, the 0 log 0 is 0. So you’ll have four terms
in the sum, and each of them will look like 1/4 log
1/4 over a 1/4 times 1/4. And so this will be 4,
so log 2 of 4 4 is 2. And so you have four terms
that are each 1/4 times 2. And so you’ll get 2. Well, this is not a sum. These are the four terms. These are the individual
nonzero terms in that sum. Does that make sense? Everyone get this? So that’s why this is a useful
measure of co-variation. If what’s in one
column really strongly influences what’s
in the other column, and there’s a lot of
variation in the two columns, and so you can really see
that co-variation well, then mutual information
is maximized. And that’s basically
what we just said, is written down here. So it’s maximal. They don’t have to
be complementary. It would achieve this maximum
of 2 if they are complementary, but it would be also if they
had some other very specific relationship between
the nucleotides. So if you’re going to use
this, the way you would use it is take your multiple
alignment, calculate the mutual information
of each pair of columns– so you actually have to
make a table, i versus j, all possible pairs
of columns– and then you’re going to look for
the really high values. And then when you find
those high values, when you look at what actual bases
are tending to occur together, you’ll want to see
that they’re bases that are complementary
to one another. And another thing
that you’d want to see is you’d want to see that
consecutive positions in one part of the alignment
are co-varying with consecutive positions in
another part of the alignment in the right way, in this sort
of inverse complementary way that RNA likes to pair. Does that make sense? So in a sort of nested way
in your multiple alignment, if you saw that this
one co-varied with that, and then you also saw that
the next base co-varied with the base right
before this one, and this one co-varies
with that one, that starts to look like a stem. It’s much more likely that
you have a three-base stem than that you just
have some isolated base pair out in the
middle of nowhere. It turns out it takes
a few bases to make a good thermodynamically
stable stem, and so you want to look
for blocks of these things. And so this works pretty well. Yeah, actually, one point
I want to make first is that mutual
information is nice because it’s kind
of a useful concept and it also relates to some
of the entropy and relative entropy that we’ve been talking
about in the course before. But it’s not the only statistic
that would work in practice. You can use any measure of
basically non-independence between distributions. A chi square statistic
would probably work equally well in practice. And so here is a
multiple alignment of a bunch of sequences. And what I’ve done is
put boxes around columns that have significant mutual information with
other sets of columns. So for example, this set of
columns here at the left– the far left– has significant
mutual information with the ones at the far right. And these ones,
these four positions co-vary with these
four, and so forth. So can you tell,
based on looking at this pattern of
co-variation, what the structure is going to be? OK, let’s say we start up here. The first is going to
pair with the last, with something at the end. Then we’re going
to have something here in the middle that pairs
with something else nearby. Then we have something
here that pairs with something else nearby,
then we have another like that. Does that make sense? So that there’s these
three pairs of columns in the middle– these two, these
two, and these two– and then they’re surrounded
by this thing, the first pairing with the last. And so it’s a clover
leaf, so that’s tRNA. Yeah? AUDIENCE: So with that previous
slide, this table here, you could create a
co-variation matrix. How would that– or,
and it could be– PROFESSOR: How does that
co-variations matrix– how do you convert it
to this representations? AUDIENCE: I’m just wondering
how this would go up. Like let’s say you took
the co-variation matrix– PROFESSOR: Oh, what
would it look like? AUDIENCE: –and visualized
it as a heat map– PROFESSOR: In the
co-variation matrix. AUDIENCE: Yeah. What would it look like in
this particular example? PROFESSOR: Yeah,
that’s a good question. OK, let’s do that. I haven’t thought
about that before, so you’ll have to
help me on this. So here’s the beginning. We’re going to write
the sequence from 1 to n in both dimensions. And so here’s the beginning,
and it co-varies with the end. So this first would have a
co-variation with the last, and then the second would
co-vary with the second to last, and so forth. So you get a little
diagonal down here. That’s this top stem here. And then what about
the second stem? So then you have
something down here that’s going to co-vary with
something kind of near by it. So block two is going to
co-vary with block three. And again, it’s going to be
this inverse complementary kind of thing like that. It’s symmetrical, so
you get this with that. But you only have
to do one half, so you can just do
this upper half here. So you get that. So it would look
something like that. AUDIENCE: So with the
diagonal line orthogonal to the diagonal of the matrix– PROFESSOR: Yeah, that’s because
they’re inverse complementary. AUDIENCE: OK. PROFESSOR: That make sense? Good question. But we’ll see an
example like that later actually, as it turns out. All right, so here’s
my question for you. You’re setting this
non-coding RNA. It has some length. You have some
number of sequences. They might have some structure. Is this method going to
work for you, or is it not? What is required for it to work? For example, would
I want to isolate this gene– this
non-coding RNA gene– just from primates, from
like human, gorilla, chimp, orangutan, and
do that alignment? Or would I want to go further? Would I want to go back to
the rodents and dog, horse– how far do you want to go? Yeah, question. AUDIENCE: I think we a need a
very strong sequence alignment for this, so we
cannot go very far, because if you don’t have
a high percentage homology, then you will see all
sorts of false positives. PROFESSOR: Absolutely. So if you go too far, your
alignment will suffer, and you need an
alignment in order to identify the
corresponding columns. So that puts an upper limit
on how far you can go. But excellent point. Is there a lower limit? Do you want to go as
close as possible, like this example I gave
with human, chimp, orangutan? Or is that too close? Why is too close bad? Tim? AUDIENCE: Maybe if
you’re too close, then the sequence is
having to [INAUDIBLE] to give you enough
information [INAUDIBLE]. PROFESSOR: Yeah, exactly. They’re all the same. Actually, you’ll
get 1 times 1 over 1 in that mutual information
statistic, which log of that is going to be 0. There’s zero mutual information
if they’re all the same. So there has to
be some variation, and the structure
has to be conserved. That’s key. You have to assume that the
structure is well conserved and you have to have
a good alignment and there has to
be some variation, a certain amount of variation. Those are basically
the three keys. Secondary structure has a more
highly conserved sequence. Sufficient divergence so that
you have these variations, and sufficient number of
homologues you have to get good statistics, and not so far
they your alignment is bad. Sorry about that. Sally? AUDIENCE: It seems
like another thing that we assume here is that
you can project it onto a plane and it will lie flat. So if you have some very
important, weird folding that allows you to, say,
crisscross the rainbow thing. PROFESSOR: Yeah,
crisscross the rainbow. Yeah, very good question. So in the example
of tRNA, if you were to do that arc
diagram for tRNA, it would look like
another big arc– that’s the first and
last– and then you have these three nested arcs. Nothing crisscrossing. What if I saw– [INAUDIBLE]–
two blocks of sequence that have a relationship like that? Is that OK? With this method, the
co-variation, that’s OK. There’s no problem there. What does this
structure look like? So [INAUDIBLE] you have a
stem, then you have a loop, and then a stem. So this is 1 pairs with 3. That’s 1. That’s 3. Then you’ve got 2 up
here, but 2 pairs with 4. So here’s 4 over
here, so 4 is going to have to come back up
here and pair with 2. This is 2 over here. So that is called a pseudoknot. It’s not really a knot
because this thing doesn’t go through the
loop, but it kind of behaves like a
knot in some ways. And so do these actually
occur in natural RNAs? Yes, Tim is nodding. And are they important? Can you give me an example
where they are important biologically? AUDIENCE: [INAUDIBLE] [INTERPOSING VOICES] PROFESSOR: Riboswitches. We’re going to come
to what riboswitches are in a moment for
those not familiar. And I think I have
an example later of a pseudoknot
that’s important. So that’s a good question. I think I should have added
to this list the point that you made in
the back that they have to be close enough that
you can get a good alignment. I should add that to this last. Thanks. It’s a good point. All right, so classes
of non-coding RNAs. As promised, my
favorites listed here. Everyone knows tRNAs, rRNAs. You can think of UTRs
as being non RNAs. They often have
structure that can be involved in
regulating the message. snRNAs involved splicing. snoRNAs– small
nucleolar RNAs– are involved in directing
modification of other RNAs, such as ribosomal
RNAs and snRNAs, for example. Terminators of
transcription in prokaryotes are like little stem
loop structures. RNaseP is an important enzyme. SRP is involved in targeting
proteins with signal peptides to the export machinery. We won’t go into tmRNA. micro RNAs and link
RNAs, you probably know, and riboswitches. So Tim, can you tell us
what a riboswitch is? AUDIENCE: A riboswitch
is any RNA structure that changes
confirmation according to some stimulus [INAUDIBLE]
or something in the cell. It could be an ion, critical
changes in the structure. [INAUDIBLE] PROFESSOR: Yeah, that was great. So just for those that
may not have heard, I’ll just say it again. So a riboswitch is
any RNA that can have multiple confirmations,
and changes confirmation in response to some stimulus–
temperature, binding of some ligand, small molecules,
something like that, et cetera. And often, one of
those structures will block a particular
regulatory element. I’ll show an
example in a moment. And so when it’s in
one confirmation, the gene will be repressed. And when it’s in
the other, it’ll be on. so it’s a way of using
RNA’s secondary structure to sense what’s
going on in the cell and to appropriately
regulate gene expression. All right, so now we’re going
to talk about a second approach. So this would be the approach. You’ve got some RNA. It may not do something,
and maybe you can’t find any homologues. It might be some newly
evolved species-specific RNA, or your studying
some obscure species where you don’t have a lot
of genomic sequence around. So you want to use the
first principles, approach, the energy
minimization approach. Or maybe you have
the homologues, but you don’t trust
your alignment. You want a second
opinion on what the structure is going to be. So just in the way
that protein folding– you could think of
an equilibrium model where it’s determined
by folding free energy, and enthalpy will
favor base pairing. You get gain some enthalpy
when you form a hydrogen bond, and entropy will tend
to favor unfolding. So an RNA molecule
that’s linear has all this confirmational
flexibility, and lose some of that
when you form a stem. It forms a helix. Those things don’t have
as much flexibility. And even the nucleotides in
the loop are a little bit confirmationally–
they’re not as flexible as they were when it was linear. So that means that
at high temperatures, it’ll favor unfolding. So the earliest
approaches were approaches that sought to maximize
the number of base pairs. So they basically ignore entropy
and focus on the enthalpy that you gain from
forming base pairs. And so Ruth Nussinov
described the first algorithm to figure out what is the
maximum number of base pairs that you can form in an RNA. And so a way to
think about this is imagine you’ve
got this sequence. What is the largest
number of base pairs I can form with this sequence? I could just draw all
possible base pairs. That A can pair with that T.
This A can pair with that T. They can’t both pair
simultaneously, right? And this C can pair with that G.
So if we don’t allow crossing, which– coming back
to Sally’s point– this would cross this, right? So we’re not going
to allow that. So the best you could do be to
have this A pair with this C and this C pair with this G
and form this little structure. This is not realistic because
RNA loops can’t be one base. They minimum is about three. But just for the
sake of argument, you can list all these
out, but imagine now you’ve got 100 bases here. Every base will on
average potentially be able to pair with
24 or 25 other bases. So you’re just going to have
just an incredible mishmash of possible lines
all crisscrossing. So how do you figure out how
to maximize that pairing? Any ideas? Don, yeah? AUDIENCE: You look for
sections of homology. PROFESSOR: We’re
not using homology. We’re doing [INAUDIBLE] AUDIENCE: I’m sorry, not
homology, but sections where– PROFESSOR: Complementary? AUDIENCE: Complementary. Yeah, that’s the
word I was thinking. PROFESSOR: The blocks
are complementary. AUDIENCE: And then so– PROFESSOR: You could blast
the sequence against inverse complements itself and
look for little blocks. You could do that. That’s not what
people generally do, mostly because the blocks of
complementarity in real RNA structures are really short. They can be two,
three, four, bases. Sally, yeah? AUDIENCE: Could you use
[INAUDIBLE] approach where you just start with a
very small case and build up? PROFESSOR: So we’ve seen that
work for protein sequence alignment. We’ve seen it work for
the Viterbi algorithm. So that is sort of the go-to
approach in bioinfomatics, is to use some sort of
dynamic programming. Now this one for RNA
secondary structure that Nussinov came up
with is a little bit different than the others. So you’ll see it has a
kind of different flavor. It turns out to be
actually it’s a little hard to get your head around
at the beginning, but it’s actually
easier to do by hand. So let’s take a look at that. OK, so recursive
maximization of base pairing. Now the thing about
base pairing that’s different from
these other problems is that the first
base in the sequence can base pair with the last. How do you chop up a sequence? Remember with Needleman-Wunsch
and with Viterbi we go from the
beginning to the end, and that’s a logical order. But with base pairing, that’s
actually not a logical order. You can’t really do it that way. So instead, you go
from the inside out. You start in the
middle of a sequence and work your way outwards
in both directions. Or another way to think about
it is you start with you write the sequence from 1
to n on both axes, and then actually we’ll see that
we initiate the diagonal all to 0’s. And then we think about
these positions here next. So 1 versus 2. Could 1 pair with 2? And could 2 pair with 3? Those are like little
bits of possible RNA secondary structure. Again, we’re ignoring
this fact that loops have to be certain minimum. This is sort of a
simplified case. And then you build outwards. So you conclude that base 4
here could pair with base 5, so we’re going to put a 1 there. And then we’re going
to build outward from that toward the
beginning of the sequence and toward the end, adding
additional base pairs when we can. That’s basically the way
the [INAUDIBLE] works. And so that’s one
key idea, that we go from sort of
close sequences, work outward, to faraway sequences. And the second key idea
is that the relationship that, as you add more bases
on the outside of what you’ve already got, that the optimal
structure in that larger portion of sequence
space is related to the optimal structures
of smaller portions of it in one of four different ways. And these are the four ways. So let’s look at these. So the first one is
probably the simplest where if you’re doing this,
you’re here somewhere, meaning you’ve compared
sequences from position, let’s say, i minus
1 to j minus 1 here. And then we’re going to
consider adding– actually, it depends how you
number your sequence. Let me see how this is done. Sorry. i plus 1. i plus 1 to j minus 1. We figured out what the
optimal structure is in here, let’s suppose. And now we’re going to
consider adding one more base on either end. We’re going to add j
down here, and we’re going to ask if it pairs with i. And if so, we’re going to take
whatever the optimal structure was in here and we’re
going to add one base pair, and we’re going to
add plus 1 because now it’s got one additional. We’re counting base pairs. So that’s that first case there. And then the second case is
you could also consider just adding one unpaired base onto
whatever structure you had, and then you don’t add one. And you could go in
either direction. You can go sort of toward of
the beginning of the sequence or toward the end
of the sequence. And then the third
one is the tricky one, is what’s called a bifurcation. You could consider
that actually i and j are both paired, but
not with each other. That i pairs with something
that was inside here and j pairs with something
that was inside here. So your optimal parse
from i to j, if you will, is not going to come from the
optimal parse from i plus 1 to j minus 1. It’s going to come from
rethinking this and doing the optimal parse from here
to here and from here to here, and combining those two. So you’re probably
confused by now, so let me try to do an example. And then I have an analogy
that will confuse you further. So ask me for that one. This was the simplest
one I could come up with that has this property. OK, so we said before that
if you were doing the optimal from 1 to 5, that it would be
the AC pairing with the GT. We do that one. And now if you notice, this guy
is kind of a similar sequence. I just added a T at the
beginning and an A at the end. And so you can probably imagine
that the best structure of this is here, those three. You’ve got three pairs of
this sub-sequence here. That’s as good as you
can do with seven bases. You can only get three pairs. And this is as good as
you can do with five, so these are clearly optimal. So the issue comes that if
you’re starting from somewhere in the middle here– let’s
say you are– let’s see, so how would you be doing this? You start here. Let’s suppose the first two
you consider are these two. You consider pairing
that T with that A. You can see this is
not going to go well. You might end up with that
as your optimal substructure of this region. Remember, you’re working
from the inside out, so you’re going from here to
here, and you end up with that. And what do you do here? You don’t have a G
to pair the C to, so you add another
unpaired base. Now you’ve got this
optimal substructure of a sequence that’s
almost the whole sequence. It’s just missing the
first and last bases, but it only has
three base pairs. So when you go to add
this, you can say, oh, I can’t add any more base
pairs, so I’ve only got three. But you should consider
that we’ve already solved the optimal
structure of that, and we had two nice pairs here. We had that pair and
that pair, and we already solved the substructure
of the optimal structure of this portion here, and
you had those three pairs. And so you can combine those
two and all of a sudden you can do much better. So that’s what that
bifurcation thing is about. So this is the
recursion working out, and you can see that’s
the base pairing one. You can add one, or you can
just add an unpaired base and you don’t add anything. Or you consider all
the possible locations of bifurcations in-between the
two positions you’re adding, i and j, and you consider
all the possible pairs. And you just sum up each
pair and go– I’m sorry, you don’t sum them up. You consider them all, and
then you take the maximum. All right, so the algorithm
is to take an n by n matrix, initialize the diagonal to 0,
and initialize the sub-diagonal to 0 also. Just don’t think
too much about that. Just do it. And then fill in this
matrix recursively from the diagonal
up and to the right. And it actually doesn’t matter
what order you fill it in as long as you’re kind
of working your way up into the right. You have to have the thing
to the left and the thing below already filled in if
you’re going to fill in a box. And then you keep track of
the optimal score, which is going to be the
sum of base pairs. And then you also keep
track of how you got there. What base pair did you add
so that you can trace back? And then when you get
up to the upper right corner of this matrix,
you then trace back. So here is a partially
filled in this matrix. This is from that the
Nature Biotechnology Review. And the 0’s are filled in. So here’s what I want
you to do at home, is print out, photocopy or
whatever– make this matrix, or make a bigger
version of it perhaps– and look at the sequence
and fill in this matrix, and fill in the little arrows
every time you add a base pair. It’s actually not that hard. There are no bifurcations in
this, so that’s the tricky one. Ignore that one. You’ll just be
adding base pairs. It’ll be pretty easy. And then you can
reconstruct the sequence. So here it is filled in. And the answer is given,
so you can check yourself. But do it without
looking at the answer. And then you go to the
upper right corner. That means that the
optimal structure from the beginning of the
sequence to the end– which, of course, was our
goal all along. And then you trace
back and you can see whenever you’re
moving diagonally here, you’re adding a base pair. Remember, you add
one on each end, and so you’re moving diagonally
and adding the base pair, and you get this
little structure here. So computational complexity
of the algorithm. You could think about this
but I’ll just tell you. It’s memory n squared
because you’ve got to fill in this
matrix, so square of the length of the sequence. Time n cubed. This is bad now. And why is it n cubed? It’s n cubed because you have to
fill in a matrix that’s n by n. And then when you do
that maximization step, that check for bifurcations,
that’s sort of of order n, as well. So n cubed– so this means
that RNA folding is slow. And in fact, some
of the servers won’t allow you to fold anything
more than a thousand bases because they’ll take
forever or something like that. And it cannot
handle pseudoknots. If you think through
the recursion, pseudoknots will be a problem. I’m going to just
show you– yeah, I’ll get to this– that
these are from the viruses. Real viruses, some of
them have pseudoknots like these ones shown
here, and some even have these kissing loops,
which is another type where the two stem loops,
the loops interact. And the pseudoknots
in particular are important in the
viral life cycle. They can actually cause
programmed ribosomal frame shifting. When the ribosomes
hits one of the things, normally it just denatures
RNA secondary structure. When it hits a
pseudoknot, it’ll actually get knocked back by
one and will start translating in a
different frame. And that’s actually
useful to the virus to do that under
certain circumstances. That’s how HIV makes the
replicated polymerase, is by doing a frame shift on
the ribosome using a pseudoknot. So these things are important. And there’s fancier
methods that use more sophisticated
thermodynamic models where GC counts more than AU. And I won’t go into
the details, but I just wanted to show you some
pretty pictures here that the Zuker
algorithm– this is a real world RNA folding
algorithm– calculates not only the minimum energy fold,
but also sub-optimal folds, and the probabilities of
particular base pairs, summing over all the possible
structures that RNA could form, weighted by their free energy. So it’s the full
partition function. It’s not perfectly accurate. It gets about 70% of
base pairs correct, which means it usually
gets things right, but occasionally totally wrong. And there’s a website for the
Mfold server, which is actually one of the most beautiful
websites in bioinfomatics, I would say. And also if you want
to run it locally, you should download the
Vienna RNAfold package, which has a very
similar algorithm. And I just wanted to show
you one or two examples. So this is the U5 snRNA. This is the output of Mfold. It predicts this structure. And then this what’s called
the energy dot plot, which shows the bases in the optimal
structure down below here and then sort of these
suboptimal structures here. And you can see
there’s no ambiguity. It’s totally confident
in this structure. Then I ran the lysine
riboswitch through this program, and I got this. I got the minimum
for energy structure down in the lower left. And then you see there’s a
lot of other colored dots. Those are from the
suboptimal structures. So it looks like this thing
has multiple structures, which of course it does. So the way that this one works
is, in the absence of lysine, it forms this structure
where the ribosome binding sequences– this is
prokaryotic– is exposed. And so the ribosome
can enter and translate these lysine
biosynthetic enzymes. But then when lysine
accumulates to a certain level, it can interact with the
RNA and shift it’s structure so that you now form
this stem, which sequesters the ribosome
binding sequence and blocks lysine biosynthesis. So a very clever system. And it turns out
that there’s dozens of these things in
bacterial genomes, and they control a
lot of metabolism. So they’re very important. And there may be some
in eukaryotes, too, and that would be good. If anyone’s looking for
a product, not happy with their current
project, you might think about looking
for more riboswitches. So I’m going to
have to end there. And thank you guys
for your attention, and good luck on the midterm.

2 Comments

Add a Comment

Your email address will not be published. Required fields are marked *