# 11. RNA Secondary Structure; Biological Functions and Predictions

The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. We should probably get started. So RNA plays important

regulatory and catalytic roles in biology, and so it’s

important to understand its function. And so that’s going to be the

main theme of today’s lecture. But before we get

to that, I wanted to briefly review what

we went over last time. So we talked about hidden

Markov models, some of the terminology, thinking

of them as generative models, terminology of the different

types of parameters, the initiation probabilities

and transition probabilities and so forth. And Viterbi algorithm, just

sort of the core algorithm used whenever you apply HMMs. Essentially, you always

use the Viterbi algorithm. And then we gave as an

example the CpG Island HMM, which is admittedly a

bit of a toy example. It’s not really

used in practice, that illustrates the principles. And then today we’re going

to talk about a couple of real world HMMs. But before we get

to that, I just wanted to– sort

of toward the end, we talked about the

computational complexity of the algorithm, and concluded

that if you have a case state HMM run on a sequence of length

L, it’s order k squared L. And this diagram is helpful

to many people in sort of thinking about that. So you can have transitions

from any state– for example, from this state–

to any of the other five states, and there’s

five-state HMM. And when you’re

doing the Viterbi, you have to maximize over the

five possible input transitions into each state. And so the full

set of computations that you have to do from going

from position i to i plus 1 is k squared. Does that make sense? And then there’s L different

transitions you have to do, so it’s k squared L. Any questions about that? OK. All right and, so the example

that we gave is shown here. And what we did was to take an

example sort of where you could sort of see the answer–

not immediately see it, but if we’re thinking about it

a little, figure out the answer. And then we talked about how

the Viterbi algorithm actually works, and why it makes the

transitions at the right place. It seems to intuitively like it

would make a transition later, but actually transitions

at the right place. And one way to

think about that is that these are not hard and

fast decisions because you’re optimizing two different paths. At every state, you’re

considering two possibilities. And so you explore the

possibility of– the first time you hit a c, you explore the

possibility of transitioning from genome to

island, but you’re not confirming whether you’re going

to do that yet until you get to the end and see whether that

path ends up having a higher probability at the end of the

sequence than the alternative. So that’s sort of one way

of thinking about that. Any questions about

this sort of thing, how to understand when a

transition will be made? And I want to emphasize,

for this simple HMM, we talked about

you can kind of see what the answer’s going to be. But if you have any HMM, any

sort of interesting real world HMM with multiple

states, there’s no way you’re going

to be able to see it. Maybe you could guess

what the answer might be, but you’re not going to be able

to be confident of what that is, which is why you have

to actually implement it. All right, good. Let’s talk about a couple

of real world HMMs. So I mentioned gene finding. That’s been a popular

application of HMMs, both in prokaryotes

and eukaryotes. There’s some examples

discussed in the text. Another very popular application

are so-called profile HMMs. And so this is a

hidden Markov model that’s made based on a multiple

alignment of proteins which have a related function

or share a common domain. For example, there’s

a database called Pfam, which includes

profile HMMs for hundreds of different types

of protein domains. And so once you have many

dozens or hundreds or thousands of examples of a

protein domain, you can learn lots of

things about it– not just what the

frequencies of each residue are in each position,

but how likely you are to have an

insertion at each position. And if you do have

an insertion, what types of amino acid residues

are likely to be inserted in that position,

and how often you are likely to have a

deletion at each position in the multiple alignment. And so the challenge then

is to take a query protein and to thread it through all

of these profile HMMs and ask, does it have a significant

match to any of them? And so that’s basically

how Pfam works. And the nice thing about

HMMs is that they allow you to– if you want to have

the same probability of an insertion at each position

in your multiple alignment, you can do that. But if you have enough data

to observe that there’s a five-fold higher likelihood of

having an insertion at position three in a multiple alignment

than there is at position two, you can put that in. You just change

those probabilities. So in this HMM, each

of the hidden states is either an M state, which is

a match state, or an I state, or an insert state. And so those will emit

actual amino acid residues. Or it could be a

delete state, which is thought of as emitting

a dash, a placeholder in the multiple alignment. So these are also widely used. And then one of my favorite

examples– it’s fairly simple, but it turns out to

be quite useful– is the so-called

TMHMM for prediction of transmembrane

helices in protein. So we know that many,

especially eukaryotic proteins, are embedded in membranes. And there’s one famous family

of seven transmembrane helix proteins, and there

are others that have one or a few

transmembrane helices. And knowing that a protein

has at least one transmembrane helix is very useful in terms

of predicting its function. You predict it’s localization. And knowing that it’s a seven

transmembrane helix protein is also useful. And so you want to predict

whether the protein has transmembrane helices and

what their orientation is. That is, proteins can

have their end terminus either inside the cell

or outside the cell. And then of course, where

exactly those helices are. And this program has

about a 97% accuracy, according to [? the author. ?]

So it works very well. So what properties

do you think– we said before that

you have to have strongly different

emission probabilities in the different hidden states

to have a chance of being able to predict

things accurately. So what properties do

you think are captured in a model of

transmembrane helices? What types of

emission probabilities would you when you have for the

different states in this model? Anyone? So for this protein,

what kind of residues would you have in here? Oops, sorry. I’m having trouble

with this thing. All right, here in the

middle of the membrane, what kind of residues are

you going to see there? AUDIENCE: [INAUDIBLE] PROFESSOR: Those are

going to be hydrophobic. Exactly. And what about right

where the helix emerges from the membrane? [INAUDIBLE] charge residue’s

there to kind of anchor it and prevent it from

sliding back into membrane. And then in general, both on

the exterior and interior, you’ll tend to have more

hydrophilic residues. So that’s sort of

the basis of TMHMM. So this is the structure. And you’ll notice that these are

not exactly the hidden states that correspond to individual

amino acid residues. These are like meta

states, just to illustrate the overall structure. I’ll show you the actual

states on the next slide. But these were the

types of states that the author, Anders

[? Crow ?], decided to model. So he has sort of a– focuses

here on the helix core. There’s also a cytoplasmic

cap and a non-cytoplasmic cap. Oops, didn’t mean that. And then there’s sort of a

globular domain on each side– both on the cytoplasmic

side, or you could have one on the

non-cytoplasmic side. OK, so there’s going to be

different compositions in each of these regions. Now one of the things we

talked about with HMMs is that if you were– now let’s

think about the helix core. The simplest model

you might think of would be to have sort

of a helix state, and then to allow that

state to recur to itself. OK, so this type of thing where

you then have some transition to some sort of cap state

after, this would allow you to model helices of any length. But now how long are

transmembrane helices? What does that

distribution look like? Anyone have an idea? There’s a certain

physical dimension. [INAUDIBLE] It takes a certain number

residues to get across here, and then that number

is about 20-ish. So transmembrane helices

tend to be sort of on the order of 20

plus or minus a few. And so it’s totally unrealistic

to have a transmembrane helix that’s, like, five

residues long. So if you run this algorithm

in generative mode, what distribution of helix

lengths will you produce? We’re running in

generative mode where we’re going to let,

remember, to generate a series of hidden

states and then associated amino acid sequences. It’s coming from some,

let’s say– I don’t know. What kind of states are there

here? [INAUDIBLE] plasmic. Let’s say goes into

helix, hangs out here. I’m sorry, is there an

answer to this question? Anyone? I don’t know how

long– if I let it run, it’ll generate a random number. It depends on what this

probability is here. Let’s call this probability

p, and then this would be 1 minus p. OK, so obviously if

1 minus p is bigger, it’ll tend to produce

longer helices. But in general,

what is the shape of the distribution there of

consecutive helical states that this model will generate? AUDIENCE: Binomial. PROFESSOR: Binomial. OK, can you explain why? AUDIENCE: Because

the helix would have to have probable–

the helix of length n would occur 1 minus

p to the n power. PROFESSOR: OK, so a helix of

length 10 with a probability of then, say, let’s call it L,

for the length of the helix, equals n is 1 minus

p to the n, right? Is that binomial? Someone else? AUDIENCE: Yeah. Is it a negative binomial? PROFESSOR: Negative binomial. OK. AUDIENCE: [INAUDIBLE] states and

a helix state before moving out [INAUDIBLE]. PROFESSOR: Yeah. So the distribution is

going to be like that. You have to stay in here

for n and then leave. So this is the

simplest– you can have special cases of binomial

and negative binomial. But in general,

this distribution is called the

geometric distribution. Or a continuous version would

be the exponential distribution. So what is the shape

of this distribution? If I were to plot n down here on

this axis, and the probability that L equals n on this

axis, what kind of shape– could someone draw in the air? So you had up and then down? OK, so actually, it’s

going to be just down. Like that, right? Because as n increases,

this goes down because 1 minus

p is less than 1. So it just steadily goes down. And what is the mean

of this distribution? Anyone remember this? Yeah, so there’s sort

of two versions of this that you’ll see. One of them is the 1 minus p

n minus 1 p, and one of them is this. And so this is the number of

failures before a success, if you will. Successes lead to the helix. And this is the number of

trials till the first success. So one of them has

a mean that’s 1/p, and the other has a mean

that’s 1 minus p over p. So usually, p is small, and

so those are about the same. So 1/p. You could think that

1/p is roughly right. And so if we were to model

transmembrane helices, and if transmembrane

heresies are about– I said about 20

residues long– you would set p to what value

to get the right mean? AUDIENCE: 0.05. PROFESSOR: Yeah. 0.05. 1/20, so that 1 over that

will be about 20, right? And then 1 minus p

would, of course, be 0.9. So if I were to do that,

I would get a distribution that looks about like

this with a mean of 20. But if I were to then look

at real transmembrane helices and look at their

distribution, I would see something

totally different. It would probably

look like that. It would have a mean around 20. But the probability of anything

less than 15 would be 0. That’s too short. It can’t go across the membrane. And then again, you don’t

have ones that are 40. They don’t kind of wiggle around

in there and then come out. They tend to just

go straight across. So there’s a problem here. You can see that if you want

to make a more accurate model, you want to not only get the

right emission probabilities with the right probabilities of

hydrophobics and hydrophilics and the different

states, but you also want to get the length right. And so the trick that–

well, actually, yeah. Can anyone think of tricks

to get the right length distribution here? How do we do better than this? Basically, hidden

Markov models where you have a state that

will recur to itself, it will always be a

geometric distribution. The only choice you have is

what is that probability. And so you can get

any mean you want, but you always get this shape. So if you want a

more general shape, what are some tricks

that you could do? How could you change the model? any ideas? Yeah, go ahead. AUDIENCE: [INAUDIBLE] have

multiple helix states. PROFESSOR: Multiple

helix states. OK. How many? AUDIENCE: Proportional to the

length we want, [INAUDIBLE]. PROFESSOR: Like one for

each possible length. AUDIENCE: It’d be

less than one length. PROFESSOR: Or less than one. OK. So you could have

something like– I mean, let’s say you have like this. Helix begin– or,

helix 1, helix 2. You allow each of these

to recur to themselves. What does that get you? This actually gets you

something a little bit better. It gives you a little bit

about of– it’s more like that. So that’s better. But if I want to get the exact

distribution, then actually one– so this is the solution

that the authors actually used. They made essentially 25

different helix states, and then they allowed various

different transitions here. So it’s a larger

arbitrary here, but they have this special state three

that can kind of take a jump. So it can just

continue on to four, and that’ll make your

maximum length helix core. Or it can skip one, go

to five, and that’ll make a helix core that’s one

residue shorter than that, or it can skip

two, and so forth. And you can set

any probabilities you want on these transitions. As so you can fit basically

an arbitrary distribution within a fixed range

of lengths that’s determined by how

many states you have. OK, so they really wanted to get

the length distribution right, and that’s what they did. What’s the cost of this? What’s the downside? Simona? AUDIENCE: I was

just going to ask, it looks like from

this your minimum helix length could be four. PROFESSOR: Yeah. That’s a good question. Well, we don’t know what

the probabilities– they say said on that. Well, did they really mean that? And also, that’s only the core,

and maybe these cap things can be– yeah, that seems

a little short to me. So yeah, I agree. I’m not sure. It could just be for the

sake of illustration, but they don’t

actually use those. But anyway, I’ll probably

have to read the paper. I haven’t read this

paper for many years so I don’t remember

exactly the answer to that. But I have a citation. You can look it up

if you’re curious. But the main point I

wanted to make with this is just that by setting an

arbitrary number of states and putting in possible

transitions between them, you can actually construct

any length of distribution you want. But there is a downside,

and what is that downside? AUDIENCE: Computational cost. PROFESSOR: Yeah, the

computational cost. Instead of having

one helix state, now we’ve got 25 or something. So and the time goes up by the

square of the number of states, so it’s going to run slower. And you also have to estimate

all these parameters. OK, so here’s an example

of the output of the TMHMM program for a mouse

chloride channel gene, CLC6. So the program

predicts that there are seven transmembrane

helices, as shown by these little red blocks here. You can see they’re all about

the same– about 20 or so– and that the program starts

outside and ends inside. So let’s say you were going

to do some experiments on this protein to

test this prediction. So one of the types of

experiments people do is they put some

sort of modifiable or modified residue

into one of the spaces between the

transmembrane helices. And then you can test,

by modifying this cell with something that’s a

non-permeable chemical, can you modify that protein? So only if that stretches

on the outside of the cell will you be able to predict it. So that’s a way of

testing the topology. So if you were doing those

types of experiments, you might actually– like

maybe you’re not sure if every transmembrane

helix is correct. There could be some

where the boundaries were a little off, or

even a wrong helix. And so one of the

things that you often want with a

prediction is not only to know what is the optimal

or most likely prediction, but also how confident

is the algorithm in each of the parts of its prediction. How confident is it in the

location of transmembrane helix three or the probability

that actually there is a transmembrane helix three. And so the way that this program

does that is using something called the

forward-backward algorithm. So those of you who read

the Rabener tutorial, it’s described

pretty well there. The basic idea is

that I mentioned that this Po– the probability

of the observable sequence summing over all

possible HMM structures or all possible sequences

of hidden states– that is possible to calculate. And the way that

you do it is you run an algorithm that’s

similar to the Viterbi, but instead of taking

the maximum entering each hidden state at

intermediate positions, you sum those inputs. So you just do the

sum at every point. And it turns out that will

calculate the sum of the two values at the end– or

the k values at the end will be equal to the sum of

the probabilities of generating the observable sequence

over all possible sequences of hidden states. OK, so that’s useful. And then you can also

run it backwards. There’s no reason it has to be

only going in one direction. And so what you do is you run

these sort of summing versions of the Viterbi in both

the forward direction and also run one in

the backward direction. And then you take a

particular position here– like let’s say this is your

helix state, for example. And we’re interested

in this position somewhere in the

middle of the protein. Is that a helix or not? And so basically

you take the value that you get here

from the forward in your forward

algorithm and the value that you get here in

the backward algorithm, and multiply those two

together, and divide by this Po. And that gives you

the probability. So that ends up being

a way of calculating the sum of all

the parses that go through this particular

position i in the sequence in that particular state. I mean, I realize that may

not have been totally clear, and I don’t want to take more

time to totally go into it, but it is pretty well

described and Rabener. And I’ll just give

you an example. So if you’re motivated,

please take a look at that. And if you have

further questions, I’d be happy to discuss

during office hours next week. And this is what it looks like

for this particular protein. So you get something called the

posterior probability, which is the sum of the probabilities

of all the parses. And they’ve plotted it for

the particular state that is in the Viterbi path, that

is in the optimal parse– so for example, in blue here. Well, actually, they’ve done

it for all the different states here. So blue is the probability

that you’re outside. OK, so it’s very, very

confident that the end terminus of the protein is

outside the cell. It’s very, very confident

in the locations of transmembrane

helices one and two. It actually more

often than not thinks there’s actually a

third helix right here, but that didn’t make it

in the optional parse. That actually occurs in

the majority of parses, but not in the optimal. And it’s probably because it

would then cause other things to be flipped later on if you

had transmembrane helix there. It’s not sure whether

there’s a helix there or not, but then it’s

confident in this one. OK, so this gives you an idea. Now if you wanted to do some

sort of test of the prediction, you want to test probably

first the higher confidence predictions, so you might

do something right here. Or if maybe from

experience you know that when it has a

probability that’s that high, it’s always right, so

there’s no point testing it. So you should test one of these

kind of less confident regions. So this actually makes the

prediction much more useful to have some degree

of confidence assigned to each part of the prediction. So for the remainder

of today, I want to turn to the topic of

RNA secondary structure. So at the beginning,

I will sort of get through some nomenclature. And then to motivate the topic,

give some biological examples of RNA structure. Gives me an excuse to show some

pretty pictures of structure. And then we’ll talk about

two approaches which are two of the most widely used

approaches toward predicting structure. So using evolution

to predict structure by method of co-variations,

which works well when you have many homologous sequences. And then using sort

of first principles thermodynamics to predict

secondary structure by energy minimization

where obviously you don’t need to have a

homologous sequence present. And the nature

biotechnology primer on RNA folding

that I recommended is a good intro to the

energy minimization approach. So what is RNA

secondary structure? So you all know that

RNAs, like proteins, have a three-dimensional

tertiary fold structure that, in many cases, determines

their function. But there’s also sort of

a simpler representation of this structure where you just

describe which pairs of bases are hydrogen bonded

to one other. OK, and so for RNA– so

it’s a famous example of an RNA structure, this

sort of clover leaf structure that all tRNAs have. The secondary structure of the

tRNA is the set of base pairs. So it’s this base pair

here between the first base and this one toward

the end, and then base right here, and so forth. And so if you specify

all those base pairs, then you can then draw a picture

like this, which gives you a good idea of what parts of

the RNA molecule are accessible. So for example,

it won’t tell you where the anticodon

loop is, which is sort of the business

end of the tRNA. But it narrows it down

to three possibilities. You might consider that,

or that, or down here. It’s unlikely to be

something in here because these bases

are already paired. They can’t pair to message. So it gives you sort of a first

approximation toward the 3D structure, and so

it’s quite useful. So how do we represent

secondary structure? So there’s a few different

common representations that you’ll see. So one is– and this is sort

of a computer-friendly but not terribly human-friendly

representation, I would say– is

this sort of dot in parentheses notation here. So the dot is an unpaired

base and the parenthesis is a paired base. And how do you know– chalk

is sort of non-uniformly distributed here– so if you

have a structure like this and you have these

three parentheses, what are they paired to? Well, you don’t know yet

until you get further down. And then each left

parenthesis has to have a right

parenthesis somewhere. So now if we see

this, then we know that there are two

unpaired bases here, and then there’s

going to be three in a row that are

paired– these guys. We don’t know what

they’re paired to yet. Then there’s going to be a

five base pair loop, maybe a little pentagon type thing. Two, three, four–

oops– four, five. And this one would be

the right parentheses that pair with the left

parentheses over here. I should probably

draw this coming out to make it clearer

that it’s not paired. So this notation you

can convert to this. So after a while, it’s

relatively easy to do this, except when they’re super long. So that’s what the left part

of that would look like. So what about the right part? So the right part, we have

something like one, two, three, four, bunch of dots, and then

we have two, and then a dot, and then two. What does that thing look like? So that’s going to look like

four bases here in a stem. Big loop, and then there’s

going to be two bases that are paired, and then

a bulge, and then two more that are paired. These things happen

in real structures. OK and then the

arced notation is a little more human-friendly. It actually draws an

arc between each pair of bases that are

hydrogen bonded. So I’m sure you can imagine

what those structures would look like. And it turns out that the

arcs are very important. Like whether those

arcs cross each other or not is sort of a fundamental

classification of RNA secondary structures, into

the ones that are tractable and the ones that

are really difficult. So pretty pictures of RNA. So this is a lower

resolution cryo-EM structure of the bacterial ribosomes. Remember, ribosomes have two

sub-units– a large sub-unit, 50S, and a small sub-unit, 30S. And if you crack it open–

OK, so you basically split. You sort of break the ribosome

like that, and you look inside, they’re full of tRNAs. So there are three

pockets that are normally distinguished within ribosomes. The A site– this

is the site where the tRNA enters

that’s going to add a new amino acid to the

growing peptide chain. The P site, which is

this tRNA will have it [INAUDIBLE] with the

actual growing peptide. And then the exit tunnel where

this tRNA will eventually– the exit, the E site,

which is the one that was added a couple

of residues ago. So people often think

of RNA structure just in terms of these

secondary structures because they’re much

easier to generate than tertiary structures, and

they give you– like for tRNA, it gives you some pretty good

information about how it works. But for a large and complex

structure like the ribosome, it turns out that

RNA is actually not bad at building

complex structures. I would say it’s not

as good as protein, but it is capable of

constructing something like a long tube. And in fact, in

the ribosome, you find such a long

tube right here. That is where the peptide

that’s been synthesized exits the ribosome. And you’ll notice it’s not

a large cavity in which the protein might start folding. It’s a skinny tube that is thin

enough that the polypeptide has to remain linear, cannot

start folding back on itself. So you sort of

extrude the protein in a linear, unfolded

confirmation, and let it fold outside

of the ribosome. If it could fold inside

that, that might clog it up. That’s probably one reason why

it’s not designed that way. I’m sure that was tried

bye evolution and rejected. So if you look at the

ribosome– now remember, the ribosome is composed

of both RNA and protein– you’ll see that it’s much

more of one than the other. And so it’s really much more

of the fettuccine, which is the RNA part, than the

linguini of the protein. And if you also look

at the distribution of the proteins on

the ribosome, you’ll see that they’re

not in the core. They’re kind of decorated

around the edges. It really looks like something

that was originally made out of RNA, and then you sort of

added proteins as accessories later. And that’s probably

what happened. This is based on

the structures that were solved a few years ago. If you then look at where

the nearest proteins are to the active site– actual

catalytic site– remember, the ribosome catalyzes peptide

in addition to an amino acid to a growing peptide, so

peptide bond formation– you’ll find that the

nearest proteins are around 18 to 20 angstroms away. And this is too far

to do any chemistry, so the active site

residues or molecules need to be within

a few angstroms to do any useful chemistry. And so this basically

proves that the ribosome. Is a ribozyme. That is, it’s an RNA enzyme. RNAs is [INAUDIBLE]. So here is the

structure of a ribosome. It’s very kind of beautiful,

and it’s impressive that somebody can actually

solve the structure of something this big. But what is actually

the practical use of this structure? Turns out there’s quite an

important practical application of knowing the structure. Any ideas? AUDIENCE: Antibiotics. PROFESSOR: Antibiotics. Exactly. So many antibiotics work by

taking advantage of differences between the prokaryotic

ribosome structure and eukaryotic

ribosome structure. So if you can make

a small molecule– these are some examples–

that will inhibit prokaryotic ribosomes

but hopefully not inhibit eukaryotic

ribosome, then you can kill bacteria that

might be infecting you. So non-coding RNA. So there’s many different

families of non-coding RNAs, and I’m going to list

some in a moment. And I’m going to

actually challenge you, see if you can

come up with any more families of non-coding RNAs. But they’re receiving

increasing interest, I would say, ever since

micro RNA’s were discovered. Sort of a boom in looking

at different types of non-coding RNAs. Link RNA is also important and

interesting, as well as many of the classical RNA’s like

tRNAs and rRNAs and snoRNAs. There may be new aspects of

their regulation and function that will be interesting. And so when you’re

studying a non RNA, it’s very, very helpful

to know its structure. If it’s going to base pair in

trans with some other RNA– as tRNAs do, as micro RNA’s

do, for example, or snRNAs and snoRNAs– then

you want to know which parts of the

molecule are free and which are

internally based paired. And if you want to predict

non RNAs genes in a genome, you may want to look

for regions that are under selection for

conservation of RNA structure, for conservation

of the potential to base pair at some distance. If you see that,

it’s much more likely that that region of the genome

encodes a non-coding RNA than it codes, for example–

there’s a coding axon or that it’s a

transcription factor binding site or something like that

that functions at the DNA level. So having this

notion of structure– even just secondary structure–

is helpful for that application as well, and predicting

functions as well, as I mentioned. So co-variation. So let’s take a look

at these sequences. So imagine you’ve discovered a

new class of mini micro RNA’s. They’re only eight bases

long, and you’ve sequence five homologues from your

five favorite mammals. And these are the

sequences that you get. And you know that

they’re homologous by [? a centimeter ?],

they’re in the same place in the genome, and they seem

to have the same function. What could you say about

their secondary structure based on this

multiple alignment? You have to stare at it a

little bit to see the pattern. There’s a pattern here. Any ideas? Anyone have a guess about

what the structure is? Yeah, go ahead. AUDIENCE: There’s a two

base pair stem, and then a four base loop. PROFESSOR: Two base pair

stem, four base loop, and you have of the stem. So how do you know that? AUDIENCE: So if you

look at the first two and last two bases

of each sequence, the first and the

eighths nucleotide can pair with each other, and so

can the second and the seventh. PROFESSOR: Yeah. Everyone see that? So in the first

column you have AUACG, and that’s

complementary to UAUGC. Each base is complementary. And the second position is

CAGGU complementary to GUCUA. There’s one slight

exception there. AUDIENCE: [INAUDIBLE] PROFESSOR: Yeah. Well, it turns out that that

RNA– although the Watson Crick pairs GC and AU are the

most stable– GU pairs are only a little bit

less stable than AU pairs, and they occur in

natural RNA molecules. So GU is allowed in RNA

even though you would never see that in DNA. OK, so everyone see that? So the structure is–

I think I have it here. This would be co-variation

You’re changing the bases, but preserving the

ability to pair. So when one base change– when

the first base changes from A to U, the last base

changes from U to A in order to preserve

that pairing. You wouldn’t know that if

you just had two sequences, but once you get

several sequences, it can be pretty

compelling and allow you to make a pretty

strong inference that that is the structure

of that molecule. So how would you do this? So imagine you had a more

realistic example where you’ve got a non-coding RNA

that’s 100 or a few hundred bases long, and you might have

a multiple alignment of 50 homologous sequences. You want something,

you’re not going to be able to see it by eye. You need sort of a more

objective criterion. So one method

that’s commonly used is this statistic IX

mutual information. So if you look in your

multiple alignment– I’ll just draw this here. You have many sequences. You consider every

pair of columns– this is a multiple alignment,

so this column and this column– and you calculate

what we’re going to call– what are we

going to call it? f ix. That would be the frequency

of a nucleotide x. You’re in column i, so you just

count how many A’s, C’s, G’s, and T’s there are. And similarly, f jy for

all the possible values of x and all the

possible values of y. So these are the base

frequencies in each column. And then you calculate the

dinucleotide frequencies xy at each pair of columns. So in this colony, you say if

there’s an A here and a C here, and then there’s

another AC down here, and there’s a total of one,

two, three, four, five, six, seven sequences,

then f AC ij is 2/7. So you just calculate the

frequency of each dinucleotide. These are no longer consecutive

dinucleotides in a sequence necessarily there. They can be in

arbitrary spacing. OK, so you calculate

those and then you throw them

into this formula, and out comes a number. So what does this

formula remind of? Have you seen a

similar formula before? AUDIENCE: [INAUDIBLE] PROFESSOR: Someone said

[INAUDIBLE] Yeah, go ahead. AUDIENCE: It reminds me of the

Shannon entropy [INAUDIBLE]. PROFESSOR: Yeah, it looks

like Shannon entropy, but there’s a log

of a ratio in there, so it’s not exactly

Shannon entropy. So what other formula has

a log of a ratio in it? AUDIENCE: [INAUDIBLE] PROFESSOR: Relative. So it actually looks

like relative entropy. So relative entropy

of what versus what? Who can sort of say more

precisely if it’s– we’ll say it’s relative entropy of

something versus a p versus q. And what is p and what is q? Yeah, in the back. AUDIENCE: Is it relative

entropy of co-occurrence versus independent occurrence? PROFESSOR: Good. Yeah. co-occurence–

everyone get that? Co-occurrence of a pair of

nucleotide xy at positions ij. Versus q is an

independent occurrence. So if x and y occurred

independently, they would have this frequency. So if you think about it,

you calculate the frequency of each base at each column

in the multiple alignment. And this is like

your null hypothesis. You’re going to assume, what if

they’re evolving independently? So if it’s not a folded

RNA– or if it’s a folded RNA but those two columns

don’t happen to interact– there’s no reason to suspect

that those bases would have any relationship

to each other. So this is like

your expected value of the frequency of

xy in position ij. And then this p is

your observed value. So you’re taking relative

entropy of basically observed over expected. And so relative entropy

has– I haven’t proved this, but it’s non-negative. It can be 0, and then it

goes up to some maximum, a positive value, but

it’s never negative. And what would it be if,

in fact, p were equal to q? What would this formula give? This is where we’re

saying suppose. Suppose this. In general, this won’t

be sure, but suppose it was equal to that. We’ve got mi ij equals

summation of what? That log of this,

which is equal to this, so it’s fx i fy j

over the same thing– hope you can see that– log

of– log of 1 is 0, right? So it’s just 0. So if the nucleotides

of the two columns occur completely independently,

mutual information is 0. And that’s one reason it’s

called mutual information. There’s no information. Knowing what’s in

column i gives you no information about column j. So remember, relative entities

are measures of information, not entropy. And what is the maximum value

that the mutual information could have? Any ideas on that? Any guesses? Joe, yeah. AUDIENCE: You could have

log base 2 log over f sub x, f sub y. PROFESSOR: Of 1? OK, so you’re saying if one of

the particular dinucleotides had a frequency of 1? AUDIENCE: Yeah. So if they’re always the same

whenever there’s– like an A, there’s always going to be a T. PROFESSOR: Right. So whenever there’s an A,

there’s always a G or a T. AUDIENCE: So then you’d

get a 1 in the numerator, and they’re relative

probably in the bottom, which would be maximized if

they were all even. PROFESSOR: If they were all? [INTERPOSING VOICES] PROFESSOR: If they were uniform. Yeah. So did everyone get that? So the maximum occurs if fx i

and j– they’re both uniform, so they’re a quarter for

every base at both positions. That’s the maximum entropy in

the background distribution. But then if fx y ij equals

1/4, for example, x equals y– or in our case, we’re

not interested in that. We’re interested in x

equals complement of y. C of y is going to be

the complement of y. And 0 otherwise for x not

equal complement of y. OK, so for example, if we have

only the dinucleotides AT, CG, GC, and TA occur,

and each of them occurs with a

frequency of 1/4, then you’ll have four terms in

the sum because, remember, the 0 log 0 is 0. So you’ll have four terms

in the sum, and each of them will look like 1/4 log

1/4 over a 1/4 times 1/4. And so this will be 4,

so log 2 of 4 4 is 2. And so you have four terms

that are each 1/4 times 2. And so you’ll get 2. Well, this is not a sum. These are the four terms. These are the individual

nonzero terms in that sum. Does that make sense? Everyone get this? So that’s why this is a useful

measure of co-variation. If what’s in one

column really strongly influences what’s

in the other column, and there’s a lot of

variation in the two columns, and so you can really see

that co-variation well, then mutual information

is maximized. And that’s basically

what we just said, is written down here. So it’s maximal. They don’t have to

be complementary. It would achieve this maximum

of 2 if they are complementary, but it would be also if they

had some other very specific relationship between

the nucleotides. So if you’re going to use

this, the way you would use it is take your multiple

alignment, calculate the mutual information

of each pair of columns– so you actually have to

make a table, i versus j, all possible pairs

of columns– and then you’re going to look for

the really high values. And then when you find

those high values, when you look at what actual bases

are tending to occur together, you’ll want to see

that they’re bases that are complementary

to one another. And another thing

that you’d want to see is you’d want to see that

consecutive positions in one part of the alignment

are co-varying with consecutive positions in

another part of the alignment in the right way, in this sort

of inverse complementary way that RNA likes to pair. Does that make sense? So in a sort of nested way

in your multiple alignment, if you saw that this

one co-varied with that, and then you also saw that

the next base co-varied with the base right

before this one, and this one co-varies

with that one, that starts to look like a stem. It’s much more likely that

you have a three-base stem than that you just

have some isolated base pair out in the

middle of nowhere. It turns out it takes

a few bases to make a good thermodynamically

stable stem, and so you want to look

for blocks of these things. And so this works pretty well. Yeah, actually, one point

I want to make first is that mutual

information is nice because it’s kind

of a useful concept and it also relates to some

of the entropy and relative entropy that we’ve been talking

about in the course before. But it’s not the only statistic

that would work in practice. You can use any measure of

basically non-independence between distributions. A chi square statistic

would probably work equally well in practice. And so here is a

multiple alignment of a bunch of sequences. And what I’ve done is

put boxes around columns that have significant mutual information with

other sets of columns. So for example, this set of

columns here at the left– the far left– has significant

mutual information with the ones at the far right. And these ones,

these four positions co-vary with these

four, and so forth. So can you tell,

based on looking at this pattern of

co-variation, what the structure is going to be? OK, let’s say we start up here. The first is going to

pair with the last, with something at the end. Then we’re going

to have something here in the middle that pairs

with something else nearby. Then we have something

here that pairs with something else nearby,

then we have another like that. Does that make sense? So that there’s these

three pairs of columns in the middle– these two, these

two, and these two– and then they’re surrounded

by this thing, the first pairing with the last. And so it’s a clover

leaf, so that’s tRNA. Yeah? AUDIENCE: So with that previous

slide, this table here, you could create a

co-variation matrix. How would that– or,

and it could be– PROFESSOR: How does that

co-variations matrix– how do you convert it

to this representations? AUDIENCE: I’m just wondering

how this would go up. Like let’s say you took

the co-variation matrix– PROFESSOR: Oh, what

would it look like? AUDIENCE: –and visualized

it as a heat map– PROFESSOR: In the

co-variation matrix. AUDIENCE: Yeah. What would it look like in

this particular example? PROFESSOR: Yeah,

that’s a good question. OK, let’s do that. I haven’t thought

about that before, so you’ll have to

help me on this. So here’s the beginning. We’re going to write

the sequence from 1 to n in both dimensions. And so here’s the beginning,

and it co-varies with the end. So this first would have a

co-variation with the last, and then the second would

co-vary with the second to last, and so forth. So you get a little

diagonal down here. That’s this top stem here. And then what about

the second stem? So then you have

something down here that’s going to co-vary with

something kind of near by it. So block two is going to

co-vary with block three. And again, it’s going to be

this inverse complementary kind of thing like that. It’s symmetrical, so

you get this with that. But you only have

to do one half, so you can just do

this upper half here. So you get that. So it would look

something like that. AUDIENCE: So with the

diagonal line orthogonal to the diagonal of the matrix– PROFESSOR: Yeah, that’s because

they’re inverse complementary. AUDIENCE: OK. PROFESSOR: That make sense? Good question. But we’ll see an

example like that later actually, as it turns out. All right, so here’s

my question for you. You’re setting this

non-coding RNA. It has some length. You have some

number of sequences. They might have some structure. Is this method going to

work for you, or is it not? What is required for it to work? For example, would

I want to isolate this gene– this

non-coding RNA gene– just from primates, from

like human, gorilla, chimp, orangutan, and

do that alignment? Or would I want to go further? Would I want to go back to

the rodents and dog, horse– how far do you want to go? Yeah, question. AUDIENCE: I think we a need a

very strong sequence alignment for this, so we

cannot go very far, because if you don’t have

a high percentage homology, then you will see all

sorts of false positives. PROFESSOR: Absolutely. So if you go too far, your

alignment will suffer, and you need an

alignment in order to identify the

corresponding columns. So that puts an upper limit

on how far you can go. But excellent point. Is there a lower limit? Do you want to go as

close as possible, like this example I gave

with human, chimp, orangutan? Or is that too close? Why is too close bad? Tim? AUDIENCE: Maybe if

you’re too close, then the sequence is

having to [INAUDIBLE] to give you enough

information [INAUDIBLE]. PROFESSOR: Yeah, exactly. They’re all the same. Actually, you’ll

get 1 times 1 over 1 in that mutual information

statistic, which log of that is going to be 0. There’s zero mutual information

if they’re all the same. So there has to

be some variation, and the structure

has to be conserved. That’s key. You have to assume that the

structure is well conserved and you have to have

a good alignment and there has to

be some variation, a certain amount of variation. Those are basically

the three keys. Secondary structure has a more

highly conserved sequence. Sufficient divergence so that

you have these variations, and sufficient number of

homologues you have to get good statistics, and not so far

they your alignment is bad. Sorry about that. Sally? AUDIENCE: It seems

like another thing that we assume here is that

you can project it onto a plane and it will lie flat. So if you have some very

important, weird folding that allows you to, say,

crisscross the rainbow thing. PROFESSOR: Yeah,

crisscross the rainbow. Yeah, very good question. So in the example

of tRNA, if you were to do that arc

diagram for tRNA, it would look like

another big arc– that’s the first and

last– and then you have these three nested arcs. Nothing crisscrossing. What if I saw– [INAUDIBLE]–

two blocks of sequence that have a relationship like that? Is that OK? With this method, the

co-variation, that’s OK. There’s no problem there. What does this

structure look like? So [INAUDIBLE] you have a

stem, then you have a loop, and then a stem. So this is 1 pairs with 3. That’s 1. That’s 3. Then you’ve got 2 up

here, but 2 pairs with 4. So here’s 4 over

here, so 4 is going to have to come back up

here and pair with 2. This is 2 over here. So that is called a pseudoknot. It’s not really a knot

because this thing doesn’t go through the

loop, but it kind of behaves like a

knot in some ways. And so do these actually

occur in natural RNAs? Yes, Tim is nodding. And are they important? Can you give me an example

where they are important biologically? AUDIENCE: [INAUDIBLE] [INTERPOSING VOICES] PROFESSOR: Riboswitches. We’re going to come

to what riboswitches are in a moment for

those not familiar. And I think I have

an example later of a pseudoknot

that’s important. So that’s a good question. I think I should have added

to this list the point that you made in

the back that they have to be close enough that

you can get a good alignment. I should add that to this last. Thanks. It’s a good point. All right, so classes

of non-coding RNAs. As promised, my

favorites listed here. Everyone knows tRNAs, rRNAs. You can think of UTRs

as being non RNAs. They often have

structure that can be involved in

regulating the message. snRNAs involved splicing. snoRNAs– small

nucleolar RNAs– are involved in directing

modification of other RNAs, such as ribosomal

RNAs and snRNAs, for example. Terminators of

transcription in prokaryotes are like little stem

loop structures. RNaseP is an important enzyme. SRP is involved in targeting

proteins with signal peptides to the export machinery. We won’t go into tmRNA. micro RNAs and link

RNAs, you probably know, and riboswitches. So Tim, can you tell us

what a riboswitch is? AUDIENCE: A riboswitch

is any RNA structure that changes

confirmation according to some stimulus [INAUDIBLE]

or something in the cell. It could be an ion, critical

changes in the structure. [INAUDIBLE] PROFESSOR: Yeah, that was great. So just for those that

may not have heard, I’ll just say it again. So a riboswitch is

any RNA that can have multiple confirmations,

and changes confirmation in response to some stimulus–

temperature, binding of some ligand, small molecules,

something like that, et cetera. And often, one of

those structures will block a particular

regulatory element. I’ll show an

example in a moment. And so when it’s in

one confirmation, the gene will be repressed. And when it’s in

the other, it’ll be on. so it’s a way of using

RNA’s secondary structure to sense what’s

going on in the cell and to appropriately

regulate gene expression. All right, so now we’re going

to talk about a second approach. So this would be the approach. You’ve got some RNA. It may not do something,

and maybe you can’t find any homologues. It might be some newly

evolved species-specific RNA, or your studying

some obscure species where you don’t have a lot

of genomic sequence around. So you want to use the

first principles, approach, the energy

minimization approach. Or maybe you have

the homologues, but you don’t trust

your alignment. You want a second

opinion on what the structure is going to be. So just in the way

that protein folding– you could think of

an equilibrium model where it’s determined

by folding free energy, and enthalpy will

favor base pairing. You get gain some enthalpy

when you form a hydrogen bond, and entropy will tend

to favor unfolding. So an RNA molecule

that’s linear has all this confirmational

flexibility, and lose some of that

when you form a stem. It forms a helix. Those things don’t have

as much flexibility. And even the nucleotides in

the loop are a little bit confirmationally–

they’re not as flexible as they were when it was linear. So that means that

at high temperatures, it’ll favor unfolding. So the earliest

approaches were approaches that sought to maximize

the number of base pairs. So they basically ignore entropy

and focus on the enthalpy that you gain from

forming base pairs. And so Ruth Nussinov

described the first algorithm to figure out what is the

maximum number of base pairs that you can form in an RNA. And so a way to

think about this is imagine you’ve

got this sequence. What is the largest

number of base pairs I can form with this sequence? I could just draw all

possible base pairs. That A can pair with that T.

This A can pair with that T. They can’t both pair

simultaneously, right? And this C can pair with that G.

So if we don’t allow crossing, which– coming back

to Sally’s point– this would cross this, right? So we’re not going

to allow that. So the best you could do be to

have this A pair with this C and this C pair with this G

and form this little structure. This is not realistic because

RNA loops can’t be one base. They minimum is about three. But just for the

sake of argument, you can list all these

out, but imagine now you’ve got 100 bases here. Every base will on

average potentially be able to pair with

24 or 25 other bases. So you’re just going to have

just an incredible mishmash of possible lines

all crisscrossing. So how do you figure out how

to maximize that pairing? Any ideas? Don, yeah? AUDIENCE: You look for

sections of homology. PROFESSOR: We’re

not using homology. We’re doing [INAUDIBLE] AUDIENCE: I’m sorry, not

homology, but sections where– PROFESSOR: Complementary? AUDIENCE: Complementary. Yeah, that’s the

word I was thinking. PROFESSOR: The blocks

are complementary. AUDIENCE: And then so– PROFESSOR: You could blast

the sequence against inverse complements itself and

look for little blocks. You could do that. That’s not what

people generally do, mostly because the blocks of

complementarity in real RNA structures are really short. They can be two,

three, four, bases. Sally, yeah? AUDIENCE: Could you use

[INAUDIBLE] approach where you just start with a

very small case and build up? PROFESSOR: So we’ve seen that

work for protein sequence alignment. We’ve seen it work for

the Viterbi algorithm. So that is sort of the go-to

approach in bioinfomatics, is to use some sort of

dynamic programming. Now this one for RNA

secondary structure that Nussinov came up

with is a little bit different than the others. So you’ll see it has a

kind of different flavor. It turns out to be

actually it’s a little hard to get your head around

at the beginning, but it’s actually

easier to do by hand. So let’s take a look at that. OK, so recursive

maximization of base pairing. Now the thing about

base pairing that’s different from

these other problems is that the first

base in the sequence can base pair with the last. How do you chop up a sequence? Remember with Needleman-Wunsch

and with Viterbi we go from the

beginning to the end, and that’s a logical order. But with base pairing, that’s

actually not a logical order. You can’t really do it that way. So instead, you go

from the inside out. You start in the

middle of a sequence and work your way outwards

in both directions. Or another way to think about

it is you start with you write the sequence from 1

to n on both axes, and then actually we’ll see that

we initiate the diagonal all to 0’s. And then we think about

these positions here next. So 1 versus 2. Could 1 pair with 2? And could 2 pair with 3? Those are like little

bits of possible RNA secondary structure. Again, we’re ignoring

this fact that loops have to be certain minimum. This is sort of a

simplified case. And then you build outwards. So you conclude that base 4

here could pair with base 5, so we’re going to put a 1 there. And then we’re going

to build outward from that toward the

beginning of the sequence and toward the end, adding

additional base pairs when we can. That’s basically the way

the [INAUDIBLE] works. And so that’s one

key idea, that we go from sort of

close sequences, work outward, to faraway sequences. And the second key idea

is that the relationship that, as you add more bases

on the outside of what you’ve already got, that the optimal

structure in that larger portion of sequence

space is related to the optimal structures

of smaller portions of it in one of four different ways. And these are the four ways. So let’s look at these. So the first one is

probably the simplest where if you’re doing this,

you’re here somewhere, meaning you’ve compared

sequences from position, let’s say, i minus

1 to j minus 1 here. And then we’re going to

consider adding– actually, it depends how you

number your sequence. Let me see how this is done. Sorry. i plus 1. i plus 1 to j minus 1. We figured out what the

optimal structure is in here, let’s suppose. And now we’re going to

consider adding one more base on either end. We’re going to add j

down here, and we’re going to ask if it pairs with i. And if so, we’re going to take

whatever the optimal structure was in here and we’re

going to add one base pair, and we’re going to

add plus 1 because now it’s got one additional. We’re counting base pairs. So that’s that first case there. And then the second case is

you could also consider just adding one unpaired base onto

whatever structure you had, and then you don’t add one. And you could go in

either direction. You can go sort of toward of

the beginning of the sequence or toward the end

of the sequence. And then the third

one is the tricky one, is what’s called a bifurcation. You could consider

that actually i and j are both paired, but

not with each other. That i pairs with something

that was inside here and j pairs with something

that was inside here. So your optimal parse

from i to j, if you will, is not going to come from the

optimal parse from i plus 1 to j minus 1. It’s going to come from

rethinking this and doing the optimal parse from here

to here and from here to here, and combining those two. So you’re probably

confused by now, so let me try to do an example. And then I have an analogy

that will confuse you further. So ask me for that one. This was the simplest

one I could come up with that has this property. OK, so we said before that

if you were doing the optimal from 1 to 5, that it would be

the AC pairing with the GT. We do that one. And now if you notice, this guy

is kind of a similar sequence. I just added a T at the

beginning and an A at the end. And so you can probably imagine

that the best structure of this is here, those three. You’ve got three pairs of

this sub-sequence here. That’s as good as you

can do with seven bases. You can only get three pairs. And this is as good as

you can do with five, so these are clearly optimal. So the issue comes that if

you’re starting from somewhere in the middle here– let’s

say you are– let’s see, so how would you be doing this? You start here. Let’s suppose the first two

you consider are these two. You consider pairing

that T with that A. You can see this is

not going to go well. You might end up with that

as your optimal substructure of this region. Remember, you’re working

from the inside out, so you’re going from here to

here, and you end up with that. And what do you do here? You don’t have a G

to pair the C to, so you add another

unpaired base. Now you’ve got this

optimal substructure of a sequence that’s

almost the whole sequence. It’s just missing the

first and last bases, but it only has

three base pairs. So when you go to add

this, you can say, oh, I can’t add any more base

pairs, so I’ve only got three. But you should consider

that we’ve already solved the optimal

structure of that, and we had two nice pairs here. We had that pair and

that pair, and we already solved the substructure

of the optimal structure of this portion here, and

you had those three pairs. And so you can combine those

two and all of a sudden you can do much better. So that’s what that

bifurcation thing is about. So this is the

recursion working out, and you can see that’s

the base pairing one. You can add one, or you can

just add an unpaired base and you don’t add anything. Or you consider all

the possible locations of bifurcations in-between the

two positions you’re adding, i and j, and you consider

all the possible pairs. And you just sum up each

pair and go– I’m sorry, you don’t sum them up. You consider them all, and

then you take the maximum. All right, so the algorithm

is to take an n by n matrix, initialize the diagonal to 0,

and initialize the sub-diagonal to 0 also. Just don’t think

too much about that. Just do it. And then fill in this

matrix recursively from the diagonal

up and to the right. And it actually doesn’t matter

what order you fill it in as long as you’re kind

of working your way up into the right. You have to have the thing

to the left and the thing below already filled in if

you’re going to fill in a box. And then you keep track of

the optimal score, which is going to be the

sum of base pairs. And then you also keep

track of how you got there. What base pair did you add

so that you can trace back? And then when you get

up to the upper right corner of this matrix,

you then trace back. So here is a partially

filled in this matrix. This is from that the

Nature Biotechnology Review. And the 0’s are filled in. So here’s what I want

you to do at home, is print out, photocopy or

whatever– make this matrix, or make a bigger

version of it perhaps– and look at the sequence

and fill in this matrix, and fill in the little arrows

every time you add a base pair. It’s actually not that hard. There are no bifurcations in

this, so that’s the tricky one. Ignore that one. You’ll just be

adding base pairs. It’ll be pretty easy. And then you can

reconstruct the sequence. So here it is filled in. And the answer is given,

so you can check yourself. But do it without

looking at the answer. And then you go to the

upper right corner. That means that the

optimal structure from the beginning of the

sequence to the end– which, of course, was our

goal all along. And then you trace

back and you can see whenever you’re

moving diagonally here, you’re adding a base pair. Remember, you add

one on each end, and so you’re moving diagonally

and adding the base pair, and you get this

little structure here. So computational complexity

of the algorithm. You could think about this

but I’ll just tell you. It’s memory n squared

because you’ve got to fill in this

matrix, so square of the length of the sequence. Time n cubed. This is bad now. And why is it n cubed? It’s n cubed because you have to

fill in a matrix that’s n by n. And then when you do

that maximization step, that check for bifurcations,

that’s sort of of order n, as well. So n cubed– so this means

that RNA folding is slow. And in fact, some

of the servers won’t allow you to fold anything

more than a thousand bases because they’ll take

forever or something like that. And it cannot

handle pseudoknots. If you think through

the recursion, pseudoknots will be a problem. I’m going to just

show you– yeah, I’ll get to this– that

these are from the viruses. Real viruses, some of

them have pseudoknots like these ones shown

here, and some even have these kissing loops,

which is another type where the two stem loops,

the loops interact. And the pseudoknots

in particular are important in the

viral life cycle. They can actually cause

programmed ribosomal frame shifting. When the ribosomes

hits one of the things, normally it just denatures

RNA secondary structure. When it hits a

pseudoknot, it’ll actually get knocked back by

one and will start translating in a

different frame. And that’s actually

useful to the virus to do that under

certain circumstances. That’s how HIV makes the

replicated polymerase, is by doing a frame shift on

the ribosome using a pseudoknot. So these things are important. And there’s fancier

methods that use more sophisticated

thermodynamic models where GC counts more than AU. And I won’t go into

the details, but I just wanted to show you some

pretty pictures here that the Zuker

algorithm– this is a real world RNA folding

algorithm– calculates not only the minimum energy fold,

but also sub-optimal folds, and the probabilities of

particular base pairs, summing over all the possible

structures that RNA could form, weighted by their free energy. So it’s the full

partition function. It’s not perfectly accurate. It gets about 70% of

base pairs correct, which means it usually

gets things right, but occasionally totally wrong. And there’s a website for the

Mfold server, which is actually one of the most beautiful

websites in bioinfomatics, I would say. And also if you want

to run it locally, you should download the

Vienna RNAfold package, which has a very

similar algorithm. And I just wanted to show

you one or two examples. So this is the U5 snRNA. This is the output of Mfold. It predicts this structure. And then this what’s called

the energy dot plot, which shows the bases in the optimal

structure down below here and then sort of these

suboptimal structures here. And you can see

there’s no ambiguity. It’s totally confident

in this structure. Then I ran the lysine

riboswitch through this program, and I got this. I got the minimum

for energy structure down in the lower left. And then you see there’s a

lot of other colored dots. Those are from the

suboptimal structures. So it looks like this thing

has multiple structures, which of course it does. So the way that this one works

is, in the absence of lysine, it forms this structure

where the ribosome binding sequences– this is

prokaryotic– is exposed. And so the ribosome

can enter and translate these lysine

biosynthetic enzymes. But then when lysine

accumulates to a certain level, it can interact with the

RNA and shift it’s structure so that you now form

this stem, which sequesters the ribosome

binding sequence and blocks lysine biosynthesis. So a very clever system. And it turns out

that there’s dozens of these things in

bacterial genomes, and they control a

lot of metabolism. So they’re very important. And there may be some

in eukaryotes, too, and that would be good. If anyone’s looking for

a product, not happy with their current

project, you might think about looking

for more riboswitches. So I’m going to

have to end there. And thank you guys

for your attention, and good luck on the midterm.

RNA Secondary Structure; Biological Functions and Predictions – MIT US

The slide titled "Architecture of TMHMM", Prof said that the minimum number of residues in the diagrammatic representation of the helix states is 4 might be only for illustration purposes, but as far as I understand, the number has to be 4 because the minimum of 4 residues is required to obtain a helix turn to form 1-> 4 H-bonding and hence the imposition of minimum length of helix residues to be 4