Hello. I’m Vivian Cheung. I’m an investigator of the Howard Hughes Medical Institute, a professor of pediatric neurology at the Life Sciences Institute at the University of Michigan. Today, in a three-part series, I’m going to talk about RNA variation, from gene expression to RNA sequences. On our planet, there are tens of thousands of species of plants, to land and marine mammals. And umm… we see this in nature from just the color… the wide ranges of the colors of trees and sizes and shapes of those trees. And in order to… of course, all these organ… biodiversity gives us the beauty of nature, but at the same time it’s also fundamental to our survival, because it provides us with the food and fuels that we need. In order for us to have this biodiversity, we need a lot of the building blocks. So, we need, at the molecular level, the same extent of variation. What I’d like to talk about today is that variation in RNA. We know from the genome project that the sizes of genomes and the number of genes itself doesn’t really give us all the answers to this complex set of functions. So, on this slide, I show the number of genes in roundworms as well as in human. We see that the worms and humans basically have the same number of genes. And if we compare wheat to human, the sizes… the genome sizes of wheat and the number of genes in wheat is actually much more than in human. So, the complexity is probably not just from how big our genomes are or how many genes we have. We have known for a long time that it’s not a straightforward, one-to-one relationship between DNA, RNA, and protein. And what, in fact, we see is that one DNA sequence can be made into different forms of RNA transcript. And these different forms come from either alternative splicing or from RNA editing that actually changes the sequences of RNA from their corresponding DNA. And from these RNA transcripts, then, different isoforms of proteins are made. So, from one DNA sequence, we ended up having different forms of RNA transcripts, as well as different forms of proteins. So, today, what I would like to discuss is mostly on RNA… from variation in RNA at the gene expression level, as well as in RNA sequences. Oftentimes, geneticists really need this diversity in order to study the genetic basis of a trait, but oftentimes in mechanistic studies, in some ways, this extensive variation can be viewed as almost as a nuisance. But what I hope to convey is that this variation is not an unwanted complication. In fact, it allows us to understand regulation. So, let me start with talking about variation in RNA, in particular, in gene expression levels. So, in this experiment, what we did was a very simple experiment where we just measured the expression level of genes in B cells from blood samples. We collected blood samples from 50 unrelated individuals. Then, we extracted B cells and measured the expression levels in those B cells. And, shown here are examples of genes that showed very little variability across individuals, such as in Parkin-7, here, where individuals with the lowest and individuals with the highest expression level… the range is very small. Whereas, in the rest of the genes, as representative examples, we see a tremendous amount of variability. So, on the y axis is log2 scale of gene expression. Each individual is shown as a dot. So, the individual with the lowest expression level, say, in this gene, versus the individual with the highest expression level, is at least 10-20-fold, or even more, differences. So, if we want to know if there is a genetic basis to this variation, what I’ve shown you on this slide are data from unrelated individuals. If we want to know if people… if there is a genetic basis, we need to look at people who are related, such as siblings in a family or people who are genetically identical, such as in monozygotic twins. So, we did exactly that. We measured… in addition to the unrelated individuals, we also measured gene expression levels in siblings in families, as well as in monozygotic twins. Here, I’ve plotted five examples where we plotted the variance across unrelated individuals, in green, individuals… the variance across siblings, in orange, and monozygotic twins, in blue. In each case, what we see is that in unrelated individuals there’s a lot more variability than in individuals who are identical, such as twins, suggesting that there is a genetic basis to this variation in gene expression. But we can look at this much more formally by calculating heritability. So, in… in order for us to calculate heritability, we can ask, what does it really mean by a trait being genetically controlled? So, if that’s… if that’s the case, then the parent who umm… given the trait to their children, or if, when the children inherited a trait from their parents, then they should be more similar to their parents. So, here, we’re asking if the expression levels are the same thing, that… if the children are more like their parents in terms of gene expression level? And so I’ll give you an example, here, for GSTM2, where, on the y axis are the expression level of individuals, on the horizontal axis is the average of their two parents. So, we see individuals with low expression come from parents who have also low expression level of GSTM2, whereas individuals who have high expression levels of GSTM2 come from parents who have high expression levels of GSTM2. So, as a result, when we look at the slope of this graph, we see a slope of 0.75, which is rather high. And we see an example, here, also listed, of genes that have high heritability. We’re certainly not saying that, for these genes, the expression level is completely genetically controlled, but it certainly suggests that there is a strong genetic component. So, what does this really mean? Umm… we usually do not think of gene expression levels as a phenotype, but I think our analysis shows that it is very similar to the quantitative phenotypes that we’re all familiar with, such as glucose level or blood pressure, or size and shapes of fruits. So, in that sense, that individual’s difference… their differences in individual’s height, here, the expression level of genes differs across people, and there’s a genetic basis to this variability. So, this gives us an opportunity to take the expression levels and decompose them genetically. We can use different type of genetic mapping methods and, as when we look at sites in the genome where people have differences in DNA sequences, whether there are some DNA sequences that allow us to decompose this range of gene expression. So, for example, here, there’s some other site where some individuals have CCs… DNA sequence that are CCs, some individuals who have DNA sequence where they’ve gotten the C from dad, a G from mom, let’s say, so they have a CG at that DNA sequence, and then some individuals who have gotten two Gs from their parents, so they are GG. And, as a result, we see here that individuals who have gotten two Gs from their parents have high expression level and people who have two Cs at that DNA site have lower expression level. And then if we go and ask, where is that DNA sequence? If it is close to the gene that we’re measuring the expression level, then we suggest that that gene is cis-regulated. If that DNA sequence is on another chromosome, say, in a transcription factor, we will have identified that the expression level is trans-regulated and identified the transcription factor that influences the expression level of that gene. So, we did this systematically for all the genes that… whose expression level we’re interested in. So, let me walk you through what we do. We treat the expression level of each gene as a quantitative phenotype. So, we identify families and measure the expression level from all members of those families. We then treat them as a quantitative phenotype, and put them through, first… in a genetic linkage analysis. In these genetic linkage analyses, we’re basically scanning for… throughout the whole genome and asking which of the chromosome regions are inherited at the same time, with the trait of interest. So, the way that we can track the different parts of the genome is to identify sites with different DNA sequences, and this way gives us markers, or basically signposts, to track different regions of the genome, and ask which of the regions of the… which parts of the genome are often inherited with high or low expression level of a particular gene. And just to give you some details, the method that allows us to track both the chromosomal region as well as the phenotype is an algorithm developed by Haseman and Elston, and together they are called Haseman-Elston algorithm. And basically, as I said, they search the markers in the genomes for ones that segregate, pass along in families with high or low expression level of genes. But, mostly, when we look at these large families using these linkage-analysis methods, because there’s… only how often that chromosomes are swapped, that we ended up with very large regulatory regions or very large linkage regions. So, we somehow need to be able to verify those regions are indeed correct and narrow those regions. In order for us to narrow those regions, we carry out another type of analyses called association analysis. And this method, instead of just taking advantage of recombinations or segregations of chromosomal regions in families, it takes care of all the historical recombination, so we get much better resolution. So, in order to do this, we do a… both at the population level, as well as a family level, using a method called the transmission disequilibrium… disequilibrium test. In the TDT, we are able to improve the resolution of our mapping and get kind of… narrowed down the regulatory regions much more. So, shown here are examples, where over the 3,500 expression phenotypes we analyzed, over 2,000 of them give us significant regions of linkage. So, these are regions that segregate with the expression level of genes, and on the y axis are the p-values of how strongly those chro… chromosomal regions segregate with the expression of a particular gene. So, for example on the… here, CHI3L2, the gene is on chromosome 1. By genetic linkage analysis, we identified the regulation. It’s also somewhere on chromosome 1, actually very close to the gene itself, suggesting it’s cis-regulated. Here’s another example where the gene is on chromosome 18, but the regulatory region is on chromosome 3. So, of course, it’s not regulated by something close to the gene but something on chromosome 13. But you will see that the size of this linkage peak is very broad, basically covering the entire chromosome. That doesn’t give us much information of where to look for the regulation. So, we use that linkage association analysis, tracking many millions of markers across the genome, each marker… the result for each marker is shown in a black bar. And the region… the markers that are most associated with the expression level of, in this case, IRF5, is shown by this marker, which is actually right over the IRF5 gene itself, suggesting that IRF5 is cis-regulated. But this way allows us to really pinpoint the regulatory regions, oftentimes to a nucleotide-resolution. So, overall, those 2,000 phenotypes that we looked at, about 20% of them, the linkage peak is very close to gene itself, suggesting they are cis-regulated. The majority of them are somewhere else in the genome, suggesting those are trans-regulated. And we were very pleased with this 20-80 split, because if every gene is cis-regulated we don’t need to do all this exercise, looking throughout the genome for the regulation — we can just look close to the gene itself. But, by… if most of the genes are trans-regulated, using this genetic approach is very powerful because we don’t need to know in advance what mech… what mechanism it is, we can just search systematically throughout the genome for the regulation, using both a combination of the linkage and association-based methods. Now, let me give you one concrete example. We found, through this genetic mapping, that KDM4C is a master regula… regulator for the expression level of hundreds of genes. So, what’s KDM4C? KDM4C is a demethylase in the Jumonji family. Most… what… in most of the time, histone marks regulate gene expression. And, in this case, when histone 3 has three methyl groups it acts as a repressive mark; it decreases the expression level of genes. So, what this KDM4C does, it removes the histone… it removes methylation marks from the histones, therefore activating genes, since it removed the repressive marks. So, what we found was, when we looked at the expression level of KDM4C, there’s a tremendous amount of variability in the expression level of KDM4C. And when we looked at… not only at the gene expression but also at the protein expression of protein KDM4C, we see that individuals who have low KDM4C gene expression also have low protein expression levels of KDM4C, and individual who have high KDM4C gene expression also have high protein levels of KDM4C. So, we then carry out our linkage and association-based method and identify a set of sequence differences in the 3′-UTR of KDM4C. So, individuals with an A genotype of… at the 3′-UTR of KDM4C have higher expression levels of KDM4C, and individuals who have the G variant of… in the 3′-UTR have lower expression levels of KDM4C. So, we know that KDM4C is cis-regulated. The… KDM4C is on chromosome 9, so we looked at the rest of our genetic mapping results and found that many of the linkage peaks actually are on chromosome 9, right over the KDM4C gene itself. So, for example, here, one example is MEF2C. MEF2C is a gene on chromosome 6 and the linkage peak is mapping to chromosome 9, so… suggesting that KDM4C is what regulates the expression level of MEF2C. And, by our association method, we were able to narrow it and confirm that, indeed, MEF2C is regulated by KDM4C. So, it’s not only these three genes with linkage peaks over… from chromosome 9. What we found was that there are over 390 genes that have linkage and association peaks mapping to KDM4C, suggesting that they are target genes of KDM4C. We then… then to experimentally validate that KDM4C indeed regulates expression levels of these genes, by showing that KDM4C binds to the promoter of these genes and removes methyl groups from the histone 3 that’s covering those target genes. So, who are these target genes of KDM4C? They included many genes that activate cell growth, including well-known genes such as MYC, HRAS, and other genes that promote cell cycle progression, genes that regulate translation. So, if it is indeed that KDM4C regulates the expression levels of these genes that influence cell growth, we should be able to see that individuals who have high KDM4C levels versus individuals who have low KDM4C will have different growth rates, or at least their cells should have different growth rates. We measure growth by using two different assays: one with a simple assay, basically measuring how fast their cell growth using a growth curve; another one is to use an assay called BrdU that basically allows us to measure how fast growth… how fast cells grow. So, we see here that individuals with high KDM4C levels grow faster, both by a growth curve and also by the BrdU assays, compared to individuals who have low KDM4C levels. We then over-expressed KDM4C and, by this BrdU assay we see that individuals who have higher express… once we have over-expressed KDM4C, those cells indeed grow faster by the BrdU assay. If KDM4C indeed regulates cell growth in diseases, such as in cancer that’s characterized by rapid cell growth, we should see higher KDM4C levels. KDM4C we turned to 18 different types of cancers and looked at KDM4C levels between the cancer and the normal counterpart. We found that, out of those 18 different types of cancers, 10 of them showed higher KDM4C in the cancer tissue compared to the normal tissues. We then ask if… in these cancer tissues, if we knock down KDM4C levels, do we slow down the cell growth? So, here’s an example where we knocked down KDM4C in colorectal cancer. First, we see that the target genes such as MYC, following our KDM4C knockdown, have a lower expression level and, as a consequence, the cells also grow slower. So, here, what we show you is that… we started with a simple observation that KDM4C level varies across individuals. Then, we asked whether we can decompose this range of KDM4C genetically, and we found that KDM4C levels are cis-regulated by a… sequence variants in the 3′-UTR of KDM4C. We then identify a set of target genes that are regulated by KDM4C and work out the regulatory mechanism, showing that KDM4C binds to their promoters and removes histone marks, removes methylation marks on those histones, and the high and low KDM4C levels actually have phenotypic effects on cell growth. So, what I’ve shown you is that DNA sequence differences influence expression level of genes. And this is basically… it’s the essence of all genetic studies, is to identify sequence differences and ask whether… which of those sequence differences affects disease susceptibility, or phenotypic differences, whether it’s gene expression level, coat color, or other types of phenotype differences. So, genetics, basically, is to ask that… what are the genetic blueprints that influence many of our phenotypes? So, we are now familiar with the term, It’s in our DNA. What I like to… kind of go one step further… is to say that, beyond our DNA, there is much information in our RNA sequences as well. So, what do I mean by that? First, of course, there are many things that are beyond DNA. There are chromatin modifications that influence phenotype. There are ways that genes are alternatively spliced, so it ended up with different proteins and that, therefore, one DNA sequence can end up having two transcript isoforms, and then, consequently, two protein isoforms. But, when I say that, It’s in our RNA, I really mean that it’s in our RNA sequences, that there are RNA sequences that differ from our underlying DNA sequences. So, perhaps this is not so surprising when we think about what is already known. The term RNA editing was coined by Robert Benne’s group in the 1980s, where they found that in… there are uridine insertions and deletions in the mitochondrial RNA of trypanosomes. And, since then… and those RNA sequences were not encoded in the DNA. Since then, there have been other examples of RNA editing found in different organisms, including in humans, where there are two forms of deaminases, that remove amine groups from cytidine and… resulting in uridine, and removing amine groups from adenosine, resulting in inosine. Beyond these two types of deaminases, what our group has identified recently is that there are other ways in which the RNA sequences can differ from the underlying DNA. Since these are not driven by these deaminases, we just descriptively call these RNA-DNA sequence differences, or RDDs. I have already shown you that individuals differ in terms of gene expression levels. What I’m going to show you next is that editing level also varies across individuals. For some, at the identical RNA editing site, some people have high editing levels and some people low editing levels. So, shown here is such a graph. It’s very similar to the graph I showed you before, except, here, what we show is plotted… are the RNA editing or RDD levels. So, on the y axis is editing or RDD level, each individual shown as a black dot. So, we show you, here, that there are some examples, for example, in the first column here, an A to G editing site in the gene, an elongation factor, where some individuals have low editing levels — only about 40% of the elongation factor have the… are in the G form — and some individuals, 100% of the elongation factor are in the G, rather than the A form. And that’s not for this one example only, but for a variety of examples where there is an A-to-G editing, or other type of RDD. We see quite a bit of individual differences. And just as in gene expression levels, in individuals who are genetically identical, such is in twins, we see that their editing and RDD levels are much more similar to each other than, say, unrelated individuals. So, what I’ve shown you in this section is that there’s a lot of variation in RNA, and I described in quite a bit of detail how the extent of RNA… of gene expression level differences across individuals, and how that is affecting cellular phenotypes. What I will talk about in the next two sessions is how RNA sequences differ across individuals and what are their cons… phenotypic consequences.