RNA sequencing – For ATVB|PVD 2016


– [Kiran] Welcome to the
RNA Sequencing Bootcamp sponsored by the FGTB Council. Prior to attending the classroom
session of the bootcamp, please review the following
presentation and slides. This presentation will give you a primer for understanding the
basics of RNA sequencing, so that you’ll be prepared not
only for the bootcamp session but also for designing your own RNA sequencing experiments in the future. To begin, you may be wondering, what is RNA sequencing, or RNA-seq? And why investigators choose to do it? RNA-seq is a method for performing unbiased quantitative
profiling of the transcriptome. By leveraging the massively
parallel sequencing capacity of next-generation sequencing technology RNA-seq allows for the large
amount of nucleotide data contained in the human transcriptome to be captured in a time-efficient manner. The data generated from RNA-seq can be used for multiple purposes, because RNA-seq technology allows for an investigator to look
at the transcriptome of experimental or clinical samples with single-base resolution. As you can imagine, this opens the door for far more than differential
expression analysis, which can be captured through
older micro array methods. For example, in addition to looking at protein coding gene expression, investigators use RNA-seq for studying non-coding RNAs, novel un-annotated transcripts, alternative splicing, alternative polyadenylation, RNA editing, and allele-specific expression
of genetic variants, to name a few areas of RNA biology. As an investigator about to embark on an RNA-seq experiment, you’ll be going through multiple steps from the wet lab to the dry lab to ensure that you generate the data you need for your project. These steps are summarized on this slide. First, you’ll design your
experiment or RNA collection from a clinical cohort to ensure you have enough samples to power your study and that the choices you make
at each of the following steps facilitate your RNA-seq goals. Next, you’ll isolate RNA from the samples you want to sequence, and either you or a
core service laboratory will generate CDNA libraries that get read on a high through-put sequencer, usually at a core facility. Once the sequencing is complete, you’ll then be able to download and analyze your data, if you have some experience
in computer programming. If this isn’t feasible,
you may collaborate with a computational group, or pay for computational
analysis services. To give you an idea of what happens after you isolate RNA, we’ll go through a general
overview of the work flow. The shcematic depicted here from a review by Wong et al, illustrates the steps of how RNA is processed for RNA-seq. For the sake of simplicity,
here we’ll focus on long RNAs, such as
messenger RNAs, or mRNAs. We’ll touch upon long
non-coding RNAs later. Briefly, after RNA
isolation, mRNAs undergo poly(A) selection or
ribosomal RNA depletion and then are converted
into a library of cDNA, complementary DNA fragments ideally greater than 100
base pairs in length. Sequencing adapters,
depicted here in blue, are subsequently added to
each cDNA library fragment. These adapters contain
identifier bar codes that allow for multiplexing
of different samples to run in the same lane. A short sequence is read from each cDNA, using high throughput
sequencing technology, which we’ll discuss using
the Illumina platform as an example. The resulting sequencing reads are aligned with the reference genome. For example, HG19 for human samples, or MM9 for mouse samples. These mapped reads are then analyzed for differential expression of genes between control and disease or experimental conditions. There are different types of sequencers you can choose for your
RNA-seq experiment. A commonly used one is
the Illumina platform, which is depicted on this slide. Illumina sequencers use
solid phase amplification on glass flow lanes that have a lawn of primers embedded on them. These primers anele to the cDNA libraries that are then amplified in clusters. The amplified clusters are then read through cyclic reversible termination using fluorescently modified nucleotides that are imaged as each
nucleotide is added. RNA-seq is a powerful tool for
studying the transcriptome, from its regulation, to
different types of RNA that are deferentially expressed. Knowing the types of questions you’ll want to ask in your RNA samples is critical for making
methodological choices that will facilitate
answering those questions in your experimental design. For example, one basic question is which RNA species do you want to study? If you want to study small
RNAs such as micro RNAs, you’ll want to make sure
your RNA isolation prep is optimized to capture small RNA species, as not all methods or kits
are designed to do this. Another question to ask is if you’ll be looking
beyond differential expression of transcripts. If you know that you’ll want
to study biological functions such as alternative splicing, you’ll need a read depth of at least 30 million reads per sample. But more ideally, 75
million reads per sample. After your RNA-seq reads are mapped, you’ll then need to choose the appropriate software program to analyze
your data for your questions. Once you have answered
those fundamental questions it’s important to consider the number of biological and technical replicates you need to power your experiment. Although larger sample size provide more statistical power, it’s not always feasible
from a cost perspective to sequence very large numbers of samples. Unfortunately, the field
has not reached a concensus in terms of how to calculate adequately powered sample sizes. But there are a few studies that look at numbers of replicates needed. For example, it’s
recommended that you have at least four replicates, preferably biological replicates, per experimental group if you’re
looking at messenger RNAs, and six replicates per
group for non-coding RNAs. More replicates lead to more power, whereas increased sequencing depth does not increase power for differential expression
analysis of messenger RNAs once you’ve reached the threshold of about 20 million reads per sample. Given that messenger RNAs do not require deep sequencing to quantitate. This is in contrast to non-coding RNAs, splicing events, and editing events that do require more depth for detection and differential analysis. Now that you know how you want to set up your RNA-seq experiment in terms of how many samples to use, it’s time to isolate
RNA from those samples. Again, the type of RNA you want to study will affect your approach
in RNA isolation. If you want total RNA, Trizol
can capture everything, although you’ll need to make modifications in ethanol precipitation
and use of carriers to ensure that you capture
enough of the small RNAs if you want to study them. Alternatively, you can use commercial kits that are designed to enrich
for different fractions. Regardless of the method of
RNA isolation you want to use, your total RNA should have
an RNA Integrity Number, or RIN number, greater than 7, to ensure that you’re sequencing mostly high-quality, undegraded RNA. RIN numbers can be
assessed on a bioanalyzer, a process that can be
performed in most campuses. With a low RIN number, you may have reads with
significant 3 prime bias, which may decrease confidence in your downstream analysis. Once you have ensured that
you have high-quality RNA, you can aliquot RNA amounts according to required input, which varies according
to the type of library you choose to make. Some examples, which we’ll describe next, are listed on the slide. Prior to isolating RNA, you have to have an idea of what you want to study. For example, studying micro RNAs, or single cell transcriptomics, will require specialized
library prep methods that do not cross over to
other types of library preps. If you’re looking at long RNAs, such as messenger RNAs, you can choose between poly-A selection, which will capture messenger RNAs, and long non-coding RNAs, many of which are polyadenylated, and ribosomal depletion,
which will include total RNA, except for ribosomal RNAs. If you choose to pursue a
ribosomal depletion method, you’ll need to ensure adequate
DNase treatment of your RNA, and be prepared to have a slightly higher ribosomal RNA fraction in your reads. The advantage of using ribosomal depletion is that you’ll capture
anti-sense transcripts that often are not poladenylated, as well as other non-coding species, such as back-spliced circular RNAs. Both poly-A and ribosomal
depleted libraries must be stranded if
you want to do analyses beyond protein coding
differential expression analysis. If you want to capture all
species of RNA in one go, you could consider TGIRT libraries, which are generated from a different reverse transcription process. More information can be
found in the reference listed at the bottom of the slide. When making your sequencing libraries, there are tricks that deviate from the manufacturer’s protocol to enhance the quality of your reads. Certainly, collaboration
with RNA biologists who do frequent RNA-seq experiments can be very useful in
learning more specific tricks. Here we share a few things we’ve picked up from our colleagues. One is that if you’re choosing to pay for 100 base pair reads for your experiment, you want to make sure that your libraries are actually over 100 base pairs long so that the sequencing depth you paid for is not wasted. One way to do this would be to decrease the fragmentation
time by 2 minutes in the library generation protocol. This usually ensures most fragments are at least 150 base pairs long. Another concern in RNA-seq is potential sources of bias at the library preparation stage. A source of bias is
certain runs of sequences that get amplified more. For example, GC rich
and GC poor transcripts may be underrepresented. Reduction of PCR cycles
for library amplification reduces this bias effect. Even difference in bar code sequences used for multiplexing can lead to differences in adaptor
ligation and library prep as well as amplification biases. So bar codes chosen for your multiplexing will need to be considered carefully. For more information, references
are listed on this slide. Now that your libraries are made, you have some choices in terms of how you want to sequence them. For example, different analytical goals will require different
depths of sequencing, from lower reads for messenger RNA differential expression analysis, to very deep sequencing for de novo assembly of transcripts,
which is important if you are, for example,
trying to annotate novel non-coding RNAs. You can also choose whether
you want paired reads, which will give you better
resolution at each base and more confidence in calling events such as alternative splicing. Likewise, longer reads will give you better coverage of a transcript, which will be needed for
more sophisticated analyses. With all of these choices, we’re sometimes limited by cost. The website listed on the slide provides a price comparison among different sequencing cores and
companies around the country. One type of sequencer
that’s frequently used is the Illumina HiSeq 2500, which contains two lanes per
flow cell for a rapid run, eight for a high-output run. With more than one lane available to you, you can multiplex libraries to run multiple samples down one lane. With multiplexing, you
then have the freedom to pull as many libraries
with different bar codes on a given lane. For the HiSeq 2500 rapid run, you want to avoid batch effects by mixing your samples
as much as possible. For example, if you have an experiment consisting of eight samples
with four replicates per group, running all eight samples on each lane gives you the same depth
as splitting them up so that all of one group is in one lane, versus the other group in the other lane. As you can see, you may be
introducing a batch effect if you were to choose
to pool your libraries in the approach shown on the left. To minimize batch effect, you want to pool your libraries in the
approach used on the right to decrease bias from
unpredicted differences in each lane’s run. After your sequencing run is complete, it’s time to process and analyze the data. Here’s an overview of
how data is processed from the sequencer to a text output of differential expression analysis. First, the facility that
performs your sequencing will release to you the raw data in the form of FASTQ files, which also usually contain some basic quality control metrics such as the percentage of read nucleotides meeting a quality Q30 score, which has a 99.9% base calling accuracy. Because your samples were multiplexed with adapters and bar codes, they need to be trimmed
so that the adapters do not cause significant
mismatches in alignment to the reference genome. Aligning reads to the reference genome is the way one identifies to which gene a transcript read belongs. After alignment, you can
do further library QC to assess whether you have
good library complexity of uniquely mapped reads. A high-duplicate read
rate for messenger RNAs means that many of
those reads are artifact from amplification clones, rather than true reads
of transcript abundance. This is because the RNA
fragmentation process for library generation will generate many random breaks and transcript
copies of the same gene, and thus unique fragments, rather than fragments of
the exact same sequence. If library complexity is
poor, you may need to consider redoing and troubleshooting
your libraries. Additional library QC includes includes looking at 3 prime bias to assess whether your sequence transcripts
are intact or degraded. Once you’re happy with the
quality of your libraries, you will then proceed with
quantifying transcript abundance and then testing for
differential expression between sample groups. Here are examples of commonly used adapter trimming programs. With the adapters trimmed
off your FASTA files, you must then align your reads to the relevant genome. This concept is reviewed
in the next slide. In the case of our bootcamp,
we’ll be using human data and thus aligning to the human genome. Two commonly used
programs are listed here. To review what we mean by
aligning to the genome, let’s say one RNA-seq read is represented by the blue puzzle piece. This piece must be matched to the corresponding portion of the genome from which the original
RNA was transcribed. Identifying the position of the genome to which the puzzle piece aligns allows for identification or mapping of the relevant gene or transcript, as many genes and transcripts
are already annotated. Mapping of an aligned
read to an annotated gene or non-coding RNA occurs
in the analytical phase. You’ll find that there are
many publicly available annotation files from which to choose, and they can include protein coding genes, as well as various species
of non-coding RNAs. Commonly used annotation
files are those from Ensemble, Gencode, UCSC, and Refseq. After aligning your
data, you are not quite ready to begin your analysis. Prior to analyzing your data, you need to ensure that the quality of your libraries is acceptable. For long RNAs, you want complex libraries that have a low duplicate read rate. What do we mean by duplicate reads? Think back to what we said about fragmenting RNA for generating libraries. Fragmentation is a random process, and it would be extremely rare for one transcript copy to be fragmented into the exact same greater
than 150 base pair pieces as a different copy of
that same transcript. Therefore, a high duplicate read rate, which is commonly defined
as greater than 30%, although there’s no consensus
on what’s acceptable, is indicative of PCR clones in the library amplification step that get preferentially amplified, rather than necessarily truly higher transcriptomic expression. Another quality measure we
examine is 3 prime bias, which is indicative of RNA degradation. This may affect the confidence
you have in your data. The graphs here represent read coverage along the length of transcripts. Even coverage is represented
in the graph on the left, where you can see that the
5 prime and 3 prime ends are represented fairly equally. The graph on the right
is an example of having more reads on the 3 prime end, which may not be suitable for analysis. Once you’re happy with your libraries, you’ll proceed to analytical programs to generate the data you want for your papers and grants. This slide lists the
two most commonly used differential expression programs, DESeq2 and Cuffdiff2, as well as new, up and coming programs
Kallisto and Sailfish. Once you or a computational
research specialist have run your processed RNA-seq files through your program of choice, you’ll receive a data set output, from which you can run further analyses to find interesting targets. To get you comfortable with
the output you might see, we’ll show you examples from DESeq2, which will be used for
the classroom portion of the bootcamp, and Cuffdiff2. There’s no right or wrong program to use. Your choices may be based more on the statistical methods
you believe in more. Here’s an example output from DESeq2. The base mean is the mean
of normalized read counts for an annotated transcript. That transcript has a
fold change in expression between experimental groups, and here the fold change is
expressed on the log 2 scale and is called the Log 2 Fold Change. IFCSE stands for Log Fold Change Standard Error. You also receive the output
of the Wald test statistic, the P-value, and the P-value adjusted
for a false detection rate, or FDR. We’ll touch more on FDR in a bit. This is an example of
output from Cuffdiff2. You can see the control
group and the variable group, as well as the corresponding
FPKM values of each group. What do we mean by FPKM? FPKM is Fragments per
Kilobase of transcript per Million mapped reads. In essence, it’s a
calculation to normalize the number of reads by the
length of the transcript, as a longer transcript may have more total read fragments mapped to it by virtue of it being long. FDR adjusted P value or Q value is more rigorous
statistically than a P value. For IDO-1, the first gene in the list, there is a 0.05% chance
among all expressed genes, which can be in the thousands, that IDO-1’s called
differential expression pattern would be a false positive, whereas the Q value
implies that 0.25% of genes with Q values less than 0.0025, a small number of genes,
will be false positives. Because Q values are more rigorous, they are often the values
reported in manuscripts. We hope that this
presentation has given you background for what you need to know for the classroom portion of our bootcamp, where we’ll have you
perform additional analyses, such as gene ontology analysis on a real RNA-seq data set processed for differential
messenger RNA expression. See you at the bootcamp.

Add a Comment

Your email address will not be published. Required fields are marked *