NCBI Targeted Loci: RefSeq Ribosomal RNA Sequences for Identification and Phylogenetic Analysis


We will always make those questions and
answers available after the webinar. That will be linked to our webinars page when
the webinar is archived and there’s also a link, that compressed link, to the
materials on our ftp site and that’s where i’ll put the Q&A document and a
copy of the recording will also be there as well. So we’re going to talk about
today is our Targeted Loci project. In particular we’ll talk about the RefSeq
ribosomal RNA sequences. And we’re going to talk about these topics briefly with
slides and then i’ll switch over to the web browser, where we’ll do a couple of examples
and demonstrate some things with these. Talk a little bit about what I’m talking about,
what are targeted loci anyway, and one kind of targeted locus is of course ribosomal
RNA genes. Those are the ones i’m going to be talking about today. And I’ll review a
little bit about the structure of those. then i’ll talk about our NCI RefSeq
ribosomal RNA gene projects, two of them in particular, the 16s bacterial and
archaeal project, and the fungal of ITS internal transcribed spacer region
project. And then we’ll look at how you can get to the data and retrieve them, download them, search them, things like
that. So a targeted locus is some kind of a marker sequence or a bar code
sequence. Majda, in our service desk, likes to call these measuring tape sequences.
These are used typically for phylogenetic studies or simply to
identify the particular taxon or organism that you’re working with. simple examples
are things that most people are familiar with, cytochrome oxidase bar code
sequences. there are various protein-coding genes that are routinely
used for phylogenetic analysis and they have different degrees of conservation
so they’re used for different phylogenetic distances. mitochondrial D-loop
sequences, and then of course the main topic for today are the structural
RNA genes. These are organized into operons in prokaryotes and
these transcriptional units in eukaryotes. there is a 16s and 23s and 5s gene in
prokaryotes, and in eukaryotes there’s an 18s, a 5.8s and a 28 or 25s
depending whether we’re talking about plants or animals, organized into a transcriptional
unit. And in particular really focusing on the ITS one and ITS two sequences for eukaryotes when we’re talking about
the primate genes. So let’s quickly look at those and I’ve got some screenshots from
our graphical sequence viewer just to illustrate that. This is a view of
escherichia coli O157:H7 the genome of this and this shows you one of the ribosomal RNA
operons. there’s the 16s sequence, the 23s the 5s sequence. The 16s
sequence, as many of you I’m sure know, is widely used in bacterial taxonomy
identification. And so there I’ve got alignment there showing you the 16s
RefSeq RNA aligned to that 16s sequence in E. coli. In E. coli there are
seven copies of this operon. The 16s genes in this particular species of
bacterium are similar enough that we make one of those for RefSeq. We’ll come
back to that in a moment. here is a eukaryote, the structural RNA genes of the
transcriptional unit for Saccharomyces cerevisiae. And you can see this sort of
highly annotated region. You have the 25s, well, we will start at the other
end where you have the 18s, the 5.8s, and then there’s the internal transcribed
spacer region one, the internal transcribed spacer reason two, between the 18s and the
5.8s is the ITS one; between the 5.8s and the 25s is the ITS two. And those are the
regions that we try to capture for the fungal sequences. They are variable
and can be used in fungal taxonomy and identification. They’re more
variable than the conserved RNA genes themselves. So let’s take a little bit, a
minute to talk about rRNA targeted loci records, what’s in our databases for those things.
So we make curated targeted loci sequences for the bacterial 16s and the
fungal ITS regions as i’ve talked about. These are human curated, they’re based on
qualified sequences from the International Sequence Database
Collaboration, which i’m going to call GenBank in this talk if I ever mention it
and again. These are submitted records. We choose these based on ones that are accurate, have the
right characteristics. We’re trying to strive for reproducible
data. The main point here is to make a clear association between the name, the
specimen, and the sequence. It’s become quite challenging in fact to make sure
that GenBank is accurate in terms of the sequence identification. So what we try
to do here is to emphasize sequences that have links to the taxonomic type or
verified material. Those are going to be cultures in the case of fungi and
bacteria. And there’s a little Entrez query switch that you can use, it’s a
filter that when you search the nucleotide database you can use this
sequence_from_type[filter] field and that will restrict your search to those
particular kinds of sequences. Now those can be GenBank sequences as well as
reference sequences that were talking about, and all these records are going to
have links to the type strains and culture collections. We make these
records collaboration with outside taxonomy experts in the biomedical
literature and those kinds of things. And here are some feature tables from two
different records. The top one is from Lactobacillus delbrueckii, which is
one of the bacteria that’s used in yoghurt cultures. And you can see right
there there’s a link to the culture collection ATCC. So you can compare, and
if you want to order this culture you could get it right from that entity or
vendor. And the second one is Trichophyton rubrum, which is a skin
infectious agent that causes things like athlete’s foot and things like that. And
you can see a link to the CBS database in the Netherlands and you can order
the culture from there. And then of course you can see on the feature table
the various things that are there, the rRNA genes plus the ITS regions. We’ll
come back to that a little bit later on. So here’s some recent statistics about
what we have for the bacterial sequences. We’re closing in on twenty thousand records
and the vast majority of those are from type material. You can see the
distribution of the major groups of bacteria that are in the 16s database.
Right now the 16s archaea is a much smaller set of data, about 900 records.
Most of them are from type material and if they’re not from type material we try to make them from some kind
of verified material. On the fungal records there about 4,500 of those, again
most of those are from type material. A couple of comments. You’ll find that there’s often more than
one sequence for a species or strain. That’s because there may be different
clones that represent maybe different members of the cassette with those
different genes; we’ll see an example that with a bacterium. You’ll see for the
bacteria that there are also two kinds of RefSeq accession numbers, the NR
representing the ribosomal RNA and sometimes there’s an NG, where there’s an
NG RefSeq. That’s because there are some bacterial and archaeal 16s regions that
have introns in them, so we have both kinds of sequences for those. So how do you get
access to these data? We do have a Targeted loci page that’s linked
to our RefSeq page. and that link there will take you to it. When I show links in
these webinars, that ncbi in the angle brackets
represents our own page URL and that’s the directories that are under that. And
so from this page you can see the 16s project and link to all the sequences and
the nucleotide database there. There’s the ITS project and then here are
some tools, in particular blast and Mole-blast, which are two tools that I’m
going to use today to search these data. You can also look at the BioProject
themselves and I’ve got the BioProject ideas here so you could use this kind of
a query to get these from BioProject or to get the records from a nucleotide
database. Again that’s the URL to our bioproject page there. If I want to get
these records they’re all nucleotide records and i can download them pretty
easily from the Entrez nucleotide database or the nuccore database as we
call it. So using any of these query terms here i could get the set of
sequences and download them in whatever format i want, fasta format if you
want to do sequence analysis with them. The largest set of these is the
bacterial that’s 18,000 16s RNA genes and it should be no trouble to download that
through Entrez. You can also get them from the ftp site. These are part of the RefSeq release. And
there is a blast database available that you can get from our blast page. The
ftp link there is, of course. the the root directory of the ftp site and
those are the directories under that you need to go to. That’s a fully formatted blast database
you can use that with some of the utilities that come with blast to generate
these sequences in fastsa format from that if you want to. As I said, at this
point it’s no problem to download these directly from the Entrez database. The
main way you’re going to access these data on our website of course is to use
blast. Now there is a dedicated 16s or
dedicated ribosomal RNA targeted loci blast page, but I think it’s best to
probably use our standard blast pages to do this. So i can do this quite easily if
i want to search the 16s right on the basic nucleotide blast page. I can also use
those bioproject queries, if i want to, to restrict to those from the nr
database. Notice that I’ve circled here one of the check boxes you can use to
modify these, if you want only the sequences from type material, which, as
i pointed out for these datasets, is most of them, you can check that box. There is
another tool that i want to remind you about or tell you about if you’ve never
heard of it before. It’s particularly suited for working with these kinds of
sequences. That tool is called mole blast and it’s linked to our blast homepage. That’s the direct URL there where the
blast and the angle brackets is the URL for the blast homepage. This is a tool that’s useful for
identifying the sources of the 16s sequences. It’s used in house here at
NCBI to sort of verify GenBank submissions. It is a blast search followed by a global
multiple alignment. It clusters your
140
00:10:31,240 –>00:102:35,260
query sequences plus the most similar
database sequences and it’s going to give you taxonomic units out of that.
And you can also of course restrict to those sequences to type material as we
did with blast. This is a perfect tool for searching against the 16s or the
ITS databases. We did a webinar on this a little bit over a year ago. If you want
to watch the recording of that it’s available on our YouTube channel and
that’s a link to it right there in the bottom of this slide. So just to show you
the MOLE-BLAST interface this is what it looks like. I can put a bunch of
sequences in there and in fact this is a set of data I’m going to demonstrate in a little bit. And i can pick the
database i want. so i can pick the 16s there because i know these are
bacterial sequences. And notice that in the advanced parameters i can check that
I want to see only sequences from type material, and I could even restrict that a
little further to those that have a binomial name, that is, actually have a
clear name that are not sort of tentative right now. Okay, so that’s all the slides i want to
show, so we’re going to go over to some live examples. we’re gonna take a look at some of the
sequences for a Archeon, Haloarcula marismortui, which is from the Dead
Sea, it’s a halophillic archaeon and it has divergent 16s, I just want to show you
that. We will take a fungal clone and will try to identify it using the 16s [ITS]
database. And then we’ll cluster some targeted sequences — sorry that
fungal clone will of course will be the ITS database, not the 16s — and then we’ll use
MOLE-BLAST to cluster targeted sequences. So i’ll stop here for a minute just to
see if there any questions at this point before we go over to the web pages. So
type some questions in there if you want to answer now, we can also have some time
at the end to answer some questions. Okay nothing now so i’m gonna go ahead
and go over to the live examples. And what I’m going to do here just to get
out of this is I’ll click this link, which will open that compressed URL that I had
earlier. And so this is the directory that has these slides if you want them, it also has the demos i’m about to
do, and i’m actually going to cheat and leave that open because i need that. So
at the top of this, by the way, are those three queries that we talked about, which
are useful, so I put them in this file. And then the examples i’m going to do
are spelled out for you here. So I’ll first start off by going to
the NCBI homepage here and i’m going to the nucleotide database and we’re going
to retrieve some 16s sequences, for particular species of Archaea. I’m just going
to go to the nucleotide database first. Now let’s do a search. I could type that in but my typing is
horrible, so what I’ll do is just copy and paste
this thing to save some time. And so that’s the BioProject for the 16s
sequences, this is a scientific name for an archaeon, Haloarcula marismortui.
And so I now have six of these records. Notice that they are for
different strains and i’ll pick the ones that have the ATCC number on them. So
there are two from each one of these three different strains, so I’ve got two
of them and let’s go ahead and analyze these by running blast. So this link here is
on all of our sequence records and you can get it for, you know, a combination of
sequence records like this. So I’ll go ahead and throw these over to the blast
search page. And so now i’m going to have two queries. What I want to do is, we’re
going to align these to the genome. We have a genomic sequence for this
particular strain of this organism. So the RefSeq version of that is going to
be in the RefSeq genomic database. I have to watch myself because it remembers what
I’ve done before. So I’m going to take that stuff off because i don’t want that. And so I’m going to add the organism
here and i can just – it doesn’t really matter in this case because we’re only
going to have one in here, but i can go ahead and pick the particular genome
that I want right here. See it does match once I learn how to type and then go
ahead and run that. Now this shouldn’t be too bad because we
did restrict it, but the refseq database is large, so if time gets tight i
can go ahead and retrieve those results myself, and I have the RID here in the
blast the handout here. So I’ll go ahead and get those. By the way these are RIDs
should be valid for a little while anyway; well it came in, that’s okay. So notice i started with two
sequences so results are therefore separate, so it paginates output,
which you are probably familiar with with blast. so this is an interesting phenomenon. This
particular archaeon has two chromosomes and so there’s actually copies of the
16s on both of them. In fact on chromosome 1 there are two copies. so
this is one of them. you can see the the very good match here, which is a
perfect match, and here’s the second one where the match is not so perfect. and
that’s why, as I was trying to point out in this example, we made two records for
this particular organism because there are two distinct and quite divergent 16s genes.
One way to look at this, which i think is fairly useful to do, is to go to
the graphics view here. One advantage of this is it’s loaded the
entire blast results so we can see the hits from both of our sequences by
clicking here. so i click on graphics you can see that there you can they’re kind of hard to see because
we’re kind of zoomed out, but there’s a hit over here and there’s one over here.
so these are the two different ribosomal RNA genes. i can just sort of zoom
in just to look at those. I want to — there’s the 16s gene and you can see the
two different query sequences. one aligns perfectly, one has a bunch of mismatches
and if you want to yourself, you can go back and retrieve these. look at the
other one and you’ll see that that that’s the opposite, they
flipped, it’s perfect along one of them not perfect on the other one. Okay so that’s one example. let’s do the
second example, which is to identify a fungal clone. so we’re gonna do is
retrieve a particular sequence. i’ll go ahead and clean up by tabs here. actually it
might be easier to save us a little bit of time, let’s go straight to blast
because I can use the accession number to do this. This is an uncultured fungal
sequence from a population set, that’s an environmental study. i’ll go ahead and go
to the blast page here. i can get my accession number here which is the
uncultured fungal clone. this is a typo which i fixed, this is the actual
accession number here. I did not type that very well. i’m going to copy that
and paste this into nucleotide blast. Now we can run this against nr if we
want to, that would be the default when you come here. something that’s kinda
useful to do sometimes is to reset the page. We can search the nucleotide collection. if
we want to, we can select sequences from type material. you can run this yourself
without doing that and what you’ll find is you get a lot of hits to the uncultured
clones of fungi. i can check this box here if I want to and it will give me a
much cleaner set of search results. and the fungal ITS sequences are in the nr
database and so we can get them quite easily by doing this particular
search and hit them. Okay, so I got them before I could get the URL
together for you. So notice that i have an NR_ sequence
here. this is one of our RefSeq ITS regions. This is a sequence from GenBank, but
these are all going to be associated with type material. and so I have an
uncultured fungus and my best bet is to Penicillium subrubescens. I can take
a look at that down here. you might like to know whether how different that is on
the clone that i did, so you can use the formatting options here. change this to
pairwise with dots for identities, a useful view in particular for trying
to identify something. so we can see that there is a single mismatch and this is
probably the right species. if I wanted to take a look at this record, which i
think is a useful thing to do here, we can go back and take a look at it in
the nucleotide database. You can see it has information about the
fact that this is the type material, this particular species, where it was
collected from, and there’s a link to the database in the Netherlands where I
could order that particular clone. and you can see what’s on here. It is ITS1, the
5.8s ribosomal RNA, and part of the 28s ribosomal RNA. so it’s got both
of these internal transcribed spacer regions, which are the more variable
parts of this that aids in species identification. if you want to see that
in some kind of a more easy-to-understand way than the feature
table, you can always click on this graphics link to show you that, which
we’ve already done once today. and here is ITS1, 28s. ITS1 here, 5.8s here,
ITS2, and the 28s here. Okay, for last example we’re going to get a set
of sequences from our popset database. These are from a wastewater treatment
plant and i’ve used this before with MOLE- BLAST, but it’s a useful example just to show
you how that works. and this will be a search against the 16s sequences. let me
go ahead and get those sequences for you. Basically I’m just going to use a link
to retrieve these from popset, which is one of our Entrez databaess. And I don’t need to do anything because
I’ve got the URL. what I’d like to have are just these nucleotide sequences so I can
work with them in a blast search, particularly a MOLE-BLAST search. So I’m
just going to follow the link to nucleotide. And there are 437 of those. i could
potentially try to cluster all of them with blast, but that’s rather expensive and
time-consuming, so we’ll just use a subset of them just to show you how this
works. one way to do this is to get the accession list; that’s just going to
give me the first 20. i can take the first ten of these if i want to or i can
try all 20, I have mole blast already saved for this. Mole blast
takes a few minutes to run, longer than we really have for this webinar.
but i’m just going to copy those to the clipboard. I’m gonna go over here, go to the, sorry,
go to the NCBI homepage, go to blast and the MOLE-BLAST page is linked to our new blast
page. Probably you’ve seen the new blast
homepage, down here in the lower right-hand corner. So I now have these 20
accession numbers here, of these basically uncultured bacterial clones from this
wastewater treatment plant. and I’ll search that against the 16s ribosomal sequences.
This is a blast database and this contains both the bacterial and archaeal
projects. notice that i could also do the same thing with the fungi and in that
document on the ftp site there’s a set of fungal sequences, the same one that I
got that clone from earlier. you can play around with trying to cluster those if you
want to. now what I’d like to do is go down here to the advanced parameters
because that’s where i can do things like make sure that i’m only looking at
sequences from type, which is a good thing to be able to do. i can adjust
other parameters of the MOLE BLAST search down here. and then i can go
ahead and run that. this takes a few minutes to run so what I’m going to do
instead of doing that, I ran some mole blast searches earlier today, one of
them did not work don’t worry about that one. but i can go
ahead and retrieve the results so you can see what that looks like. So this is our new tree viewer, which is
a little bit more deluxe than anything we’ve had for this purpose is before. Let me
go ahead make that maximized, but there’s a lot of leaves here. one thing you can
do is you click on this optimal Zoom button here, and that lets you see things a
little bit more clearly. so now we have all these uncultured
bacterial clones, but you can see that they’ve now been classified for me, at
least to high-level bacterial groups here. you can see that i can at least
figure out whether they’re in beta proteobacteria or other things like that, or
other groups that I recognize like cyanobacteria. some of them might even
feel like I’ve got to the level of the genus. like this one, Exiguobacterium; there is
a little bit of a distance here. i can display that alignment if I want to just
by clicking this link here and we’ll open that in a new tab. and I can
evaluate what I think about the relationship among those sequences. so
notice that it’s got one of my query sequences here, and it’s got these
sequences here. this is actually the title of the record. if you wanted to, if
you were really trying to be good about checking a taxonomic name, you could use
that as link on your node there. if i do that it’s going to redraw my tree and I’d have to start
over again, so i’m not going to do that. Just keep in mind that if you’re really
trying to identify something, select this as the label for your leaf nodes. let’s
go ahead and show that alignment there. And it got me with the pop-up blocker. There it is. this is the alignment and you can see my
query sequence embedded in those 16s sequences. you can actually change this
to dots for identities like we did previously to see where the differences are.
So it agrees with several of these sequences at this point where it’s
different. and then there’s some changes towards the end of the sequence that may
or may not be important, you’d have to evaluate that. and from here you could
download this alignment if you want to, or you could do the entire thing, download. Okay, so i think that’s all i
wanted to show you for today. to remind you that if you need help with any of
this stuff you can go to our Learn page, which has links to the webinar and
courses page, where you can find the materials for this. we have a set of fact
sheets that are very helpful. check on our youtube channel for this video and
other recordings of webinars and other helpful videos. If you have any questions
about this webinar, the content of it, write to info or write to me, [email protected]
or [email protected] If you have questions about the webinar
program or technicalities about the webinar itself, you can write to
[email protected] and i’ll leave the webinar open for a few minutes
for questions. so we’ll stay online for a few minutes if anybody has any questions.
Okay, well, we’re not hearing from anybody right now, but please feel free to write
if you have any questions. I’m going to go ahead and close the webinar for now
and we’ll talk to you soon.

2 Comments

Add a Comment

Your email address will not be published. Required fields are marked *