NCBI Minute: Human Population Genetic Data at NCBI


Hello everbody and welcome to today’s webinar. This is Peter Cooper from NCBI. This is a webinar on human population genetic
variation data at NCBI. Lon Phan will be a presenter today. I’m going to turn it over to Lon and he will
take you through the information today. Thank you Peter. Welcome everyone these are the topics will
be covered today we will start with dbSNP then we will go into this new project the
allele frequency aggregator ALFA project then a quick demo on the web and also on the API
then we will follow with question and QA. dbSNP just celebrated it’s 20th anniversary. It currently has over 680 million reference
snps (RS) from over 2 billion submissions annotated on both GRCh37 and h38. Over half 1 billion RS have frequency data. In addition to mapping to GRCh37 and 38, we
also annotate the RS on the mRNA and protein and also the functional consequences and ClinVar
clinical significance, allele frequency. You can find out more about the RS and other
attributes that we annotate at the link, https://go.usa.gov/xpPet. The topic today is on allele frequency so
this is what we currently have in dbSNP . The name is a misnomer that dbSNP contains both
common and rare variants. This is a table [Diverse Populations] of the
projects currently dbSNP the Project, the number of subjects and the variants that contribute
to each one of those projects. DbSNP aggregates data both from the global
survey studies including gnomAD, TopMed, ExAC, PAGE GO-ESP, 1000 Genomes. Also include regional or cohort specific studies
icluding UK parent and child cohorts (ALSPAC) and the twin cohorts (TWINSUK). Also we include regional or specific populations
(Estonian, Vietnamese, Northern Sweden) and as we identify more of the studies we add
them to dbSNP. And also including the alpha project will
be included in future dbSNP builds. ALFA is on the same scale it is one of the
largest projects in dbSNP. How do you search for this data? You can begin your search by going to dbSNP
homepage the search box is on top there are some examples provided down here you can search
by reference SNP, gene, genomic location, clinical significance, and allele frequence
range. To get you started other terms are listed
here (link name). There’s a handout that you can download that
is a fact sheet that can get you started with the search and displaying the records. When you search in dbSNP when you enter the
search term what you get back is a list of variants in the center here. There are some widgets over here to help manage
filters and search queries on the left here is where all the filters are for the attributes
I described earlier including the variant types you can filter by variant types clinical
significance and also for the allele frequency. Once you get narrow down your search results
what is shown on the Entrez display is some summary information about the variant, variant
type, map location, gene, functional consequence, also the clinical significance but it also
includes the allele frequency down here. This is the study wide allele frequency of
these projects but if you want to look at the subpopulation let’s say Asian you have
to go to the reference SNP page and you can follow this link here (rs number link) to
go to the full refsnp page. You get the summary information but the detailed
information is provided in these tabs. So there is a tab for each category of information
and you can drill down to the information for each tab. For frequency if you click on it to get this
big long table that shows you the studies and the subpopulation in the studies and the
samples and the allele frequency. If you zoom in on one particular region for
instance here this shows the 1000 Genomes studies global-wide the frequency data the
sample size and the ref allele and the alt allele. Then also the subpopulations. There couple of other tools on this table
that help you narrow down to your population study of interest. That includes a search filter box up here. So for instance you can start typing the study
name and it will narrow itself down to just that study and show you the results of that. The other thing we have is the columns here
are sortable this allows you to sort each column by any of these attributes. For instance if you sort by the Alt allele,
either ascending or descending whatever your preference, in this case we sorted by descending
order and what this shows is that this allele is most frequently recorded in Asian populations
across these studies. That may give you more confidence in terms
of the prevalence of your allele for your populations of interest by looking across
different studies. You can also download your results by clicking
on Send to and there are different formats. But the other advantages of using the Entrez
system that you have access to the eUtils API so all the search that you do a web you
can do with the API also and retrieve the results. We have a github page that you can go to is,
there demo for eUtils here that you can take a look and other examples for how to parse
and access the SNP data. You can also launch that demo on binder and
play around with it and interact with it live. I will go over some of the demo later. Here’s an example of the notebook there’s
codes here to access dbSNP. Here’s an example just enter the same term
you would normally here on the web then retrieve the results. The data is also available on FTP download
on dbSNP homepage in VCF and JSON format. That concludes the dbSNP part of this presentation
next I will describe to you the alpha project with dbGaP. dbGaP is a database that is housed at NCBI
that obtains the results of studies that looks of the interaction of human genotype and phenotype. It contains a lot of data including over 1.5
million subjects and trillions of genotypes. So dbGaP actually has two ways to access,
open and controlled. And when it was started in 2007 the Genomic
Summary Results (GSR) that included p-values and allelic frequency was initially open access. Shortly after it went live there was a publication
that went out that showed that you can identify individuals using statistical methods you
can identify an individual in a study if you have that person’s genome. So in response NIH to protect the participant
and also to evaluate the policy the GSR data was moved under controlled access. However recently this policy has been reversed
so now we are allowed to again share the summary result information includeding allele frequency. We wmbarked on this project called the Allele
Frequency Aggregator or ALFA. This will aggregate frequency data from dbGaP
and making it available to dbSNP for variant interpretation analysis. I will walk you through how we process data. This will give you an overview of the processing
pipeline. All the studies that have been improved for
open access go through QA/QC. We then compute the ancestry for all the subjects
in the studies. Then we convert all the data into a standard
VCF format Then they get fed into a dbSNP build pipeline for normalizing variants that
utilizes a couple of new algorithms and variant representations such as SPDI and VOCA. You can read about that in the publication
that just came out this month on those algorithms we used to normalize the variants (PubMed:
31738401). Then once we aggregate the variants and normalize
the variants we can aggregate individuals and genotypes and compute the allele frequency
and make it available to the normal dbSNP resources. There are a couple of challenges with this
project one of course is the dbGaP large amount of data and the other one is harmonizing the
different subjects and populations across studies. Most of these are self asserted or self-reported
population or they have no ancestry reported. The other challenge is of course normalizing
the variants as they come in from different formats like vcf, plink, they may come from
different assembly versions, called from different pipelines and of mixed quality. So to solve the large volume of data and normalizing
variants we used the existing dbSNP and dbGaP curations and build pipeline. And to normalize the variants we used a program
called GRAF-pop that was developed by dbGaP. What is does is use the Ancestry Informative
Markers (AIMS) to validate the sample ancestry, remove duplicates, also to assign subjects
with common ancestry to populations. I will walk you through briefly how this is
done. This shows you an example of how GRAF-pop
assigns the individual to a particular population. What GRAF-pop does is it creates genetic fingerprint
of all the individuals, from the genotype, and computes genetic distances. Then plots them on a graph. What this shows you is a comparison between
what GRAF-pop computed versus what was self-reported in the different colors. Individuals with common ancestry are grouped
together and samples are shown at the vertices. Then
individuals with mixed ancestry will fall along the axes. Shown in blue ring is the 95 percent interval,
those are a common ancestry for the individuals we assigned to the populations the vertices,
African, European or East Asian. Here are subjects with mixed ancestry. For example here, there are subjects reported
as Hispanic but GRAF-pop identified as two separate ancestries. One is a Hispanic group with an Afro Caribbean
ancestry, and one Hispanic group that has more European ancestry. So that is how we assigned the individuals
to the different populations and there are 12 different populations that GRAF-pop assigned
them to. Any individual of mixed ancestry not within
the 95th percentile are assigned to Other. We also use the AIM markers to QC the variants
and the allele frequency reported by each study. We compared the markers from the studies and
then compared them to the frequency reported by 1000 Genomes. This is a plot of the variant allele frequency
in the study versus 1000 Genomes. Of course, this is a nice… they are in concordance,
so that is good. At least for these European subjects. This shows you that this study did not pass
QC because more than half of the variants are reported where the allele had be flipped
that’s why you see this the axes have been inverted essentially. Shown in red here. Then more subtle issues that we identified
here where there seems to be some systematic error where you expect some variants to be
random are assigned at 0.3 or 0.7 so that failed QC also. Then even more subtle is where 1000 Genomes
reported that some of the markers are heterozygous, but the dbGaP study reported some as homozygous. Typically they are derived from micro-array
assay at probably these are probes that failed in the assay failed to detect the heterozygosity. So that illustrates kind of the process and
the steps that we go for the QC for the subjects and for the variants. Once we have done all of the QC, as I mentioned
before we convert the data to a standard format and that goes into the dbGaP pipeline where
they get normalized and assigned RS numbers if they are novel, or get assigned to existing
RS numbers they are assigned to an existing RS number then distributed to the usual product
pipeline. This is a build summary we did recently of
about 50 studies from dbGaP that have been approved for open access and that is comprised
of over 140,000 objects and over half 1 trillion genotypes. And these are normalized or aggregated to
almost half 1 billion RefSNP. The majority of them already exist in dbSNP
and about 20 million that are novel. This is a breakdown of the populations from
that build. Most of it is European 75% European there
is some representation of the other continental populations. This only represents 10% of the potential
1.5 million subjects in dbGaP so far and we will be adding our goal is to increase underrepresented
populations in the future builds and releases. That may include new cohorts and new populations. They will be included with the quarterly dbSNP
builds. The data for ALFA will be reported on the
RefSNP web page just as I showed before for existing data for other projects. The data will also be accessible by API. You can get the metadata for the populations
and the studies. you can also query by position range and also
by RS numbers. Of course the data will also be available
for download from the FTP site. It is a standard vcf format: accession, position,
rs ID, alleles, then there will be a column for each of the populations with the allele
counts. I mentioned before the first release of the
alpha data will be coming out in January 2020 so look for that announcement. With that we will move on to the demo section
of the webinar. First I will demonstrate the Entrez search
then we will go into the API. You can start from dbSNP homepage I’m going
to use use the example on our help page if you go here like I mentioned before there
are additional search terms supported there are some examples shown here to some complex
query. I am going to look at this first example which
is kind of interesting, how you can use the global minor allele frequency to interrogate
whether an allele is suspect or not. I’ll click on this then explain it in a bit. Typically what I like to do is look at the
outliers so look for variants where the frequency is not detected in any of the projects or
that’s rare. So here I am creating where the frequencies
are zero. What this shows is that suggests that there
may be an assembly bias and that the assembly may have been incorrectly annotated at that
position so that when compared against what the assembly would call a variant there but
then they go and genotype it will fail and they will not detect heterozygosity or allele
frequency. For instance if you look at the first one
here you can see on GRCh38 the position reported, the allele, is C-G. And then if you look at the allele frequency
everybody is non-detectable, the variant allele is non-detectable. But then if you look at the HGVS we report
on both GRCh37 and 38, this is 38 for this one you see the C to G here. If you look at GRCh37 you see the alleles
have flipped. Then if you interrogate further, look at the
dbSNP web page you see no evidence that the variants allele — whether it is an artifact
or real. So if you look here, here is GRCh37 and 38
you see the allele is flipped. But then if you go down to frequency — you
can also see the evidence for this by using the sequence viewer by looking up — adding
the alignment tracks for GRCh37 and 38. Go to Configure Tracks and then go to Sequence
then scroll all the way down we have alignment for GRCh38 and all of the alternate assemblies
and the older versions of the GRC assemblies. Go all the way down to GRCh37, the alignment
between 38 and 37 and add that. There you can see a mismatch between GRCh37
and 38. So this is an example of how you can use allele
frequency, it is probably not a normal use case but it is kind of interesting. Now back to dbSNP. We have to tell you how to access the data
using eUtil on github, so click on github there. There are a couple jupyter notebooks. One that we just recently developed for how
to query and parse the allele frequency from MafGraph and you can launch interact live
by clicking on these buttons (launch and binder) I already have launched one here is right
here. I have to rerun it. I will just walk you through briefly what
it does. dbSNP is part of the Entrez system that allows
you to search across not just dbSNP but all the databases and link all the information
together so this example what it does is it searches for the gene of interest in the Gene
database and then it uses that information to search dbSNP and then extract the frequency
data. These are the codes. Here we are searching the Gene database first
we are searching for all ACMG genes It finds that there are 60 genes, that’s here. This is just some simple widget we developed
to demonstrate the search. From there is searched for each of the genes
you can iterate through and search for the variants in dbSNP. The search for that starts here. They found that there are 832 snps in this
gene here. It’s actually TP57. I’m not sure why the widget didn’t show up
let me try to run it again. I’m going to have to restart it, it might
have timed out. While that spins up I can still explain the
code. Anyway, this allows you to search across multiple
databases and integrate the data. Essentially what it does is search dbSNP,
it grabbed each of the RS and then plot, for example one RS here, the allele frequency. It’s still spinning up. Anyway, I guess we can move t0 the Q&A while
this is spinning up and I can go back to it later. Do we have any questions? [Peter Cooper] If anybody has any questions
— this is Peter Cooper again — go ahead and type them in the pod. We don’t have any questions in the moment. Lon, while this is spinning up you want to
show the other slide and we can share some other URLs they can go to they want to. Just go to last slide. This last slide has a number of URLs you can
go visit to find out more information. The first two are variation services and SPDI
annotation, these are useful to understand the various API things that Lon was talking
about today. Just remind everybody that we have a blog
called NCBI Insights that is where you can go to find out the latest information about
NCBI services and databases we post articles there several times a week. There’s also a Learn page that you can go
to to find documentation things like that. A very nice set of fact sheets that are PDFs
that are on the FTP site. These are the handouts that we give out at
conferences. They take you through a particular resource
and show you how to do things. And we have a YouTube Channel. That’s were this webinar will be and there
is a nice play list of webinars plus some other tutorials. Okay Lon, we do have one question, is there
a way to check the allele mismatch, GRCh38 versus 37 in a dbSNP build 153, the downloaded
VCF file? [Lon Phan] The only way you could do that
I guess is you compare the annotation on 37 and 38. We did have a list if you go to dbSNP ftp
site we have it in archives, the list of the variants that are a mismatch between 37 and
38. If you go to FTP download there is an pre-build
152 go to organism then human it doesn’t matter I don’t think. Go to b151_GRCh38p7/known_issues. So here is the one where we report all the
variants for 37/38 where the variants have flipped right there. That should not change too much between the
last couple of builds. Okay? [Peter Cooper] Here is a general question
is the rate of discovery of new novel SNP’s declining or accelerating? [Lon Phan] The number is declining. I think we probably reach somewhere near the
threshold right now. Essentially all the common variantss have
been found. All the novel variants are the rare variants
and it really kind of depends on the study size so the larger the study size the more
the rare variants can be detected. Hard to say but if I had to guess it’s probably
starting to plateau off. The rate is decreasing. [Peter Cooper] Thanks Lon I think you addressed
this but I will read the question anyway, can we explore unique variants, those that
are individual-specific SNP’s? [Lon Phan] I’m not sure what you mean by unique
you mean the one;s specific for a population? [Peter Cooper] I think he means for an individual,
so that would be controlled access data [Lon Phan] It’s going to be hard assuming
you could if you look at the allele count if it’s like one or two it is probably only
coming from one or two individuals so those of the rare variants, So we could probably
try to — if you go to like — if you search on dbSNP to go to something like — if you
search for really rare variants. So we’ll just change this from zero two let’s
say 1 out of 10 thousand. Let’s see if we find any, it’s not many but
there are a few there. Actually we should get rid of the zero, try
that. Essentially you just have to search for the
really, really rare variants if there allele count is in the twos or one probably because
it is only come from one or two individuals. [Peter Cooper] Someone asked the question
but really I think they are making a point that a SNP reported in one individual is maybe
an artifact rather than a true variants which we showed earlier when you were doing the
zero frequency example. I don’t see any more questions. Lon did you ever get this to run? Let’s do that. [Lon Phan] It looks like it did run finally. You can probably run this on other services,
download the notebook and run them on Azure or Google which will probably be faster but
I think all the codes are there for you to play around with. It was supposed to plot out the allele frequency
variants and there is supposed to be a drop-down menu that you can select which variant you
want to plot for. It is suppposed to allow you to dynamically
change the query and play around with it but anyway. That is about it. [Peter Cooper] One last question for today,
they wanted to find a way to explore SNP’s that overlap with multiple regulatory sites
for example from ENCODE or transcription binding sites, you would have to look at the annotation
on the genome to find out where those were, I think. [Lon Phan] Very good question. We do plan on adding the regulatory regions
annotations to dbSNP so probably coming out this summer. Then it will be just another attribute that
you can filter over. For instance, let’s just use all[sb] here. So it will have annotate regular just like
— there will be a list here that would show you all the regulatory region, enhancers,
silencers and so I like that. And that would be based on RefSeq annotations. RefSeq already has a project to annotate the
regulatory regions so that will be included in dbSNP hopefully later, in the summer. [Peter Cooper] Okay thanks everybody for coming
and thank you Lon. I will go ahead and and the webinar now thank
you all for coming.

Add a Comment

Your email address will not be published. Required fields are marked *