Comparing genes and species in Ensembl


Hello my name’s Emily and I’d like to guide you through some of the comparative genomics tools on Ensembl. We’ll be looking at the human EGFR gene, and we’ll be comparing it with other genes, both in human and in other species. Epidermal Growth Factor is an important gene in cell growth and has been implicated in some cancers. In this demo, we’re using Ensembl version 70. When a new version has gone live, version 70 will still be available through our archive sites. Let’s start at the main page. Enter “human EGFR” in the search box.
Click “Go”. Click “Gene” then “Human”. Now let’s click on the EGFR Ensembl ID… …to go to the gene summary page. To see how this gene compares to the same locus in other species click on genomic alignments at the left hand side. By default the exons are highlighted in red and only the human is shown. This is because an alignment has not yet been selected. Let’s select an alignment. There are multiple pre-calculated whole genome alignments available. The first options are multiple alignments for certain taxonomic groups; for example, the primates, eutherian mammals, or vertebrates. We only see the alignments available for human. If you were in a zebrafish gene, you would see the option for the fish alignments. Pecan and EPO refer to the computational analysis used to determine the alignments. Further down the list are whole genome alignments done on a pair-wise basis using BLASTz-net (for a comparison on the nucleotide level) or translated BLAT (on the amino acid level) to compare two genomes. Click on the (i) icon for more information. Or, go directly to the Comparative Genomics section of the “Help and Documentation”. For now, we’ll continue to explore the page. I’m going to choose the 13 eutherian mammals EPO… …and click “Go”. Listed here are the regions of the species that could be aligned. The genomic assembly, chromosome, base-pair range and strand are indicated. For example, the orangutan alignment is on chromosome seven, on the reverse strand, as indicated by a minus one. Underneath are the aligned sequences themselves. Again, exons are shown in red. Dashes are gaps in the alignment, and dots indicate no known sequence for a region. We can highlight conserved residues. Click on “configure this page” in the side menu. Select “conservation regions” and “all conserved regions”. Other options are available, including variations and line numbering, but we’ll leave those for now. Click on the tick to save. Now we can see conserved nucleotides highlighted in blue. These are just single nucleotides that are the same in more than half the species. Exons are still red. Whole genome alignments are also available from the location tab, in a zoomable display. We’ll take a look later on. For now let’s look at orthologues and paralogues. Homology is determined using all the genes in all species in Ensembl. The longest reliable protein for each gene is chosen as a representative transcript. Click on the “gene tree (image)” link at the left. There’s another help page here to guide you through the graphic if you need it. This is the collapsed form of the tree. Our gene of interest is marked in red. Collapsed nodes are shaped like funnels. At the right hand side are the protein alignments. Green bars represent alignments. The light green bars correspond to matches an individual protein. Dark green bars are consensus sequences corresponding to collapsed nodes of the tree. White spaces are alignment gaps. Let’s expand the “African mammals” node. Click on the node, then click “Expand this subtree”. The node I clicked on was blue, which is a speciation event. Red nodes indicate duplication events. Let’s click on one of those. The duplication confidence score is shown. Ideally if the species is found on one side of the duplication node it should be found on the other, if no gene loss has occurred over time. The duplication confidence score is the fraction of species found on either side of the node. Turquoise nodes indicate an unsupported duplication,which we consider to be a speciation event. The confidence score for these is zero. From this red node, let’s follow the link to Jalview. At the time of recording, Java 7 is required for Jalview, which does not work with Google Chrome on a Mac operating system. If you’re using Google Chrome on a Mac, switch to another browser, such as Firefox or Safari, for the next part. Click on the node,
Then “Start Jalview”. Jalview is an alignment editor. Edit or export sequence alignments using this Java plugin. Gene trees are also available in text Newick format. We can also look at the Gene gain/loss tree, which shows whether gene trees were contracted or expanded over evolutionary time. Orthologues and paralogues are determined with the trees and listed in tables in the pages here. Let’s take a look at the orthologues. You can choose which species you’d like to see orthologues from. I’m interested in mouse orthologues, so I’ll select “rodents”. I can see lots of information about the orthologues, and I can click on Ensembl IDs or location to see the genes in the browser. I can click on this link to see protein alignments of all the orthologues of human EGFR. Next, let’s click on “Protein families” in the left hand menu. These families are generated using Markov clustering of all isoforms in Ensembl and additional metozoan proteins in Uniprot. Jalview is available for cross-species alignment. Click “all proteins in a family” to get a list of members.
There are two families we can choose: RECEPTOR TYROSINE KINASE ERBB PRECURSOR, or EPIDERMAL GROWTH FACTOR RECEPTOR.
Let’s choose the second one. This view includes both Uniprot and Ensembl proteins, and allows users to compare data between the two datasets. We’ve, so far, constrained our exploration of comparative genomics views to one gene. But we can also explore whole genome alignments and synteny in a zoomable region. To do this, let’s go to the location tab. This is the genomic region of EGFR, on chromosome 7. I’m in the “Region in detail” view. At the top is chromosome 7, with alternate patches and haplotypes shaded in green and red. The location of EGFR is indicated by this red box. Below is an image centred on EGFR, with other genes indicated in a 1Mb region. The third image is a zoomable region. For more about the Region in Detail view, see our video. You can turn on comparative genomics tracks in the bottom view by clicking on “Configure this page”. Here’s the comparative genomics menu. I’m interested in the Constrained elements and the Conservation score for the 36 eutherian mammals. The constrained elements are switched on by default, but if you’ve been exploring Ensembl and editing these tracks you may need to switch it on. I’ll switch on the conservation score. Click on the tick… …and let’s look at the tracks. The pink histogram indicates the GERP scores, which are a score of how much conservation there is between the 36 eutherian mammals. These pinkish-brown bars can be considered as a summary of the GERP scores. They highlight regions of high GERP scores, known as constrained elements. There are other views in the Location tab dedicated to comparative genomics. I can view alignments as text or as an image, or I can view regions in the browser or by synteny. First, let’s have a look at the alignments image. At the moment, no alignment is specified. I’ll choose the 13 eutherian mammals again. At the top is the human region. Other species are shown underneath. At the very bottom, a note indicates which species could not be aligned in this region. Looking at the image, vertical peach stripes show alignment, and white stripes are gaps in the alignment. Coral-coloured arrows represent breaks in the alignment. The blue horizontal bar represents the contigs. Contigs are the blocks of sequence that make up the whole genome assembly. In some species in this alignment, we can see gaps in the blue bar. Where there is a corresponding gap in the coral bar, this is a gap in the alignment. Where there is no gap in the coral bar, this is a gap in the sequence. Hollow coral bars represent the reverse strand. Now let’s look at the text alignment.Now let’s look at the text alignment. This has the same display as the “genomic alignments” option from the gene tab. This time though we’re not confined to one gene and we could look at intergenic regions. The 13 eutherian mammals EPO is already selected. Again chromosomal regions for these species are listed above and the alignments are listed below with exons in red. Click “configure this page”. I’m going to choose “show variations”, …“yes and show links”. Click on the tick to save and close. Now any variants in this region for any of the species are highlighted. Let’s take a look at this variant. The variant of interest is rs17335717 and the source is dbSNP. These buttons allow you explore the variant. The links in the left hand menu navigate to the same places as the buttons. Click on the “phylogenetic context” button. Select the 13 eutherian mammals alignment again. This is the alignment centred on the variant. Differences between species are highlighted in pink, whilst known variants are highlighted according to the legend at the top. In this particular alignment, all the variants are intronic. Did you know that Ensembl calculates ancestral sequences? See the article in our “help and documentation” page for more information. Let’s go back to the location tab and take a look at the “Region Comparison” view. At the moment we can see a view similar to the “Region in Detail view”. To compare to other species we have to add them using this button. I’m going to choose mouse. Now click on the tick. At the top there’s a zoomed-out view of the region. Here’s mouse EGFR. Scroll down to see how the regions align. The pink bars show alignment, with the green linkers connecting them between the two species. The white spaces are unaligned sequences. Gaps in the blue bar indicate regions that have not been sequenced yet; fortunately there are no gaps in the assembly in human and mouse at this region. You can also add features to this view, by clicking on “Configure this page”. For example, I could add CpG islands. This will only add it to human. If I want mouse CpG islands too, I have to select mouse from this drop down menu and select the mouse track. Now when I return to the view I can see the CpG islands on both species in pink. You can also upload your own data to this view, allowing you to compare between species. For example you could compare the binding locations of a protein from ChIP-seq experiments in the two species. The “Synteny” view is similar to “Region Comparison”, but at a larger scale with less detail. Let’s take a look. Syntenic regions are 100kb or more of conserved sequence across two species in Ensembl. Change the species of comparison and chromosome of interest at the right hand side. At the moment it’s on the human and mouse. Human chromosome seven is shown in the centre and the gene of interest is marked in red. Mouse chromosomes with syntenic blocks are scattered alongside. Coloured blocks correspond to synteny on these mouse chromosomes. For example syntenic blocks to mouse chromosome six are labelled pink. For a gene list, you can choose upstream or downstream, or centre on your gene of interest. This will give you a list of genes shared by the two species to compare. Another great way to look at syntenic regions is using our scrollable view. I’m going to switch to automatic track height so I can see everything in each track. You can add synteny tracks by going to “Configure this page”. I’m going to turn on “synteny with mouse”. The coloured bars at the bottom shows synteny with mouse. I’m going to zoom out by scrolling with my mouse wheel. From here I can see synteny with mouse chromosomes 6 and 11. You can draw synteny in Region in Detail which was the first view we saw in this Location tab Region Overview is very similar, but allows you to zoom out more than 1 Mb. Synteny can be drawn in either view using “Configure this page”. That’s it for the browser but I’d just like to show you how to use BioMart to export homology information. I’ll walk you through this,and we also have a video tutorial explaining how BioMart works if you’d like to explore it further. Let’s click on BioMart in the top banner. As the database I’ll choose “Ensembl genes”. And I’m going to select “Homo sapiens genes”. I’m not interested in all the genes in the human genome, I’m only interested in one, so I’m going to filter them. If I go into “Filters”, and then open up “Gene” I will narrow it down to the HGNC symbol and enter “EGFR”. Click count. This shows that there is only one gene that passes my filter. Now I can pick attributes and select the homologues page. You can now choose paralogues. I’d also like to see if there any orthologues. I’m interested in chicken and mouse. I don’t need separate results for each Ensembl transcript so I’ll turn that off. Click results for a preview table. Yes, there’s both a chicken and a mouse orthologue for this human gene, as well as some human paralogues. You can export all the results to file if you want or view all the rows as HTML. There is more information about our comparative genomics analysis in our Help and Documentation. The other way that you can access comparative genomics data is via our Perl API. Find out how to install and use our API here. There are tutorials here to help you with BioMart, the API and other features of Ensembl. Thank you for watching and please contact us if you have any queries about this or other features of Ensembl.

3 Comments

Add a Comment

Your email address will not be published. Required fields are marked *