Hello everyone and welcome to today’s webinar on submitting BioSamples to the NCBI database. My name is Bonnie Maidak and I will be introducing our presenter today and also going through some of the logistics for the webinar. Our presenter is John Anderson. He has been at NCBI for the last 15 years and has been working on the BioSample database for the last three to four years. The slides are available at the FTP site directory that is listed there or there is a shorter link that is the 1.usa.gov / 1RBERXx All of the contents including a video recording will be available after the webinar via a link on our webinars page which is shown there on the screen. Any time you see a bracket, ncbi bracket, you need to replace it with the full web URL as shown in the comments on the screen. Now I am going to introduce John again, and he will talk about submitting BioSamples to NCBI. Good afternoon everybody. In the last few years, new technologies have made it easier and easier to produce large amounts of sequence data from biological materials. One of the challenges for NCBI has been to find a way to organize these data that is accessible and useful to researchers. We have created a database to collect and store the information about the biological specimens used to create this data. It’s called the BioSample database and it is the repository for biologicals source information for the many primary archives at NCBI. Today I would like to introduce you to the BioSamples database and explain some of the concepts and then I will take you through a demonstration of a couple of actual submissions. Let’s begin by considering a simple hypothetical example. Imagine I have a grant to study the effects of arsenopyrite as an environmental contaminant. We know that arsenopyrite affects metabolic pathways in bacteria so I want to characterize the effects of different concentrations on the expression of specific genes in a particular species of bacterium. In my mind I understand the organization of my study and I can describe all its components. I know what the project is about and I know what my experiments were and I have the sequence results. The project is a transcriptome project, a study of expression under specific arsenic conditions and my method is RNA sequencing. I have a list of samples with the information about the arsenic treatments since that is central to the project. And I have recorded other information that might be relevant or useful as well. When I try to submit my sequences to NCBI they tell me I need to submit a BioProject for the project and BioSamples for the sample metadata for the source material. What are those? In NCBI terminology the BioProject is a record that provides a container for a formalized version of my project description. It has its own rules and its own database and submission of a BioProject is fairly simple, but we won’t go into that today. The data, of course means the sequence data, and these have their own archives and submission procedures. The BioSample is a record that stores the information about the samples. This slide illustrates how NCBI organizes the records for this information. Under the BioProject, you have the BioSamples and then the data that is derived from those. So what is a BioSample? The sample is the actual biological material that the data come from. The BioSample is the data file that contains the metadata for your sample. Sample metadata is analogous to metadata for a digital photo file. When you take a picture with your digital camera it records the image, that is the data, but it also records other information like the date the picture was taken and various camera settings. That’s called the photo metadata. Similarly, when you generate sequence data there’s information about the sample that is useful or critical to have. Why not store the sample information directly in the data file like you do with a digital photo? The problem with that is the sample can be used to generate different kinds of data and those data files go to many different databases. If we record the sample metadata on each data files it becomes very hard to maintain. Different databases may collect different information that is hard to reconcile and any updates to the information would have to be done to every instance of the record. Having a master sample record that all data records point to simplifies all of this. NCBI requires a BioSample record for sequence data submitted to our major sequence archives. SRA, TSA and WGS. GenBank and GEO also use BioSample information, but they have a separate method for collecting information. Here we see a typical BioSample record as it appears in the public pages at NCBI. It has a sample title that is descriptive, and then a set of identifiers. The main identifier is this number that begins with SAMN. BioSample accession number. It is the permanent, stable identifier and it is how you should always refer to your BioSample records. There is also a sample name and that is a name that is chosen by the submitter and is required to be unique among the samples submitted by one person. This particular record also has an SRS number because it has data in the SRA database. Below that we see the organism name, all BioSample records are required to have a valid scientific name from the NCBI taxonomy database. Then there is package; we will talk about that later. And if you look down lower you will see links to the BioProject that the sample is associated with and also a link that can retrieve all samples from this project. Below that is the information about the submitter and the data submission. And up in the right corner are links to the BioProject, the SRA data and the taxonomy information. The main thing I want to show on this page are the attributes. This is the metadata, the information that is being stored about the sample. We see here several things like the name of the breed, of the mouse, the age of the individual mouse, who provided the biomaterial, sex of the mouse, the tissue that was used and so on. All of these are predefined attributes. Having predefined names for the attributes let’s us standardize the information we get and makes it possible to aggregate samples with similar characteristics. We do want to have the attributes flexible so we do allow custom attribute names as well. That will let you enter any attributes you need to comprehensively describe your samples. The previous slide showed what a record looks like but what do we consider a sample? A BioSample should hold information that is biologically meaningful and basically can be anything that is relevant to your study. I will illustrate these examples in the next few slides. Different individuals are obviously different samples. Here we see an example set of samples that are all the same species of mouse but collected at different locations. Clearly each of these would need a separate BioSample record. Different tissues from the same individual are also different samples. For genomic study the tissue might not matter but for a transcriptome or epigenetic study identifying differences between tissues may be the whole point of the study. If the difference is relevant for your study you need to record it. For a single individual sample at different time points, each time point can be considered a different sample. Here we have a single mouse sample at four different ages Since these are distinct in ways that are relevant to the study, the information should be recorded as separate BioSamples. The same organism treated with different conditions would be different samples. Two cultures of the same bacterial strain, but grown under different conditions would need separate BioSample records. On the other hand, multiple plates of the same clone not handled in any way that would affect your study would not be considered to be different samples. That is, you don’t need to submit a new BioSample record every time you use this strain for an experiment. A sample used to generate different data types would still just be a single sample. If you use the same example to generate a transcriptome and genomic sequences, there’s no reason to register two BioSample records. The data files can both point to the same BioSample record as the source material. A sample used to generate repeats of the same data is still just a single sample. Next generation sequencing platforms deliver hundreds of files for a single sample but you don’t need a BioSample record for each. Similarly if you repeat the experiment sequencing on a different platform, those may be different experiments but there still is only one BioSample record needed. So far we have talked about organismal samples. There is a special class of biological samples that is an environmental or clinical sample where you expect to find multiple organisms. We refer to these as metagenomes. The term metagenome is partially historical since the first instances of this sample type were for genomic sequencing. It now includes any samples of this type regardless of the type of data that would be generated. Even though a sample may contain thousands of organisms of many species, it is still a single physical sample. The information in the BioSample record refers to that sample not the individual organism. Genomes are defined by location, substance sample, time of sample, that sort of thing. On the next slide, I have an example metagenome. Here we have a hypothetical set of samples that are water samples taken from different places in this same river. The goal of this study might be to look for differences in bacterial species composition in different locations. If we look at the data here, you see they were all collected in the same date. and the difference between samples is the location of the samples. Arrows on the slide show the actual locations that are described in the geographic location information and also in the sample name, creek_mouth. Further up the creek, the shoreline in the deep water of the river. The latitude and longitude coordinates give the precise location of each of the samples. And that here would be additional information that might be important for interpreting the results of the study. However, metagenome results are not separate BioSamples. A single metagenomes sample can produce hundreds of separate sequence files and identify many sequences but it is still just one sample. Usually the sequence that is generated from a metagenomic study are short reads and submitted to SRA. Organisms are identified by sequence homology to other sequences. Often identifying the organism present in the sample is the goal of the study. In these cases you do not need a separate BioSample for each identified organism. That information goes with the sequence record. The sequence files derived from the single physical environmental sample would all refer to the same BioSample record as the source material. There is an exception to this. If you aligned enough short reads from a single taxon in your metagenome sample, you may be able to assemble a complete genome for that organism. If you submit that assembly to NCBI the rules for NCBI genome submission require that you identify the species and register a separate BioSample record for the uncultured organism. A note is made on the record that it was derived from the original metagenome sample. In an effort to improve the quality of the information we collect for Biosample, we have organized them into categories based on what type of data we need to collect for each type of sample. Metagenomes are one example of the sample type. The goal is to improve the quality of the information gathered. We use the term sample type synonymously with package, and actually package is the preferred term and I will use that from now on. Here we see a screenshot of the descriptions of the sample type packages that you see in the submission portal and I will show you the slides later on. These are mostly based on organism types that are designed to make it possible to ask for attributes that are most appropriate for that type of sample. When you’re submitting to BioSamples, you should just read the descriptions and choose the package that is most appropriate for your organism. I will have some more examples of this in a few slides. First I want to point out that some packages have subcategories. The pathogen package for example, which are organisms that are important for public health are divided into clinical and environmental type samples. And we ask for slightly different different information for those. If you look at the bottom of the page the genome standards consortium has a set of sample types that are similar to NCBI packages, and we mirror those. These subtypes overlap with NCBI standards — the packages — and have stricter requirements most cases. If you want to follow the GSC standards and have your records marked MIxS compliant then you should choose one of these packages. You can get a set of definitions for each of the packages by choosing one on the page I showed you before and then clicking on the definitions button in the lower left corner. That will pop up a window that gives definitions for that specific template. Here’s a closer view of those two that were on the previous slide. I want to point out this is for the pathogen clinical type. The notation of one asterisk indicates an attribute that is mandatory and 2 asterisks indicate that you have a choice between two. I’ll explain that better in just a second. First of all, sample name is always required. And I’ve mentioned sample name before and organism is always required. For a clinical pathogen, you need to choose either a strain name or isolate name but not both. And below that you see several attributes with single *-asterisks and those are ones that have been chosen by the pathogen people as being relevant and necessary for a good pathogen record. And below that, there is a long list of predefined attributes which I mentioned earlier. Here is a different example, this is a plant package. And this has a different set of requirements that are appropriate for plants. Sample name and organism are required but now you see that we offer ecotype and cultivar which are terms and used in botany. Or isolate. And you need to provide a name for one of those three for your sample. Similarly, age and developmental stage are options one or the other may be appropriate depending on the lifecycle of your plant. There are just a couple of other required attributes for this package with geographic location and tissue type and below that again you see the long list of optional predefined attributes. We will pause for questions. If you do have a question, please type it in the question pod. You can do that as we go through the webinar. We will answer those that we can during the webinar while John is presenting information. Or we might have some questions that we will have John answer at this time. This is Peter Cooper. We have one question and the question is are these forms available as a checklist in order to plan what metadata to capture before collecting the sample? You mean the required attributes? Yes. Yes you can download that list I was showing on one of the earlier slides and you can also download a template file that I will be showing later on that shows all that information about what is required and what is available. Another question is whether BioSamples are required for dbGaP submissions. Yes they are, but that is handled through dbGaP. Human samples are — have special considerations because of privacy concerns and so those are handled completely by dbGaP and they handle submissions for those. This will be the last question we will take at this point in the webinar. On one of the first slides the BioSample included as a property, RNA sequencing. Is that not a property of the experiment? In general, yes. That is actually a good question because the BioSample to be describing the sample not the experiment that was conducted on the sample. And so when you submit one you don’t need to specify the experimental type like RNA sequencing and in fact we prefer you don’t because the same sample could be used for RNA sequencing at one time and DNA sequencing for another time. So if I had it in a slide that was a mistake. I think this slide was trying to just describe the study itself. And not necessarily a characteristic of the BioSample record. I see. That’s right where I said my method is RNA sequencing. That information is the property of the project. I didn’t talk about the BioProject. I didn’t talk about how to submit BioProjects, but one of the things you do is you choose the method and the scope of the BioProject and that information goes there. At this point I want to switch to a live demonstration of submitting a BioSample to NCBI. I have a link here to the submission portal and you can see the URL. This is where you need to go to submit your BioSamples. You need to have already created an NCBI username. And so you need to then login once you’re logged in, you will see your username at the top and you can — this is the landing page for the entire submission portal. You can see there are links to several different databases to this submissions for those but we want a sample so I’ll click on that. This is your personal submission list. A list of the submissions that I have submitted under this username. And you create a new submission by clicking on new submission at the top. That launches the submission Wizard and creates the new submission. You see the submission ID. At the top. The submission ID is a temporary ID and only applies to the submission session itself. You should not use it in any publications. You should use it to communicate with us about your submission. The submission wizard software takes you through a series of tasks to collect your information. These need to be completed in order before proceeding to the next. We started on the submitter tab. That contains the submitter information and it’s auto filled from your profile. You can also make corrections or edits here. Once you are satisfied that everything is right on any of the submission tabs, you click continue to move to the next tab. The second page is the general info tab. There are two choices to make here. The first is for when you are — when your records will be released. It can either be the preferred method which is immediately after the process is complete, it be made public. Or if you are not ready for your data to be made public yet, you can specify a release date in the future. The record will be held private until either the specified date is reached or the accession numbers published or linked data are made public. The second choice you have to make is to specify whether you are submitting a single sample or a file containing multiple samples. We will do both but we will start with single BioSample example. Click continue to move to the next tab which is sample type tab. These have all the choices that I showed you earlier for this demonstration I will choose the model organism or animal sample type. That is valid for model organisms and most mammals. I choose one, click continue and move on to the next tab. That is the attributes tab. This is a web form with many fields and as I showed you earlier we have an indication of the required attributes and several predefined but optional attributes. Also, notice if you mouse over these little ?-question-mark bubbles you get a definition for that attribute. That can help you decide what information to put in each one. For this demonstration I will fill in some information. I’ll fill in sample name. I wanted to demonstrate that for organism when you start typing a name a query is sent to the taxonomy database and it begins to auto fill. Based on what you have typed so far. If I type canis l, I find canis latrans, the coyote, and that is the one I want to do for this demonstration. The next section you need to fill in is one of these — at least one is required in this section. Isolate, breed, cultivar, or ecotype. Fairly general at this point. For this organism, I will just put wildtype for the breed. We need to put an age or developmental stage. Note that age is not — it doesn’t have units. That gives you the freedom to put in whatever units are appropriate for your organism whether it be two hours, two days, two years. In this case I will say two years. Not required to put units at all. In fact we could just put two. Because it’s not validated at all, some of the attributes are validated. Sex is validated and there is a specified list of acceptable values that you can enter here. Male, female, pooled, neuter, hermaphrodite, not determined. If you don’t have the information for this or any of the required attributes you can put missing, not applicable or not collected. We understand you don’t always have all the information that we would like to collect. And so the option is there to leave it out if you don’t have it. III’ll choose male, then for tissue, just to demonstrate, I will put missing. And that is acceptable. Then there is the long list of attributes. You should scroll through this and find the information that you want to add. I would like to put in a value for latitude and longitude because that is an interesting property. This one is linked to a map that shows location that you type in. That makes it easier to double check yourself if you have made a mistake. If I put East instead of West. I’m on a different continent. I intend this to be North America so that’s a correct value there. And the last thing I wanted to show is custom attributes. If you look through all the predefined attributes and don’t find what you want, you can enter something here. If I had ear tag information for this animal. And I could enter that here. And just put number. Once I’m satisfied that everything is exactly the way I want it to be, I click continue at the bottom of the page. And move onto the next part. The next tab is the BioProject tab. If you have already submitted a BioProject you can enter the succession number here. It is not required at this point and can be created later or created automatically when data points are the same BioProject and BioSample. We will leave this blank for now and we can just move right along. This tab is the comments tab. There’s a title automatically generated based on the package that you have chosen and the organism name but you can also enter a custom title at that point. There’s a text box for public description and this will appear on the public record and also a text box for private comments to NCBI and only the curators will see those and they will not be included in the final record. And finally when you have gone through all the tabs you come to an overview page and it shows you everything that you have entered, the submission release date, the submitter information and the organism, the package. And all the attributes that I entered. This is your chance to double check them and make sure everything is the way you want it and when it is, you click the submit button. When you click the submit button the record leaves the submission portal and goes into the BioSample database. A set of validations are done and if there are any errors, then the record is flagged for curator review. Otherwise it will process automatically and be released in just a few minutes. Now we see that this submission that I was working on is listed as a waiting process. If I click on the submission ID, it opens up the overview page. And indicates it is awaiting processing. After processing, this will change to successfully completed and I don’t think it will finish in a timely manner right now. It usually takes only a couple minutes. But that’s the only change to the page. Once the submission is processed you will be sent an e-mail that gives you the accession number and shows you — it gives you the succession number and shows you your submission was successfully processed and there is a URL that will show the — take you to the record in the public page once it becomes public. And that link won’t work until this record is released. And that is how to submit a single BioSample. Now I will go back to the BioSample submission page and start a new submission and submit this time a batch BioSample. The submitter page of course is the same and we’ll move right through that. And I will choose a future date. And this time I choose batch submission. The next couple of tabs are the same. I will choose the microbe package this time and hit Continue. Now we’re at the attribute page. Before, this was a web form that you filled in all the different values. But now for a batch you are offered the opportunity to upload a file with your sample data for submitting samples. The templates for those files can be downloaded by clicking on these links right here. And I will download an Excel format. This is what the template file looks like in Excel format. The first several rows are instructions The submission template indicates that this one is for the micro sample which is what I had chosen. Fields that are highlighted in green are required, fields in blue are those that you have choose one or two options to fill in and yellow fields are optional. And if you mouse over any of the column headers you get a brief description of them. Of the attributes. So you would simply enter sample names and all the other information. And here I have a file that I have filled in previously. Then you can see that there is a unique sample name for each one, culture_a1, a2, etc. I’ve entered organism name, each has a unique isolate name, the isolation source is bench swab for all of them, collection date for all is 11 May 2015. Geographic location, which is a country name, and sample type is required and those are all filled in. The next step once you have done this is to save it of course. When you upload to the webpage you cannot upload an Excel file. You have to save it as a tab delimited text file. Here’s what this looks like in Excel on a Windows machine. Some other platforms may be different. But if you look in the pull down menu for type for Excel, here is text tab delimited. That might be hard to see. Choose that type. And then save. And then confirm we want that type. And now I’ve saved the text version of this file that I will use to upload. To upload first you find the file by using Browse. And here is the file I saved. My upload file. I open this file, that uploads it to the website. But it’s not been processed yet. I wanted to point out that if you find you have uploaded the wrong file or you decide to make a change at some point you can click the delete button here and it removes that file and you are free to upload again. So once you have uploaded your file, click continue and it’s processed and you go on to the rest of the submission. The comments page and the overview page. Now you don’t see a list of your attributes because it would be too difficult to show the entire table. You just have your file and you can look at that file. And when you are ready, click submit and it is processed just the same way as the single sample process. At this point I want to go back to the slideshow. This is what the overview page looks like once the submission has been successfully processed. And that is what is indicated here and instead of the single accession number now you see a list six objects. Those are the six samples and six accession numbers. You get an e-mail that shows that same set of accession numbers. And also links to the BioSamples. I forgot to point out earlier that there is a BioSample objects file attached to the e-mail and that has all this information in a text file. And for our batch submission, that is beyond a certain size you might have a batch of hundreds of thousands of samples. That would be too many to display so all the information is in the BioSample object attachment. That’s how to do a single and a batch submission and now I will take a few minutes and show some common error messages that you might see. Just a few examples here. In this example I have changed the file by adding an organism name that is nonsense, it is not a real organism and it’s not in the BioSample [note: meant to say Taxonomy] database. If I save this file and upload it and do the processing, the validator checks the file, and issues this warning. Information here and it says organism not found and gives the value of what the organism name was. that caused the problem. This is a warning rather than an error message. Warning means that you can proceed, you don’t have to correct this possible problem. We make this a warning because we want to allow for people to submit new species or unknown species or species that just haven’t been entered in the taxonomy database yet. This is a chance to check the spelling. If it’s a simple misspelling you can correct that. But if you have it right then you can continue and the submission can be completed and it will be reviewed by curators and taxonomy assigned and your sample is released with the new name. This is another type of common error that we see — bad format. Date is one that is commonly entered wrong. Here I’ve entered the date in the standard American-style of month, day, and year. 5/11/2015 for three of the samples and if you look at the last three it is 5/22/2015. If I upload that file and submit it, I get two types of messages. The first is an error. An error message is for a mistake that you have to correct before you can proceed. You have to go in and correct mistakes. The problem here is that this date is ambiguous. You can read that as month / day so it is May 11 or read it as day / month so that it would be November 5. We can’t guess what you meant by that so you have to go back in your file and correct that. The other set, the 5/22/2015 are unambiguous because there is not a month 22 so it’s obviously the 5 refers to May and 22 refers to the day. So the wizard can suggest an auto correction. When you see this, that means that the format will be automatically changed and doesn’t require any further action on your part and for warnings you can click continue but for errors you cannot. Other places that we commonly see mistakes are in lat lon values and geo loc name values. So those are things to watch out for as well. One more error type I want to point out. Doesn’t apply to single submissions but only batch submissions. And that is the no unique information error. If you look at this file, I’ve edited it so that everything except for the sample name is the same across all the samples. If you try to upload this file you will get this error message and this one tends to give users a lot of trouble. This error message indicates that the problem with the submission is that the information in the rows is identical. We implemented this check to try to encourage submitters to include distinguishing metadata in their samples. The information in the sample name and the description and the sample title are not considered because the free text is not part of the controlled vocabulary. The sample names are listed in the error message not because they are wrong but rather just to indicate which rows have the error. If you see this message should go back into your file and add distinguishing information. And so here is what we have done today. I’ve tried to do an introduction to the BioSample database and explain the terms and the concepts. I’ve shown the submission of a single record, what attributes are. Submission of the batch record and templates and talked about error messages. Before we take more questions, I want to point out some additional resources. NCBI has a YouTube channel. So the link is there. You can also get more documentation and help by going to the BioSample help documents. And we have NCBI Factsheets that can be useful for related material. There is a BioProject Factsheet. The NCBI helpdesk address is [email protected] If you have general questions about NCBI resources, you can write a message there. If you have a comment or question about webinars, you can use the [email protected] address. And if you need help with your BioSample data submission then please write to [email protected] We do have a few more questions and another earlier question that we had that we want to clarify. So we will ask those questions now. One of the questions is whether you can reuse a BioSample record to create additional BioSample records. For example if only the time of a treatment changes, time or treatment changes. And whether you can go back and edit the Excel file after it has been submitted. So I will rephrase this question in the sense of you have done a BioSample submission, you are going do another BioSample submission. The data are basically the same except perhaps the earlier study a month ago, you want to reuse that BioSample record to copy or edit it as a template for the new data submission. In the submission portal you can create a copy of a submission. And then use that as a new submission. You will have most of the information. You have to keep in mind that every sample needs a unique sample name so you have to provide a new sample name along with the new information. And the question was — the last part of the question was — could you edit the Excel file. That brings up an important point. Once a submission is completed and a BioSample’s moved into the BioSample database, all edits to the existing BioSamples have to be done by our curators. We don’t have any provision for users to edit their records yet. So you send a message to BioSample help to get your records edited. That sounds good John, thank you. And the answer that you just provided addresses this other question and that is whether there is a way to update or modify released BioSamples and there is but it has to be done by NCBI staff. You need to write to BioSample help to tell us what you need to have changed. A new question is whether there is a way to submit BioSamples using scripts. The Excel sheets and batch submission is that the only way through Excel spreadsheet or actually the tab delimited text files. What I was talking about today is using the submission portal to do your submissions. There is also a UI-less route to submit. where you have to create XML files and send those to us. That is usually set up for large scale users and you would have to write to BioSample help and I don’t usually handle those so I would have to put you in touch with the people who do. But it can be done. The question that came earlier I want to have John clarify, I’m going to show him the slide to verify that that is the one that he and I both agree that the question relates to And then we will go back in the PowerPoint file to go to that slide. Give us just a moment. Right, I think this is the one that caused the confusion earlier. The second bullet point there, Method-RNA sequencing. I should have made that clearer. The first bullet point is talking about the description of the project. And the method is included in the project description. And the third bullet point, Samples, is referring to what will go in the BioSample. Method does not going to BioSample, it goes with the BioProject. If the person who asked that question, we may need to have you write to the BioSample help address or to the webinars address would be better. to clarify that. We think we have addressed your question but we’re getting the sense you still have a remaining question. That will conclude today’s webinar. Thank you very much for attending and again, if you have any questions about a BioSample submission, write to [email protected] If you have questions about webinars go to [email protected], and if you have other questions or comments about NCBI resources, please write to [email protected] The submission portal URL is listed on this screen and that is where you will start when you do a BioSample submission.