Updating Information on GenBank Genome Records
You can update your existing GenBank prokaryotic and eukaryotic genome records at any time using the different file types described below. If you are updating multiple records, please send a list of all accessions to be updated at the top of your request. Save the update file types as plain text and email them to us at genomes@ncbi.nlm.nih.gov.
You can also request a change in the release date of your genomes, depending on their status. Send an email request with the genome accessions to us at genomes@ncbi.nlm.nih.gov.
If you submitted to our collaborators at ENA or DDBJ, please see their instructions for update formats.
See these instructions for updating GenBank records that are not prokaryotic or eukaryotic genomes.
See SRA for information about updating SRA submissions.
Update Formats
- Changing the release date
- Updating Publication Information
- Source Information
- A new assembly of the genome (sequence update)
- Changing text or adding new qualifiers for existing features
- Adding or removing a few features
- Re-annotating existing records (adding or removing or moving many features)
Changing the release date
Your genome will be released on the day that you selected during the submission or when its accession or description is publicly available, whichever is first. You can request a change in the release date, depending on the status. If needed, send an email request with the genome accessions to us at genomes@ncbi.nlm.nih.gov.
If you are requesting the release of a genome because a manuscript has been accepted for publication or is published, please also provide the publication information so that we can include that on the genome.
Note that release of the genome will automatically trigger the release of its BioProject and BioSample. However, the reverse is not true; the release of a BioProject or BioSample will not automatically trigger the release of associated data.
Updating Publication Information
[a] If the PMID or DOI are publicly available please send the information as a tab-delimited table as follows:
acc. num. PMID
CP002501 29980901
JAARTP00000000 29985341
or
acc. num. DOI
CP002501 10.1000/xyz456
JAARTP00000000 https://doi.org/10.1000/xyz123doi
[b] For all other updates, please provide the revised information in a tab-delimited table. You must replace any non-ASCII characters (for example, characters with accents and umlauts) with the appropriate English letters.
The complete list of revised author names should be provided in the following format: first_initial middle_initial surname, etc., For example:
acc. num. authors title
ARTP00000000 J. A. Smith Analysis of the ABCD genome
CP002341 X. P. Weng, J. Doe Comparison of gut genomes
The complete list of revised author names should be provided in the following format: first_initial middle_initial surname, etc., For example:
acc. num. authors
ARTP00000000 J. A. Smith
CP002341 X. P. Weng, J. Doe
Source Information
Send updates to the source information (e.g., strain, cultivar, geo_loc_name, specimen_voucher) in a multi-column tab-delimited table, and we will update the genome and its BioSample. An example table is:
acc. num. strain geo_loc_name
XXXX00000000 82 USA
XXXY00000000 ABC Canada
A new assembly of the genome (sequence update)
If you have updated the sequence and the chromosomes of the genome are still in multiple pieces, then create a new genome file as you did before, following the instructions. Submit a new genome submission in the Genome Submission Portal and select 'yes' that it is an update to an existing genome at the prompt. Include the WGS accession XXXX00000000 in the box. Be sure to provide the BioProject and BioSample IDs from the original genome. Choose option 2 (wgs genomes) on the Files tab and finish the submission. The sequences will be assigned new accession numbers and the master accession will increment to the next version, e.g., XXXX01000000 would update to XXXX02000000. See more information about WGS accession numbers.
If the chromosomes of the genome are now each in a single sequence, then create a new genome file as per the instructions. Submit a new genome submission in the Genome Submission Portal and select 'yes' that it is an update to an existing genome at the prompt. Include the previous genome accession (eg, CP000001 or XXXX00000000) in the box. Be sure to provide the BioProject and BioSample IDs from the original genome. Choose option 1 on the Files tab and finish the submission.
If you are including annotation, then see the prokaryotic or eukaryotic annotation instructions.
Changing text or adding new qualifiers for existing features
Use a tab-delimited table for simple updates to existing features (e.g., changing product names or adding EC_numbers to existing CDS features). The first row in the table would be the headers, with subsequent rows for each qualifier being modified or added. The first column is the accession or contig name, the second is locus_tag, and subsequent columns are the qualifiers being changed. For example:
Accession Locus_tag gene_name CDS_product CDS note gene note
XXXX01000001 Abc_xxxx lacZ beta-galactosidase present in multiple copies
XXXX01000010 Abc_xxxy helicase required for replication
Also indicate whether blank cells mean 'delete what is present' or 'no change'. A blank cell in the CDS_product can never mean 'delete' since that is a required field. You only need to include the features that are changing. New product names will need to follow the protein naming conventions; see the prokaryotic and eukaryotic annotation guidelines.
NOTE: You cannot add, remove, or change the locations of new features (e.g., new CDS or gene) this way. If you want to make those changes, then see the instructions below.
Adding or removing a few features
If you are only adding a few new features, then you could send a small 5-column Feature Table .tbl file that has only the new features. However, if there are many changes, then follow the instructions below for re-annotating existing records. For more information about this table format see the prokaryotic or eukaryotic annotation instructions.
If you are only removing a few features, send us a list of the locus_tags for the features that need to be removed.
We will let you know if we find any issues when the update file is processed, e.g., if a CDS overlaps an rRNA feature. Email the files to genomes@ncbi.nlm.nih.gov andinclude the WGS accession in the request.
Re-annotating existing records (adding or removing or moving features)
If the genome has been released with annotation and you want to update the annotation (but not change the sequences), create a new annotated submission as you did originally, and submit the update via the Genome Submission Portal. We will replace all of the existing annotation with the annotation in the new file.
The fasta header in the update must include the contig identifier (SeqID) used in the original submission and the accession numbers. The correct format for the identifiers of a WGS genome in such an update is:
gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx
gnl|WGS:XXXXXX|SeqID|gb|XXXXXX01xxxxxx
where XXXX (or XXXXXX) is the WGS accession prefix and XXXX01xxxxxx (or XXXXXX01xxxxxx) is the contig accession number. A file containing the contig identifiers and accession numbers was posted to the Submission Portal when the genome was released. The file name format is xxxx_accs file, where xxxx is the WGS accession prefix. Let us know if you need a copy of this file.
Please see the submission guide for instructions about how to generate a submission. In addition, if you are including annotation, be sure to read the prokaryotic or eukaryotic annotation guidelines.
In the standard situation for WGS genomes, annotation is not tracked from the previous version to the new version. The locus_tag prefix always remains the same; however, the locus_tags would need to be unique in the new annotation, both within the update and compared to the previous annotation. A simple way to ensure uniqueness is to use a different number of digits after the underscore in the locus_tag. For example, if the registered locus_tag prefix is ABC and the previous annotation has 4 digits after the underscore (ABC_0001), then the new annotation could have 5 digits (ABC_00001). Similarly, the protein_ids must be unique compared to the previous assembly. By using the locus_tag in the protein_id, this uniqueness could be maintained, for example:
gnl|WGS:xxxx|ABC_00001
where XXXX is the accession number prefix of the project.
Alternatively, you could track the annotation from the previous version to this update. Note that this is not required. Track both the locus_tag's and protein_id's so that they are included when the gene/CDS is retained in the new annotation, even if the nucleotide location is modified slightly (e.g., the start codon is being extended upstream). To track the proteins, the protein_id's must have the format:
gnl|WGS:xxxx|SeqID|gb|accession_number
where XXXX is the accession number prefix of the project, SeqID is the protein SeqId (column 1 of the p2g file) and accession_number is the protein accession number (column 2 of the p2g file). You should have received a p2g file with the release letter for the genome. We can send this file again if you need it.
If you are adding a new protein, it would not have a protein accession number. You would need to use a new locus_tag that was not in the previous annotation and you would need to give the new protein a unique identifier (usually the same as the new locus_tag). For example, if you used ABC_6000 as the new locus_tag, you could use:
gnl|WGS:XXXX|ABC_6000
Please include a summary of the expected protein fates (new proteins, same proteins, changed proteins, removed proteins) so we will know what to expect.
If you are modifying an existing protein (maybe just moving the start codon) then use the same locus_tag and protein_id that is in the previous annotation. The protein will also keep its protein accession number. If you find that two adjacent proteins should be combined into a single protein and part of the translation stays the same, then choose one of the locus_tag/protein_ids/protein_accessions from the previous annotation to use for the new protein (preferably the one that had the similar translation) and remove the other identifiers (or you could add the removed locus_tag to the /old_locus_tag qualifier and include a note explaining that two proteins were combined). If you are completely changing a protein (maybe changing the reading frame) such that the new translation is completely different, then it would be a new protein with a new locus_tag, new protein_id, which would be assigned a new accession upon release into GenBank. If you do remove a protein, then do not reuse the locus_tag/protein_id/protein_accession for a different protein. The identifiers are meant to represent a single unique feature and should not be moved to different proteins.
Please contact us at genomes@ncbi.nlm.nih.gov if you have questions about generating the submission files, or about details of annotation.
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample