Submission of annotation using a 5-column tab-delimited feature table
Introduction
The 5-column, tab-delimited feature table (referred to as a "feature table" in this document) format allows different kinds of features (e.g., gene, coding region) and qualifiers (e.g., /product, /note) to be included with sequence submissions. The valid features and qualifiers are restricted to those approved by the International Nucleotide Sequence Database Collaboration.
Please refer to the section that applies to your use case for more information.
- Feature tables for use in BankIt, table2asn, and Accession update requests
-
When submitting an annotated prokaryotic or eukaryotic genome, please review the genome guidelines and appropriate annotation details for prokaryotes or eukaryotes.
Feature tables for use in BankIt, table2asn, and Accession update requests
Table Layout
The five-column, tab-delimited feature table specifies the location and type of each feature. The first line of the table contains the following basic information:
›Feature SeqId table_name
The sequence identifier (SeqId) must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Each feature is on a separate line. Qualifiers describing that feature are on the line below. Columns are separated by tabs.
Column 1: Start location of feature
Column 2: Stop location of feature
Column 3: Feature key
Line2:
Column 4: Qualifier key
Column 5: Qualifier value
Figure 1 shows a sample table and illustrates several points about the table format. The GenBank flatfile corresponding to this table is shown in Figure 2.
- Features that are on the complementary strand, such as the gene YPR027C and tRNA-Phe, are indicated by reversing the interval locations.
- Locations of partial (incomplete) features are indicated with a ">" or "<" in front of the nucleotide location. The “<” symbol always appears in column 1 and “>” always appears in column 2, regardless of the strandedness of the feature. In this example, the first gene, CDS, and mRNA all begin upstream of the start of the nucleotide sequence. The "<" symbol indicates that they are 5' partial features.
- For the protein of a CDS that is partial at its 5’ end to translate correctly, the first nucleotide of the CDS that is the first base of the first complete codon must be indicated with the qualifier "codon_start". This is not the reading frame of the entire sequence; it is just the nucleotide position within the CDS. In the example, nucleotide 2 begins the first complete codon of the acid trehalase CDS. The default situation is that the codon_start is 1. There is no need to indicate the codon_start on complete CDSs, as the translation always begins at the first nucleotide of the interval.
- If a feature contains multiple intervals, like the spliced tRNA-Phe or the Yip2p CDS, each interval is listed on a separate line by its start and stop position before subsequent qualifier lines.
- Gene features are always a single interval, and their location should cover the intervals of all the relevant features. For example, the gene YIP2 is as long as its mRNA, and is thus longer than its CDS.
- If the gene feature spans the intervals of the CDS or mRNA features for that gene, there is no need to include gene qualifiers on those features in the table, because they will be picked up by overlap. For example, in the flatfile, the gene names ATH1 and YPR027C are present as /gene on the overlapping CDS, even though they are not explicitly listed as gene qualifiers on those CDSs in the table. This option can be suppressed by adding a gene qualifier with the value '-' to the feature. Suppressing the overlapping /gene is important when, for example, a tRNA is encoded within an intron of a housekeeping gene.
- If a protein has more than one name, each can be listed in the table as a separate product qualifier on the CDS in the table. The value of the first product qualifier will become the /product on the CDS in the flatfile, and any additional product qualifiers will be shown as a /note on the CDS in the flatfile. See the first CDS, which has two product qualifiers, acid trehalase and Ath1p. All CDS features must have at least one product.
- A flatfile /note can be added to any feature using the qualifier note in the table. A note has been added to the second CDS.
- Published citations are added using the REFERENCE feature. For most publications, the start and stop of the feature are the first and last nucleotides of the sequence. The qualifier key is PubMed, and the value is the PubMed Identifier (PMID), which can be found in PubMed.
- The [offset] is used to add a specified number to all subsequent nucleotide intervals. In this example, the record was annotated in two pieces, each piece starting from residue number 1. The sequences themselves were joined together in the FASTA file. The [offset=2000] adds 2000 nt to the location of all features that follow it, sparing the submitter the need to recalculate the location of each feature. This option could be used if the feature intervals for two arms of a chromosome or adjacent contigs are stored separately, but need to be joined for the final submission.
Figure 1 : Example feature table
>Feature Sc_16
1 7000 REFERENCE
PubMed 8849441
<1 1050 gene
gene ATH1
<1 1009 CDS
product acid trehalase
product Ath1p
codon_start 2
<1 1050 mRNA
product acid trehalase
[offset=2000]
1253 420 gene
gene YPR027C
1253 420 CDS
product Ypr027cp
note hypothetical protein
1253 420 mRNA
product Ypr027cp
2626 2535 gene
gene trnF
2626 2590 tRNA
2570 2535
product tRNA-Phe
2626 2590 exon
number 1
2570 2535 exon
number 2
3450 4536 gene
gene YIP2
3522 3572 CDS
3706 4197
product Yip2p
prot_desc similar to human polyposis locus protein 1 (YPD)
3450 3572 mRNA
3706 4536
product Yip2p
Figure 2 : GenBank flatfile
LOCUS Sc_16 7000 bp DNA PLN 08-MAY-2000
DEFINITION Saccharomyces cerevisiae strain S288C chromosome XVI, partial sequence.
ACCESSION Sc_16
VERSION
KEYWORDS .
SOURCE baker's yeast.
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces.
REFERENCE 1 (bases 1 to 7000)
AUTHORS Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
Oliver,S.G.
TITLE Life with 6000 genes
JOURNAL Science 274 (5287), 546 (1996)
PUBMED 8849441
REFERENCE 2 (bases 1 to 7000)
AUTHORS Ouellette,B.F.F.
TITLE Direct Submission
JOURNAL Submitted (08-MAY-2000) NCBI/NLM, National Institutes of Health,
Building 38A, Room 8N805, Bethesda, MD 20894, USA
FEATURES Location/Qualifiers
source 1..7000
/organism="Saccharomyces cerevisiae"
/strain="S288C"
/chromosome="XVI"
mRNA <1..1050
/gene="ATH1"
/product="acid trehalase"
gene <1..1050
/gene="ATH1"
CDS <1..1009
/gene="ATH1"
/note="Ath1p"
/codon_start=2
/product="acid trehalase"
/translation="DHNGTIVHKSGDVPIHIKIPNRSLIHDQDINFYNGSENERKPNL
ERRDVDRVGDPMRMDRYGTYYLLKPKQELTVQLFKPGLNARNNIAENKQITNLTAGVP
GDVAFSALDGNNYTHWQPLDKIHRAKLLIDLGEYNEKEITKGMILWGQRPAKNISISI
LPHSEKVENLFANVTEIMQNSGNDQLLNETIGQLLDNAGIPVENVIDFDGIEQEDDES
LDDVQALLHWKKEDLAKLIEQIPRLNFLKRKFVKILDNVPVSPSEPYYEASRNQSLIE
ILPSNRTTFTIDYDKLQVGDKGNTDWRKTRYIVVAVQGVYDDYDDDNKGATIKEIVLN
D"
mRNA complement(2420..3253)
/gene="YPR027C"
/product="Ypr027cp"
gene complement(2420..3253)
/gene="YPR027C"
CDS complement(2420..3253)
/gene="YPR027C"
/note="hypothetical protein"
/codon_start=1
/product="Ypr027cp"
/translation="MVGIYRILASFVPLLGLLFAFHDDDMIDTVTIIKTVYETVTSTS
TAPAPAATKSVSEKKLDDTKLTLQVIQTMVSCFSVGENPANMISCGLGVVILMFSLII
ELINKLENDGINEPQRLYDLIKPKYVELPSNYVNEKIKTTFEPLDLYLGVNMNTSGSE
LNQNCLILKLGEKTALPFPGLAQQICYTKGASNEFTNYKLSDIQGNLNENSQGIANGV
FQKISNIRKISGNFKSQLYQISEKITDENWDGSAVGFTAHGREKGPNKSQISVSFYRD
N"
gene complement(4535..4626)
/gene="trnF"
tRNA complement(join(4535..4570,4590..4626))
/product="tRNA-Phe"
/gene="trnF"
exon complement(4535..4570)
/number=1
exon complement(4590..4626)
/number=2
mRNA join(5450..5572,5706..6536)
/gene="YIP2"
/product="Yip2p"
gene 5450..6536
/gene="YIP2"
CDS join(5522..5572,5706..6197)
/gene="YIP2"
/note="similar to human polyposis locus protein 1 (YPD)"
/codon_start=1
/product="Yip2p"
/translation="MSEYASSIHSQMKQFDTKYSGNRILQQLENKTNLPKSYLVAGLG
FAYLLLIFINVGGVGEILSNFAGFVLPAYLSLVALKTPTSTDDTQLLTYWIVFSFLSV
IEFWSKAILYLIPFYWFLKTVFLIYIALPQTGGARMIYQKIVAPLTDRYILRDVSKTE
KDEIRASVNEASKATGASVH"
BASE COUNT 2201 a 1276 c 1255 g 2268 t
ORIGIN
1 cgaccacaat ggtacgattg ttcataaatc aggagatgtt cctattcata taaagatacc
61 aaacagatct ctaatacatg accaggatat caacttctat aatggttccg aaaacgaaag
121 aaaaccaaat ctagagcgta gagacgtcga ccgtgttggt gatccaatga ggatggatag [etc.]
Applying the feature table
A feature table file may be imported on the Features page in a BankIt submission. On the Features step select “Add features by uploading five column feature table file”, then “Choose File”, select your file to upload, and then click the Upload file button. Your file will be validated. If there are no errors, the features will be imported where you can review the annotations and submit your data to GenBank.
The command line program table2asn can be used to automate part of the submission process and is available via ftp. tbl2asn reads a template, along with the sequence and feature table files, and outputs ASN.1 for submission to GenBank.
Feature tables can be exported from existing public sequence records to help you prepare an update to your accessioned GenBank record. From the search result in Entrez, save the feature table using the Send to file option from the toolbar, specifying the format as feature table. Send your edited feature table to gb-admin@ncbi.nlm.nih.gov.
Feature tables for use in the GenBank Submission Portal
Currently, the GenBank Submission Portal accepts 5-column, tab-delimited feature tables for mRNA sequence submissions. Only gene, CDS, 5'UTR, and 3'UTR features are accepted. Other features are not yet supported but will be added in the future. If you include other feature types in your file, the annotation for those other features will not be applied to your submission. If you have questions about this, please reach out to: gb-admin@ncbi.nlm.nih.gov.
Preparing the feature table file for a GenBank Submission Portal Submission
File type
The file must be a plain ascii text file. Do not save your file as .rtf., .xls, .docx, or any binary file types, as our software cannot read these. If you use a spreadsheet program to prepare your feature table, save it as tab-delimited text.
General format for a feature table
Review the example tables in Figure 3 while you read this section to help you understand the feature table format.
The first line of the feature table contains the following basic information, like this example:
>Feature sequence_ID
The sequence identifiers (sequence_ID) must match exactly the sequence_IDs in your nucleotide FASTA file. The next lines contain information about the features and qualifiers:
- Each feature is on a separate line.
- Qualifiers describing a feature are on the lines below that feature and its intervals.
- Each column is separated by a tab.
Column 1: Start location (first nucleotide) of a feature
Column 2: Stop location (last nucleotide) of a feature
Column 3: Feature key (for example, 'CDS' or 'gene')
Column 4: Qualifier key (for example, 'product' or 'gene' or 'note')
Column 5: Qualifier value
Figure 3 : Example table
>Feature Seq1
1 1210 CDS
product acid trehalase 1
>Feature Seq2
<1 >1050 gene
gene ATH2
<1 >1050 CDS
product acid trehalase 2
codon_start 3
>Feature Seq3
<1 1600 gene
gene ATH4
<1 9 5'UTR
10 1550 CDS
product acid trehalase 4
1551 1600 3'UTR
>Feature Seq4
1150 1 gene
gene ATH3
1150 1 CDS
product acid trehalase 3
note alternatively spliced
Example feature table explained
The examples in Figure 3 illustrate several points about the table format:
Complete and partial features
Complete features - the sequence contains the complete feature.
-
The features are complete in the Seq1 and Seq4 examples.
-
A complete CDS begins with the first codon that is translated and ends with the terminal stop codon.
-
For a complete CDS, there should be no use of ">" or "<" next to the nucleotide start and stop position numbers. You do not need to indicate codon_start for a complete CDS, as it is assumed that the translation starts at the first nucleotide of the interval.
Partial features - the sequence does not contain the full feature at either (or both) ends.
-
Locations of partial (incomplete) features are indicated with a ">" or "<" next to the first and last nucleotide position numbers.
-
In the Seq2 example, both the gene and CDS features are partial at the 5' and 3' ends of the sequence. In other words, the CDS begins upstream of the start of the nucleotide sequence and ends downstream of the nucleotide sequence (the sequence does not contain the start codon or stop codon). The "<" symbol next to the first nucleotide location indicates a 5' partial feature and the ">" symbol next to the last nucleotide location indicates the CDS feature is 3' partial. Furthermore, for the protein to translate correctly, the correct reading frame must be indicated with the qualifier "codon_start" on the CDS if the sequence is 5' partial. The valid qualifier value for codon_start is 1, 2, or 3. The codon_start qualifier is only valid for a CDS feature.
-
In the Seq3 example, the sequence contains part of the 5'UTR. The 5'UTR is 5' partial as indicated by the < next to the nucleotide position 1. Since the 5'UTR for this gene is 5' partial, the gene is also 5' partial. The CDS and 3'UTR are complete in this example and do not have the < or > symbols.
Gene features
-
The Seq2, Seq3, and Seq4 examples have gene features.
-
In the examples, 'gene' is both a feature and a qualifier and must be entered in two separate lines and columns.
-
Gene features are always a single interval.
-
The gene intervals should cover the intervals of all the features associated with that gene. For example, Seq3 has gene intervals which include the 5'UTR, CDS, and 3'UTR.
-
The gene partialness follows the partialness of the features within the gene. See the section "Complete and partial features" for information on annotating partial features.
CDS features
-
Protein-coding sequences should have a CDS feature.
-
CDS features must have a product qualifier and qualifier value. The CDS product is the name of the protein encoded by the sequence. For example, see Seq1 which has the value 'acid trehalase 1' for the CDS product qualifier.
-
If a sequence does not contain the translation start codon and/or a stop codon, the CDS is partial. See the section "Complete and partial features" for information on annotating partial features.
Note qualifier
- A note can be added to any feature using the 'note' qualifier. See Seq4 for an example of a CDS feature with a note qualifier.
Features on the complementary strand
- Features on the complementary strand are indicated by reversing the interval locations. An example of this is shown in Seq4.
Applying a feature table in submission portal
A feature table file may be imported in the GenBank Submission Portal for submission workflows that do not have an automated feature annotation component. Submission workflows that do not have automated feature annotation will request this information from you on the Features page. Upload a feature table on the Features step by selecting “Add features by file upload”, then “5-column tab-delimited feature table”, and then "Upload file”. Your file will be validated. If there are no errors, the features will be imported where you can then review the annotations and submit your data to GenBank.