Submission of annotation using a 5-column tab-delimited feature table

Introduction

The 5-column, tab-delimited feature table (referred to as a "feature table" in this document) format allows different kinds of features (e.g., gene, coding region) and qualifiers (e.g., /product, /note) to be included with sequence submissions. The valid features and qualifiers are restricted to those approved by the International Nucleotide Sequence Database Collaboration.

Please refer to the section that applies to your use case for more information.

Feature tables for use in BankIt, table2asn, and Accession update requests
Feature tables for use in the GenBank Submission Portal
When submitting an annotated prokaryotic or eukaryotic genome, please review the genome guidelines and appropriate annotation details for prokaryotes or eukaryotes.

Feature tables for use in BankIt, table2asn, and Accession update requests

Table Layout

The five-column, tab-delimited feature table specifies the location and type of each feature. The first line of the table contains the following basic information:

›Feature SeqId table_name

The sequence identifier (SeqId) must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Each feature is on a separate line. Qualifiers describing that feature are on the line below. Columns are separated by tabs.

Column 1: Start location of feature 
Column 2: Stop location of feature 
Column 3: Feature key 
Line2: 
Column 4: Qualifier key 
Column 5: Qualifier value

Figure 1 shows a sample table and illustrates several points about the table format. The GenBank flatfile corresponding to this table is shown in Figure 2.

Features that are on the complementary strand, such as the gene YPR027C and tRNA-Phe, are indicated by reversing the interval locations.
Locations of partial (incomplete) features are indicated with a ">" or "<" in front of the nucleotide location. The “<” symbol always appears in column 1 and “>” always appears in column 2, regardless of the strandedness of the feature. In this example, the first gene, CDS, and mRNA all begin upstream of the start of the nucleotide sequence. The "<" symbol indicates that they are 5' partial features.
For the protein of a CDS that is partial at its 5’ end to translate correctly, the first nucleotide of the CDS that is the first base of the first complete codon must be indicated with the qualifier "codon_start". This is not the reading frame of the entire sequence; it is just the nucleotide position within the CDS. In the example, nucleotide 2 begins the first complete codon of the acid trehalase CDS. The default situation is that the codon_start is 1. There is no need to indicate the codon_start on complete CDSs, as the translation always begins at the first nucleotide of the interval.
If a feature contains multiple intervals, like the spliced tRNA-Phe or the Yip2p CDS, each interval is listed on a separate line by its start and stop position before subsequent qualifier lines.
Gene features are always a single interval, and their location should cover the intervals of all the relevant features. For example, the gene YIP2 is as long as its mRNA, and is thus longer than its CDS.
If the gene feature spans the intervals of the CDS or mRNA features for that gene, there is no need to include gene qualifiers on those features in the table, because they will be picked up by overlap. For example, in the flatfile, the gene names ATH1 and YPR027C are present as /gene on the overlapping CDS, even though they are not explicitly listed as gene qualifiers on those CDSs in the table. This option can be suppressed by adding a gene qualifier with the value '-' to the feature. Suppressing the overlapping /gene is important when, for example, a tRNA is encoded within an intron of a housekeeping gene.
If a protein has more than one name, each can be listed in the table as a separate product qualifier on the CDS in the table. The value of the first product qualifier will become the /product on the CDS in the flatfile, and any additional product qualifiers will be shown as a /note on the CDS in the flatfile. See the first CDS, which has two product qualifiers, acid trehalase and Ath1p. All CDS features must have at least one product.
A flatfile /note can be added to any feature using the qualifier note in the table. A note has been added to the second CDS.
Published citations are added using the REFERENCE feature. For most publications, the start and stop of the feature are the first and last nucleotides of the sequence. The qualifier key is PubMed, and the value is the PubMed Identifier (PMID), which can be found in PubMed.
The [offset] is used to add a specified number to all subsequent nucleotide intervals. In this example, the record was annotated in two pieces, each piece starting from residue number 1. The sequences themselves were joined together in the FASTA file. The [offset=2000] adds 2000 nt to the location of all features that follow it, sparing the submitter the need to recalculate the location of each feature. This option could be used if the feature intervals for two arms of a chromosome or adjacent contigs are stored separately, but need to be joined for the final submission.

Figure 1 : Example feature table

>Feature Sc_16
1    7000    REFERENCE
                        PubMed         8849441
<1    1050    gene
                        gene           ATH1
<1    1009    CDS
                        product        acid trehalase
                        product        Ath1p
                        codon_start    2
<1    1050    mRNA
                        product        acid trehalase
[offset=2000]
1253    420    gene
                        gene           YPR027C
1253    420    CDS
                        product        Ypr027cp
                        note           hypothetical protein
1253    420    mRNA
                        product        Ypr027cp
2626    2535    gene
                        gene           trnF
2626    2590    tRNA
2570    2535
                        product        tRNA-Phe
2626    2590    exon
                        number         1
2570    2535    exon
                        number         2
3450    4536    gene
                        gene           YIP2
3522    3572    CDS
3706    4197
                        product        Yip2p
                        prot_desc      similar to human polyposis locus protein 1 (YPD)
3450    3572    mRNA
3706    4536
                        product        Yip2p

Figure 2 : GenBank flatfile

LOCUS       Sc_16        7000 bp    DNA             PLN       08-MAY-2000
DEFINITION  Saccharomyces cerevisiae strain S288C chromosome XVI, partial sequence.
ACCESSION   Sc_16
VERSION
KEYWORDS    .
SOURCE      baker's yeast.
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
            Saccharomycetaceae; Saccharomyces.
REFERENCE   1  (bases 1 to 7000)
  AUTHORS   Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
            Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
            Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
            Oliver,S.G.
  TITLE     Life with 6000 genes
  JOURNAL   Science 274 (5287), 546 (1996)
   PUBMED   8849441
REFERENCE   2  (bases 1 to 7000)
  AUTHORS   Ouellette,B.F.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-MAY-2000) NCBI/NLM, National Institutes of Health,
            Building 38A, Room 8N805, Bethesda, MD 20894, USA
FEATURES             Location/Qualifiers
     source          1..7000
                     /organism="Saccharomyces cerevisiae"
                     /strain="S288C"
                     /chromosome="XVI"
     mRNA            <1..1050
                     /gene="ATH1"
                     /product="acid trehalase"
     gene            <1..1050
                     /gene="ATH1"
     CDS             <1..1009
                     /gene="ATH1"
                     /note="Ath1p"
                     /codon_start=2
                     /product="acid trehalase"
                     /translation="DHNGTIVHKSGDVPIHIKIPNRSLIHDQDINFYNGSENERKPNL
                     ERRDVDRVGDPMRMDRYGTYYLLKPKQELTVQLFKPGLNARNNIAENKQITNLTAGVP
                     GDVAFSALDGNNYTHWQPLDKIHRAKLLIDLGEYNEKEITKGMILWGQRPAKNISISI
                     LPHSEKVENLFANVTEIMQNSGNDQLLNETIGQLLDNAGIPVENVIDFDGIEQEDDES
                     LDDVQALLHWKKEDLAKLIEQIPRLNFLKRKFVKILDNVPVSPSEPYYEASRNQSLIE
                     ILPSNRTTFTIDYDKLQVGDKGNTDWRKTRYIVVAVQGVYDDYDDDNKGATIKEIVLN
                     D"
     mRNA            complement(2420..3253)
                     /gene="YPR027C"
                     /product="Ypr027cp"
     gene            complement(2420..3253)
                     /gene="YPR027C"
     CDS             complement(2420..3253)
                     /gene="YPR027C"
                     /note="hypothetical protein"
                     /codon_start=1
                     /product="Ypr027cp"
                     /translation="MVGIYRILASFVPLLGLLFAFHDDDMIDTVTIIKTVYETVTSTS
                     TAPAPAATKSVSEKKLDDTKLTLQVIQTMVSCFSVGENPANMISCGLGVVILMFSLII
                     ELINKLENDGINEPQRLYDLIKPKYVELPSNYVNEKIKTTFEPLDLYLGVNMNTSGSE
                     LNQNCLILKLGEKTALPFPGLAQQICYTKGASNEFTNYKLSDIQGNLNENSQGIANGV
                     FQKISNIRKISGNFKSQLYQISEKITDENWDGSAVGFTAHGREKGPNKSQISVSFYRD
                     N"
     gene            complement(4535..4626)
                     /gene="trnF"
     tRNA            complement(join(4535..4570,4590..4626))
                     /product="tRNA-Phe"
                     /gene="trnF"
     exon            complement(4535..4570)
                     /number=1
     exon            complement(4590..4626)
                     /number=2
     mRNA            join(5450..5572,5706..6536)
                     /gene="YIP2"
                     /product="Yip2p"
     gene            5450..6536
                     /gene="YIP2"
     CDS             join(5522..5572,5706..6197)
                     /gene="YIP2"
                     /note="similar to human polyposis locus protein 1 (YPD)"
                     /codon_start=1
                     /product="Yip2p"
                     /translation="MSEYASSIHSQMKQFDTKYSGNRILQQLENKTNLPKSYLVAGLG
                     FAYLLLIFINVGGVGEILSNFAGFVLPAYLSLVALKTPTSTDDTQLLTYWIVFSFLSV
                     IEFWSKAILYLIPFYWFLKTVFLIYIALPQTGGARMIYQKIVAPLTDRYILRDVSKTE
                     KDEIRASVNEASKATGASVH"
BASE COUNT     2201 a   1276 c   1255 g   2268 t
ORIGIN
        1 cgaccacaat ggtacgattg ttcataaatc aggagatgtt cctattcata taaagatacc
       61 aaacagatct ctaatacatg accaggatat caacttctat aatggttccg aaaacgaaag
      121 aaaaccaaat ctagagcgta gagacgtcga ccgtgttggt gatccaatga ggatggatag [etc.]

Applying the feature table

Bankit

A feature table file may be imported on the Features page in a BankIt submission. On the Features step select “Add features by uploading five column feature table file”, then “Choose File”, select your file to upload, and then click the Upload file button. Your file will be validated. If there are no errors, the features will be imported where you can review the annotations and submit your data to GenBank.

table2asn

The command line program table2asn can be used to automate part of the submission process and is available via ftp. tbl2asn reads a template, along with the sequence and feature table files, and outputs ASN.1 for submission to GenBank.

Updates

Feature tables can be exported from existing public sequence records to help you prepare an update to your accessioned GenBank record. From the search result in Entrez, save the feature table using the Send to file option from the toolbar, specifying the format as feature table. Send your edited feature table to gb-admin@ncbi.nlm.nih.gov.

Feature tables for use in the GenBank Submission Portal

Currently, the GenBank Submission Portal accepts 5-column, tab-delimited feature tables for mRNA sequence submissions. Only gene, CDS, 5'UTR, and 3'UTR features are accepted. Other features are not yet supported but will be added in the future. If you include other feature types in your file, the annotation for those other features will not be applied to your submission. If you have questions about this, please reach out to: gb-admin@ncbi.nlm.nih.gov.

Preparing the feature table file for a GenBank Submission Portal Submission

File type

The file must be a plain ascii text file. Do not save your file as .rtf., .xls, .docx, or any binary file types, as our software cannot read these. If you use a spreadsheet program to prepare your feature table, save it as tab-delimited text.

General format for a feature table

Review the example tables in Figure 3 while you read this section to help you understand the feature table format.

The first line of the feature table contains the following basic information, like this example:

>Feature sequence_ID

The sequence identifiers (sequence_ID) must match exactly the sequence_IDs in your nucleotide FASTA file. The next lines contain information about the features and qualifiers:

Each feature is on a separate line.
Qualifiers describing a feature are on the lines below that feature and its intervals.
Each column is separated by a tab.

Column 1: Start location (first nucleotide) of a feature 
Column 2: Stop location (last nucleotide) of a feature 
Column 3: Feature key (for example, 'CDS' or 'gene') 
Column 4: Qualifier key (for example, 'product' or 'gene' or 'note') 
Column 5: Qualifier value

Figure 3 : Example table

>Feature Seq1
1     1210    CDS
                        product        acid trehalase 1
>Feature Seq2                        
<1    >1050    gene
                        gene           ATH2
<1    >1050    CDS
                        product        acid trehalase 2
                        codon_start    3
>Feature Seq3                        
<1    1600    gene
                        gene           ATH4
<1    9       5'UTR                         
10    1550    CDS
                        product        acid trehalase 4                         
1551  1600    3'UTR
>Feature Seq4                        
1150  1       gene
                        gene           ATH3
1150  1       CDS        
                        product        acid trehalase 3
                        note           alternatively spliced

Example feature table explained

The examples in Figure 3 illustrate several points about the table format:

Complete and partial features

Complete features - the sequence contains the complete feature.

The features are complete in the Seq1 and Seq4 examples.
A complete CDS begins with the first codon that is translated and ends with the terminal stop codon.
For a complete CDS, there should be no use of ">" or "<" next to the nucleotide start and stop position numbers. You do not need to indicate codon_start for a complete CDS, as it is assumed that the translation starts at the first nucleotide of the interval.

Partial features - the sequence does not contain the full feature at either (or both) ends.

Locations of partial (incomplete) features are indicated with a ">" or "<" next to the first and last nucleotide position numbers.
In the Seq2 example, both the gene and CDS features are partial at the 5' and 3' ends of the sequence. In other words, the CDS begins upstream of the start of the nucleotide sequence and ends downstream of the nucleotide sequence (the sequence does not contain the start codon or stop codon). The "<" symbol next to the first nucleotide location indicates a 5' partial feature and the ">" symbol next to the last nucleotide location indicates the CDS feature is 3' partial. Furthermore, for the protein to translate correctly, the correct reading frame must be indicated with the qualifier "codon_start" on the CDS if the sequence is 5' partial. The valid qualifier value for codon_start is 1, 2, or 3. The codon_start qualifier is only valid for a CDS feature.
In the Seq3 example, the sequence contains part of the 5'UTR. The 5'UTR is 5' partial as indicated by the < next to the nucleotide position 1. Since the 5'UTR for this gene is 5' partial, the gene is also 5' partial. The CDS and 3'UTR are complete in this example and do not have the < or > symbols.

Gene features

The Seq2, Seq3, and Seq4 examples have gene features.
In the examples, 'gene' is both a feature and a qualifier and must be entered in two separate lines and columns.
Gene features are always a single interval.
The gene intervals should cover the intervals of all the features associated with that gene. For example, Seq3 has gene intervals which include the 5'UTR, CDS, and 3'UTR.
The gene partialness follows the partialness of the features within the gene. See the section "Complete and partial features" for information on annotating partial features.

CDS features

Protein-coding sequences should have a CDS feature.
CDS features must have a product qualifier and qualifier value. The CDS product is the name of the protein encoded by the sequence. For example, see Seq1 which has the value 'acid trehalase 1' for the CDS product qualifier.
If a sequence does not contain the translation start codon and/or a stop codon, the CDS is partial. See the section "Complete and partial features" for information on annotating partial features.

Note qualifier

A note can be added to any feature using the 'note' qualifier. See Seq4 for an example of a CDS feature with a note qualifier.

Features on the complementary strand

Features on the complementary strand are indicated by reversing the interval locations. An example of this is shown in Seq4.

Applying a feature table in submission portal

A feature table file may be imported in the GenBank Submission Portal for submission workflows that do not have an automated feature annotation component. Submission workflows that do not have automated feature annotation will request this information from you on the Features page. Upload a feature table on the Features step by selecting “Add features by file upload”, then “5-column tab-delimited feature table”, and then "Upload file”. Your file will be validated. If there are no errors, the features will be imported where you can then review the annotations and submit your data to GenBank.

GenBank

Public nucleic acid sequence repository