TSA Submission Guide
Requirements
- Raw reads should be submitted to SRA prior to submitting your transcriptome. The SRA run accession(s) (SRRXXXXXX) and associated BioProject (PRJNAXXXXXX) and BioSample(s) (SAMNXXXXXX) are required for TSA submission.
- Assembly Data structured comment. This information is input directly in the Submission Portal dialogs.
- Description of the assembly process if a multi-step assembly was performed should be provided in the COMMENT section.
- If annotation is provided the product names should follow the International Protein Nomenclature Guidelines.
- Annotation must be biologically valid.
- The keyword 'Targeted' and feature annotation should be included for all targeted subsets of transcriptome data. See Targeted vs. Non-targeted TSA Studies for more information.
Creating the TSA submission file
[1] The BioProject accession, BioSample accession(s), SRA run accession(s) and Assembly Structured Comment data are entered using the Submission Portal dialogs.
[2] If submitting a Targeted subset of your data see the additional requirements under Targeted vs. Non-targeted TSA .
[3] All TSA submissions are submitted through the TSA Submission Portal .
[4] Submit either a fasta or .sqn file
-
Preparing a fasta file for submission:
- Sequences should be in fasta format.
- Files have the suffix .fsa.
- fasta defline component: [moltype=transcribed_RNA]
- Each sequence has a definition line beginning with a unique identifier, eg contig001, contig002, etc.
- Use concise names for the unique indentifier that do not include length or coverage information.
- The unique identifier cannot exceed 50 characters.
- If uploading multiple fasta file sets in one submission (SUB) the sequence identifier must be unique across all sets.
- The unique identifier appears in the DEFINITION line in the flatfile view of the TSA flatfiles.
- Contigs should >199 nt.
- Remove any n's from the beginning or end of each sequence.
- Do not include internal n's that represent a gap of unknown length. These sequences should be split at the gap.
- Any n stretches greater than 14 nucleotides will need an assembly gap feature. The Submission Portal will will provide a prompt to set up the assembly gap feature.
- Scan the sequences using the NCBI Foreign Contamination Screen FCS-GX to avoid delays after submission.
-
Preparing .asn file using table2asn for submission.
- table2asn reads a template.sbt along with the sequence and table files, and outputs ASN.1 for submission to TSA through the portal.
- Annotation may be included using a Feature table. See table2asn .
-
fasta defline components:
- [moltype=transcribed_RNA]
- [tech=TSA]
- To add Source information see table2asn Source table format
Sample command line:
table2asn -t template.sbt -indir . -Y comment -M t
The validator output (*.val) should be reviewed before submitting. Any validator errors not resolved prior to submission may be stopped in the Submission Portal. See Submitting the file to TSA-Submission Portal for more information.
tbl2asn command line arguments -Y To import Assembly Description Comment -M t To run standard validator and additional TSA checks -j Allows the addition of source qualifiers that will be the same for each submission. Example: -j "[organism=Homo sapiens] [tissue-type=liver]"
Submitting the file to TSA Submission Portal
All files must be submitted via the Submission Portal .
When the file is uploaded it will undergo a series of validation checks. The following will stop your submission in portal:
- Sequences less than 200 bp
- Sequences with univec hits that are for Next-Gen sequencing primers
- Sequences that are more than 10% n's or have more than 14n's in a row
- Files that are incorrectly formatted or have biologically invalid annotation
Submission statuses in the Submission Portal:
- Queued: The submission is successful and waiting for review by TSA staff. If there are any issues the submitter will be contacted with a list of revisions and/or inquiries.
- Error: The TSA staff has reported any error(s) to the submitter. The corrections need to be made and a new file uploaded using the Fix button.
- Processing: The submission has been successfully completed and an accession number for the project has been assigned.
- Processed: The project has been released to the database.
Metagenomic Transcription Studies
For metagenomic transcriptome studies the regular submission process should be followed with the following adjustments:
- Use a metagenome organism name that describes the sample from which the DNA was isolated (eg soil metagenome or gut metagenome). The NCBI Taxonomy database lists several metagenomic names. The name is selected in the first steps when submitting to SRA or registering your BioSamples.
- To save processing time please remove potential contaminant sequences prior to submitting. Scan the sequences using the NCBI Foreign Contamination Screen FCS-GX
- Since TSA runs several validation checks on each transcriptome assembly, you may need to split large assembly files into smaller files.
Targeted vs. Non-targeted TSA Studies
It is expected that submissions to TSA would comprise a large-scale comprehensive study of the complete transcriptome of an organism. However, some scientists do targeted studies of their transcriptome data and only want to submit this small subset. For targeted studies the regular submission process should be followed with the following requirements:
- The keyword 'Targeted' should be added to the submission file. Using tbl2asn this can be done by including [keyword=Targeted] in the fasta definition line.
- Annotation must be included showing the focus of the targeted study. This can be done with a gene, misc_feature, or RNA feature.
- If coding regions are provided the product names should follow the International Protein Nomenclature Guidelines. If misc_features are provided then the /note should be in the following format "similar to product_name".
- Set the molecule type (moltype) to the appropriate RNA type -mRNA, rRNA, ncRNA, or transcribed RNA.
*SRA cannot release a subportion of your data to match your subset. The entire SRA dataset will be released upon release of your subset.
Assembly Gaps
Sequences with known gaps can be submitted to TSA providing the gap is annotated with an assembly_gap feature.
The required qualifiers for the assembly_gap feature are:
- estimated_length
- linkage_evidence
- paired-ends: paired sequences from the two ends of a DNA fragment.
- align-genus: alignment to a reference genome within the same genus.
- align-xgenus: alignment to a reference genome within another genus.
- align-trnscpt: alignment to a transcript from the same species.
Updating TSA submissions
See Update TSA Records for instructions on how to update your assemblies. Contact gb-admin@ncbi.nlm.nih.gov with any additional questions.