RefSeq curation and annotation of the human reference genome
Overview
The human reference genome sequence is an essential resource for clinical, forensic, and research uses. Accurate annotation of known genes, and prediction of novel genes based on available transcript evidence, provides indespensable functional context that supports use of this sequence resource. NCBI has been involved with supporting the human genome assembly and annotation needs since the 1990's.
Select human genome and annotation publications with NCBI authors:
- Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. O'Leary et al., Nucleic Acids Res. 2016 44(D1):D733-45.
- Extending reference assembly models. Church et al., Genome Biol. 2015 Jan 24;16:13
- RefSeq: an update on mammalian reference sequences. Pruitt et al., Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63
- Modernizing reference genome assemblies. Church et al., PLoS Biol. 2011 Jul;9(7):e1001091.
- Finishing the euchromatic sequence of the human genome. International Human Genome Sequencing Consortium. Nature. 2004 Oct 21;431(7011):931-45.
- Accessing the human genome. Church D, Pruitt KD.Curr Protoc Hum Genet. 2002 Nov;Chapter 6:Unit 6.9.
- NCBI's LocusLink and RefSeq. Maglott et al., Nucleic Acids Res. 2000 Jan 1;28(1):126-8.
- Introducing RefSeq and LocusLink: curated human genome resources at the NCBI.Pruitt et al., Trends Genet. 2000 Jan;16(1):44-7.
- Coordination of human genome sequencing via a consensus framework map. Bentley et all, J.Trends Genet. 1998 Oct;14(10):381-4
RefSeq human reference genome assembly
The human reference genome assembly and sequence information that is represented in the RefSeq collection is identical to the sequence data available in databases maintained by the International Nucleotide Sequence Database Collaboration (INSDC) including NCBI's GenBank database.
The assembly and sequence is maintained by the International Genome Reference Consortium (GRC) whose activities have included providing updated versions of the reference genome assembly, and providing alternate representations (alternate loci) for regions that are too complex to be represented by a single path. In addition the GRC releases regional fixes known as patches. The GRC website, hosted by NCBI, includes rich information about assembly updates and issues reported for the current assembly.
The RefSeq version of the human genome differs from the GRC and GenBank version in terms of genome annotation and record descriptors.
Information about the human reference genome assembly structure and statistics is available through NCBI's Assemblyresource. The RefSeq assembly accession number for the human reference genome is GCF_000001405. The assembly version number provides essential specificity. GCF_000001405.33 refers to the RefSeq copy of the GRCh38.p7 assembly (where p7 indicates it is the seventh patch for the GRCh38 assembly). The GRC-maintained GenBank version of the reference assembly is tracked with the assembly accession number GCA_000001405.
Human genome annotation process
The RefSeq project generates comprehensive genome annotation results for the reference assembly each year (approximately every 12 to 18 months). Annotation is recalculated with each update in order to integrate newly available data. Provision of human reference genome annotation is a core mission of the RefSeq project. NCBI's eukaryotic genome annotation pipeline incorporates the known (curated) RefSeq data set and analyzes cDNA, RNA-Seq, and protein alignments to predict new alternatively spliced transcripts (model RefSeqs), as well as new genes, which are not represented in the curated RefSeq collection.
Each annotation run is tracked numerically and is accompanied by a detailed annotation release report. The annotation release number can be cited in publications to refer to use of a specific annotation result set.
Products of the annotation process include:
-
Annotation tracks which are displayed in the Gene resource (for example see PTEN, GeneID:5728), the new Genome Data Viewer, and Map Viewer (excluding RNA-Seq tracks)
- Gene, transcript, CDS (protein) annotations
- Consensus CDS (CCDS) Collaboration tracked features
- CpG islands
- GRC tracked assembly regions, alternate loci, patches, and reported issues
- RNA-Seq aggregate tracks and per-sample tracks
-
Detailed annotation release report which includes information about annotation reagents (per RNA-Seq sample), alignment quality, and annotation results. For example, Annotation release 108reports RefSeq annotation results for assembly version GRCh38.p7 (GCF_000001405.33).
- FTP release which includes sequence data, annotation data in multiple formats, and assembly reports
In addition to full annotation releases, RefSeq produces interim annotation releases approximately every 3 months to coincide with GRC patch releases. Interim annotations incorporate new and updated known RefSeq transcripts into the genome annotation and carry over model RefSeqs from the full annotation release. The interim annotation is available in GFF3 format by FTP and as an additional annotation track in most of NCBIs graphical annotation viewers.
Mitochondrial genome
RefSeq uses the Revised Cambridge Reference Sequence. The RefSeq sequence (NC_012920.1) is identical to the INSDC record (J01415.2) in both sequence and feature annotation. The RefSeq record was modified to include official nomenclature details as provided by the HUGO Gene Nomenclature Committee (HGNC).
Curation
NCBI staff scientists review sequence alignments, publications, extensive quality assurance tests, and non-NCBI data resources while generating RefSeq transcript, protein, pseudogene, and genomic region records. The human transcript and protein data set is classified into two categories:
- Known RefSeq: This category is supported by extensive manual curation. Records are primarily derived from INSDC cDNAs, EST, and Transcript Shotgun Assembly (TSA) records. These records may also be derived from genomic sequences based on aligned transcripts including RNA-Seq data. RefSeq pseudogene records may be derived from either INSDC DNA or RNA data types. Access known RefSeq transcripts in NCBI's Nucleotideor Proteindatabases.
- Model RefSeq: This category is compututationally predicted based on aligned evidence. Records are primarily derived from genomic sequence although some records include data from INSDC transcript records in order to represent a more complete record in the region of assembly gaps or indels. RefSeq curation of this category is limited to converting best supported model RefSeqs to known RefSeqs. Access model RefSeq transcripts in NCBI's Nucleotideor Proteindatabases.
Known RefSeq records undergo two levels of curation:
- Validated: The record has undergone sequence analysis which includes an evaluation of all INSDC transcript data with a high-quality alignment to the gene locus, review of RNA-Seq alignment summary data, and a review of the protein sequence as relevant. This review may result in a RefSeq record that is compiled from more than one INSDC accession in order to represent both the most supported sequence data per nucleotide and well supported exon combinations. The curation process may include addition of RefSeq attribute data to the internal curation database, which are incorporated into the record using automatic processes. Validated records are considered to be curated.
- Reviewed: In addition to the sequence review carried out for validated records, reviewed records have undergone additional functional review. This may include a literature review, writing a brief gene summary, incorporating RefSeq attribute data, adding feature annotation to the sequence record, adding gene symbols and alternate names, or revising the protein name.
For both of the above review categories, NCBI curation staff may contact collaborating groups to reconcile data conflicts or inconsistencies. We collaborate with the human (HGNC), mouse (MGI), and rat (RGD) official nomenclature groups.
RefSeq is also a partner in the Consensus CDS (CCDS) collaboration which aims to harmonize protein-coding gene annotation at the major genome browsers available at NCBI, Ensembl, and UCSC.
Summary statistics with links to Nucleotide & Protein resources (August 2016):
Included file 'human-table1.inc' not found
RefSeq record content
We value clear communication regarding data provenance. RefSeq records include specific information in the COMMENT section that details which INSDC accessions were used to compile the represented sequence. The COMMENT section also includes an Evidence Data report of additional INSDC accessions or RNA-Seq samples that support the exon combination (or exon pairs) that is reflected in the transcript. We also report the Evidence and Conclusion Ontology identifiers (ECO) identifier in this section. In addition, curated RefSeq attributes are reported in this section of the record.
- See example records with Evidence Data comments.
- See example records with RefSeq Attribute comments.
All RefSeq transcript, protein, and genomic region records include HGNC provided official gene symbols and names (as available) and the NCBI Gene ID which provides a link to NCBI's Gene resource. The NCBI Gene ID is a unique gene identifier and we recommend using this instead of the gene symbol (which may change over time) for tracking.
RefSeq records include relevant feature annotation including the coding sequence region (CDS), publications (a limited set, more are available in NCBI's Gene resource), and in some cases functional regions such as sequence motifs. When a RefSeq protein sequence is the same length and nearly identical to a UniProtKB/Swiss-Prot record, we propage curated Swiss-Prot feature preferred names and select feature annotations to the RefSeq record. RefSeq protein records for human preferentially use UniProtKB/Swiss-Prot records as the name authority.
The reported RefSeq attributes include the following:
- bicistronic transcript
- CDS uses downstream in-frame AUG
- endogenous retrovirus
- gene product(s) localized to mito.
- gene product(s) localized to plastid
- imprinted gene
- inferred exon combination
- multifunctional gene product(s)
- NMD candidate
- non-AUG initiation codon
- polyA required for stop codon
- polymorphic pseudogene
- protein contains selenocysteine
- protein has antimicrobial activity
- readthrough transcript
- regulatory uORF
- replication-dependent histone
- replication-independent histone
- ribosomal slippage
- stop codon readthrough
- undergoes RNA editing
RefSeqGene project:
RefSeqGene, a subset of NCBI's Reference Sequence (RefSeq) project, provides genomic sequences to be used as reference standards for well-characterized genes. These sequences, labeled with the keyword RefSeqGene in NCBI's nucleotide database, serve as a stable foundation for reporting sequence variants and for establishing conventions for numbering exons and introns. RefSeqGene records typically include 5 Kb of 5' flanking sequence and 2 Kb of 3' flanking sequence. RefSeqGene records include representation of a subset of RefSeq (or, at times GenBank) mRNAs and coding regions that have been selected to serve as reference standards. RefSeqGene records provide a stable coordinate system, which is easier to manage than chromosome coordinates, for reporting exonic, intronic, or flanking sequence variation using the established Human Genome Variation Society (HGVS) standard.