About the NCBI RefSeqGene Project
RefSeqGene, a subset of NCBI's Reference Sequence (RefSeq) project, defines genomic sequences to be used as reference standards for well-characterized genes. These sequences, labeled with the keyword RefSeqGene in NCBI's nucleotide database, serve as a stable foundation for reporting mutations, for establishing conventions for numbering exons and introns, and for defining the coordinates of other variations. RefSeq mRNA and protein sequences have long been used for this purpose, but have the obvious weakness of not providing explicit coordinates for flanking or intronic sequence. RefSeq chromosome sequences do provide explicit coordinates no matter the relationship to any gene annotation, but have awkwardly large coordinate values that will change when the sequence is updated because of a re-assembly. Sequences of the RefSeqGene project counter both of these drawbacks by providing more stable gene-specific genomic sequence for each gene, as well as including upstream and downstream flanking regions. If modifications must be made to any RefSeqGene sequence, it will be versioned and tools will be provided to facilitate conversion of coordinates. The RefSeqGene sequences are aligned to reference chromosomes, and current and previous chromosome coordinates are available because of that re-alignment.
The RefSeqGene project gratefully acknowledges the leadership and interest of Dr. M.L. Gulley and the Molecular Pathology Resource Committee of the College of American Pathologists.
See also M.L. Gulley et al ., Clinical laboratory reports in molecular pathology .
RefSeqGene and LRG (Locus Reference Genomic)
The RefSeqGene project is an active member of the Locus Reference Genomic project. More details about the relationship between RefSeqGene and LRG are available here .
Sequence Selection
Sequences in the RefSeqGene set are well-supported and, to the extent for which this is possible, represent a prevalent, 'standard' allele.
Criterion 1. Well-supported
The default implementation of 'well-supported genomic sequence' is the sequence from the public reference assembly. The rationale for this definition is the quality of the genomic product. If the current public reference assembly is not well supported, then an alternate sequence will be selected, in consultation with gene-specific experts as available. When feasible, RefSeqGene sequences will be derived from a single clone, based on the assumption that no sequence errors were introduced in cloning, and that a single insert represents an example of a naturally occurring haplotype.
Another aspect of well-supported is the placement of exons and coding regions on the RefSeqGene. The mRNAs used to define the exons are selected based on consultation with locus-specific database curators and other domain experts. Almost all have been reviewed as well by the Consensus CDS project (CCDS) so the placement of splice junctions and the initiation and termination codons are well-supported.
Criterion 2. Standard allele
The default implementation of 'standard allele' will be the sequence from the public reference assembly. If, however, there is published evidence, evidence from locus-specific databases, or evidence from clinical testers, that the sequence in the Reference assembly is not standard, the RefSeqGene sequence will be constructed from an alternate source sequence, or locally modified.
Transcript annotation
RefSeqGenes are annotated with one or more standard transcripts; other transcripts for the gene are represented by alignments.
Reference standard transcripts
RefSeqGene selects one or more transcripts known for a gene for specific annotation. When the RefSeqGene is also an LRG, this selection is made in collaboration with LRG. The reference standard transcripts are selected based on usage by the clinical community, scope of the exons represented, and requests by domain experts. When they are selected, the placement of the exons and coding region is forced to be stable.
Selection of a standard choice of transcripts encourages use of a standard, when a cDNA HGVS expression is used to represent an alternate allele. However, the selection of reference standard is somewhat arbitrary; other transcripts known for a gene are likely to be just as valid.
Aligned transcripts
Alignments of other transcripts for the gene are provided to facilitate mapping from the transcripts to the RefSeqGene/LRG.