NCBI Embiotoca jacksoni Annotation Release GCF_022577435.1-RS_2024_10

The genome sequence records for Embiotoca jacksoni RefSeq assembly GCF_022577435.1 (fEmbJac1.0.p) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_022577435.1-RS_2024_10".

Date of Entrez queries for transcripts and proteins: Oct 11 2024
Date of submission of annotation to the public databases: Oct 16 2024
Software version: 10.3

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
fEmbJac1.0.p	GCF_022577435.1	UCLA	03-15-2022	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	fEmbJac1.0.p
Genes and pseudogenes	24,580
protein-coding	22,292
non-coding	1,848
Transcribed pseudogenes	0
Non-transcribed pseudogenes	361
genes with variants	8,237
Immunoglobulin/T-cell receptor gene segments	71
other	8
mRNAs	40,320
fully-supported	38,859
with > 5% ab initio	631
partial	130
with filled gap(s)	36
known RefSeq (NM_)	0
model RefSeq (XM_)	40,320
non-coding RNAs	2,632
fully-supported	1,595
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2,031
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	40,391
fully-supported	38,859
with > 5% ab initio	726
partial	124
with major correction(s)	909
known RefSeq (NP_)	0
model RefSeq (XP_)	40,320

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,148	16,950	7,682	55	658,484
All transcripts	42,952	3,771	3,040	55	96,148
mRNA	40,320	3,930	3,170	186	96,148
misc_RNA	583	3,341	2,799	159	14,150
tRNA	601	74	73	71	87
lncRNA	1,012	1,393	935	92	17,015
snoRNA	195	115	94	60	332
snRNA	90	139	141	55	195
rRNA	143	544	119	119	3,935
Single-exon transcripts	904	1,949	1,696	186	28,698
coding transcripts (NM_/XM_ )	904	1,949	1,696	186	28,698
CDSs	40,320	2,272	1,581	114	94,851
Exons	270,326	303	136	1	29,418
in coding transcripts (NM_/XM_ )	266,369	301	136	1	29,418
in non-coding transcripts (NR_/XR_ )	8,917	305	132	10	14,072
Introns	246,014	1,658	397	30	327,067
in coding transcripts (NM_/XM_ )	243,379	1,645	396	30	305,561
in non-coding transcripts (NR_/XR_ )	7,517	2,011	475	30	327,067

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.8	1	1	50
Number of exons per transcript	13.45	10	1	251

BUSCO analysis of gene annotation

BUSCO v5.7.1 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the actinopterygii_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 22292 coding genes, 21164 genes had a protein with an alignment covering 50% or more of the query and 10108 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
fEmbJac1.0.p	GCF_022577435.1	28.09%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species TSA	67,839	67,330 (99.25%)	62,773 (92.53%)	99.75%	99.79%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	975,860,844	85%	22%	277,275
SAMN20959966	replicate_4A (Embiotoca jacksoni, SAMN20959966)	16,896,989	81%	21%	183,823
SAMN20959967	replicate_5A (Embiotoca jacksoni, SAMN20959967)	18,360,819	83%	22%	180,694
SAMN20959968	replicate_1A (Embiotoca jacksoni, SAMN20959968)	14,062,823	81%	22%	182,707
SAMN20959969	replicate_2A (Embiotoca jacksoni, SAMN20959969)	14,265,644	81%	22%	188,718
SAMN20959970	replicate_3A (Embiotoca jacksoni, SAMN20959970)	14,641,971	78%	21%	181,680
SAMN20959971	replicate_4B (Embiotoca jacksoni, SAMN20959971)	15,745,242	78%	22%	185,212
SAMN20959972	replicate_5B (Embiotoca jacksoni, SAMN20959972)	16,144,150	80%	21%	188,887
SAMN20959973	replicate_1B (Embiotoca jacksoni, SAMN20959973)	14,233,291	81%	21%	182,214
SAMN20959974	replicate_2B (Embiotoca jacksoni, SAMN20959974)	16,931,799	84%	21%	190,697
SAMN20959975	replicate_3B (Embiotoca jacksoni, SAMN20959975)	22,971,966	86%	21%	193,147
SAMN20959976	replicate_4C (Embiotoca jacksoni, SAMN20959976)	19,418,299	87%	21%	185,188
SAMN20959977	replicate_5C (Embiotoca jacksoni, SAMN20959977)	20,367,941	86%	22%	190,596
SAMN20959978	replicate_1C (Embiotoca jacksoni, SAMN20959978)	20,300,681	85%	21%	188,746
SAMN20959979	replicate_2C (Embiotoca jacksoni, SAMN20959979)	19,932,004	86%	22%	196,489
SAMN20959980	replicate_3C (Embiotoca jacksoni, SAMN20959980)	19,022,226	87%	21%	191,465
SAMN20959981	replicate_4D (Embiotoca jacksoni, SAMN20959981)	19,809,302	87%	19%	174,053
SAMN20959982	replicate_5D (Embiotoca jacksoni, SAMN20959982)	19,687,675	87%	21%	194,000
SAMN20959983	replicate_1D (Embiotoca jacksoni, SAMN20959983)	17,275,903	87%	23%	192,791
SAMN20959984	replicate_2D (Embiotoca jacksoni, SAMN20959984)	20,512,310	88%	22%	187,349
SAMN20959985	replicate_3D (Embiotoca jacksoni, SAMN20959985)	19,122,863	88%	22%	193,655
SAMN20959986	replicate_09-1_brain (Embiotoca jacksoni, SAMN20959986)	16,363,592	88%	18%	166,511
SAMN20959987	replicate_09-2_brain (Embiotoca jacksoni, SAMN20959987)	12,887,602	89%	19%	169,187
SAMN20959988	replicate_10-1_brain (Embiotoca jacksoni, SAMN20959988)	13,288,019	88%	15%	155,989
SAMN20959989	replicate_10-2_brain (Embiotoca jacksoni, SAMN20959989)	11,772,102	85%	18%	80,399
SAMN20959990	replicate_11-1_brain (Embiotoca jacksoni, SAMN20959990)	16,407,788	95%	34%	86,917
SAMN20959991	replicate_12-1_brain (Embiotoca jacksoni, SAMN20959991)	17,837,003	95%	33%	70,605
SAMN20959992	replicate_12-2_brain (Embiotoca jacksoni, SAMN20959992)	17,593,174	94%	35%	61,470
SAMN20959993	replicate_12-3_brain (Embiotoca jacksoni, SAMN20959993)	16,627,944	95%	35%	77,957
SAMN20959994	replicate_09-1_muscle (Embiotoca jacksoni, SAMN20959994)	12,798,146	93%	34%	53,196
SAMN20959995	replicate_09-2_muscle (Embiotoca jacksoni, SAMN20959995)	11,370,942	92%	36%	75,322
SAMN20959996	replicate_10-1_muscle (Embiotoca jacksoni, SAMN20959996)	11,153,890	94%	37%	59,082
SAMN20959997	replicate_10-2_muscle (Embiotoca jacksoni, SAMN20959997)	10,164,618	95%	36%	48,836
SAMN20959999	replicate_12-1_muscle (Embiotoca jacksoni, SAMN20959999)	18,870,286	91%	16%	121,834
SAMN20960000	replicate_12-2_muscle (Embiotoca jacksoni, SAMN20960000)	18,097,244	89%	17%	133,946
SAMN20960001	replicate_12-3_muscle (Embiotoca jacksoni, SAMN20960001)	18,752,290	90%	17%	143,737
SAMN40863702	muscle (Embiotoca jacksoni, SAMN40863702)	116,554,626	85%	26%	191,191
SAMN41792258	brain (Embiotoca jacksoni, SAMN41792258)	100,144,102	79%	15%	209,716
SAMN41792259	gills (Embiotoca jacksoni, SAMN41792259)	87,055,576	82%	14%	162,574
SAMN41792260	liver (Embiotoca jacksoni, SAMN41792260)	88,418,002	82%	17%	134,364

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR15597406	SRX11895051	SRP333880	SAMN20959966	16,896,989	81%	21%
SRR15597405	SRX11895052	SRP333880	SAMN20959967	18,360,819	83%	22%
SRR15597394	SRX11895063	SRP333880	SAMN20959968	14,062,823	81%	22%
SRR15597383	SRX11895074	SRP333880	SAMN20959969	14,265,644	81%	22%
SRR15597376	SRX11895081	SRP333880	SAMN20959970	14,641,971	78%	21%
SRR15597375	SRX11895082	SRP333880	SAMN20959971	15,745,242	78%	22%
SRR15597374	SRX11895083	SRP333880	SAMN20959972	16,144,150	80%	21%
SRR15597373	SRX11895084	SRP333880	SAMN20959973	14,233,291	81%	21%
SRR15597372	SRX11895085	SRP333880	SAMN20959974	16,931,799	84%	21%
SRR15597371	SRX11895086	SRP333880	SAMN20959975	22,971,966	86%	21%
SRR15597404	SRX11895053	SRP333880	SAMN20959976	19,418,299	87%	21%
SRR15597403	SRX11895054	SRP333880	SAMN20959977	20,367,941	86%	22%
SRR15597402	SRX11895055	SRP333880	SAMN20959978	20,300,681	85%	21%
SRR15597401	SRX11895056	SRP333880	SAMN20959979	19,932,004	86%	22%
SRR15597400	SRX11895057	SRP333880	SAMN20959980	19,022,226	87%	21%
SRR15597399	SRX11895058	SRP333880	SAMN20959981	19,809,302	87%	19%
SRR15597398	SRX11895059	SRP333880	SAMN20959982	19,687,675	87%	21%
SRR15597397	SRX11895060	SRP333880	SAMN20959983	17,275,903	87%	23%
SRR15597396	SRX11895061	SRP333880	SAMN20959984	20,512,310	88%	22%
SRR15597395	SRX11895062	SRP333880	SAMN20959985	19,122,863	88%	22%
SRR15597393	SRX11895064	SRP333880	SAMN20959986	16,363,592	88%	18%
SRR15597392	SRX11895065	SRP333880	SAMN20959987	12,887,602	89%	19%
SRR15597391	SRX11895066	SRP333880	SAMN20959988	13,288,019	88%	15%
SRR15597390	SRX11895067	SRP333880	SAMN20959989	11,772,102	85%	18%
SRR15597389	SRX11895068	SRP333880	SAMN20959990	16,407,788	95%	34%
SRR15597388	SRX11895069	SRP333880	SAMN20959991	17,837,003	95%	33%
SRR15597387	SRX11895070	SRP333880	SAMN20959992	17,593,174	94%	35%
SRR15597386	SRX11895071	SRP333880	SAMN20959993	16,627,944	95%	35%
SRR15597385	SRX11895072	SRP333880	SAMN20959994	12,798,146	93%	34%
SRR15597384	SRX11895073	SRP333880	SAMN20959995	11,370,942	92%	36%
SRR15597382	SRX11895075	SRP333880	SAMN20959996	11,153,890	94%	37%
SRR15597381	SRX11895076	SRP333880	SAMN20959997	10,164,618	95%	36%
SRR15597379	SRX11895078	SRP333880	SAMN20959999	18,870,286	91%	16%
SRR15597378	SRX11895079	SRP333880	SAMN20960000	18,097,244	89%	17%
SRR15597377	SRX11895080	SRP333880	SAMN20960001	18,752,290	90%	17%
SRR29376841	SRX24891212	SRP513449	SAMN40863702	116,554,626	85%	26%
SRR29376844	SRX24891209	SRP513449	SAMN41792258	100,144,102	79%	15%
SRR29376843	SRX24891210	SRP513449	SAMN41792259	87,055,576	82%	14%
SRR29376842	SRX24891211	SRP513449	SAMN41792260	88,418,002	82%	17%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Betta splendens high-quality model RefSeq (XP_)	18,289	18,017 (98.51%)	18,017 (98.51%)	70.73%	80.81%
Xiphophorus couchianus high-quality model RefSeq (XP_)	18,371	18,129 (98.68%)	18,129 (98.68%)	70.40%	80.40%
Actinopterygii GenBank	95,375	90,427 (94.81%)	90,427 (94.81%)	69.25%	80.32%
Actinopterygii known RefSeq (NP_)	25,910	24,537 (94.70%)	24,537 (94.70%)	68.26%	78.00%
Danio rerio high-quality model RefSeq (XP_)	8,031	7,681 (95.64%)	7,681 (95.64%)	62.83%	71.69%
Esox lucius high-quality model RefSeq (XP_)	18,508	18,002 (97.27%)	18,002 (97.27%)	67.78%	76.27%
Homo sapiens known RefSeq (NP_)	67,680	57,387 (84.79%)	57,387 (84.79%)	66.22%	68.15%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013-2015. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences