Contamination in Sequence Databases

New: Try the NCBI Foreign Contamination Screen (FCS) yourself!

Definition

A contaminated sequence is one that does not faithfully represent the genetic information from the biological source organism/organelle because it contains one or more sequence segments of foreign origin.

Sources of Contamination

The most common sources of contamination are accessory DNAs deliberately attached to the DNA/RNA under investigation:

Vectors DNA/cDNAs from the biological source organism/organelle are usually inserted into a cloning vector (e.g., plasmid, phage, cosmid, BAC, PAC, YAC) so that they can be cloned, propagated, and manipulated. Sequencing of such constructs frequently produces raw sequences that include segments derived from vector. Failure to identify and remove all of the vector sequence results in a finished sequence that is contaminated.
Adapters, linkers, and PCR primers Various oligonucleotides can be attached to the DNA/RNA under investigation as part of the cloning or amplification process. The sequences of these oligonucleotides are therefore often present in raw sequences and will contaminate the finished sequence unless they are identified and removed.

Unintended events can also introduce contamination from other sources:

Transposons and insertion sequences A transposable element from the cloning host (generally Escherichia coli or yeast) occasionally will insert itself into the cloned DNA/RNA while the clone is being propagated. The foreign transposable element will then be sequenced as if it were part of the DNA/RNA under investigation. The chance of a transposon or insertion sequence inserting into the clone increases with the size of the DNA insert; thus, large genomic clones, e.g., BACs, are more likely to be the target for insertion of transposable elements than cDNA clones.
Impurities in the DNA/RNA under investigation Nucleic acid preparations may contain DNA/RNA from sources other than the intended one. Sequences obtained from an impure nucleic acid preparation may be partly or entirely derived from the impurities. Examples of such impurities include:
- nucleic acids from an organelle present in a cellular DNA/RNA preparation, or cellular nucleic acids present in a preparation of DNA/RNA from an organelle (e.g., mitochondrial DNA contamination in genomic DNA preparations)
- mRNA/DNA present in a reagent used in the isolation, purification, or cloning procedures (e.g., yeast genomic DNA present in tRNA used as a carrier in the preparation of cDNAs)
- nucleic acids from other organisms present in the material from which the DNA/RNA under investigation was isolated (e.g., microbes growing in contaminated cell cultures)
- other DNAs/RNAs used in the laboratory (e.g., from accidental mixing of samples or cross contamination from dirty pipettes, tips, tubes, or equipment)

Consequences of Contamination

The sources of contamination discussed in the previous section all result in contaminated nucleotide sequences. Contamination, therefore, has its greatest impact on nucleotide sequence analyses. Protein sequence analyses can also be affected by contamination, however, because the foreign sequence may add or extend an open reading frame, thus altering the predicted translation product(s).

The primary consequences of contamination are:

Time and effort wasted on meaningless analyses The interpretation of any analysis performed on a contaminated sequence can be confounded by the presence of segments of foreign origin. For example, a similarity search can produce hits that are based only on shared segments of foreign sequence. Reviewing such hits to determine their possible significance wastes time and effort.
Erroneous conclusions drawn about the biological significance of the sequence The foreign segments in a contaminated sequence can generate misleading data when the sequence is analyzed for possible functions and evolutionary relationships.
Misassembly of sequence contigs and false clustering of Expressed Sequence Tags (ESTs) Sequences contaminated with the same foreign sequence can be aligned via the shared foreign segment. This can lead to joining or grouping of unrelated sequences.
Delay in the release of the sequence in a public database Failure to remove segments of foreign origin before submission significantly increases the time needed to process the submission, hence delaying the release of the sequence.
Pollution of public databases Sequences from public databases are widely used for many different types of analyses. If contaminated sequences are deposited in a public database, they can confound subsequent analyses of any data sets that include the contaminated sequences.

Detection of Contamination

Vector Contamination

The primary approach to screening nucleic acid sequences for vector contamination is to run a sequence similarity search against a database of vector sequences. The preferred tool for conducting such a search is NCBI's VecScreen. VecScreen detects contamination by running a BLAST sequence similarity search against the UniVec vector sequence database. VecScreen then categorizes the matches, eliminates redundant hits, and shows the location of contaminating and suspect segments on a simple graphical display. Screens for vector contamination may also be conducted by running a sequence similarity search, such as BLAST, against other vector sequence databases, for example the artificial sequences subset of NCBI's nr/nt database, or the EMVEC vector database from the European Bioinformatics Institute (EBI).

Another method used to detect vector contamination is to search the sequence for restriction sites. (Software for restriction site analysis is widely available. Sequences can also be analyzed via the Internet using Webcutter.) Clusters of restriction sites often indicate sequence derived from the multiple cloning site (MCS) of a vector. The presence of even a single site known to have been used for cloning indicates a potential junction between vector sequence and sequence from the biological source organism/organelle. The restriction site analysis method has the advantage of being able to reveal contamination with a MCS, even if the vector sequence is not in any database. The ability to detect vector contamination using this approach alone is limited, however, because it is hard to distinguish between a single cloning site and a naturally occurring restriction site, and because the cloning process does not always recreate the sites used for cloning. Restriction site analysis is therefore best used to supplement a sequence similarity based search, such as VecScreen.

Adapter, Linker, and Primer Contamination

VecScreen can be used to detect contamination with many of the adapters, linkers, and PCR primers used in the most popular cDNA cloning strategies because the UniVec database includes the sequences for such oligonucleotides. The most reliable method for detecting adapter, linker, and primer contamination, however, is to run a nucleotide sequence similarity search, such as BLAST, using the sequences of the specific oligonucleotides used to clone and/or amplify the DNA/RNA under investigation.

Other Contamination

Nucleic acid sequences can be screened for other segments of foreign origin by running a similarity search, e.g., BLAST, against databases of sequences from potential contaminants. Screens useful for detecting contamination include BLAST searches against databases of mitochondrial sequences, yeast sequences, and Escherichia coli sequences. A similarity search against an inclusive database, such as the blast nr database, can reveal contamination with sequences from other organisms. Sequences can be screened against the nr database from NCBI's BLAST Web page.

NCBI's Foreign Contamination Screen tools

FCS-adaptor detects adaptor and vector contamination in genome sequences. FCS-GX detects contamination from foreign organisms in genome sequences. Users may also run these programs themselves to detect contaminations in their sequences.