Filter DNA - TheBiologyBro

Input DNA

Paste any DNA text — numbered sequences, plain sequence, or FASTA. Input limit: 500,000,000 characters.

Remove:

Replace with:

Case:

💡 Quick Summary

Filter DNA removes or replaces unwanted characters from a DNA sequence using a choice of preset patterns, then optionally converts the remaining text to uppercase or lowercase. It is the fastest way to strip line numbers, spaces, digits, or non-IUPAC characters from copied sequence text so it is ready for downstream tools.

📋 How to Use

Paste any DNA text — numbered sequences, FASTA bodies, or raw text — into the input area.
Choose what to Remove: pick the pattern that matches the characters you want to discard. The most common choice is "Remove non-GATCN characters", which strips everything except valid DNA bases and N.
Choose what to Replace with: by default the matched characters are deleted. You can instead substitute them with a placeholder such as N, a gap character (−), or any other single character.
Choose a Case conversion: leave case unchanged, convert all bases to uppercase, or convert all to lowercase.
Click Filter. The output is a FASTA entry whose header line states the final sequence length.
Use Copy to copy the result to your clipboard.
Click Load Example to try the tool with a numbered DNA sequence — a format frequently encountered when copying from databases or textbooks.
Click Clear to reset.

🧮 Formulas & Logic

Core operation

filtered = input.replace( /pattern/g, replacementChar )

Characters removed

count of characters that matched the selected remove pattern before replacement

Output length

length of the filtered string after replacement and case conversion

FASTA header

>filtered DNA sequence consisting of N bases.

📊 Result Interpretation

Filtered Length

Number of characters in the output sequence after all replacements and case conversion.

Characters Replaced

Number of input characters that matched the selected removal pattern. If "replace with nothing" was chosen, these were deleted; otherwise each was swapped for the chosen placeholder.

🔬 Applications

Removing line numbers and whitespace from numbered sequences copied from databases or textbooks
Stripping all non-IUPAC characters from mixed text before BLAST or alignment
Converting a DNA sequence to uppercase for tools that require it
Replacing T with nothing to prepare a sequence for RNA-focused tools that expect U
Masking ambiguous positions by replacing non-GATCN characters with N

⚠️ Common Mistakes & Warnings

The tool does not validate the biological meaning of the result

Replacing T with U, or stripping all non-lowercase letters, produces a new character string but no biological check is performed. Review the output before using it in an analysis.

FASTA headers are passed through unchanged

If your input includes a FASTA ">" header line, the characters in that line are also subject to the selected filter. Strip the header before filtering if you want to preserve it, or use the output header generated by the tool.

Input size limit: 500,000,000 characters

This matches the original SMS limit and is intentionally larger than the EMBL tools, since raw sequence text files can be very large.

❓ Frequently Asked Questions

What does "remove non-GATCN characters" mean?

This option keeps only the characters G, A, T, C, and N (in upper or lower case) and deletes everything else — spaces, digits, newlines, punctuation, and any other letter. It is the most common cleaning step when you have a numbered or formatted sequence that you want to turn into a plain DNA string.

What is the difference between "remove" and "replace"?

When you select "replace with nothing", matched characters are simply deleted and the sequence gets shorter. When you choose a replacement character (N, -, ?, etc.), each matched character is swapped for that character, so the total length stays the same. Replacement is useful when you want to mark positions rather than collapse them.

What are IUPAC DNA characters?

The full IUPAC DNA alphabet covers: G A T C (definite bases) plus R Y S W K M B D H V (two-base ambiguity codes) and N X (any base / unknown). The "Remove non-IUPAC DNA" option keeps all of these; the simpler "Remove non-GATCN" option keeps only the four definite bases plus N.

Can I use this to convert DNA to RNA?

Partially. Selecting "Remove T" with "Replace with U" converts every T/t to U/u, producing a RNA-like string. This is a character substitution — the tool does not check reading frames or perform transcription in a biological sense.

Why does the output always start with a ">" line?

The output is in FASTA format, which requires a header line starting with ">". The header states the final sequence length so you can quickly confirm how many bases remain after filtering.