Filter Protein
Remove or replace characters in a protein sequence, then convert case

Paste any protein text — numbered sequences, plain sequence, or FASTA. Input limit: 500,000,000 characters.

💡 Quick Summary

Filter Protein removes or replaces unwanted characters from a protein sequence using a choice of preset patterns, then optionally converts the remaining text to uppercase or lowercase. It is the fastest way to strip line numbers, spaces, digits, or non-amino-acid characters from copied sequence text so it is ready for downstream tools.

📋 How to Use
  1. Paste any protein text — numbered sequences, FASTA bodies, or raw text — into the input area.
  2. Choose what to Remove: pick the pattern that matches the characters you want to discard. The most common choice is "Remove non-ACDEFGHIKLMNPQRSTVWY characters", which strips everything except the 20 standard amino acid codes.
  3. Choose what to Replace with: by default matched characters are deleted. You can instead substitute them with a placeholder such as X (unknown residue), a gap character (−), or a stop codon (*).
  4. Choose a Case conversion: leave case unchanged, convert all residues to uppercase, or convert all to lowercase.
  5. Click Filter. The output is a FASTA entry whose header states the final sequence length in residues.
  6. Use Copy to copy the result to your clipboard.
  7. Click Load Example to try the tool with a numbered protein sequence — a common format when copying from databases or papers.
  8. Click Clear to reset.
🧮 Formulas & Logic
Core operation
filtered = input.replace( /pattern/g, replacementChar )
Characters replaced
count of characters matching the selected pattern before replacement
Output length
number of characters in the filtered string after replacement and case conversion
FASTA header
>filtered protein sequence consisting of N residues.
📊 Result Interpretation
Filtered Length (aa)

Number of amino acid residues in the output sequence after all replacements and case conversion.

Characters Replaced

Number of input characters that matched the selected removal pattern. If "replace with nothing" was chosen, these were deleted; otherwise each was swapped for the chosen placeholder.

🔬 Applications
  • Removing line numbers and whitespace from numbered protein sequences copied from databases or publications
  • Stripping all non-standard residues before running alignment or secondary-structure prediction
  • Converting a protein sequence to uppercase for tools that require it
  • Replacing ambiguous or non-standard residues (B, J, O, U, Z) with X for downstream compatibility
  • Masking stop-codon asterisks by replacing * with nothing or X before BLAST submission
⚠️ Common Mistakes & Warnings
FASTA headers are subject to the same filter

If your input includes a FASTA ">" header line, its characters are also processed by the selected pattern. Paste only the sequence body, or use the FASTA header generated by the tool instead.

The tool does not validate biological meaning

Replacing residues with X or converting case produces a new string but no structural or functional check is performed. Always verify the output before using it in analysis.

Input size limit: 500,000,000 characters

This matches the original SMS limit and is intentionally large, since proteome-wide FASTA files can be very large.

❓ Frequently Asked Questions

What does "remove non-ACDEFGHIKLMNPQRSTVWY characters" mean?
This option keeps only the 20 standard IUPAC single-letter amino acid codes (in upper or lower case) and deletes everything else — spaces, digits, newlines, punctuation, and ambiguity codes. It is the most common cleaning step when preparing a protein sequence for downstream tools.
When should I include the stop codon (*)?
Some databases and translation tools append a "*" to mark the stop codon. If you need to preserve it, choose "remove non-ACDEFGHIKLMNPQRSTVWY* characters". If you are submitting to BLAST or alignment tools that do not accept "*", choose the version without it.
What is the difference between "remove" and "replace"?
When you select "replace with nothing", matched characters are deleted and the sequence gets shorter. When you choose a replacement character (X, -, *, etc.), each matched character is swapped one-for-one, so the total length stays the same. Replacement is useful when you want to mark uncertain positions rather than collapse them.
Why does the output always start with a ">" line?
The output is in FASTA format, which requires a header line starting with ">". The header states the final residue count so you can quickly confirm how many amino acids remain after filtering.
What are the extended alphabet options?
The "remove non-ABCDEFGHIJKLMNOPQRSTUVWXYZ.-" options keep all letters plus gap and stop characters. These are useful when working with alignment files, where gap characters (- and .) are meaningful positional placeholders.