Fuzzy Search Protein - TheBiologyBro

Target Sequence

Raw sequence or FASTA format. Non-protein characters are stripped. Input limit: 2,000 characters.

Query sequence max 30 amino acids

Scoring matrix

Gap value

Hits to report

💡 Quick Summary

Fuzzy Search Protein finds sites in a target protein sequence that are identical or similar to a short query sequence. Similarity is scored using a selected amino-acid substitution matrix (PAM30, PAM70, BLOSUM80, BLOSUM62, or BLOSUM45) plus a configurable gap penalty.

📋 How to Use

Paste the target protein sequence (raw or FASTA format) into the top textarea. Input limit is 2,000 characters.
Type the query sequence into the query field (max 30 amino acids). This is the short motif you want to find.
Choose a scoring matrix. BLOSUM62 is the standard choice for most protein homology searches.
Adjust the gap value and the number of hits to report if needed, then click Run.
Each hit shows the aligned query and target segment with the alignment score.
Click Load Example to try searching for GAD in a sample protein sequence.

🧮 Formulas & Logic

Cell score

max( diagonal + matrix_score(aa1, aa2), left − gap, up − gap, 0 )

Alignment score

Sum of substitution matrix scores along the traceback path minus gap penalties

Hit ranking

Hits are returned in descending score order; previously matched cells are zeroed before the next iteration

📊 Result Interpretation

Hits found

Number of distinct local alignments returned (up to the requested maximum).

Query from … to …

1-based start and end positions of the aligned portion within the query.

Target from … to …

1-based start and end positions of the aligned portion within the target.

Score

Cumulative substitution-matrix alignment score. Higher scores indicate greater similarity.

Gaps (−)

A dash in the aligned sequence marks an insertion in the opposite sequence.

🔬 Applications

Finding approximate occurrences of a short functional motif (e.g. an active-site residue pattern) within a full-length protein
Identifying regions of a protein that could be mutated to match a desired epitope
Comparing how well a peptide query matches different regions of a target protein
Detecting conserved short motifs across distantly related sequences where exact matches are unlikely

⚠️ Common Mistakes & Warnings

Target sequence is limited to 2,000 characters

The algorithm allocates an O(n × m) scoring matrix. Protein scoring with full substitution matrices is more expensive than nucleotide identity checks, so the target limit is kept small to ensure the calculation completes in reasonable time.

Query length is limited to 30 amino acids

This keeps the scoring matrix manageable and ensures near-real-time results even for longer target sequences.

Non-protein characters are stripped

Any character that is not a standard single-letter amino acid code is removed from both the target and query before the search.

Only the first FASTA record is searched

If you paste a multi-FASTA file only the first sequence is used as the target.

❓ Frequently Asked Questions

Which scoring matrix should I choose?

BLOSUM62 is the default and works well for sequences with roughly 60–80% identity. Use BLOSUM80 or PAM30/PAM70 for more closely related sequences (higher identity). Use BLOSUM45 for more distantly related sequences where you expect fewer similarities. The PAM matrices model evolutionary divergence, while BLOSUM matrices are derived from aligned protein block statistics.

How does the gap value work?

A positive gap value (displayed as "−1", "−2", etc.) subtracts points for each gap introduced into the alignment. The default of −2 applies a 2-point penalty per gap. A negative gap value (displayed as "+1", "+2", etc.) adds points per gap, making gapped alignments preferred — unusual but sometimes useful.

Why are subsequent hits scored lower?

After each hit is recorded, the cells used in its traceback path are zeroed in the scoring matrix. The matrix is re-scored to find the next best non-overlapping alignment. Each successive hit is therefore the best remaining alignment that does not reuse the same target positions.