Fuzzy Search DNA - TheBiologyBro

Target Sequence

Raw sequence or FASTA format. Non-DNA characters are stripped. Input limit: 2,000,000 characters.

Query sequence max 30 characters

Match value

Mismatch value

Gap value

Hits to report

💡 Quick Summary

Fuzzy Search DNA finds sites in a target DNA sequence that are identical or similar to a short query sequence. Scoring is controlled by match, mismatch, and gap parameters. Use it to locate sequences that can be mutated into a restriction site, or to find approximate occurrences of a motif.

📋 How to Use

Paste the target sequence (raw or FASTA format) into the top textarea. Input limit is 2,000,000 characters.
Type the query sequence into the query field (max 30 characters). This is the short motif you want to find.
Adjust the scoring parameters if needed: Match value (reward for identical bases), Mismatch value (reward/penalty for non-identical bases), Gap value (cost for insertions/deletions), and the number of hits to report.
Click Run. Each hit shows the aligned query and target segment with its alignment score.
Click Load Example to try searching for cccggg (an SmaI restriction site) in a sample sequence.

🧮 Formulas & Logic

Cell score

max( diagonal + match/mismatch, left − gap, up − gap, 0 )

Alignment score

Sum of match/mismatch values along the traceback path minus gap penalties

Hit ranking

Hits are returned in descending score order; previously matched cells are zeroed before the next iteration

📊 Result Interpretation

Hits found

Number of distinct local alignments returned (up to the requested maximum).

Query from … to …

1-based start and end positions of the aligned portion within the query.

Target from … to …

1-based start and end positions of the aligned portion within the target.

Score

Cumulative alignment score. Higher scores indicate closer matches.

Gaps (−)

A dash in the aligned sequence marks an insertion in the opposite sequence.

🔬 Applications

Finding near-matches to a restriction enzyme recognition site so you can plan a silent mutation to introduce the site
Locating degenerate primer binding sites in a template sequence
Identifying approximate occurrences of a short regulatory motif across a genomic region
Checking whether a synthesised oligo will bind off-target sites with only a few mismatches

⚠️ Common Mistakes & Warnings

Query length is limited to 30 characters

The algorithm uses O(n × m) memory where n and m are the sequence lengths. Keeping the query short ensures the alignment matrix stays manageable.

Non-DNA characters are stripped

Any character that is not a valid DNA base (A, T, G, C, N, U, R, Y, S, W, K, M, B, D, H, V) is removed from both the target and query before the search.

Only the first FASTA record is searched

If you paste a multi-FASTA file only the first sequence is used as the target. Use a single-sequence FASTA or a raw sequence.

❓ Frequently Asked Questions

How does the gap value work?

A positive gap value subtracts points for each gap (insertion or deletion) introduced into the alignment — the typical setting for penalising indels. A negative gap value adds points for each gap, making gapped alignments preferred over mismatches. The default of −2 applies a 2-point penalty per gap, which usually produces clean alignments for short motif searches.

What do the start/end positions mean?

Positions are 1-based. "Query from 1 to 6" means the alignment covers the entire 6-character query. "Target from 50 to 55" means the matching region in the target runs from base 50 to base 55 (inclusive) of the cleaned sequence.

Why are subsequent hits scored lower?

After each hit is recorded, the cells used in its traceback path are zeroed in the scoring matrix. The matrix is then re-scored to find the next best non-overlapping alignment. Each successive hit is therefore the best remaining alignment that does not reuse the same target positions.

Can I use IUPAC ambiguity codes in the query?

The scoring matrix uses a simple identity check: two characters match only if they are identical after lowercasing. Ambiguity codes such as R or Y are not expanded — they only match another R or Y respectively.