Split FASTA - TheBiologyBro

Input Sequences

💡 Quick Summary

Split FASTA divides FASTA sequence records into smaller fragments of a size you specify. An optional overlap value can be used to create consecutive fragments that share bases with their neighbours — useful for sliding-window analyses, read simulation, or tiling PCR design.

📋 How to Use

Paste one or more FASTA sequences into the input area. Input limit is 500,000,000 characters.
Set the Fragment length — the number of bases in each output fragment (default 100).
Optionally set an Overlap — the number of bases shared between consecutive fragments (default 0). Must be less than the fragment length.
Click Run. Each input sequence is tiled into fragments; each fragment is output as a separate FASTA entry with its position and length recorded in the title.
The Summary panel shows the number of source sequences processed and the total fragments produced.
Use the Copy button to copy all fragments to your clipboard.
Click Load Example to try with a 1,000-base random sequence split into 100-base fragments.
Click Clear to reset.

🧮 Formulas & Logic

Fragment start (1-based)

start = j + 1, where j is the 0-based offset into the sequence

Fragment end (1-based)

end = start + fragment_length − 1

Step between starts

step = fragment_length − overlap

Number of fragments

⌈ source_length / step ⌉ (last fragment may be shorter than fragment_length)

📊 Result Interpretation

Sequences Processed

Number of FASTA source records successfully tiled.

Total Fragments

Total number of output fragments across all source sequences.

Fragment Length / Overlap

The configured fragment size and overlap used for this run, shown as "N bp" or "N bp / M bp overlap".

🔬 Applications

Generating tiled overlapping fragments for primer design or cloning strategies
Simulating short sequencing reads from a reference sequence
Creating sliding-window sub-sequences for compositional or structural analyses
Preparing BLAST queries by breaking a long sequence into manageable chunks
Splitting a chromosome-scale sequence into segments for upload to tools with size limits

⚠️ Common Mistakes & Warnings

Non-letter characters are stripped

Digits, whitespace, gap characters (- .), and any other non-alphabetic characters are removed from each sequence before tiling. This strips GenBank/EMBL line numbering automatically but also removes alignment gaps — use Split FASTA on ungapped sequences.

The last fragment may be shorter than the specified length

When the source sequence length is not an exact multiple of the step size (fragment length minus overlap), the final fragment will contain fewer bases than requested. Its actual length is recorded in the FASTA title.

Overlap must be less than the fragment length

Setting overlap equal to or greater than the fragment length would cause fragments to never advance along the sequence. The tool validates this and returns an error if the values are incompatible.

❓ Frequently Asked Questions

What does the overlap parameter do?

With overlap = 0 (default), consecutive fragments are adjacent and non-overlapping. With overlap = N, each new fragment starts N bases before the end of the previous one, so the two fragments share N bases. For example, with fragment length 100 and overlap 20, fragment 1 covers bases 1–100, fragment 2 covers bases 81–180, fragment 3 covers bases 161–260, and so on.

Why must overlap be less than the fragment length?

The step between consecutive fragment start positions equals fragment_length minus overlap. If overlap ≥ fragment_length, the step would be zero or negative, and the tool would never advance along the sequence.

Can I process multiple sequences at once?

Yes. Paste any number of FASTA-formatted sequences (each starting with ">title") and all will be tiled independently in a single run.

How is each fragment named?

Each fragment is given a FASTA title in the format: >fragment_N;source_title_start=X;end=Y;length=Z;source_length=L — where N is the fragment number within its source sequence, X and Y are 1-based positions in the original sequence, Z is the fragment length (which may be shorter for the last fragment), and L is the full source sequence length.

Are gap characters kept in the fragments?

No. All non-letter characters — including alignment gaps (- .) — are removed before tiling. If your input contains gapped alignments, remove the gaps before using this tool.