Sample DNA - TheBiologyBro

💡 Quick Summary

Sample DNA randomly selects bases from a guide sequence (with replacement) until a new sequence of the desired length is constructed. Because sampling is proportional to the base composition of the guide, the output sequences preserve the nucleotide frequencies of the guide. Sampled sequences serve as composition-matched controls for evaluating sequence analysis results.

📋 How to Use

Enter the desired output length in bases (default: 100; maximum: 10,000,000).
Paste a raw or FASTA guide sequence into the textarea. The tool samples from the bases in this sequence. Input limit: 10,000,000 characters.
Choose how many sequences to generate (1, 10, 50, or 100).
Click Run. Each output sequence is independently sampled from the guide and output as a FASTA record. Use Copy to copy the plain-text result.

🧮 Formulas & Logic

Sampling probability

P(base X) = count(X in guide) / length(guide). Sampling is with replacement, so the same guide position can be picked multiple times.

📊 Result Interpretation

With-replacement sampling

Each base is drawn independently and uniformly from all positions in the guide. The guide is not consumed — the same position can be selected multiple times.

Composition preservation

If the guide is 60% G+C, the output sequences will on average also be ~60% G+C. The exact composition will vary slightly due to random sampling.

Output length vs guide length

The output length can be shorter or longer than the guide. The guide only defines the sampling pool, not the output size.

🔬 Applications

Generating composition-matched null sequences for statistical testing of motif enrichment
Producing background sequences that reflect the GC content of a specific organism or genomic region
Creating synthetic sequences with a defined base composition for benchmarking tools
Testing analysis pipelines with sequences that share properties with real data

⚠️ Common Mistakes & Warnings

Non-DNA characters in the guide are stripped

Digits, spaces, and non-IUPAC characters are removed from the guide sequence before sampling. Only valid DNA bases remain in the sampling pool.

Only the first FASTA record in the guide is used

If multiple FASTA records are pasted as the guide, only the first sequence is used as the sampling source.

❓ Frequently Asked Questions

How does this differ from Random DNA Sequence?

Random DNA Sequence draws each base from a uniform distribution (25% each for G, A, C, T). Sample DNA draws from the actual base composition of your guide sequence, so the output reflects the frequency bias of the guide rather than a flat distribution.

Can I use a coding sequence as a guide to generate composition-matched controls?

Yes — paste your coding sequence as the guide and set the desired output length. The output sequences will have the same codon-pool base composition as your coding sequence, making them useful as intragenic composition-matched controls.

What happens if the guide contains IUPAC ambiguity codes (N, R, Y, etc.)?

Ambiguity codes are treated as single characters in the sampling pool. For example, if the guide contains "N", the output may also contain "N" in proportion to how often it appeared in the guide.