Sample Protein - TheBiologyBro

💡 Quick Summary

Sample Protein randomly selects residues from a guide sequence (with replacement) until a new sequence of the desired length is constructed. Because sampling is proportional to the residue composition of the guide, the output sequences preserve the amino acid frequencies of the guide. Sampled sequences serve as composition-matched controls for evaluating sequence analysis results.

📋 How to Use

Enter the desired output length in residues (default: 100; maximum: 100,000,000).
Paste a raw or FASTA guide sequence into the textarea. The tool samples from the residues in this sequence. Input limit: 100,000,000 characters.
Choose how many sequences to generate (1, 10, 50, or 100).
Click Run. Each output sequence is independently sampled from the guide and output as a FASTA record. Use Copy to copy the plain-text result.

🧮 Formulas & Logic

Sampling probability

P(residue X) = count(X in guide) / length(guide). Sampling is with replacement, so the same guide position can be picked multiple times.

📊 Result Interpretation

With-replacement sampling

Each residue is drawn independently and uniformly from all positions in the guide. The guide is not consumed — the same position can be selected multiple times.

Composition preservation

If the guide is 20% Leucine, the output sequences will on average also be ~20% Leucine. The exact composition will vary slightly due to random sampling.

Output length vs guide length

The output length can be shorter or longer than the guide. The guide only defines the sampling pool, not the output size.

🔬 Applications

Generating composition-matched null sequences for statistical testing of motif enrichment
Producing background sequences that reflect the residue bias of a specific protein family or domain
Creating synthetic sequences with a defined amino acid composition for benchmarking tools
Testing analysis pipelines with sequences that share compositional properties with real proteins

⚠️ Common Mistakes & Warnings

Non-protein characters in the guide are stripped

Digits, spaces, and non-IUPAC amino acid characters are removed from the guide sequence before sampling. Only valid residues remain in the sampling pool.

Only the first FASTA record in the guide is used

If multiple FASTA records are pasted as the guide, only the first sequence is used as the sampling source.

❓ Frequently Asked Questions

How does this differ from Random Protein Sequence?

Random Protein Sequence draws each residue from a uniform distribution (5% each for all 20 standard amino acids). Sample Protein draws from the actual residue composition of your guide sequence, so the output reflects the frequency bias of the guide rather than a flat distribution.

Can I use a protein domain as a guide to generate composition-matched controls?

Yes — paste your domain sequence as the guide and set the desired output length. The output sequences will have the same residue composition as your domain, making them useful as composition-matched controls.

What if my guide sequence contains only a few distinct residues?

The output sequences will only contain residues present in the guide. For example, a guide of "ACHKLMG" produces output sequences that only contain A, C, H, K, L, M, and G in the same proportions.