DNA Stats - TheBiologyBro

Input Sequence

💡 Quick Summary

DNA Stats returns the number and percentage of occurrences of each nucleotide residue in the sequences you enter. It covers all 16 IUPAC single bases, all 25 G/A/T/C/N dinucleotide combinations, and key base groups including GC content and degenerate code totals. Use it to quickly compare sequence composition.

📋 How to Use

Paste a raw DNA sequence or one or more FASTA sequences into the input area. Input limit is 500,000,000 characters.
Click Run. Each sequence produces three sections: single residue composition, dinucleotide composition, and group composition.
Use the Copy button to copy all results to your clipboard.
Click Load Example to try with a balanced 50-nt sample sequence.

🧮 Formulas & Logic

Single base %

base_count / sequence_length × 100

Dinucleotide %

dinucleotide_count / (sequence_length − 1) × 100

Group %

sum_of_matching_base_counts / sequence_length × 100

GC content

(G_count + C_count) / sequence_length × 100

📊 Result Interpretation

Sequences Processed

Number of FASTA records successfully analysed.

Total Bases

Sum of all cleaned sequence lengths across all records.

Single residue table

Count and % for each of the 16 IUPAC DNA bases (G, A, T, C, N, U, R, Y, S, W, K, M, B, D, H, V).

Dinucleotide table

Count and % for all 25 G/A/T/C/N × G/A/T/C/N pairs. Overlapping dinucleotides are counted (each position counted once).

Group composition

Aggregated counts for GC, AT, 2-fold degenerate bases (R,Y,S,W,K), 3/4-fold degenerate (B,H,D,V,N), and all degenerate bases combined.

🔬 Applications

Measuring GC content to assess primer melting temperature or genomic bias
Checking CpG (cg) dinucleotide frequency relative to other dinucleotides
Comparing base composition between two sequences or organisms
Assessing the frequency of degenerate IUPAC codes in a mixed population or consensus sequence
Quality-checking raw sequencing output for unexpected base distributions

⚠️ Common Mistakes & Warnings

Non-DNA characters are stripped

Any character that is not a valid IUPAC DNA/RNA letter is removed before counting. Position numbers in the output reflect the cleaned sequence.

U (uracil) is counted but not converted

If your sequence contains U residues they are counted under U in the single residue table. They are not converted to T. Dinucleotides involving U are not tabulated.

Dinucleotide percentages use length − 1 as the denominator

For a sequence of N bases there are N−1 consecutive dinucleotide positions. This is the denominator for all dinucleotide percentages.

❓ Frequently Asked Questions

What is GC content and why does it matter?

GC content is the percentage of bases that are either guanine (G) or cytosine (C). Because G-C base pairs form three hydrogen bonds (compared to two for A-T), higher GC content correlates with higher melting temperature, which affects PCR primer design, hybridisation, and genomic stability. Organisms differ substantially in their overall GC content.

Why are dinucleotide counts interesting?

In vertebrate genomes, the CpG dinucleotide (cg in the tool) is under-represented relative to what you would expect from the individual C and G frequencies — because methylated CpG sites spontaneously mutate to TpG over evolutionary time. An observed/expected CpG ratio well below 1.0 is therefore expected in vertebrate bulk DNA, while CpG islands (promoter regions) show ratios closer to 1.0.

What are the degenerate base groups?

IUPAC codes beyond G, A, T, C represent ambiguous positions: R=A/G, Y=C/T, S=C/G, W=A/T, K=G/T, M=A/C (all 2-fold); B=C/G/T, D=A/G/T, H=A/C/T, V=A/C/G (3-fold); N=any (4-fold). The group table sums these so you can quickly see what proportion of the sequence is ambiguous.

Can I process multiple sequences at once?

Yes. Paste any number of FASTA-formatted sequences. Each record is analysed independently and its composition table is listed separately in the output.