CpG Islands
Detect CpG island regions in DNA sequences using the Gardiner-Garden & Frommer method

Paste a raw sequence or one or more FASTA sequences. Non-DNA characters are stripped automatically. Input limit: 100,000,000 characters.

💡 Quick Summary

CpG Islands scans a DNA sequence for regions where CpG dinucleotides occur more frequently than expected by chance and the GC content is elevated. These regions — called CpG islands — are often found near the promoters of vertebrate genes and can indicate transcription start sites.

📋 How to Use
  1. Paste a raw DNA sequence or one or more FASTA sequences into the input area. Input limit is 100,000,000 characters.
  2. Optionally adjust the Window size (default 200 bp) and Obs/Exp cutoff (default 0.6).
  3. Click Run. Each detected CpG island is reported with its start and end position, Obs/Exp ratio, and %GC.
  4. Use the Copy button to copy the results to your clipboard.
  5. Click Load Example to try with a sample genomic sequence.
🧮 Formulas & Logic
Expected CpG
count(C) × count(G) / window_length
Obs/Exp
count(CpG) / expected_CpG
%GC
( count(G) + count(C) ) / window_length × 100
📊 Result Interpretation
Sequences Processed

Number of FASTA records successfully scanned.

Islands Detected

Total number of windows that met both the Obs/Exp and %GC thresholds.

Obs/Exp > 0.6

The standard threshold indicating CpG dinucleotides are not strongly suppressed.

%GC > 50%

The region is GC-rich, a hallmark of CpG islands in vertebrate genomes.

🔬 Applications
  • Identifying potential gene promoter regions in vertebrate genomic sequences
  • Locating transcription start sites upstream of known or predicted genes
  • Assessing methylation patterns — CpG islands in promoters are often unmethylated in expressed genes
  • Annotating novel genomic sequences from sequencing projects
  • Comparing CpG island density between organisms or genomic regions
⚠️ Common Mistakes & Warnings
Sequence shorter than window size

If the input sequence is shorter than the window size (default 200 bp), no analysis can be performed. Use a longer sequence or reduce the window size.

Non-DNA characters are stripped

Any character that is not a valid IUPAC DNA letter is removed before analysis. FASTA header lines are ignored automatically.

Each window is reported independently

Adjacent windows that all meet the thresholds are listed separately, not merged into a single island region. This matches the original Gardiner-Garden & Frommer method.

❓ Frequently Asked Questions

What is a CpG island?
A CpG island is a region of DNA where cytosine (C) and guanine (G) occur consecutively as a CpG dinucleotide more often than expected. In vertebrate genomes, CpG dinucleotides are underrepresented genome-wide because methylated cytosines spontaneously deaminate to thymine. CpG islands, however, are protected from this suppression and are enriched at gene promoters.
What do the Obs/Exp and %GC thresholds mean?
The Obs/Exp ratio compares the observed number of CpG dinucleotides in a window to the number expected if C and G were randomly distributed. A value above 0.6 indicates the CpGs are not strongly suppressed. The %GC threshold (>50%) ensures the region is genuinely GC-rich. Both criteria must be met simultaneously for a window to be called a CpG island.
Can I change the window size and cutoff?
Yes. The default 200 bp window and 0.6 Obs/Exp cutoff follow Gardiner-Garden & Frommer (1987). Takai & Jones (2002) recommend a 500 bp window with a higher cutoff (0.65) and a minimum %GC of 55% to reduce false positives. Adjust the parameters in the input panel to match the criteria you need.
Why are so many overlapping windows reported?
The algorithm slides the window one base at a time and reports every window that meets the criteria. Consecutive windows that all qualify will therefore appear as a long list of overlapping ranges. This is the standard output of the original method — downstream tools typically merge overlapping ranges into a single island.
Can I process multiple sequences at once?
Yes. Paste any number of FASTA-formatted sequences. Each record is scanned independently and its results are listed separately in the output.