GenBank to FASTA
Convert GenBank flat-file records to FASTA sequences

Paste one or more GenBank records. Each must begin with a "LOCUS" line, contain a DEFINITION and ORIGIN section, and end with "//". Input limit: 200,000,000 characters.

💡 Quick Summary

GenBank to FASTA reads one or more GenBank flat-file records and returns each DNA sequence as a separate FASTA entry, using the DEFINITION line as the title. It is the fastest way to strip annotation text and obtain a clean sequence for tools like BLAST, alignment, or primer design.

📋 How to Use
  1. Paste the contents of one or more GenBank files into the input area. Each record must start with a LOCUS line and contain both a DEFINITION section and an ORIGIN section ending with //.
  2. Click Convert to FASTA. Each GenBank record is converted to a single FASTA entry; the DEFINITION line becomes the FASTA title.
  3. The Summary panel shows how many records were converted and the total number of bases extracted.
  4. Use the Copy button to copy all FASTA sequences to your clipboard.
  5. Click Load Example to try with a real GenBank record — the Strongylocentrotus purpuratus fascin (FSCN1) mRNA from NCBI.
  6. Click Clear to reset.
🧮 Formulas & Logic
Title
FASTA title = filterFastaTitle( DEFINITION line text )
Sequence
DNA = removeNonDna( text after ORIGIN header, before // )
Wrapping
Output sequence is wrapped at 60 characters per line
📊 Result Interpretation
Records Converted

Number of GenBank records (LOCUS … //) successfully parsed and converted to FASTA.

Total Bases (bp)

Sum of all DNA characters extracted across all converted records, after stripping digits, spaces, and non-IUPAC characters from the ORIGIN block.

🔬 Applications
  • Stripping GenBank annotation to obtain a bare FASTA sequence for BLAST submission
  • Preparing sequences for multiple sequence alignment tools that require FASTA input
  • Extracting full genomic sequences from downloaded GenBank files before primer design
  • Batch-converting multiple GenBank records in a single paste operation
  • Converting GenBank sequences to FASTA as a pre-processing step for bioinformatics pipelines
⚠️ Common Mistakes & Warnings
Only the ORIGIN block is extracted

The tool reads only the sequence data between the ORIGIN header and the // terminator. Annotation sections (FEATURES, REFERENCE, etc.) are ignored entirely.

DNA only — protein GenBank records are not supported

The ORIGIN section is cleaned with an IUPAC DNA filter. GenBank records where ORIGIN contains amino acids (rare) will produce an empty or incorrect sequence.

DEFINITION is used as the FASTA title

Only the DEFINITION section text is used for the FASTA header. If the record has no DEFINITION, the title will be blank. The ACCESSION or VERSION line is not appended automatically.

Incomplete records are skipped

Records missing a DEFINITION, ACCESSION, or ORIGIN section cannot be parsed and are skipped with a warning.

❓ Frequently Asked Questions

Where do I get GenBank-format files?
GenBank flat files are available from NCBI at www.ncbi.nlm.nih.gov. Search for an accession number, open the record, and use the "Send to → File → Format: GenBank (full)" option to download the flat file.
Can I paste multiple GenBank records at once?
Yes. Paste any number of complete records (each starting with "LOCUS" and ending with "//"). Each record is converted independently and all FASTA entries are output together, separated by a blank line.
What is in the ORIGIN section?
The ORIGIN section of a GenBank file contains the actual nucleotide sequence, formatted in numbered blocks of 60 bases. The tool removes the numbers and spaces automatically, leaving only the IUPAC nucleotide characters.
Why does the FASTA title come from DEFINITION and not ACCESSION?
The DEFINITION line contains a human-readable description of the sequence (e.g. "Strongylocentrotus purpuratus fascin mRNA, complete cds") which is more informative as a FASTA title than the bare accession number. This matches the behaviour of the original SMS tool.
What does removeNonDna do to the sequence?
It strips every character that is not a valid IUPAC nucleotide code (G, A, T, C, U, R, Y, S, W, K, M, B, D, H, V, N, X — upper or lower case). This removes the line numbers, spaces, and any stray punctuation from the ORIGIN block.