EMBL Feature Extractor Tool

Input EMBL

💡 Quick Summary

EMBL Feature Extractor reads the feature table (FT lines) of one or more EMBL records and returns each annotated feature — CDS, mRNA, gene, exon, and more — as a separate FASTA entry. A feature-type breakdown panel shows exactly what was found. Two output modes: Separated (isolated feature sequence) or Uppercased in context (full sequence with the feature capitalised).

📋 How to Use

Paste the contents of one or more EMBL files into the input area.
Choose an Output mode: Separated returns only the nucleotides within the feature coordinates — ideal for downstream analysis. Uppercased in context returns the full genomic sequence in lowercase with the feature region in uppercase — useful for visually locating the feature.
Click Extract Features. Each annotated feature is output as a FASTA entry; records are separated by a "=== title ===" header line.
The Feature Types Found panel shows every distinct feature key (CDS, mRNA, source, etc.) with its count across all records.
Use the Copy button to copy all extracted sequences to your clipboard.
Click Load Example to try with a compact synthetic record that demonstrates source, gene, CDS (join coordinates), and a complement-strand feature.
Click Clear to reset.

🧮 Formulas & Logic

Separated mode

feature_seq = dna[ start − 1 : stop ] (1-based EMBL coordinates converted to 0-based substring)

Uppercased in context

output = lowercase_left_context + UPPERCASE_FEATURE + lowercase_right_context

Complement strand

feature_seq = reverse( IUPAC_complement( extracted_range ) )

join() coordinates

join(x1..y1, x2..y2, …) — each range extracted and concatenated in order; complement features iterate ranges in reverse before complementing

📊 Result Interpretation

Records Processed

Number of EMBL records (ID … //) successfully found and parsed.

Features Extracted

Total number of feature table entries converted to FASTA sequences. Features with unsupported position formats are excluded and listed in the Warnings panel.

Feature Types

Count of distinct feature keys (e.g. CDS, mRNA, gene) across all processed records.

🔬 Applications

Extracting CDS sequences from EMBL records for codon-usage analysis or translation
Reconstructing spliced mRNA from multi-exon join() coordinates
Deriving the exact nucleotide sequence for a single annotated feature before BLAST or primer design
Visually inspecting where a feature sits within its genomic context using Uppercased in context mode
Processing multiple EMBL records in a single pass to collect all features of a given type

⚠️ Common Mistakes & Warnings

Complex position types are skipped

Positions using one-of(), order(), bond(), or other advanced EMBL location descriptors cannot be represented as a simple sequence. They are skipped with a warning in the output panel. Simple positions (e.g. 1..100) and join() coordinates are fully supported.

Partial-position markers are stripped

The "" markers (indicating a partial or fuzzy position boundary) are removed before extraction. The resulting sequence may be shorter than the feature annotation implies.

Only the first DE line is used as the record title

Multi-line DE (description) fields are truncated to the first line. If the description continues on subsequent DE lines, those are not appended.

DNA only — protein-sequence EMBL records are not supported

The SQ block is treated as a DNA sequence. EMBL records where the SQ section contains amino acids (rare) will produce incorrect output.

❓ Frequently Asked Questions

What is the difference between Separated and Uppercased in context?

"Separated" extracts only the nucleotides within the feature coordinates — the FASTA output contains just the feature sequence, ready for downstream tools like BLAST or primer design. "Uppercased in context" outputs the full genomic sequence in lowercase with the feature region in uppercase letters, making it easy to see where the feature sits relative to surrounding sequence.

How are join() coordinates handled?

A join position such as join(265..402,673..781) is split on commas and each range is extracted separately, then the fragments are concatenated. This correctly reconstructs spliced sequences like mature mRNA or multi-exon CDS entries. For complement join features the ranges are also iterated in reverse before the combined sequence is reverse-complemented.

What does complement() mean in the feature table?

A feature annotated as complement(start..stop) is encoded on the opposite (antisense) DNA strand. The tool extracts the specified range, takes the IUPAC nucleotide complement (A↔T, G↔C, with full degenerate-base support), then reverses the result — producing the correct sequence for complement-strand features.

Can I paste multiple EMBL records at once?

Yes. Paste any number of complete records (each starting with "ID " and ending with "//"). Each record is processed independently and all features are output together, separated by a "=== record title ===" header line.

Where do I get EMBL-format files?

EMBL flat files are available from the European Nucleotide Archive (ENA) at www.ebi.ac.uk/ena. Search for an accession number and choose "EMBL" as the download format. GenBank records can be converted to EMBL format using tools such as Biopython's SeqIO or EMBOSS seqret.

Why are some features missing from the output?

Features using unsupported position descriptors (one-of, order, bond) cannot be represented as a plain sequence and are skipped. A warning message is shown in the Processing Warnings panel for each skipped feature.